Open In App

What is Character Encoding System?

Last Updated : 19 Oct, 2021
Improve
Improve
Like Article
Like
Save
Share
Report

As we all know, computers do not understand the English alphabet, numbers except 0 and 1, or text symbols. We use encoding to convert these. So, encoding is the method or process of converting a series of characters, i.e, letters, numbers, punctuation, and symbols into a special or unique format for transmission or storage in computers. Data is represented in computers using ASCII, UTF8, UTF32, ISCII, and Unicode encoding schemes. All types of data, including numbers, text, photos, audio, and video files, can be handled by computers. For example, 65 is represented as A because all the characters, symbols, numbers are assigned some unique code by the standard encoding schemes. Some of the commonly used encoding schemes are described below:

1. ASCII: ASCII is known as American Standard Code for Information Interchange. The X3 group, part of the ASA, produced and published ASCII for the first time in 1963. (American Standards Association). The ASCII standard was first published in 1963 as ASA X3.4-1963, and it was revised ten times between 1967 and 1986. ASCII is an 8-bit code standard that divides the 256 slots into letters, numbers, and other characters. The ASCII decimal (Dec) number is constructed using binary, which is the universal computer language. The decimal value of the lowercase “h” character (char) is 104, which is “01101000” in binary.

The ASCII table is broken down into three sections.

  1. Non-printable, system codes between 0 and 31.
  2. Lower ASCII, between 32 and 127.
  3. Higher ASCII, between 128 and 255.

ASCII Table for characters:

Letter ASCII Code Letter ASCII Code
a      97 A 65
b        98 B 66
c    99 C 67
d       100 D 68
e      101 E 69
f      102 F 70
g     103 G 71
h      104 H 72
i       105 I 73
j     106 J 74
k       107 K 75
l       108 L 76
m   109 M 77
n       110 N 78
o       111 O 79
p       112 P 80
q         113 Q 81
r       114 R 82
s       115 S 83
t       116 84
u       117 U 85
v       118 V 86
w       119 W 87
x       120 X 88
y       121 Y 89
z       122 Z 90

2. ISCII: ISCII (Indian Script Code for Information Interchange) is the abbreviation for the Indian Script Code for Information Interchange. ISCII is a method of encoding that can be used to encode a wide range of Indian languages, both written and spoken. To ease transliteration across multiple writing systems, ISCII adopts a single encoding mechanism.

ISCII was established in 1991 by the Bureau of Indian Standards (BIS). It has a character count of roughly 256 and employs an 8-bit encoding technique. From 0-127, the first 128 characters are the same as in ASCII. The following characters, which range from 128 to 255, represent characters from Indian scripts.

Advantages include:

  1. The vast majority of Indian languages are represented in this.
  2. The character set is simple and straightforward.
  3. It is possible to easily transliterate between languages.

Disadvantages include:

  1. A special keyboard with ISCII character keys is required.
  2. Because Unicode was created later, and Unicode included ISCII characters, ISCII became obsolete.ISCII (Indian Script Code for Information Interchange) is the Indian Script Code for Information Interchange.
  3. ISCII is a method of encoding that can encode a wide range of Indian languages, both written and spoken. To ease transliteration across multiple writing systems, ISCII adopts a single encoding mechanism.

3. Unicode:  Unicode Characters are translated and stored in computer systems as numbers (bit sequences) that the processor can handle. In Unicode, a code page is an encoding system that converts a set of bits into a character representation. Hundreds of different encoding techniques allocated a number to each letter or character in the globe before Unicode. Many of these methods used code pages with only 256 characters and each of which required 8 bits of storage. 

  1. Unicode enables the creation of a single software product or website for multiple platforms, languages, and countries (without re-engineering), resulting in significant cost savings over older character sets.
  2. Unicode data can be used without generating data corruption in a variety of systems.
  3. Unicode is a universal encoding technique that can be used to encode any language or letter irrespective of devices, operating systems, or software.
  4. Unicode is a character encoding standard that allows you to convert between multiple character encoding systems. Because Unicode is a superset of all other major character encoding systems, you can convert from one encoding scheme to Unicode and then from Unicode to a different encoding scheme.
  5. The most extensively used encoding is Unicode.
  6. The applicable versions of ISO/IEC 10646, which defines the Universal Character Set character encoding, are fully compatible and synchronized with Unicode Standard versions. Or we can say that it includes 96,447 character codes that are far enough to decode any character symbol present in the world.

4. UTF-8: It is a character encoding with variable widths that are used in electronic communication. With one to four one-byte (8-bit) code units, it can encode all 1,112,064[nb 1] valid Unicode character code points. Code points with lower numerical values are encoded with fewer bytes since they occur more frequently. When it was created the creators make sure that this encoding scheme is ASCII compatible and the first 128 Unicode characters that are one-to-one to ASCII are encoded using a single byte with the same binary value as ASCII and ensure that ASCII text is also valid UTF-8-encoded Unicode.

Converting Symbols to Binary:

Character ASCII Byte
A               65 1000001
a            97 1100001
B                  66 1000010
b                     98 1100010
Z                  90 1011010
0                    48 110000
9                    57 111001
!     33 100001
?     63 111111

5. UTF-32: UTF-32 is known as 32-bit Unicode Transformation Format. It is a fixed-length encoding that encodes Unicode code points using 32 bits per code. It uses 4-bytes per character and we can count the number of characters in UTF-32 string simply by just counting bytes. The main advantage of using UTF-32 is that Unicode code points can be directly indexed (although letters in general, such as “grapheme clusters” or some emojis, cannot be directly indexed, thus determining the displayed width of a string is more complex). A constant-time operation is finding the Nth code point in a sequence of code points. On the other hand, a variable-length code necessitates sequential access to locate the Nth code point in a row. As a result, UTF-32 is a straightforward substitute for ASCII code that examines each issue in a string using numbers incremented by one.



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads