We all know, the computer does not directly store letters, numbers, and pictures directly. It converts them into small pieces called bits, which either have two values, 0 or 1. To represent each letter or number properly, we need some rules to correctly store them. These rules correspond to the encoding schema. We will look at the 3 most popular storage encoding schema:
ASCII Stands for American Standard Code for Information Interchange. ASCII was introduced in the year 1963 by the American Standards Association (ASA). ASCII is broadly classified into 2 sub-categories:
- Standard ASCII: Standard ASCII represents the first half of ASCII that is, the first 128 characters from 0 to 127. Standard ASCII comprises non-printable and the lower ASCII. Non-printable ASCII contains the characters that cannot be printed on the screen and comprise various system codes. They start from range 0 to 31. Lower ASCII comprises the remaining range of Standard ASCII, that is, from 32 to 127. It contains alphabets, numbers as well as special symbols.
- Extended ASCII: Extended ASCII was proposed because though standard ASCII was enough to represent all major characters from major languages yet it was not sufficient to cover all of them. Extended ASCII solves this by adding more 128 characters, thus taking the total ASCII characters to 256.
ISCII stands for the Indian Script Code for Information Interchange. It was proposed by the Bureau of Indian Standards (BIS) in the year 1991. It is an 8-bit standard where the first 128 characters, that is, from 0 to 127 are the same as standard ASCII. The next 128 characters constitute the characters of Indian scripts. Most popular languages that are spoken in India are present in the encoding. These include Devanagari, Gujarati, Bengali, Oriya, Punjabi, Assamese, Kannada, Telugu, Malayalam, Tamil.
With the invention of ASCII, it was felt that the character encoding was limited and was not enough to cover all the languages of the world. Hence, a new encoding schema was needed to cover all languages. The Unicode Consortium, a non-profit organization, designed and developed Unicode in the year 1991. Initially, there were only 50, 000 characters present. But today, the Unicode covers more than 128, 000 characters.
Types of Unicode encoding:
- UTF-8: It uses 8 bits for its encoding. It is used in email over the internet. It is a standard encoding scheme used on web and software programs.
- UTF-16: It uses 2 bytes i.e. 16 bits for encoding.
- UTF-32: It uses 4 bytes i.e. 32 bits for encoding.
Why do we need Unicode?
- Unicode allows us to design a single application for many various platforms and languages. We do not need to remake the same application for launching it in another language.
- This leads to reduced application development costs.
- It prevents data corruption.
- It acts as a single encoding schema across all languages and platforms.
- It can be considered a superset of all encoding schema and hence we can convert all encoding schemas to Unicode and vice-versa.
Attention reader! Don’t stop learning now. Get hold of all the important CS Theory concepts for SDE interviews with the CS Theory Course at a student-friendly price and become industry ready.