Unicode is a universal encoding system to provide a comprehensive character set and was created by the Unicode Consortium (a group of multilingual software manufacturers). Unicode simplifies software localization and improves multilingual text processing. It overcomes the difficulty inherent in ASCII and extended ASCII. Unicode has standardizes script behavior which allows any combination of characters, drawn from any combination of scripts and languages, to co-exist in a single document. Unicode defines multiple encodings of its single character set: UTF-7, UTF-8, UTF-16, and UTF-32. Conversion of data among these encodings is lossless. Unicode was originally a 2-byte character set. Unicode version 3, however, is a 4-byte code and is fully compatible with ASCII and extended ASCII. These all support encoding the same set of characters.
- UTF-8 uses anywhere from 1 to 4 bytes per character depending on character, but ASCII take only 1 byte and 4 bytes for unusual ones.
- UTF-16 uses 2 bytes for most characters, while very unusual characters take 4.
- UTF-32 uses 4 bytes per character. We can calculate the number of characters in a UTF-32 string by only counting bytes.
The notation uses hexadecimal digits in format as follows. U-XXXXXXXX – The numbering goes from U-00000000 to U-FFFFFFFF. Unicode divides the available space codes into planes. A plane is a continuous group of 65,536 code points. The most significant 16 bits define the plane (i.e. number of planes = 65,535) and each plane can define up to 65,536 characters or symbols. Types of Plane –
- Basic multilingual plane (BMP) – Plane 0000, the basic multilingual plane is designed to be compatible with the previous 16-bit Unicode. The most significant 16-bits in this plane are all zeroes. It mostly defines character sets in different languages with the exception of some control and special characters. It is represented as U+XXXX where XXXX is the least significant 16-bits, eig.,: U+0900 to U+09FF reserved for Devanagari, Bengali U+2200 to U+22FF reserved for a mathematical operation etc.
- Supplementary multilingual plane (SMP) – Plane 0001, the supplementary multilingual plane, is designed to provide more codes for those multilingual characters that are excluded in the BMP. Example: 10140-1018F are reserved for Ancient Greek Numbers.
- Supplementary ideography plane (SIP) – Plane 0002, the supplementary ideography plane, is designed to provide codes for ideographic symbols, symbols that provide an idea in contrast to a sound, e.g., 20000-2A6DF are reserved for CJK Unified Extension B
- Supplementary special plane (SSP) – 000E, the supplementary special plane, is used for special characters, e.g., E0000-E007F are reserved for tags.
- Private use planes (PUPs) – Planes 000F and 0010, private use planes are for private use. They are used by fonts internally to refer to auxiliary glyphs.
Universal character set: Unicode supports almost all the characters and symbols used in the world’s writing systems, making it a universal character set that can be used to represent text in any language.
Interoperability: Unicode provides interoperability between different computing systems, platforms, and software applications. This means that text encoded in Unicode can be exchanged and displayed correctly across different systems, regardless of the language or script used.
Compatibility: Unicode is compatible with all the major computing platforms, including Windows, macOS, Linux, and mobile devices. This makes it easy to share and display text across different devices and platforms.
Efficient storage: Unicode uses a fixed-length encoding scheme, which makes it more efficient in terms of storage and memory usage than other encoding standards.
Complexity: Unicode is a complex encoding standard that can be difficult to implement and use correctly. It requires a significant amount of knowledge and expertise to correctly encode, store, and display text in Unicode.
Compatibility issues with legacy systems: Some legacy systems and software applications may not support Unicode or may not display Unicode characters correctly. This can cause compatibility issues when exchanging text across different systems.
Large character set: Unicode’s large character set can be a disadvantage in some applications, where only a small subset of characters is needed. This can result in larger file sizes and increased memory usage.
Localization: While Unicode supports most of the world’s writing systems, it may not be sufficient for some localization requirements, such as the need for specialized symbols or characters that are unique to a particular language or culture.
Reference – Unicode – msdn.microsoft Data Communication and Networking – Forounzan This article is contributed by Himanshi. If you like GeeksforGeeks and would like to contribute, you can also write an article using write.geeksforgeeks.org or mail your article to firstname.lastname@example.org. See your article appearing on the GeeksforGeeks main page and help other Geeks. Please write comments if you find anything incorrect, or you want to share more information about the topic discussed above.