Transliterating non-ASCII characters with Python
Transliteration is a process of writing the word of one language using similarly pronounced alphabets in other languages. It deals with the pronunciation of words in other languages. Similarly, in computer language, the computer can handle ASCII characters but has problems with non-ASCII characters. There are some times when we are unable to skip non-ASCII characters as it can lead to loss of information. There should be a way to read non-ASCII characters and express them by text in ASCII characters.
This approach is related to the inbuilt library unidecode. This library helps Transliterating non-ASCII characters in Python. It provides an unidecode() method that takes Unicode data and tries to represent it in ASCII. This method automatically determines scripting language and transliterates it accordingly. It accepts unicode string values and returns a transliteration in string format.
- Import unidecode library
- Call unidecode() method with input text
kozuscek aap niice apnii bhaassaa aur inputt upkrnn cuneN aur likhnaa aarNbh kreN Xie Xie Ni arigatou. ulybat'sia Vladimir Putin
This approach deals with building a structure that will help in transliteration. In this, Unicode values of non-ASCII characters are labeled with related ASCII values from here provides a list of scripts. Unicode’s value for letters in each script is provided therewith representable ASCII character. Wikipedia also has a collection of Unicode values and respective transliteration.
- Create Dictionary having Unicode values as keys and ASCII representation as values
- Transliterate each letter in the text using that dictionary.
For example, if we want transliteration of the Russian language, we will take all letters in Cyrillic script as Russian uses Cyrillic. Then for each letter, its Unicode value and ASCII representation are used to create a dictionary. And, then this dictionary is used to transliterate given text.
For example, some letters of Devanagari script are taken and a dictionary is created. Further, it is used on small text for transliteration.
AAP NIICE APNII BHAASSAA AUR INPUTT UPKRNN CUNEn AUR LIKHNAA AARnBH KREn