Open In App

Transliterating non-ASCII characters with Python

Transliteration is a process of writing the word of one language using similarly pronounced alphabets in other languages. It deals with the pronunciation of words in other languages.  Similarly, in computer language, the computer can handle ASCII characters but has problems with non-ASCII characters. There are some times when we are unable to skip non-ASCII characters as it can lead to loss of information.  There should be a way to read non-ASCII characters and express them by text in ASCII characters. 

Approach 1:



This approach is related to the inbuilt library unidecode. This library helps Transliterating non-ASCII characters in Python. It provides an unidecode() method that takes Unicode data and tries to represent it in ASCII.  This method automatically determines scripting language and transliterates it accordingly. It accepts unicode string values and returns a transliteration in string format.

Steps:



Example:




# Import unidecode module from unidecode
from unidecode import unidecode
 
# Get transliteration for following
# non-ASCII text (Unicode string)
print(unidecode(u'ko\u017eu\u0161\u010dek'))
 
# Get transliteration for following
# non-ASCII text (Devanagari)
print(unidecode("आप नीचे अपनी भाषा और इनपुट उपकरण चुनें और लिखना आरंभ करें"))
 
# Get transliteration for following
# non-ASCII text (Chinese)
print(unidecode("谢谢你"))
 
# Get transliteration for following
# non-ASCII text (Japanese)
print(unidecode("ありがとう。"))
 
# Get transliteration for following
# non-ASCII text (Russian)
print(unidecode("улыбаться Владимир Путин"))

Output:

kozuscek
aap niice apnii bhaassaa aur inputt upkrnn cuneN aur likhnaa aarNbh kreN
Xie Xie Ni
arigatou.
ulybat'sia Vladimir Putin

Approach 2: 

This approach deals with building a structure that will help in transliteration. In this, Unicode values of non-ASCII characters are labeled with related ASCII values from here provides a list of scripts. Unicode’s value for letters in each script is provided therewith representable ASCII character. Wikipedia also has a collection of Unicode values and respective transliteration.

Steps:

For example, if we want transliteration of the Russian language, we will take all letters in Cyrillic script as Russian uses Cyrillic. Then for each letter, its Unicode value and ASCII representation are used to create a dictionary. And, then this dictionary is used to transliterate given text.

For example, some letters of Devanagari script are taken and a dictionary is created. Further, it is used on small text for transliteration.

Example:




# Create devanagari transliteration dictionary
devanagari_translit_dict = {
    '\u0905': 'A', '\u0906': 'AA', '\u0907': 'I', '\u0908': 'II',
    '\u0909': 'U', '\u090A': 'UU', '\u090F': 'E', '\u0910': 'AI',
    '\u0913': 'O', '\u0914': 'AU', '\u0915': 'K', '\u0916': 'KH',
    '\u0917': 'G', '\u0918': 'GH', '\u0919': 'NG', '\u091A': 'C',
    '\u091B': 'CH', '\u091C': 'J', '\u091D': 'JH', '\u091E': 'NY',
    '\u091F': 'TT', '\u0920': 'TTH', '\u0921': 'DD', '\u0922': 'DDH',
    '\u0923': 'NN', '\u0924': 'T', '\u0925': 'TH', '\u0926': 'D',
    '\u0927': 'DH', '\u0928': 'N', '\u092A': 'P', '\u092B': 'PH',
    '\u092C': 'B', '\u092D': 'BH', '\u092E': 'M', '\u092F': 'Y',
    '\u0930': 'R', '\u0932': 'L', '\u0933': 'LL', '\u0935': 'V',
    '\u0936': 'SH', '\u0937': 'SS', '\u0938': 'S', '\u0939': 'H',
    '\u093E': 'AA', '\u093F': 'I', '\u0940': 'II', '\u0941': 'U',
    '\u0942': 'UU', '\u0947': 'E', '\u0948': 'AI', '\u094B': 'O',
    '\u094C': 'AU', '\u094D': '', '\u0902': 'n'}
 
# Define function transliterating text
def transliterate(text, translit_dict):
    new_word = ''
    for letter in text:
        new_letter = ''
        if letter in translit_dict:
            new_letter = translit_dict[letter]
        else:
            new_letter = letter
        new_word += new_letter
    return new_word
 
# Input text in devanagari
text = "आप नीचे अपनी भाषा और इनपुट उपकरण चुनें और लिखना आरंभ करें"
 
# Obtain Transliterated text for given input text
transliterated_text = transliterate(text, devanagari_translit_dict)
print(transliterated_text)

Output:

AAP NIICE APNII BHAASSAA AUR INPUTT UPKRNN CUNEn AUR LIKHNAA AARnBH KREn

Article Tags :