Python | Character Encoding

Finding the text which is having nonstandard character encoding is a very common step to perform in text processing.
All the text would have been from utf-8 or ASCII encoding ideally but this might not be the case always. So, in such cases when the encoding is not known, such non-encoded text has to be detected and the be converted to a standard encoding. So, this step is important before processing the text further.

Charade Installation :
For performing the detection and conversion of encoding, charade – a Python library is required. This module can be simply installed using sudo easy_install charade or pip install charade.

Let’s see the wraper function around the charade module.

Code : encoding.detect(string), to detect the encoding

filter_none

edit
close

play_arrow

link
brightness_4
code

# -*- coding: utf-8 -*-
  
import charade
def detect(s):
      
    try:
        # check it in the charade list
        if isinstance(s, str):
            return charade.detect(s.encode())
        # detecting the string
          else:
            return charade.detect(s)
      
    # in case of error
    # encode with 'utf -8' encoding
    except UnicodeDecodeError:
        return charade.detect(s.encode('utf-8'))

chevron_right


The detect functions will return 2 attributes :

Confidence : the probability of charade being correct.
Encoding   : which encoding it is. 

Code : encoding.convert(string) to convert the encoding.

filter_none

edit
close

play_arrow

link
brightness_4
code

# -*- coding: utf-8 -*-
import charade
  
def convert(s):
      
    # if in the charade instance
    if isinstance(s, str):
        s = s.encode()
      
    # retrieving the encoding information 
    # from the detect() ouptut
    encode = detect(s)['encoding']
      
    if encode == 'utf-8':
        return s.decode()
    else:
        return s.decode(encoding) 

chevron_right


Code : Example

filter_none

edit
close

play_arrow

link
brightness_4
code

# importing library
import encoding 
  
d1  = encoding.detect('geek')
print ("d1 is encoded as  : ", d1)
  
d2  = encoding.detect('ascii')
print ("d2 is encoded as  : ", d2)

chevron_right


Output :

d1 is encoded as : (confidence': 0.505, 'encoding': 'utf-8')
d2 is encoded as : ('confidence': 1.0, 'encoding': 'ascii')

detect() : It is a charade.detect() wrapper. It encodes the strings and handles the UnicodeDecodeError exceptions. It expects a bytes object so therefore the string is encoded before trying to detect the encoding.

convert() : It is a charade.convert() wrapper. It calls detect() first to get the encoding. Then, it returns a decoded string.



My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.




Article Tags :

Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.