Skip to content
Related Articles

Related Articles

Python | Character Encoding

View Discussion
Improve Article
Save Article
  • Last Updated : 29 May, 2021
View Discussion
Improve Article
Save Article

Finding the text which is having nonstandard character encoding is a very common step to perform in text processing. 
All the text would have been from utf-8 or ASCII encoding ideally but this might not be the case always. So, in such cases when the encoding is not known, such non-encoded text has to be detected and the be converted to a standard encoding. So, this step is important before processing the text further. 
Charade Installation : 
For performing the detection and conversion of encoding, charade – a Python library is required. This module can be simply installed using sudo easy_install charade or pip install charade. 
Let’s see the wrapper function around the charade module. 
Code : encoding.detect(string), to detect the encoding 


# -*- coding: utf-8 -*-
import charade
def detect(s):
        # check it in the charade list
        if isinstance(s, str):
            return charade.detect(s.encode())
        # detecting the string
            return charade.detect(s)
    # in case of error
    # encode with 'utf -8' encoding
    except UnicodeDecodeError:
        return charade.detect(s.encode('utf-8'))

The detect functions will return 2 attributes : 

Confidence : the probability of charade being correct.
Encoding   : which encoding it is. 

Code : encoding.convert(string) to convert the encoding.


# -*- coding: utf-8 -*-
import charade
def convert(s):
    # if in the charade instance
    if isinstance(s, str):
        s = s.encode()
    # retrieving the encoding information
    # from the detect() output
    encode = detect(s)['encoding']
    if encode == 'utf-8':
        return s.decode()
        return s.decode(encoding)

Code : Example 


# importing library
import encoding
d1  = encoding.detect('geek')
print ("d1 is encoded as  : ", d1)
d2  = encoding.detect('ascii')
print ("d2 is encoded as  : ", d2)

Output : 

d1 is encoded as : (confidence': 0.505, 'encoding': 'utf-8')
d2 is encoded as : ('confidence': 1.0, 'encoding': 'ascii')

detect() : It is a charade.detect() wrapper. It encodes the strings and handles the UnicodeDecodeError exceptions. It expects a bytes object so therefore the string is encoded before trying to detect the encoding.
convert() : It is a charade.convert() wrapper. It calls detect() first to get the encoding. Then, it returns a decoded string.

My Personal Notes arrow_drop_up
Recommended Articles
Page :

Start Your Coding Journey Now!