Finding the text which is having nonstandard character encoding is a very common step to perform in text processing.
All the text would have been from utf-8 or ASCII encoding ideally but this might not be the case always. So, in such cases when the encoding is not known, such non-encoded text has to be detected and the be converted to a standard encoding. So, this step is important before processing the text further.
Charade Installation :
For performing the detection and conversion of encoding, charade – a Python library is required. This module can be simply installed using sudo easy_install charade or pip install charade.
Let’s see the wraper function around the charade module.
Code : encoding.detect(string), to detect the encoding
# -*- coding: utf-8 -*- import charade def detect(s): try : # check it in the charade list if isinstance (s, str ): return charade.detect(s.encode()) # detecting the string else : return charade.detect(s) # in case of error # encode with 'utf -8' encoding except UnicodeDecodeError: return charade.detect(s.encode( 'utf-8' )) |
The detect functions will return 2 attributes :
Confidence : the probability of charade being correct. Encoding : which encoding it is.
Code : encoding.convert(string) to convert the encoding.
# -*- coding: utf-8 -*- import charade def convert(s): # if in the charade instance if isinstance (s, str ): s = s.encode() # retrieving the encoding information # from the detect() output encode = detect(s)[ 'encoding' ] if encode = = 'utf-8' : return s.decode() else : return s.decode(encoding) |
Code : Example
# importing library import encoding d1 = encoding.detect( 'geek' ) print ( "d1 is encoded as : " , d1) d2 = encoding.detect( 'ascii' ) print ( "d2 is encoded as : " , d2) |
Output :
d1 is encoded as : (confidence': 0.505, 'encoding': 'utf-8') d2 is encoded as : ('confidence': 1.0, 'encoding': 'ascii')
detect() : It is a charade.detect() wrapper. It encodes the strings and handles the UnicodeDecodeError exceptions. It expects a bytes object so therefore the string is encoded before trying to detect the encoding.
convert() : It is a charade.convert() wrapper. It calls detect() first to get the encoding. Then, it returns a decoded string.
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.