Python | Character Encoding
Finding the text which is having nonstandard character encoding is a very common step to perform in text processing.
All the text would have been from utf-8 or ASCII encoding ideally but this might not be the case always. So, in such cases when the encoding is not known, such non-encoded text has to be detected and the be converted to a standard encoding. So, this step is important before processing the text further.
Charade Installation :
For performing the detection and conversion of encoding, charade – a Python library is required. This module can be simply installed using sudo easy_install charade or pip install charade.
Let’s see the wrapper function around the charade module.
Code : encoding.detect(string), to detect the encoding
Python3
import charade
def detect(s):
try :
if isinstance (s, str ):
return charade.detect(s.encode())
else :
return charade.detect(s)
except UnicodeDecodeError:
return charade.detect(s.encode( 'utf-8' ))
|
The detect functions will return 2 attributes :
Confidence : the probability of charade being correct.
Encoding : which encoding it is.
Code : encoding.convert(string) to convert the encoding.
Python3
import charade
def convert(s):
if isinstance (s, str ):
s = s.encode()
encode = detect(s)[ 'encoding' ]
if encode = = 'utf-8' :
return s.decode()
else :
return s.decode(encoding)
|
Code : Example
Python3
import encoding
d1 = encoding.detect( 'geek' )
print ( "d1 is encoded as : " , d1)
d2 = encoding.detect( 'ascii' )
print ( "d2 is encoded as : " , d2)
|
Output :
d1 is encoded as : (confidence': 0.505, 'encoding': 'utf-8')
d2 is encoded as : ('confidence': 1.0, 'encoding': 'ascii')
detect() : It is a charade.detect() wrapper. It encodes the strings and handles the UnicodeDecodeError exceptions. It expects a bytes object so therefore the string is encoded before trying to detect the encoding.
convert() : It is a charade.convert() wrapper. It calls detect() first to get the encoding. Then, it returns a decoded string.
Last Updated :
29 May, 2021
Like Article
Save Article
Share your thoughts in the comments
Please Login to comment...