Finding the text which is having nonstandard character encoding is a very common step to perform in text processing.
All the text would have been from utf-8 or ASCII encoding ideally but this might not be the case always. So, in such cases when the encoding is not known, such non-encoded text has to be detected and the be converted to a standard encoding. So, this step is important before processing the text further.
Charade Installation :
For performing the detection and conversion of encoding, charade – a Python library is required. This module can be simply installed using sudo easy_install charade or pip install charade.
Let’s see the wraper function around the charade module.
Code : encoding.detect(string), to detect the encoding
The detect functions will return 2 attributes :
Confidence : the probability of charade being correct. Encoding : which encoding it is.
Code : encoding.convert(string) to convert the encoding.
Code : Example
d1 is encoded as : (confidence': 0.505, 'encoding': 'utf-8') d2 is encoded as : ('confidence': 1.0, 'encoding': 'ascii')
detect() : It is a charade.detect() wrapper. It encodes the strings and handles the UnicodeDecodeError exceptions. It expects a bytes object so therefore the string is encoded before trying to detect the encoding.
convert() : It is a charade.convert() wrapper. It calls detect() first to get the encoding. Then, it returns a decoded string.
- Python - Golomb Encoding for b=2n and b!=2n
- Run Length Encoding in Python
- ML | One Hot Encoding of datasets in Python
- Elias Gamma Encoding in Python
- Python | C Strings of Doubtful Encoding | Set-1
- ML | Label Encoding of datasets in Python
- Python | Encoding Decoding using Matrix
- response.encoding - Python requests
- Python | C Strings of Doubtful Encoding | Set-2
- Encoding and Decoding Base64 Strings in Python
- Python program to read character by character from a file
- Encoding and Decoding Custom Objects in Python-JSON
- Python | Insert character after every character pair
- Python | Add leading K character
- Python - Uppercase Nth character
- Python - Groups Strings on Kth character
- Python - Lowercase Kth Character in string
- Python | K Character Split String
- Python | Deleting all occurrences of character
- Python | Group List on K character
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to email@example.com. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.
Improved By : nidhi_biet