Finding the text which is having nonstandard character encoding is a very common step to perform in text processing.
All the text would have been from utf-8 or ASCII encoding ideally but this might not be the case always. So, in such cases when the encoding is not known, such non-encoded text has to be detected and the be converted to a standard encoding. So, this step is important before processing the text further.
Charade Installation :
For performing the detection and conversion of encoding, charade – a Python library is required. This module can be simply installed using sudo easy_install charade or pip install charade.
Let’s see the wraper function around the charade module.
Code : encoding.detect(string), to detect the encoding
The detect functions will return 2 attributes :
Confidence : the probability of charade being correct. Encoding : which encoding it is.
Code : encoding.convert(string) to convert the encoding.
Code : Example
d1 is encoded as : (confidence': 0.505, 'encoding': 'utf-8') d2 is encoded as : ('confidence': 1.0, 'encoding': 'ascii')
detect() : It is a charade.detect() wrapper. It encodes the strings and handles the UnicodeDecodeError exceptions. It expects a bytes object so therefore the string is encoded before trying to detect the encoding.
convert() : It is a charade.convert() wrapper. It calls detect() first to get the encoding. Then, it returns a decoded string.
- ML | One Hot Encoding of datasets in Python
- Run Length Encoding in Python
- Python | Encoding Decoding using Matrix
- Python | C Strings of Doubtful Encoding | Set-2
- ML | Label Encoding of datasets in Python
- Python | C Strings of Doubtful Encoding | Set-1
- Python | Insert character after every character pair
- Zip function in Python to change to a new character set
- Python | Lowercase first character of String
- Ways to increment a character in Python
- Python | Frequency of each character in String
- Python | Deleting all occurrences of character
- Python | Count occurrences of a character in string
- Python | Split string on Kth Occurrence of Character
- Python | Find position of a character in given string
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to firstname.lastname@example.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.