String handling in extension modules is a problem. C strings in extensions might not follow the strict Unicode encoding/decoding rules that Python normally expects. Thus, it’s possible that some malformed C data would pass to Python. A good example might be C strings associated with low-level system calls such as filenames. For instance, what happens if a system call returns a broken string back to the interpreter that can’t be properly decoded.
Normally, Unicode errors are often handled by specifying some sort of error policy, such as strict, ignore, replace, or something similar. However, the downside of these policies is that they irreparably destroy the original string content.
For example, if the malformed data in the example was decoded using one of these policies, you would get results as shown below –
Code #1 :
'Spicy Jalapeño' 'Spicy Jalapeño?'
The surrogateescape error handling policies takes all nondecodable bytes and turns them into the low-half of a surrogate pair (\udcXX where XX is the raw byte value).
Code #2 :
Isolated low surrogate characters such as \udcae never appear in valid Unicode. Thus, this string is technically an illegal representation. In fact, if one tries to pass it to functions that perform output, encoding errors will come out.
Code #3 :
Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'utf-8' codec can't encode character '\udcae' in position 14: surrogates not allowed
However, the main point of allowing the surrogate escapes is to allow malformed strings to pass from C to Python and back into C without any data loss. When the string is encoded using surrogateescape again, the surrogate characters are turned back into their original bytes. For example:
'Spicy Jalapeño\udcae' b'Spicy Jalape\xc3\xb1o\xae'
As a general rule, it’s probably best to avoid surrogate encoding whenever possible. The code will be much more reliable if it uses proper encodings. However, sometimes there are situations where one simply don’t have control over the data encoding and one aren’t free to ignore or replace the bad data because other functions may need to use it.
As a final note, many of Python’s system-oriented functions, especially those related to filenames, environment variables, and command-line options, use surrogate encoding. For example, if a function such as
os.listdir() is used on a directory containing an undecodable filename, it will be returned as a string with surrogate escapes.
- Python | C Strings of Doubtful Encoding | Set-1
- ML | One Hot Encoding of datasets in Python
- Run Length Encoding in Python
- Python | Character Encoding
- ML | Label Encoding of datasets in Python
- Python | Encoding Decoding using Matrix
- Python | Remove empty strings from list of strings
- Python | Tokenizing strings in list of strings
- C strings conversion to Python
- Python | Interleaving two strings
- Python | How to sort a list of strings
- Python | Similarity metrics of strings
- Python | Split Sublist Strings
- Python Strings decode() method
- Python 3 Strings | expandtabs() method
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to email@example.com. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.