Python | C Strings of Doubtful Encoding | Set-2

Last Updated : 07 Jun, 2019

String handling in extension modules is a problem. C strings in extensions might not follow the strict Unicode encoding/decoding rules that Python normally expects. Thus, it’s possible that some malformed C data would pass to Python. A good example might be C strings associated with low-level system calls such as filenames. For instance, what happens if a system call returns a broken string back to the interpreter that can’t be properly decoded.

Normally, Unicode errors are often handled by specifying some sort of error policy, such as strict, ignore, replace, or something similar. However, the downside of these policies is that they irreparably destroy the original string content.
For example, if the malformed data in the example was decoded using one of these policies, you would get results as shown below –

Code #1 :

raw = b'Spicy Jalape\xc3\xb1o\xae'
  
print (raw.decode('utf-8', 'ignore')) 
  
print (raw.decode('utf-8', 'replace')) 

Output :

'Spicy Jalapeño'
'Spicy Jalapeño?'

The surrogateescape error handling policies takes all nondecodable bytes and turns them into the low-half of a surrogate pair (\udcXX where XX is the raw byte value).

Code #2 :

print (raw.decode('utf-8', 'surrogateescape'))

Output :

'Spicy Jalapeño\udcae'

Isolated low surrogate characters such as \udcae never appear in valid Unicode. Thus, this string is technically an illegal representation. In fact, if one tries to pass it to functions that perform output, encoding errors will come out.

Code #3 :

s = raw.decode('utf-8', 'surrogateescape') 
print(s) 

Output :

Traceback (most recent call last):
File "", line 1, in 
UnicodeEncodeError: 'utf-8' codec can't encode 
character '\udcae' in position 14: surrogates not allowed

However, the main point of allowing the surrogate escapes is to allow malformed strings to pass from C to Python and back into C without any data loss. When the string is encoded using surrogateescape again, the surrogate characters are turned back into their original bytes. For example:

Code #4:

print (s) 
print(s.encode('utf-8', 'surrogateescape')) 

Output :

'Spicy Jalapeño\udcae'
b'Spicy Jalape\xc3\xb1o\xae'

As a general rule, it’s probably best to avoid surrogate encoding whenever possible. The code will be much more reliable if it uses proper encodings. However, sometimes there are situations where one simply don’t have control over the data encoding and one aren’t free to ignore or replace the bad data because other functions may need to use it.

As a final note, many of Python’s system-oriented functions, especially those related to filenames, environment variables, and command-line options, use surrogate encoding. For example, if a function such as os.listdir() is used on a directory containing an undecodable filename, it will be returned as a string with surrogate escapes.

Suggest improvement

Python | C Strings of Doubtful Encoding | Set-1

Share your thoughts in the comments

Python | C Strings of Doubtful Encoding | Set-2

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?