Python | C Strings of Doubtful Encoding | Set-2

String handling in extension modules is a problem. C strings in extensions might not follow the strict Unicode encoding/decoding rules that Python normally expects. Thus, it’s possible that some malformed C data would pass to Python. A good example might be C strings associated with low-level system calls such as filenames. For instance, what happens if a system call returns a broken string back to the interpreter that can’t be properly decoded.

Normally, Unicode errors are often handled by specifying some sort of error policy, such as strict, ignore, replace, or something similar. However, the downside of these policies is that they irreparably destroy the original string content.
For example, if the malformed data in the example was decoded using one of these policies, you would get results as shown below –

Code #1 :



filter_none

edit
close

play_arrow

link
brightness_4
code

raw = b'Spicy Jalape\xc3\xb1o\xae'
  
print (raw.decode('utf-8', 'ignore'))
  
print (raw.decode('utf-8', 'replace'))

chevron_right


Output :

'Spicy Jalapeño'
'Spicy Jalapeño?'

The surrogateescape error handling policies takes all nondecodable bytes and turns them into the low-half of a surrogate pair (\udcXX where XX is the raw byte value).

Code #2 :

filter_none

edit
close

play_arrow

link
brightness_4
code

print (raw.decode('utf-8', 'surrogateescape'))

chevron_right


Output :

'Spicy Jalapeño\udcae'

Isolated low surrogate characters such as \udcae never appear in valid Unicode. Thus, this string is technically an illegal representation. In fact, if one tries to pass it to functions that perform output, encoding errors will come out.

Code #3 :

filter_none

edit
close

play_arrow

link
brightness_4
code

s = raw.decode('utf-8', 'surrogateescape')
print(s)

chevron_right


Output :

Traceback (most recent call last):
File "", line 1, in 
UnicodeEncodeError: 'utf-8' codec can't encode 
character '\udcae' in position 14: surrogates not allowed

However, the main point of allowing the surrogate escapes is to allow malformed strings to pass from C to Python and back into C without any data loss. When the string is encoded using surrogateescape again, the surrogate characters are turned back into their original bytes. For example:

Code #4:

filter_none

edit
close

play_arrow

link
brightness_4
code

print (s)
print(s.encode('utf-8', 'surrogateescape'))

chevron_right


Output :

'Spicy Jalapeño\udcae'
b'Spicy Jalape\xc3\xb1o\xae'

As a general rule, it’s probably best to avoid surrogate encoding whenever possible. The code will be much more reliable if it uses proper encodings. However, sometimes there are situations where one simply don’t have control over the data encoding and one aren’t free to ignore or replace the bad data because other functions may need to use it.

As a final note, many of Python’s system-oriented functions, especially those related to filenames, environment variables, and command-line options, use surrogate encoding. For example, if a function such as os.listdir() is used on a directory containing an undecodable filename, it will be returned as a string with surrogate escapes.



My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.




Article Tags :

Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.