Unicode Strings Passing to C Libraries
Given that one wants to write an extension module that needs to pass a Python string to C library function. So, the question arises to properly handle Unicode. So, one of the main issues that arise is that existing C libraries won’t understand Python’s native representation of Unicode. Therefore, the main challenge is to convert the Python string into a form that can be more easily understood by C libraries.
To illustrate the solution – given below are two C functions that operate on string data and output it for debugging and experimentation.
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning - Basic Level Course
Code #1 : Uses bytes provided in the form
char *, int
Code #2 : Uses wide characters in the form
wchar_t *, int
Python strings need to be converted to a suitable byte encoding such as UTF-8 for the byte-oriented function
print_chars(). The code given below a simple extension function solving the purpose.
Code #3 :
For library functions that work with the machine native
wchar_t type, C extension code can be written as –
Code #4 :
Now the code below checks how the extension functions work.
It is to be observed the way the byte-oriented function
print_chars() is receiving UTF-8 encoded data, whereas
print_wchars() is receiving the Unicode code point values.
Code #5 :
53 70 69 63 79 20 4a 61 6c 61 70 65 c3 b1 6f 53 70 69 63 79 20 4a 61 6c 61 70 65 f1 6f
Let’s check the nature of C library that is been accessed. For many C libraries, it might make more sense to pass bytes instead of a string. Let’s use the conversion code given below to do so.
Code #6 :
If still desire to pass strings, it is to be taken care that Python3 uses an adaptable string representation that is not entirely straightforward to map directly to C libraries using the standard types
char * or
wchar_t *. Thus, in order to present string data to C, some kind of conversion is almost always necessary. The s# and u# format codes to
PyArg_ParseTuple() safely perform such conversions.
Whenever a conversion is made, a copy of the converted data is attached to the original string object so that it can be reused later as shown in the code below.
Code #7 :
Size : 87 53 70 69 63 79 20 4a 61 6c 61 70 65 c3 b1 6f Size : 103 53 70 69 63 79 20 4a 61 6c 61 70 65 f1 6f Size : 163