Unicode Strings Passing to C Libraries
Last Updated :
02 Apr, 2019
Given that one wants to write an extension module that needs to pass a Python string to C library function. So, the question arises to properly handle Unicode. So, one of the main issues that arise is that existing C libraries won’t understand Python’s native representation of Unicode. Therefore, the main challenge is to convert the Python string into a form that can be more easily understood by C libraries.
To illustrate the solution – given below are two C functions that operate on string data and output it for debugging and experimentation.
Code #1 : Uses bytes provided in the form char *, int
void print_chars( char *s, int len)
{
int n = 0;
while (n < len)
{
printf ( "%2x " , (unsigned char ) s[n]);
n++;
}
printf ( "\n" );
}
|
Code #2 : Uses wide characters in the form wchar_t *, int
void print_wchars( wchar_t *s, int len)
{
int n = 0;
while (n < len)
{
printf ( "%x " , s[n]);
n++;
}
printf ( "\n" );
}
|
Python strings need to be converted to a suitable byte encoding such as UTF-8 for the byte-oriented function print_chars()
. The code given below a simple extension function solving the purpose.
Code #3 :
static PyObject *py_print_chars(PyObject *self, PyObject *args)
{
char *s;
Py_ssize_t len;
if (!PyArg_ParseTuple(args, "s#" , &s, &len))
{
return NULL;
}
print_chars(s, len);
Py_RETURN_NONE;
}
|
For library functions that work with the machine native wchar_t
type, C extension code can be written as –
Code #4 :
static PyObject * py_print_wchars(PyObject * self , PyObject * args)
{
wchar_t * s;
Py_ssize_t len ;
if (! PyArg_ParseTuple(args, "u#" , &s, & len ))
{
return NULL;
}
print_wchars(s, len );
Py_RETURN_NONE;
}
|
Now the code below checks how the extension functions work.
It is to be observed the way the byte-oriented function print_chars()
is receiving UTF-8 encoded data, whereas print_wchars()
is receiving the Unicode code point values.
Code #5 :
s = 'Spicy Jalape\u00f1o'
print (print_chars(s))
print ( "\n" , print_wchars(s))
|
Output :
53 70 69 63 79 20 4a 61 6c 61 70 65 c3 b1 6f
53 70 69 63 79 20 4a 61 6c 61 70 65 f1 6f
Let’s check the nature of C library that is been accessed. For many C libraries, it might make more sense to pass bytes instead of a string. Let’s use the conversion code given below to do so.
Code #6 :
static PyObject *py_print_chars(PyObject *self, PyObject *args)
{
char *s;
Py_ssize_t len;
if (!PyArg_ParseTuple(args, "y#" , &s, &len))
{
return NULL;
}
print_chars(s, len);
Py_RETURN_NONE;
}
|
If still desire to pass strings, it is to be taken care that Python3 uses an adaptable string representation that is not entirely straightforward to map directly to C libraries using the standard types char *
or wchar_t *
. Thus, in order to present string data to C, some kind of conversion is almost always necessary. The s# and u# format codes to PyArg_ParseTuple()
safely perform such conversions.
Whenever a conversion is made, a copy of the converted data is attached to the original string object so that it can be reused later as shown in the code below.
Code #7 :
import sys
s = 'Spicy Jalape\u00f1o'
print ( "Size : " , sys.getsizeof(s))
print ( "\n" , print_chars(s))
print ( "\nSize : " , sys.getsizeof(s))
print ( "\n" , print_wchars(s))
print ( "\nSize : " , sys.getsizeof(s))
|
Output :
Size : 87
53 70 69 63 79 20 4a 61 6c 61 70 65 c3 b1 6f
Size : 103
53 70 69 63 79 20 4a 61 6c 61 70 65 f1 6f
Size : 163
Like Article
Suggest improvement
Share your thoughts in the comments
Please Login to comment...