Passing NULL-Terminated Strings to C Libraries
If one wants an extension module that needs to pass a NULL-terminated string to a C library. Let’s see how to do it with Python’s Unicode string implementation. C libraries has many functions that operate on NULL-terminated strings declared as type char *
.
The code given below has C function that we will illustrate and test the problem. The C function (Code #1) simply prints the hex representation of individual characters so that the passed strings can be easily debugged.
Code #1 :
void print_chars( char *s)
{
while (*s)
{
printf ( "%2x " , (unsigned char ) *s);
s++;
}
printf ( "\n" );
}
print_chars( "Hello" );
|
Output :
48 65 6c 6c 6f
To call such C function from Python, there are few choices. First of it is that – it can be restricted to only operate on bytes using “y” conversion code to PyArg_ParseTuple()
as shown in the code below.
Code #2 :
static PyObject * py_print_chars(PyObject * self, PyObject * args)
{
char * s;
if (! PyArg_ParseTuple(args, "y" , &s))
{
return NULL;
}
print_chars(s);
Py_RETURN_NONE;
}
|
Let’s see the how to resulting function operates and how bytes with embedded NULL bytes and Unicode strings are rejected.
Code #3 :
print (print_chars(b 'Hello World' ))
print ( "\n" , print_chars(b 'Hello\x00World' ))
print ( "\n" , print_chars( 'Hello World' ))
|
Output :
48 65 6c 6c 6f 20 57 6f 72 6c 64
Traceback (most recent call last):
File "", line 1, in
TypeError: must be bytes without null bytes, not bytes
Traceback (most recent call last):
File "", line 1, in
TypeError: 'str' does not support the buffer interface
If you want to pass Unicode strings instead, use the “s” format code to PyArg_ParseTuple()
as shown below.
Code #4 :
static PyObject *py_print_chars(PyObject *self, PyObject *args)
{
char *s;
if (!PyArg_ParseTuple(args, "s" , &s))
{
return NULL;
}
print_chars(s);
Py_RETURN_NONE;
}
|
Using above code (code #4) will automatically convert all strings to a NULL-terminated UTF-8 encoding. As shown in the code below.
Code #5 :
print (print_chars( 'Hello World' ))
print ( "\n" , print_chars( 'Spicy Jalape\u00f1o' ))
print ( "\n" , print_chars( 'Hello\x00World' ))
print ( "\n" , print_chars(b 'Hello World' ))
|
Output :
48 65 6c 6c 6f 20 57 6f 72 6c 64
53 70 69 63 79 20 4a 61 6c 61 70 65 c3 b1 6f
Traceback (most recent call last):
File "", line 1, in
TypeError: must be str without null characters, not str
Traceback (most recent call last):
File "", line 1, in
TypeError: must be str, not bytes
If working with a PyObject *
and can’t use PyArg_ParseTuple()
, the code below explains how to check and extract a suitable char *
reference, from both a bytes and string object.
Code #6 : Conversion from bytes
PyObject *obj;
{
char *s;
s = PyBytes_AsString(o);
if (!s)
{
return NULL;
}
print_chars(s);
}
|
Code #7 : Conversion to UTF-8 bytes from a string
{
PyObject *bytes;
char *s;
if (!PyUnicode_Check(obj))
{
PyErr_SetString(PyExc_TypeError, "Expected string" );
return NULL;
}
bytes = PyUnicode_AsUTF8String(obj);
s = PyBytes_AsString(bytes);
print_chars(s);
Py_DECREF(bytes);
}
|
Both of the code conversions guarantee NULL-terminated data, but there is no check for embedded NULL bytes elsewhere inside the string. That needs to be check if it’s important.
Note : There is a hidden memory overhead associated with using the “s” format code to PyArg_ParseTuple()
that is easy to overlook. When writing a code that uses this conversion, a UTF-8 string is created and gets permanently attached to the original string object which if contains non-ASCII characters, makes the size of the string increase until it is garbage collected.
Code #8 :
import sys
s = 'Spicy Jalape\u00f1o'
print ( "Size : " , sys.getsizeof(s))
print ( "\n" , print_chars(s))
print ( "\nSize : " , sys.getsizeof(s))
|
Output :
Size : 87
53 70 69 63 79 20 4a 61 6c 61 70 65 c3 b1 6f
Size : 103
Last Updated :
29 Mar, 2019
Like Article
Save Article
Share your thoughts in the comments
Please Login to comment...