Unicode Strings Passing to C Libraries

Last Updated : 02 Apr, 2019

Given that one wants to write an extension module that needs to pass a Python string to C library function. So, the question arises to properly handle Unicode. So, one of the main issues that arise is that existing C libraries won’t understand Python’s native representation of Unicode. Therefore, the main challenge is to convert the Python string into a form that can be more easily understood by C libraries.

To illustrate the solution – given below are two C functions that operate on string data and output it for debugging and experimentation.

Code #1 : Uses bytes provided in the form char *, int

void print_chars(char *s, int len) 
{ 
    int n = 0; 
    while (n < len) 
    { 
        printf("%2x ", (unsigned char) s[n]); 
        n++; 
    } 
    printf("\n"); 
} 

Code #2 : Uses wide characters in the form wchar_t *, int

void print_wchars(wchar_t *s, int len) 
{ 
    int n = 0; 
    while (n < len) 
    { 
        printf("%x ", s[n]); 
        n++; 
    } 
    printf("\n"); 
} 

Python strings need to be converted to a suitable byte encoding such as UTF-8 for the byte-oriented function print_chars(). The code given below a simple extension function solving the purpose.

Code #3 :

static PyObject *py_print_chars(PyObject *self, PyObject *args) 
{ 
    char *s; 
    Py_ssize_t len; 
    if (!PyArg_ParseTuple(args, "s#", &s, &len)) 
    { 
        return NULL; 
    } 
    print_chars(s, len); 
    Py_RETURN_NONE; 
} 

For library functions that work with the machine native wchar_t type, C extension code can be written as –

Code #4 :

static PyObject * py_print_wchars(PyObject * self, PyObject * args) 
{ 
    wchar_t * s; 
    Py_ssize_t len; 
    if (! PyArg_ParseTuple(args, "u#", &s, &len)) 
    { 
        return NULL; 
    } 
    print_wchars(s, len); 
    Py_RETURN_NONE; 
} 

Now the code below checks how the extension functions work.

It is to be observed the way the byte-oriented function print_chars() is receiving UTF-8 encoded data, whereas print_wchars() is receiving the Unicode code point values.

Code #5 :

s = 'Spicy Jalape\u00f1o'
print (print_chars(s)) 
  
print ("\n", print_wchars(s)) 

Output :

53 70 69 63 79 20 4a 61 6c 61 70 65 c3 b1 6f

53 70 69 63 79 20 4a 61 6c 61 70 65 f1 6f

Let’s check the nature of C library that is been accessed. For many C libraries, it might make more sense to pass bytes instead of a string. Let’s use the conversion code given below to do so.

Code #6 :

static PyObject *py_print_chars(PyObject *self, PyObject *args) 
{ 
    char *s; 
    Py_ssize_t len; 
      
    // accepts bytes, bytearray, or other byte-like object  
      
    if (!PyArg_ParseTuple(args, "y#", &s, &len)) 
    { 
        return NULL; 
    } 
    print_chars(s, len); 
    Py_RETURN_NONE; 
} 

If still desire to pass strings, it is to be taken care that Python3 uses an adaptable string representation that is not entirely straightforward to map directly to C libraries using the standard types char * or wchar_t *. Thus, in order to present string data to C, some kind of conversion is almost always necessary. The s# and u# format codes to PyArg_ParseTuple() safely perform such conversions.
Whenever a conversion is made, a copy of the converted data is attached to the original string object so that it can be reused later as shown in the code below.

Code #7 :

import sys 
  
s = 'Spicy Jalape\u00f1o'
print ("Size : ", sys.getsizeof(s)) 
  
print("\n", print_chars(s)) 
  
print ("\nSize : ", sys.getsizeof(s)) 
  
print ("\n", print_wchars(s)) 
  
print ("\nSize : ", sys.getsizeof(s))

Output :

Size : 87
    
53 70 69 63 79 20 4a 61 6c 61 70 65 c3 b1 6f

Size : 103    

53 70 69 63 79 20 4a 61 6c 61 70 65 f1 6f

Size : 163

Suggest improvement

Passing NULL-Terminated Strings to C Libraries

Share your thoughts in the comments

Unicode Strings Passing to C Libraries

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?