Unicode Strings Passing to C Libraries

Given that one wants to write an extension module that needs to pass a Python string to C library function. So, the question arises to properly handle Unicode. So, one of the main issues that arise is that existing C libraries won’t understand Python’s native representation of Unicode. Therefore, the main challenge is to convert the Python string into a form that can be more easily understood by C libraries.

To illustrate the solution – given below are two C functions that operate on string data and output it for debugging and experimentation.

Code #1 : Uses bytes provided in the form char *, int



filter_none

edit
close

play_arrow

link
brightness_4
code

void print_chars(char *s, int len)
{
    int n = 0;
    while (n < len)
    {
        printf("%2x ", (unsigned char) s[n]);
        n++;
    }
    printf("\n");
}

chevron_right


 
Code #2 : Uses wide characters in the form wchar_t *, int

filter_none

edit
close

play_arrow

link
brightness_4
code

void print_wchars(wchar_t *s, int len)
{
    int n = 0;
    while (n < len)
    {
        printf("%x ", s[n]);
        n++;
    }
    printf("\n");
}

chevron_right


Python strings need to be converted to a suitable byte encoding such as UTF-8 for the byte-oriented function print_chars(). The code given below a simple extension function solving the purpose.

Code #3 :

filter_none

edit
close

play_arrow

link
brightness_4
code

static PyObject *py_print_chars(PyObject *self, PyObject *args)
{
    char *s;
    Py_ssize_t len;
    if (!PyArg_ParseTuple(args, "s#", &s, &len))
    {
        return NULL;
    }
    print_chars(s, len);
    Py_RETURN_NONE;
}

chevron_right


For library functions that work with the machine native wchar_t type, C extension code can be written as –

Code #4 :

filter_none

edit
close

play_arrow

link
brightness_4
code

static PyObject * py_print_wchars(PyObject * self, PyObject * args)
{
    wchar_t * s;
    Py_ssize_t len;
    if (! PyArg_ParseTuple(args, "u#", &s, &len))
    {
        return NULL;
    }
    print_wchars(s, len);
    Py_RETURN_NONE;
}

chevron_right


Now the code below checks how the extension functions work.

It is to be observed the way the byte-oriented function print_chars() is receiving UTF-8 encoded data, whereas print_wchars() is receiving the Unicode code point values.

Code #5 :

filter_none

edit
close

play_arrow

link
brightness_4
code

s = 'Spicy Jalape\u00f1o'
print (print_chars(s))
  
print ("\n", print_wchars(s))

chevron_right


Output :

53 70 69 63 79 20 4a 61 6c 61 70 65 c3 b1 6f

53 70 69 63 79 20 4a 61 6c 61 70 65 f1 6f

Let’s check the nature of C library that is been accessed. For many C libraries, it might make more sense to pass bytes instead of a string. Let’s use the conversion code given below to do so.

Code #6 :

filter_none

edit
close

play_arrow

link
brightness_4
code

static PyObject *py_print_chars(PyObject *self, PyObject *args)
{
    char *s;
    Py_ssize_t len;
      
    // accepts bytes, bytearray, or other byte-like object 
      
    if (!PyArg_ParseTuple(args, "y#", &s, &len))
    {
        return NULL;
    }
    print_chars(s, len);
    Py_RETURN_NONE;
}

chevron_right


If still desire to pass strings, it is to be taken care that Python3 uses an adaptable string representation that is not entirely straightforward to map directly to C libraries using the standard types char * or wchar_t *. Thus, in order to present string data to C, some kind of conversion is almost always necessary. The s# and u# format codes to PyArg_ParseTuple() safely perform such conversions.
Whenever a conversion is made, a copy of the converted data is attached to the original string object so that it can be reused later as shown in the code below.

Code #7 :

filter_none

edit
close

play_arrow

link
brightness_4
code

import sys
  
s = 'Spicy Jalape\u00f1o'
print ("Size : ", sys.getsizeof(s))
  
print("\n", print_chars(s))
  
print ("\nSize : ", sys.getsizeof(s))
  
print ("\n", print_wchars(s))
  
print ("\nSize : ", sys.getsizeof(s))

chevron_right


Output :

Size : 87
    
53 70 69 63 79 20 4a 61 6c 61 70 65 c3 b1 6f

Size : 103    

53 70 69 63 79 20 4a 61 6c 61 70 65 f1 6f

Size : 163    


My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.




Article Tags :

Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.