Skip to content
Related Articles

Related Articles

Improve Article
Save Article
Like Article

Unicode Strings Passing to C Libraries

  • Last Updated : 02 Apr, 2019

Given that one wants to write an extension module that needs to pass a Python string to C library function. So, the question arises to properly handle Unicode. So, one of the main issues that arise is that existing C libraries won’t understand Python’s native representation of Unicode. Therefore, the main challenge is to convert the Python string into a form that can be more easily understood by C libraries.

To illustrate the solution – given below are two C functions that operate on string data and output it for debugging and experimentation.

 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning - Basic Level Course

Code #1 : Uses bytes provided in the form char *, int






void print_chars(char *s, int len)
{
    int n = 0;
    while (n < len)
    {
        printf("%2x ", (unsigned char) s[n]);
        n++;
    }
    printf("\n");
}

 
Code #2 : Uses wide characters in the form wchar_t *, int




void print_wchars(wchar_t *s, int len)
{
    int n = 0;
    while (n < len)
    {
        printf("%x ", s[n]);
        n++;
    }
    printf("\n");
}

Python strings need to be converted to a suitable byte encoding such as UTF-8 for the byte-oriented function print_chars(). The code given below a simple extension function solving the purpose.

Code #3 :




static PyObject *py_print_chars(PyObject *self, PyObject *args)
{
    char *s;
    Py_ssize_t len;
    if (!PyArg_ParseTuple(args, "s#", &s, &len))
    {
        return NULL;
    }
    print_chars(s, len);
    Py_RETURN_NONE;
}

For library functions that work with the machine native wchar_t type, C extension code can be written as –

Code #4 :




static PyObject * py_print_wchars(PyObject * self, PyObject * args)
{
    wchar_t * s;
    Py_ssize_t len;
    if (! PyArg_ParseTuple(args, "u#", &s, &len))
    {
        return NULL;
    }
    print_wchars(s, len);
    Py_RETURN_NONE;
}

Now the code below checks how the extension functions work.

It is to be observed the way the byte-oriented function print_chars() is receiving UTF-8 encoded data, whereas print_wchars() is receiving the Unicode code point values.

Code #5 :




s = 'Spicy Jalape\u00f1o'
print (print_chars(s))
  
print ("\n", print_wchars(s))

Output :

53 70 69 63 79 20 4a 61 6c 61 70 65 c3 b1 6f

53 70 69 63 79 20 4a 61 6c 61 70 65 f1 6f

Let’s check the nature of C library that is been accessed. For many C libraries, it might make more sense to pass bytes instead of a string. Let’s use the conversion code given below to do so.

Code #6 :




static PyObject *py_print_chars(PyObject *self, PyObject *args)
{
    char *s;
    Py_ssize_t len;
      
    // accepts bytes, bytearray, or other byte-like object 
      
    if (!PyArg_ParseTuple(args, "y#", &s, &len))
    {
        return NULL;
    }
    print_chars(s, len);
    Py_RETURN_NONE;
}

If still desire to pass strings, it is to be taken care that Python3 uses an adaptable string representation that is not entirely straightforward to map directly to C libraries using the standard types char * or wchar_t *. Thus, in order to present string data to C, some kind of conversion is almost always necessary. The s# and u# format codes to PyArg_ParseTuple() safely perform such conversions.
Whenever a conversion is made, a copy of the converted data is attached to the original string object so that it can be reused later as shown in the code below.

Code #7 :




import sys
  
s = 'Spicy Jalape\u00f1o'
print ("Size : ", sys.getsizeof(s))
  
print("\n", print_chars(s))
  
print ("\nSize : ", sys.getsizeof(s))
  
print ("\n", print_wchars(s))
  
print ("\nSize : ", sys.getsizeof(s))

Output :

Size : 87
    
53 70 69 63 79 20 4a 61 6c 61 70 65 c3 b1 6f

Size : 103    

53 70 69 63 79 20 4a 61 6c 61 70 65 f1 6f

Size : 163    



My Personal Notes arrow_drop_up
Recommended Articles
Page :