Python | C Strings of Doubtful Encoding | Set-1

One can convert strings between C and Python vice-versa but the C encoding is of a doubtful or unknown nature. Let’s suppose that a given C data is supposed to be UTF-8, but it’s not being strictly enforced. So, it is important to handle such kind of malformed data so that it doesn’t crash Python or destroy the string data in the process.

Code#1 : C data and a function illustrating the problem.

filter_none

edit
close

play_arrow

link
brightness_4
code

/* Some dubious string data (malformed UTF-8) */
const char* sdata = "Spicy Jalape\xc3\xb1o\xae";
int slen = 16;
/* Output character data */
void print_chars(char* s, int len)
{
    int n = 0;
    while (n < len) {
        printf("%2x ", (unsigned char)s[n]);
        n++;
    }
    printf("\n");
}

chevron_right


In the code above, the string sdata contains a mix of malformed data and UTF-8. Nevertheless, if a user calls print_chars(sdata, slen) in C, it works fine.



Now suppose one wants to convert the contents of sdata into a Python string, further passing that string to the print_chars() function through an extension. The code given below shows the way that exactly preserves the original data even though there are encoding problems.

Code#2 :

filter_none

edit
close

play_arrow

link
brightness_4
code

/* Return the C string back to Python */
static PyObject *py_retstr(PyObject *self, PyObject *args)
{
    if (!PyArg_ParseTuple(args, ""))
    {
        return NULL;
    }
    return PyUnicode_Decode(sdata, slen, "utf-8", "surrogateescape");
}
  
/* Wrapper for the print_chars() function */
static PyObject *py_print_chars(PyObject *self, PyObject *args)
{
    PyObject *obj, *bytes;
    char *s = 0;
    Py_ssize_t len;
    if (!PyArg_ParseTuple(args, "U", &obj))
    {
        return NULL;
    }
    if ((bytes = PyUnicode_AsEncodedString(obj,
    "utf-8","surrogateescape"))
            == NULL)
    {
        return NULL;
    }
    PyBytes_AsStringAndSize(bytes, &s, &len);
    print_chars(s, len);
    Py_DECREF(bytes);
    Py_RETURN_NONE;
}

chevron_right


Code#3 : Using the above code 2

filter_none

edit
close

play_arrow

link
brightness_4
code

s = retstr()
printf (s)
  
printf ("\n", print_chars(s))

chevron_right


'Spicy Jalapeño\udcae'

53 70 69 63 79 20 4a 61 6c 61 70 65 c3 b1 6f ae

Here, one can see that the malformed string got encoded into a Python string without errors and that when passed back into C, it turned back into a byte string that exactly encoded the same bytes as the original C string.



My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.




Article Tags :

Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.