How to Handle Unicode Strings in C++?

A Unicode string is a sequence of code points, where each code point is a unique integer representing a character in the Unicode standard. It is used to represent text data in any language as it assigns a unique number to every character similar to the ASCII standard. But instead of representing only 128 characters, the Unicode represents all the characters in every spoken language of the world. In this article, we will learn how to handle such Unicode strings in C++.

Example:

Input:
string utf8_string = "Привет, мир!";

Output:
Wide string: Privet, mir!

Handling Unicode Strings in C++

We can handle the Unicode strings by converting the Unicode string to wide string literals, which are sequences of wchar_t and these wide-string literals can be defined using the L prefix.

Syntax to Convert Unicode String to std::wstring

We can use the below syntax to convert Unicode string to std::wstring:

wstring_convert<codecvt_utf8<wchar_t>, wchar_t> converter; 
wstring wide_string = converter.from_bytes(utf8_stringName);

Here,

wstring_convert is a template class used for converting narrow character strings to wide strings.
codecvt_utf8<wchar_t> specifies the codecvt facet used for conversion.
converter.from_bytes(utf8_stringName) converts the UTF-8 encoded string to a wide string using the codecvt facet specified in the converted.

C++ Program to Convert Unicode Strings to wstring

The following program illustrates how we can Handle Unicode Strings in C++.

C++

// C++ Program to Handle Unicode Strings

#include <codecvt> //for wstring_convert and codecvt_utf8 if using C++17 or earlier
#include <iostream>
#include <locale>
#include <string>
using namespace std;

int main()
{

    // Assuming the source file is saved in UTF-8 encoding
    string utf8_string
        = "Привет, мир!"; // Hello, world! in Russian

    // Convert UTF-8 string to wide string
    wstring_convert<codecvt_utf8<wchar_t>, wchar_t>
        converter;
    wstring wide_string = converter.from_bytes(utf8_string);

    // Print wide string
    wcout.imbue(
        locale("")); // Ensure the wide output stream can
                     // handle the locale properly.
    wcout << L"Wide string: " << wide_string
          << " that means Hello, World! in Russian Language"
          << endl;

    return 0;
}

Output

Wide string: Privet, mir! that means Hello, World! in Russian Language

Time Complexity: O(N), where N is the length of the wide string.
Auxiliary Space: O(N + M), where N is the length of the wide string and M is the length of UTF-8.

Article Tags :

C++

C++ Programs

CPP Examples

cpp-string