The character encoding plays a major role in the interpretation of the content of an HTML and XML document. A document does not only contain English characters but also non-English characters like Hebrew, Latin, Greek and much more. To let the parser know, which encoding method should be used, the documents will contain a dedicated tag and attribute to specify this. For example:
In HTML documents
<meta charset=”–encoding method name–” content=”text/html”>
In XML documents
<?xml version=”1.0″ encoding=”–encoding method name–“?>
These tags convey the browser which encoding method can be used for parsing. If the proper encoding method is not specified, either the content is rendered incorrectly or sometimes with the replacement character ‘�‘.
XML encoding methods
The XML documents can be encoded in one of the formats listed below.
- ISO-8859-1 to ISO-8859-10
Amongst these methods, UTF-8 is commonly found. UTF-16 allows 2 bytes for each character and the documents with ‘0xx’ are encoded by this method. Latin1 covers Western European characters.
HTML encoding methods
The HTML and HTML5 documents can be encoded by any one of the methods below.
- UTF-16BE (Big Indian)
- UTF-16LE (Little Indian)
- WINDOWS-1250 to WINDOWS-1258
For HTML5 documents, mostly UTF-8 is recommended. ISO-8859-1 is mostly used with XHTML documents. Some methods like UTF-7, UTF-32, BOCU-1, CESU-8 are explicitly mentioned not to use as they replace most of the characters with replacement character ‘�‘.
BeautifulSoup and encoding
The BeautifulSoup module, popularly imported as bs4, is a boon that makes HTML/XML parsing a cake-walk. It has a rich number of methods among which one helps to select contents by their tag name or by the attribute present in the tag, one helps to extract the content based on the hierarchy, printing content with indentation required for HTML, and so on. The bs4 module auto-detects the encoding method used in the documents and converts it to a suitable format efficiently. The returned BeautifulSoup object will have various attributes which give more information. However, sometimes it incorrectly predicts the encoding method. Thus, if the encoding method is known by the user, it is good to pass it as an argument. This article provides the various ways in which the encoding methods can be specified in the bs4 module.
The bs4 module has a sub-library called Unicode, Dammit that finds the encoded method and uses that to convert to Unicode characters. The original_encoding attribute is used to return the detected encoding method.
Example 1 :
Given an HTML element parse it and find the encoding method used.
Here, the HTML element string is prefixed by ‘b‘, which means treat it as a byte literal. Thus, ASCII encoding method is detected and used by the parser. In real world situations, the original encoding will be the one mentioned in the HTML document
Given a URL, parse the contents and find the original encoding method.
Enoded method : utf-8
Verifying the output :
Encoding method : UTF-8
This is a parameter that can be passed to the constructor BeautifulSoup(). This tells the bs4 module explicitly, which encoding method has to be used. This saves time and avoids incorrect parsing due to misprediction.
If the below warning is generated:
/usr/lib/python3/dist-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I’m using the best available HTML parser for this system (“html5lib”). This usually isn’t a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
To get rid of this warning, change this:
BeautifulSoup([your markup], “html5lib”)
Traceback (most recent call last):
File “/home/98e5f50281480cda5f5e31e3bcafb085.py”, line 9, in <module>
UnicodeEncodeError: ‘ascii’ codec can’t encode characters in position 0-1: ordinal not in range(128)
The editor in GeeksforGeeks tried to parse it with ASCII and ended up with an error. The output of executing the same code in the local machine gave the following output :
But the content actually corresponds to “ISO-8859-8” and the interpreted characters are not the desired ones. Thus by explicitly mentioning the encoding method if known, the correct output will be given.
When the parsed HTML content has to be given as output, by default bs4 module delivers it as UTF-8 encoded document or sometimes with the mispredicted ones. If You want a document to be encoded by other methods without passing to the constructor, the following can be done :
- prettify() : This method is used to print the HTML content with correct indentation. The encoding method to be used can be passed as a parameter to this method, so that while printing it modifies the encoding method also.
Here, you can see the <meta> tag where encoding is set as UTF-8. To prevent this, one can write as below.
b'<html>\n <meta charset="iso-8859-8"/>\n <body>\n <h1>\n \xa2\xf6`\xe0\n </h1>\n </body>\n</html>'
- encode() : The encoding method can be used to explicitly pass the required method. This replaces characters with the corresponding XML references.
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.