html5lib and lxml parsers in Python

Parsers in Python:
Parsing simply means to break down a blob of text into smaller and meaningful parts. This breaking down depends on certain rules and factors which a particular parser defines. These parsers can range from native string methods of parsing line by line to the libraries like html5lib which can parse almost all the elements of an HTML doc, breaking it down into different tags and pieces which can be filtered out for various use cases.
The two parsers we will focus on in this article are html5lib and lxml. So, before diving into their pros, cons and differences, let’s have an overview of both of these libraries.

html5lib: A pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.

lxml: A Pythonic, mature binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API.



Key point:
Since html5lib is a pure-python library, it has an external Python Dependency while lxml being a binding for certain C libraries has external C dependency.

Pros and Cons:

html5lib:

  • Implements the HTML5 parsing algorithm which is heavily influenced by current browsers which means you get the same parsed text as it’s done on the browser.
  • Since it uses HTML5 parsing algorithm, it even fixes lots of broken HTML and adds several tags which are missing in order to complete the text and make it look like an HTML doc.
  • Extremely Lenient.
  • Very Slow. Why? Because it’s backed by lots of Python Code.

lxml:

  • Very Fast. Why? Because it’s backed by lots of Cython code.
  • Fixes some broken HTML, but doesn’t work enough in order to present it like a complete HTML doc.
  • Quite lenient.

Differences with Beautifulsoup:
Just to highlight the difference between the two parsers in terms of how they work and make the tree in order to fix document which is not perfectly formed, we’ll take the same example and feed it to the two parsers.

<li></p>

html5lib:

filter_none

edit
close

play_arrow

link
brightness_4
code

from bs4 import BeautifulSoup
  
soup_html5lib = BeautifulSoup("<li></p>", "html5lib")
  
print(soup_html5lib)

chevron_right


Output:

<html><head></head><body><li><p></p></li></body></html>

What we find:

  • Opening and closing html tags.
  • Opening and closing head tags (empty).
  • Opening and closing body tags.
  • Opening p tag to support closing p tag
  • Closing li tag to support opening li tag.
  • No tag removed in the final text from the soup object.

lxml:

filter_none

edit
close

play_arrow

link
brightness_4
code

from bs4 import BeautifulSoup
  
soup_lxml = BeautifulSoup("<li></p>", "lxml")
  
print(soup_lxml)

chevron_right


Output:

<html><body><li></li></body></html>

What we find:

  • Opening and closing html tags.
  • No head tags.
  • Opening and closing body tags.
  • Closing li tag to support opening li tag.
  • Missing p tag.

We can easily observe the differences between the two libraries in terms of the final tree formation or the parsing of the document received and spot the completeness, html5lib provides to the final parsed text.



My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.




Article Tags :

Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.