How to parse local HTML file in Python?
Prerequisites: Beautifulsoup
Parsing means dividing a file or input into pieces of information/data that can be stored for our personal use in the future. Sometimes, we need data from an existing file stored on our computers, parsing technique can be used in such cases. The parsing includes multiple techniques used to extract data from a file. The following includes Modifying the file, Removing something from the file, Printing data, using the recursive child generator method to traverse data from the file, finding the children of tags, web scraping from a link to extract useful information, etc.
Modifying the file
Using the prettify method to modify the HTML code from- https://festive-knuth-1279a2.netlify.app/, look better. Prettify makes the code look in the standard form like the one used in VS Code.
Example:
Python3
from bs4 import BeautifulSoup
import requests as req
S = BeautifulSoup(Web.text, 'lxml' )
print (S.prettify())
|
Output:
Removing a tag
A tag can be removed by using the decompose method and the select_one method with the CSS selectors to select and then remove the second element from the li tag and then using the prettify method to modify the HTML code from the index.html file.
Example:
File Used:
Python3
from bs4 import BeautifulSoup
HTMLFile = open ( "index.html" , "r" )
index = HTMLFile.read()
S = BeautifulSoup(index, 'lxml' )
Tag = S.select_one( 'li:nth-of-type(2)' )
Tag.decompose()
print (S.body.prettify())
|
Output:
Finding tags
Tags can be found normally and printed normally using print().
Example:
Python3
from bs4 import BeautifulSoup
HTMLFile = open ( "index.html" , "r" )
index = HTMLFile.read()
Parse = BeautifulSoup(index, 'lxml' )
print (Parse.head)
print (Parse.h1)
print (Parse.h2)
print (Parse.h3)
print (Parse.li)
|
Output:
Traversing tags
The recursiveChildGenerator method is used to traverse tags, which recursively finds all the tags within tags from the file.
Example:
Python3
from bs4 import BeautifulSoup
HTMLFile = open ( "index.html" , "r" )
index = HTMLFile.read()
S = BeautifulSoup(index, 'lxml' )
for TraverseTags in S.recursiveChildGenerator():
if TraverseTags.name:
print (TraverseTags.name)
|
Output:
Parsing name and text attributes of tags
Using the name attribute of the tag to print its name and the text attribute to print its text along with the code of the tag- ul from the file.
Example:
Python3
from bs4 import BeautifulSoup
HTMLFile = open ( "index.html" , "r" )
index = HTMLFile.read()
S = BeautifulSoup(index, 'lxml' )
print (f 'HTML: {S.ul}, name: {S.ul.name}, text: {S.ul.text}' )
|
Output:
Finding Children of a tag
The Children attribute is used to get the children of a tag. The Children attribute returns ‘tags with spaces’ between them, we’re adding a condition- e. name is not None to print only names of the tags from the file.
Example:
Python3
from bs4 import BeautifulSoup
HTMLFile = open ( "index.html" , "r" )
index = HTMLFile.read()
S = BeautifulSoup(index, 'lxml' )
Attr = S.html
Attr_Tag = [e.name for e in Attr.children if e.name is not None ]
print (Attr_Tag)
|
Output:
Finding Children at all levels of a tag:
The Descendants attribute is used to get all the descendants (Children at all levels) of a tag from the file.
Example:
Python3
from bs4 import BeautifulSoup
HTMLFile = open ( "index.html" , "r" )
index = HTMLFile.read()
S = BeautifulSoup(index, 'lxml' )
Des = S.body
Attr_Tag = [e.name for e in Des.descendants if e.name is not None ]
print (Attr_Tag)
|
Output:
Finding all elements of tags
Using find_all():
The find_all method is used to find all the elements (name and text) inside the p tag from the file.
Example:
Python3
from bs4 import BeautifulSoup
HTMLFile = open ( "index.html" , "r" )
index = HTMLFile.read()
S = BeautifulSoup(index, 'lxml' )
for tag in S.find_all( 'p' ):
print (f '{tag.name}: {tag.text}' )
|
Output:
CSS selectors to find elements:
Using the select method to use the CSS selectors to find the second element from the li tag from the file.
Example:
Python3
from bs4 import BeautifulSoup
HTMLFile = open ( "index.html" , "r" )
index = HTMLFile.read()
S = BeautifulSoup(index, 'lxml' )
print (S.select( 'li:nth-of-type(2)' ))
|
Output:
Last Updated :
16 Mar, 2021
Like Article
Save Article
Share your thoughts in the comments
Please Login to comment...