Prerequisites: Introduction to Web Scrapping
In this article, we will discuss the lxml python library to scrape data from a webpage, which is built on top of the libxml2 XML parsing library written in C. When compared to other python web scraping libraries like BeautifulSoup and Selenium, the lxml package gives an advantage in terms of performance. Reading and writing large XML files takes an indiscernible amount of time, making data processing easier & much faster.
We will be using the lxml library for Web Scraping and the requests library for making HTTP requests in Python. These can be installed in the command line using the pip package installer for Python.
Getting data from an element on the webpage using lxml requires the usage of Xpaths.
XPath works very much like a traditional file system
To access file 1,
Similarly, To access file 2,
Now consider a simple web page,
This can be represented as an XML Tree as follows,
For getting the text inside the <p> tag,
XPath : html/body/p/text()
Result : This is the first paragraph
For getting a value inside the <href> attribute in the anchor or <a> tag,
XPath : html/body/a/@href
For getting the value inside the second <h2> tag,
XPath : html/body/h2/text()
Result: Hello World
To find the XPath for a particular element on a page:
- Right-click the element in the page and click on Inspect.
- Right click on the element in the Elements Tab.
- Click on copy XPath.
- We will use requests.get to retrieve the web page with our data.
- We use html.fromstring to parse the content using the lxml parser.
- We create the correct XPath query and use the lxml xpath function to get the required element.
Below is a program based on the above approach which uses a particular URL.
Another example for an E-commerce website, URL.
- Web Scraping CryptoCurrency price and storing it in MongoDB using Python
- Web scraping from Wikipedia using Python - A Complete Guide
- Implementing Web Scraping in Python with Scrapy
- What is Web Scraping and How to Use It?
- Web Scraping Coronavirus Data into MS Excel
- Scraping Television Rating Point using Python
- Scraping Amazon Product Information using Beautiful Soup
- Scraping Covid-19 statistics using BeautifulSoup
- Image Scraping with Python
- Scraping websites with Newspaper3k in Python
- How to Scrape Web Data from Google using Python?
- Pagination using Scrapy - Web Scrapping with Python
- Create GUI to Web Scrape articles in Python
- Web scraper for extracting emails based on keywords and regions
- Create a database on Relational Database Service (RDS) of Amazon Web Services(AWS)
- Competitive Coding Setup for C++ and Python in VS Code using Python Script
- Python | Visualizing O(n) using Python
- Communication between Parent and Child process using pipe in Python
- Python | Copy and Paste Images onto other Image using Pillow
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to email@example.com. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.