Python | Tools in the world of Web Scraping
Web page scraping can be done using multiple tools or using different frameworks in Python. There are variety of options available for scraping data from a web page, each suiting different needs.
First, let’s understand the difference between web-scraping and web-crawling. Web crawling is used to index the information on the page using bots also known as Crawlers. On the other hand, Web-scraping is an automated way of extracting the information/content using bots also known as Scrapers.
Let see some most commonly used web Scraping tools for Python3 :
Among all the available frameworks/ tools, only
urllib2 come pre-installed with Python. So all other tools need to be installed, if needed. Let’s discuss all these tools in detail.
- Urllib2 :
Urllib2is a python module used for fetching URL’s. It offers a very simple interface, in the form of urlopen function, which is capable of fetching URL’s using different protocols like HTTP, FTP etc.
- Requests :
Requestsdoes not come pre-installed with Python. Requests allows to send HTTP/1.1 requests. One can add headers, form data, multipart files and parameters with simple Python dictionaries and access the response data in the same way.
Installing requests can be done using pip.
pip install requests
# Using requests module
# get URL
- BeautifulSoup :
Beautiful soupis a parsing library that can use different parsers. Beautiful Soup’s default parser comes from Python’s standard library. It creates a parse tree that can be used to extract data from HTML; a toolkit for dissecting a document and extracting what you need. It automatically converts incoming documents to Unicode and outgoing documents to UTF-8.
pipcan be used to install BeautifulSoup :
pip install beautifulsoup4
# importing BeautifulSoup form
# bs4 module
# importing requests
# get URL
- Lxml :
Lxmlis a high-performance, production-quality HTML and XML parsing library. If the user need speed, then go for Lxml. Lxml has many modules and one of the module is
etree, which is responsible for creating elements and structure using these elements.
One can start using lxml by installing it as a python package using
pip install lxml
# importing etree from lxml module
pippackage is used to install selenium :
pip install selenium
# importing webdriver from selenium module
# path for chromedriver
- MechanicalSoup :
One can use following command to install MechanicalSoup :
pip install MechanicalSoup
# importing mechanicalsoup
- Scrapy :
Scrapyis an open source and collaborative web crawling framework for extracting the data needed from websites. It was originally designed for web scraping. It can be used to manage requests, preserve user sessions follow redirects and handle output pipelines.
There are 2-methods to install scrapy :
- Using pip :
pip install scrapy
- Using Anaconda : First install Anaconda or Miniconda and then use following command to install scrapy :
conda install -c conda-forge scrapy
# importing scrapy module
# Parse function
Use following command to run a scrapy code :
scrapy runspider samplescapy.py
- Using pip :
Above discussed module are most commonly used scrappers for Python3. Although there are few more but no longer compatible with Python3 like Mechanize, Scrapemark.