Multithreaded crawler in Python
In this article, we will describe how it is possible to build a simple multithreading-based crawler using Python.
bs4: Beautiful Soup (bs4) is a Python library for extracting data from HTML and XML files. To install this library, type the following command in IDE/terminal.
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning - Basic Level Course
pip install bs4
requests: This library allows you to send HTTP/1.1 requests very easily. To install this library, type the following command in IDE/terminal.
pip install requests
Step 1: We will first import all the libraries that we need to crawl. If you’re using Python3, you should already have all the libraries except BeautifulSoup, requests. So if you haven’t installed these two libraries yet, you’ll need to install them using the commands specified above.
Step 2: Create a main program and then create an object of class MultiThreadedCrawler and pass the seed URL to its parameterized constructor, and call run_web_scrawler() method.
Step 3: Create a class named MultiThreadedCrawler. And initialize all the variables in the constructor, assign base URL to the instance variable named seed_url. And then format the base URL into absolute URL, using schemes as HTTPS and net location.
To execute the crawl frontier task concurrently use multithreading in python. Create an object of ThreadPoolExecutor class and set max workers as 5 i.e To execute 5 threads at a time. And to avoid duplicate visits to web pages, In order to maintain the history create a set data structure.
Create a queue to store all the URLs of crawl frontier and put the first item as a seed URL.
Step 4: Create a method named run_web_crawler(), to keep on adding the link to frontier and extracting the information use an infinite while loop and display the name of the currently executing process.
Get the URL from crawl frontier, for lookup assign timeout as 60 seconds and check whether the current URL is already visited or not. If not visited already, Format the current URL and add it to scraped_pages set to store in the history of visited pages and choose from a pool of threads and pass scrape page and target URL.
Step 5: Using the handshaking method place the request and set default time as 3 and maximum time as 30 and once the request is successful return the result set.
Step 6: Create a method named scrape_info(). And pass the webpage data into BeautifulSoap which helps us to organize and format the messy web data by fixing bad HTML and present to us in an easily-traversible structure.
Using the BeautifulSoup operator extract all the text present in the HTML document.
Step 7: Create a method named parse links, using BeautifulSoup operator extract all the anchor tags present in HTML document. Soup.find_all(‘a’,href=True) returns a list of items that contain all the anchor tags present in the webpage. Store all the tags in a list named anchor_Tags. For each anchor tag present in the list Aachor_Tags, Retrieve the value associated with href in the tag using Link[‘href’]. For each retrieved URL check whether it is any of the absolute URL or relative URL.
- Relative URL: URL Without root URL and protocol names.
- Absolute URLs: URL With protocol name, Root URL, Document name.
If it is a Relative URL using urljoin method change it to an absolute URL using the base URL and relative URL. Check whether the current URL is already visited or not. If the URL has not been visited already, put it in the crawl queue.
Step 8: For extracting the links call the method named parse_links() and pass the result. For extracting the content call the method named scrape_info() and pass the result.
Below is the complete implementation: