Web crawling using Breadth First Search at a specified depth

Last Updated : 16 Oct, 2020

Web scraping is extensively being used in many industrial applications today. Be it in the field of natural language understanding or data analytics, scraping data from websites is one of the main aspects of many such applications. Scraping of data from websites is extracting large amount of contextual texts from a set of websites for different uses. This project can also be extended for further use such as topic or theme based text summarization, news scraping from news websites, scraping of images for training a model etc.

Libraries Used:

To start with, let us discuss the several libraries that we are going to use in this project.

requests: It is a library to send HTTP 1.1 requests very easily. Using requests.get method, we can extract a URL’s HTML content.
urlparse: It provides a standard interface to break down a URL into different components such as network location, addressing scheme, path, etc.
urljoin: It allows us to join a base URL with a relative URL to form an absolute URL.
beautifulsoup: It is a python library to extract data out of HTML and XML files. We can convert a HTML page to a beautifulsoup object and then extract HTML tags along with their contents

Installation:

Next, we will discuss how to install these libraries. Note that if you have pip3 installed in your system, you need to use pip3 instead of pip.

pip install requests
pip install bs4

Features:

Next, let’s discuss the various aspects and features of the project.

Given an input URL and a depth upto which the crawler needs to crawl, we will extract all the URLs and categorize them into internal and external URLs.
Internal URLs are those which has the same domain name as that of the input URL. External URLs are those which has different domain name as that of the given input URL.
We check the validity of the extracted URLs. If the URL has a valid structure, only then it is considered.
A depth of 0 means that only the input URL is printed. A depth of 1 means that all the URLs inside the input URL is printed and so on.

Approach:

First we import the installed libraries.
Then, we create two empty sets called internal_links and external_links which will store internal and external links separately and ensure that they do not contain duplicates.
We then create a method called level_crawler which takes an input URL and crawls it and displays all the internal and external links using the following steps –
- Define a set called url to temporarily store the URLs.
- Extract the domain name of the url using urlparse library.
- Create a beautifulsoup object using HTML parser.
- Extract all the anchor tags from the beautifulsoup object.
- Get the href tags from the anchor tags and if they are empty, don’t include them.
- Using urljoin method, create the absolute URL.
- Check for the validity of the URL.
- If the url is valid and the domain of the url is not in the href tag and is not in external links set, include it into external links set.
- Else, add it into internal links set if it is not there and print and put it in temporary url set.
- Return the temporary url set which includes the visited internal links. This set will be used later on.
If the depth is 0, we print the url as it is. If the depth is 1, we call the level_crawler method defined above.
Else, we perform a breadth first search (BFS) traversal considered the formation of a URL page as tree structure. At the first level we have the input URL. At the next level, we have all the URLs inside the input URL and so on.
We create a queue and append the input url into it. We then pop an url and insert all the urls inside it into the queue. We do this until all the urls at a particular level is not parsed. We repeat the process for the number of times same as the input depth.

Below is the complete program of the above approach:

Python3

# Import libraries 
from urllib.request import urljoin 
from bs4 import BeautifulSoup 
import requests 
from urllib.request import urlparse 
  
  
# Set for storing urls with same domain 
links_intern = set() 
input_url = "https://www.geeksforgeeks.org/machine-learning/"
depth = 1
  
# Set for storing urls with different domain 
links_extern = set() 
  
  
# Method for crawling a url at next level 
def level_crawler(input_url): 
    temp_urls = set() 
    current_url_domain = urlparse(input_url).netloc 
  
    # Creates beautiful soup object to extract html tags 
    beautiful_soup_object = BeautifulSoup( 
        requests.get(input_url).content, "lxml") 
  
    # Access all anchor tags from input  
    # url page and divide them into internal 
    # and external categories 
    for anchor in beautiful_soup_object.findAll("a"): 
        href = anchor.attrs.get("href") 
        if(href != "" or href != None): 
            href = urljoin(input_url, href) 
            href_parsed = urlparse(href) 
            href = href_parsed.scheme 
            href += "://"
            href += href_parsed.netloc 
            href += href_parsed.path 
            final_parsed_href = urlparse(href) 
            is_valid = bool(final_parsed_href.scheme) and bool( 
                final_parsed_href.netloc) 
            if is_valid: 
                if current_url_domain not in href and href not in links_extern: 
                    print("Extern - {}".format(href)) 
                    links_extern.add(href) 
                if current_url_domain in href and href not in links_intern: 
                    print("Intern - {}".format(href)) 
                    links_intern.add(href) 
                    temp_urls.add(href) 
    return temp_urls 
  
  
if(depth == 0): 
    print("Intern - {}".format(input_url)) 
  
elif(depth == 1): 
    level_crawler(input_url) 
  
else: 
    # We have used a BFS approach 
    # considering the structure as 
    # a tree. It uses a queue based 
    # approach to traverse 
    # links upto a particular depth. 
    queue = [] 
    queue.append(input_url) 
    for j in range(depth): 
        for count in range(len(queue)): 
            url = queue.pop(0) 
            urls = level_crawler(url) 
            for i in urls: 
                queue.append(i) 

Input:

url = "https://www.geeksforgeeks.org/machine-learning/"
depth = 1

Output:

python-bfs-web-scrape

Suggest improvement

Hyperlink Induced Topic Search (HITS) Algorithm using Networkx Module | Python

Share your thoughts in the comments

Web crawling using Breadth First Search at a specified depth

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?