Web crawling using Breadth First Search at a specified depth

Web scraping is extensively being used in many industrial applications today. Be it in the field of natural language understanding or data analytics, scraping data from websites is one of the main aspects of many such applications. Scraping of data from websites is extracting large amount of contextual texts from a set of websites for different uses. This project can also be extended for further use such as topic or theme based text summarization, news scraping from news websites, scraping of images for training a model etc.

Libraries Used:

To start with, let us discuss the several libraries that we are going to use in this project.

requests: It is a library to send HTTP 1.1 requests very easily. Using requests.get method, we can extract a URL’s HTML content.
urlparse: It provides a standard interface to break down a URL into different components such as network location, addressing scheme, path, etc.
urljoin: It allows us to join a base URL with a relative URL to form an absolute URL.
beautifulsoup: It is a python library to extract data out of HTML and XML files. We can convert a HTML page to a beautifulsoup object and then extract HTML tags along with their contents

Installation:

Next, we will discuss how to install these libraries. Note that if you have pip3 installed in your system, you need to use pip3 instead of pip.

pip install requests
pip install bs4

Features:

Next, let’s discuss the various aspects and features of the project.

Given an input URL and a depth upto which the crawler needs to crawl, we will extract all the URLs and categorize them into internal and external URLs.
Internal URLs are those which has the same domain name as that of the input URL. External URLs are those which has different domain name as that of the given input URL.
We check the validity of the extracted URLs. If the URL has a valid structure, only then it is considered.
A depth of 0 means that only the input URL is printed. A depth of 1 means that all the URLs inside the input URL is printed and so on.

Approach:

First we import the installed libraries.
Then, we create two empty sets called internal_links and external_links which will store internal and external links separately and ensure that they do not contain duplicates.
We then create a method called level_crawler which takes an input URL and crawls it and displays all the internal and external links using the following steps –
- Define a set called url to temporarily store the URLs.
- Extract the domain name of the url using urlparse library.
- Create a beautifulsoup object using HTML parser.
- Extract all the anchor tags from the beautifulsoup object.
- Get the href tags from the anchor tags and if they are empty, don’t include them.
- Using urljoin method, create the absolute URL.
- Check for the validity of the URL.
- If the url is valid and the domain of the url is not in the href tag and is not in external links set, include it into external links set.
- Else, add it into internal links set if it is not there and print and put it in temporary url set.
- Return the temporary url set which includes the visited internal links. This set will be used later on.
If the depth is 0, we print the url as it is. If the depth is 1, we call the level_crawler method defined above.
Else, we perform a breadth first search (BFS) traversal considered the formation of a URL page as tree structure. At the first level we have the input URL. At the next level, we have all the URLs inside the input URL and so on.
We create a queue and append the input url into it. We then pop an url and insert all the urls inside it into the queue. We do this until all the urls at a particular level is not parsed. We repeat the process for the number of times same as the input depth.

Below is the complete program of the above approach:

Python3

# Import libraries 

from urllib.request import urljoin 

from bs4 import BeautifulSoup 

import requests 

from urllib.request import urlparse 

# Set for storing urls with same domain 

links_intern = set() 

input_url = "https://www.geeksforgeeks.org/machine-learning/"

depth = 1

# Set for storing urls with different domain 

links_extern = set() 

# Method for crawling a url at next level 

def level_crawler(input_url): 

    temp_urls = set() 

    current_url_domain = urlparse(input_url).netloc 

    # Creates beautiful soup object to extract html tags 

    beautiful_soup_object = BeautifulSoup( 

        requests.get(input_url).content, "lxml") 

    # Access all anchor tags from input  

    # url page and divide them into internal 

    # and external categories 

    for anchor in beautiful_soup_object.findAll("a"): 

        href = anchor.attrs.get("href") 

        if(href != "" or href != None): 

            href = urljoin(input_url, href) 

            href_parsed = urlparse(href) 

            href = href_parsed.scheme 

            href += "://"

            href += href_parsed.netloc 

            href += href_parsed.path 

            final_parsed_href = urlparse(href) 

            is_valid = bool(final_parsed_href.scheme) and bool( 

                final_parsed_href.netloc) 

            if is_valid: 

                if current_url_domain not in href and href not in links_extern: 

                    print("Extern - {}".format(href)) 

                    links_extern.add(href) 

                if current_url_domain in href and href not in links_intern: 

                    print("Intern - {}".format(href)) 

                    links_intern.add(href) 

                    temp_urls.add(href) 

    return temp_urls 

if(depth == 0): 

    print("Intern - {}".format(input_url)) 

elif(depth == 1): 

    level_crawler(input_url) 

else: 

    # We have used a BFS approach 

    # considering the structure as 

    # a tree. It uses a queue based 

    # approach to traverse 

    # links upto a particular depth. 

    queue = [] 

    queue.append(input_url) 

    for j in range(depth): 

        for count in range(len(queue)): 

            url = queue.pop(0) 

            urls = level_crawler(url) 

            for i in urls: 

                queue.append(i)

Input:

url = "https://www.geeksforgeeks.org/machine-learning/"
depth = 1

Output:

Article Tags :

Project

Python

Web Technologies