Web scraping is extensively being used in many industrial applications today. Be it in the field of natural language understanding or data analytics, scraping data from websites is one of the main aspects of many such applications. Scraping of data from websites is extracting large amount of contextual texts from a set of websites for different uses. This project can also be extended for further use such as topic or theme based text summarization, news scraping from news websites, scraping of images for training a model etc.
To start with, let us discuss the several libraries that we are going to use in this project.
- requests: It is a library to send HTTP 1.1 requests very easily. Using requests.get method, we can extract a URL’s HTML content.
- urlparse: It provides a standard interface to break down a URL into different components such as network location, addressing scheme, path, etc.
- urljoin: It allows us to join a base URL with a relative URL to form an absolute URL.
- beautifulsoup: It is a python library to extract data out of HTML and XML files. We can convert a HTML page to a beautifulsoup object and then extract HTML tags along with their contents
Next, we will discuss how to install these libraries. Note that if you have pip3 installed in your system, you need to use pip3 instead of pip.
pip install requests pip install bs4
Next, let’s discuss the various aspects and features of the project.
- Given an input URL and a depth upto which the crawler needs to crawl, we will extract all the URLs and categorize them into internal and external URLs.
- Internal URLs are those which has the same domain name as that of the input URL. External URLs are those which has different domain name as that of the given input URL.
- We check the validity of the extracted URLs. If the URL has a valid structure, only then it is considered.
- A depth of 0 means that only the input URL is printed. A depth of 1 means that all the URLs inside the input URL is printed and so on.
- First we import the installed libraries.
- Then, we create two empty sets called internal_links and external_links which will store internal and external links separately and ensure that they do not contain duplicates.
- We then create a method called level_crawler which takes an input URL and crawls it and displays all the internal and external links using the following steps –
- Define a set called url to temporarily store the URLs.
- Extract the domain name of the url using urlparse library.
- Create a beautifulsoup object using HTML parser.
- Extract all the anchor tags from the beautifulsoup object.
- Get the href tags from the anchor tags and if they are empty, don’t include them.
- Using urljoin method, create the absolute URL.
- Check for the validity of the URL.
- If the url is valid and the domain of the url is not in the href tag and is not in external links set, include it into external links set.
- Else, add it into internal links set if it is not there and print and put it in temporary url set.
- Return the temporary url set which includes the visited internal links. This set will be used later on.
- If the depth is 0, we print the url as it is. If the depth is 1, we call the level_crawler method defined above.
- Else, we perform a breadth first search (BFS) traversal considered the formation of a URL page as tree structure. At the first level we have the input URL. At the next level, we have all the URLs inside the input URL and so on.
- We create a queue and append the input url into it. We then pop an url and insert all the urls inside it into the queue. We do this until all the urls at a particular level is not parsed. We repeat the process for the number of times same as the input depth.
Below is the complete program of the above approach:
url = "https://www.geeksforgeeks.org/machine-learning/" depth = 1
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.
- Difference between Web Scraping and Web Crawling
- Nodejs | Web Crawling using Cheerio
- Web 1.0, Web 2.0 and Web 3.0 with their difference
- Difference between Web Browser and Web Server
- How to choose Web Hosting Server and Web Domain ?
- Difference between Web Designer and Web Developer
- Differences between Web Services and Web API
- Progressive Web App - A Combination of Native and Web App
- 10 Web Development and Web Design Facts That You Should Know
- Difference between Search Engine and Web Browser
- How to add custom google search bar inside your web-page?
- Web API URL.search Property
- PyQt5 QSpinBox - Getting depth
- Python | Find depth of a dictionary
- PyQt5 QCalendarWidget - Bit Depth
- Python OpenCV - Depth map from Stereo Images
- D3.js node.depth Property
- HTML5 MathML depth Attribute
- HTML Course | First Web Page | Printing Hello World
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to email@example.com. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.