BeautifulSoup – Scraping Link from HTML
Prerequisite: Implementing Web Scraping in Python with BeautifulSoup
In this article, we will understand how we can extract all the links from a URL or an HTML document using Python.
- bs4 (BeautifulSoup): It is a library in python which makes it easy to scrape information from web pages, and helps in extracting the data from HTML and XML files. This library needs to be downloaded externally as it does not come readily with Python package. To install this library, type the following command in your terminal.
pip install bs4
- requests: This library enables to send the HTTP requests and fetch the web page content very easily. This library also needs to be downloaded externally as it does not come readily with Python package. To install this library, type the following command in your terminal.
pip install requests
Steps to be followed:
- Import the required libraries (bs4 and requests)
- Create a function to get the HTML document from the URL using requests.get() method by passing URL to it.
- Create a Parse Tree object i.e. soup object using of BeautifulSoup() method, passing it HTML document extracted above and Python built-in HTML parser.
- Use the a tag to extract the links from the BeautifulSoup object.
- Get the actual URLs from the form all anchor tag objects with get() method and passing href argument to it.
- Moreover, you can get the title of the URLs with get() method and passing title argument to it.