How to Scrape Websites with Beautifulsoup and Python ?
Have you ever wondered how much data is created on the internet every day, and what if you want to work with those data? Unfortunately, this data is not properly organized like some CSV or JSON file but fortunately, we can use web scraping to scrape the data from the internet and can use it according to our own needs. There are many ways to scrape data and one such way is using BeautifulSoup.
Before starting learning the BeautifulSoup let’s learn what is a web scraping and if we should do it or not?
What is Web Scraping?
In Layman’s term, web scraping is the process of gathering data from any website. It is just like copying and pasting the data from a website to your own file but automatically. In technical terms, Web Scripting is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications.
Note: For more information, refer to What is Web Scraping and How to Use It?
Legalization of Web Scraping
The legalization of web scraping is a sensitive topic, depending on how it is used it can either be a boon or a bane. On one hand, web scraping with good bot enables search engines to index web content, price comparison services to save customer money and value. But web scraping can be re-targeted to meet more malicious and abusive ends. Web scraping can be aligned with other forms of malicious automation, named “bad bots”, which enable other harmful activities like denial of service attacks, competitive data mining, account hijacking, data theft etc.
Now after learning the basics of web scraping let’s not waste any more of time and dive straight into the BeautifulSoup. Let’s start with the Installation.
To install Beautifulsoup on Windows, Linux, or any operating system, one would need pip package. To check how to install pip on your operating system, check out – PIP Installation – Windows || Linux. Now run the below command in the terminal.
pip install beautifulsoup4
Refer to the below articles to know more ways of installing BeautifulSoup if the above method does not work for you.
Inspecting the Website
Before scraping any website, the first thing you need to do is to know about the structure of the website. This is needed to be done in order to select the desired data from the entire page. We can do this by right clicking on the page we want to scrape and select inspect element.
Note: We will be scraping Python Programming Page for this Tutorial.
After clicking the inspect button the Developer Tools of the browser gets open. Now almost all the browsers come with the developers tools installed, and we will be using Chrome for this tutorial.
The developers tools allows to see the site’s Document Object Model (DOM). If you don’t know about DOM then don’t worry just consider the text displayed as the HTML structure of the page.
Getting the HTML of the Page
After inspecting the HTML of the page we still need to get all the HTML into our Python Code so that we can scrape the desired data. For this Python provides a module called requests. Requests library is one of the integral part of Python for making HTTP requests to a specified URL. Requests installation depends on type of operating system on eis using, the basic command anywhere would be to open a command terminal and run,
pip install requests
Now let’s make a simple GET request using the get() method.
Refer to the below tutorial to get detailed and well explained information about the requests module.
Parsing the HTML
After getting the HTML of the page let’s see how to parse this raw HTML code into some useful information. First of all, we will create a BeautifulSoup object by specifying the parser we want to use.
Note: BeautifulSoup library is built on the top of the HTML parsing libraries like html5lib, lxml, html.parser, etc. So BeautifulSoup object and specify the parser library can be created at the same time.
<title>Python Programming Language - GeeksforGeeks</title> title meta
Now, we would like to extract some useful data from the HTML content. The soup object contains all the data in the nested structure which could be programmatically extracted. The website we want to scrape contains a lot of text so now let’s scrape all those content.
First let’s inspect the webpage we want to scrape.
Finding Elements by Class
In the above image we can see that all the content of the page is under the div with class entry-content. We will store all the result found under this class.
In the above example we have used the find class. This class will find the given tag with the given attribute. In our case it will find all the div having class as entry-content. We have got all the content from the site but you can see that all the images and links are also scraped. So our next task is to find only the content from the above parsed HTML.
Let’s again inspect the HTML of our website.
We can see that the content of the page is under the <p> tag. Now we have to find all the p tags present in this class. We can use the find_all class of the BeautifulSoup.
We finally get all the content stored under the <p> tag.
Finding Elements by ID
In the above example, we have found the elements by the class name but let’s see how to find elements by id. Now for this task let’s scrape the content of the leftbar of the page. The first step is to inspect the page and see the leftbar falls under which tag.
The above image shows that the leftbar falls under the <div> tag with id as main. Now lets’s get the HTML content under this tag.
Now let’s inspect more of the page get the content of the leftbar.
We can see that the list in the leftbar is under the <ul> tag with the class as leftBarList and our task is to find all the li under this ul.
Refer to the below articles to get detailed information about finding elements.
- Python BeautifulSoup – find all class
- How to extract a div tag and its contents by id with BeautifulSoup?
- Find the siblings of tags using BeautifulSoup
- Extracting an attribute value with beautifulsoup in Python
- BeautifulSoup – Find all <li> in <ul>
- Find text using beautifulSoup then replace in original soup variable
- BeautifulSoup – Search by text inside a tag
- BeautifulSoup – Find tags by CSS class with CSS Selectors
Extracting Text from the tags
In the above examples, you must have seen that while scraping the data the tags also gets scraped but what if we want only the text without any tags. Don’t worry we will discuss the same in this section. We will be using the text property. It only prints the text from the tag. We will be using the above example and will remove all the tags from them.
Example 1: Removing the tags from the content of the page
Now we have successfully scraped the content from our first website. This script will run on every system until and unless there is some changes the HTML of the webpage itself.
Example 2: Removing the tags from the content of the leftbar.
Refer to the below articles to get detailed information about extracting text.
- Show text inside the tags using BeautifulSoup
- Find the text of the given tag using BeautifulSoup
- How to scrape all the text from body tag using Beautifulsoup in Python?
More Topics on BeautifulSoup
- Beautifulsoup – nextSibling
- BeautifulSoup – Remove the contents of tag
- BeautifulSoup – Append to the contents of tag
- How to delete child element in BeautifulSoup?
- Pretty-Printing in BeautifulSoup
- BeautifulSoup – Modifying the tree
- Converting HTML to Text with BeautifulSoup
- How to modify HTML using BeautifulSoup ?
- Change the tag’s contents and replace with the given string using BeautifulSoup
- Remove all style, scripts, and HTML tags using BeautifulSoup
- Insert tags or strings immediately before and after specified tags using BeautifulSoup
- How to parse local HTML file in Python?
- How to use Xpath with BeautifulSoup ?
- BeautifulSoup – Wrap an element in a new tag
- BeautifulSoup – Parsing only section of a document
- How to write the output to HTML file with Python BeautifulSoup?
- Encoding in BeautifulSoup
- How to Scrape Nested Tags using BeautifulSoup?
- Convert XML structure to DataFrame using BeautifulSoup – Python
BeautifulSoup Exercises and Projects
- Get all HTML tags with BeautifulSoup
- Find the title tags from a given html document using BeautifulSoup in Python
- Extract all the URLs that are nested within <li> tags using BeautifulSoup
- Get a list of all the heading tags using BeautifulSoup
- BeautifulSoup – Scraping List from HTML
- BeautifulSoup – Scraping Paragraphs from HTML
- How to Scrape all PDF files in a Website?
- Downloading PDFs with Python using Requests and BeautifulSoup
- How to Extract Weather Data from Google in Python?
- How to Scrape Videos using Python ?