Open In App

Scrap books using Beautifulsoup from books.toscrape in Python

Improve
Improve
Like Article
Like
Save
Share
Report

Web Scraping is the technique of extracting large amounts of data from websites where the extracted data is saved in a local file on your computer. The simplest form of Web scraping is manually copying and pasting data from a web page into a text file or spreadsheet. Sometimes this is the only solution when the websites set up barriers. But in most cases, a huge amount of data is required which is difficult for a human to scrape. Therefore, we have Web scraping tools to automate the process. One such Web scraping tool is BeautifulSoup.

BeautifulSoup is a Python Web scraping library for pulling data out and parsing of HTML and XML files. To install BeautifulSoup type the below command in the terminal.

pip install BeautifulSoup4

BeautifulSoup is a tool for HTML parsing but we will need a web client to grab something from the internet. 
This can be achieved in Python by using the package urllib.

In this article, our task will be to –

  • Collect the name of the product and price.
  • Save the collected data in .csv format.

As an example, we will collect the title and price of the book from the website: BookToScrape

BooksToScrape Website with All Products in 1st Page

Approach

  • Import the libraries: BeautifulSoup and urllib.
  • Read HTML link using urllib.
  • Allow parsing of the link using Beautiful soup.
  • Look for the tag that contains all the products of that particular webpage and extract it.
  • Look for the tag that displays the name and price of the book and extract it.
  • With all the information extracted, print and save everything in a csv file.

All books and its underlying information.

Below is the implementation. 

Python3




# import web grabbing client and
# HTML parser
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
 
# variable to store website link as string
 
# grab website and store in variable uclient
uClient = uReq(myurl)
 
# read and close HTML
page_html = uClient.read()
uClient.close()
 
# call BeautifulSoup for parsing
page_soup = soup(page_html, "html.parser")
 
# grabs all the products under list tag
bookshelf = page_soup.findAll(
    "li", {"class": "col-xs-6 col-sm-4 col-md-3 col-lg-3"})
 
# create csv file of all products
filename = ("Books.csv")
f = open(filename, "w")
 
headers = "Book title, Price\n"
f.write(headers)
 
for books in bookshelf:
 
    # collect title of all books
    book_title = books.h3.a["title"]
 
    # collect book price of all books
    book_price = books.findAll("p", {"class": "price_color"})
    price = book_price[0].text.strip()
 
    print("Title of the book :" + book_title)
    print("Price of the book :" + price)
 
    f.write(book_title + "," + price+"\n")
 
f.close()


Output :

CSV File:

Let’s explain the above code. In this example, a single book and its information are contained under the <li> tag. So we use –

  • The findAll() function looks for all the li tags with class(can be ID if its ID) named “col-xs-6 col-sm-4 col-md-3 col-lg-3” and stores it in the variable bookshelf.

bookshelf = page_soup.findAll(“li” , {“class” : “col-xs-6 col-sm-4 col-md-3 col-lg-3”}) 
 

  • It will store all information about that <li> class which are the books. You can see in the picture above, each <li> tag and class : “col-xs-6 col-sm-4 col-md-3 col-lg-3” represents one book.
  • Now that we have all our books, we need to select which particular information we must extract from each book and that is Title and Price
     
  • You can see in the image above , the title of each book is under <h3> tag which is under the <a> tag with ‘title’. So the above function is used for extracting the Title of each book.
book_title = books.h3.a["title"]
  • For the price we can see that it is under <p> tag in class : “price_color” so we use findAll()
book_price = books.findAll("p", {"class" : "price_color"})
price = book_price[0].text.strip()
  • The 0 index takes in the price of the first book of the page and stores it in variable price. The function .text.strip() collects only the text and strips of any spaces forwards and backwards.


Last Updated : 21 Nov, 2022
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads