Scrap books using Beautifulsoup from books.toscrape in Python

Web Scraping is the technique of extracting large amounts of data from websites where the extracted data is saved in a local file on your computer. The simplest form of Web scraping is manually copying and pasting data from a web page into a text file or spreadsheet. Sometimes this is the only solution when the websites set up barriers. But in most cases, a huge amount of data is required which is difficult for a human to scrape. Therefore, we have Web scraping tools to automate the process. One such Web scraping tool is BeautifulSoup.

BeautifulSoup is a Python Web scraping library for pulling data out and parsing of HTML and XML files. To install BeautifulSoup type the below command in the terminal.

pip install BeautifulSoup4

BeautifulSoup is a tool for HTML parsing but we will need a web client to grab something from the internet.
This can be achieved in Python by using the package urllib.

In this article, our task will be to –

Collect the name of the product and price.
Save the collected data in .csv format.

As an example, we will collect the title and price of the book from the website: BookToScrape

BooksToScrape Website with All Products in 1st Page

Approach

Import the libraries: BeautifulSoup and urllib.
Read HTML link using urllib.
Allow parsing of the link using Beautiful soup.
Look for the tag that contains all the products of that particular webpage and extract it.
Look for the tag that displays the name and price of the book and extract it.
With all the information extracted, print and save everything in a csv file.

All books and its underlying information.

Below is the implementation.

Python3

# import web grabbing client and
# HTML parser

from urllib.request import urlopen as uReq

from bs4 import BeautifulSoup as soup
 
# variable to store website link as string

myurl = 'http://books.toscrape.com/index.html'
 
# grab website and store in variable uclient

uClient = uReq(myurl)
 
# read and close HTML

page_html = uClient.read()
uClient.close()
 
# call BeautifulSoup for parsing

page_soup = soup(page_html, "html.parser")
 
# grabs all the products under list tag

bookshelf = page_soup.findAll(

    "li", {"class": "col-xs-6 col-sm-4 col-md-3 col-lg-3"})
 
# create csv file of all products

filename = ("Books.csv")

f = open(filename, "w")
 
headers = "Book title, Price\n"
f.write(headers)
 
for books in bookshelf:
 
    # collect title of all books

    book_title = books.h3.a["title"]
 
    # collect book price of all books

    book_price = books.findAll("p", {"class": "price_color"})

    price = book_price[0].text.strip()
 
    print("Title of the book :" + book_title)

    print("Price of the book :" + price)
 
    f.write(book_title + "," + price+"\n")
 
f.close()

Output :

CSV File:

Let’s explain the above code. In this example, a single book and its information are contained under the <li> tag. So we use –

The findAll() function looks for all the li tags with class(can be ID if its ID) named “col-xs-6 col-sm-4 col-md-3 col-lg-3” and stores it in the variable bookshelf.

bookshelf = page_soup.findAll(“li” , {“class” : “col-xs-6 col-sm-4 col-md-3 col-lg-3”})

It will store all information about that <li> class which are the books. You can see in the picture above, each <li> tag and class : “col-xs-6 col-sm-4 col-md-3 col-lg-3” represents one book.
Now that we have all our books, we need to select which particular information we must extract from each book and that is Title and Price.
You can see in the image above , the title of each book is under <h3> tag which is under the <a> tag with ‘title’. So the above function is used for extracting the Title of each book.

book_title = books.h3.a["title"]

For the price we can see that it is under <p> tag in class : “price_color” so we use findAll()

book_price = books.findAll("p", {"class" : "price_color"})
price = book_price[0].text.strip()

The 0 index takes in the price of the first book of the page and stores it in variable price. The function .text.strip() collects only the text and strips of any spaces forwards and backwards.

Article Tags :

Python

Python-projects