Open In App

Scrap books using Beautifulsoup from books.toscrape in Python

Web Scraping is the technique of extracting large amounts of data from websites where the extracted data is saved in a local file on your computer. The simplest form of Web scraping is manually copying and pasting data from a web page into a text file or spreadsheet. Sometimes this is the only solution when the websites set up barriers. But in most cases, a huge amount of data is required which is difficult for a human to scrape. Therefore, we have Web scraping tools to automate the process. One such Web scraping tool is BeautifulSoup.

BeautifulSoup is a Python Web scraping library for pulling data out and parsing of HTML and XML files. To install BeautifulSoup type the below command in the terminal.



pip install BeautifulSoup4

BeautifulSoup is a tool for HTML parsing but we will need a web client to grab something from the internet. 
This can be achieved in Python by using the package urllib.

In this article, our task will be to –



As an example, we will collect the title and price of the book from the website: BookToScrape

BooksToScrape Website with All Products in 1st Page

Approach

All books and its underlying information.

Below is the implementation. 




# import web grabbing client and
# HTML parser
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
 
# variable to store website link as string
 
# grab website and store in variable uclient
uClient = uReq(myurl)
 
# read and close HTML
page_html = uClient.read()
uClient.close()
 
# call BeautifulSoup for parsing
page_soup = soup(page_html, "html.parser")
 
# grabs all the products under list tag
bookshelf = page_soup.findAll(
    "li", {"class": "col-xs-6 col-sm-4 col-md-3 col-lg-3"})
 
# create csv file of all products
filename = ("Books.csv")
f = open(filename, "w")
 
headers = "Book title, Price\n"
f.write(headers)
 
for books in bookshelf:
 
    # collect title of all books
    book_title = books.h3.a["title"]
 
    # collect book price of all books
    book_price = books.findAll("p", {"class": "price_color"})
    price = book_price[0].text.strip()
 
    print("Title of the book :" + book_title)
    print("Price of the book :" + price)
 
    f.write(book_title + "," + price+"\n")
 
f.close()

Output :

CSV File:

Let’s explain the above code. In this example, a single book and its information are contained under the <li> tag. So we use –

bookshelf = page_soup.findAll(“li” , {“class” : “col-xs-6 col-sm-4 col-md-3 col-lg-3”}) 
 

book_title = books.h3.a["title"]
book_price = books.findAll("p", {"class" : "price_color"})
price = book_price[0].text.strip()

Article Tags :