Related Articles

Related Articles

Scrap books using Beautifulsoup from books.toscrape in Python
  • Last Updated : 25 Oct, 2020

Web Scraping is the technique of extracting large amounts of data from websites where the extracted data is saved in a local file on your computer. The simplest form of Web Scrapping is manually copying and pasting data from a web page into a text file or spreadsheet. Sometimes this is the only solution when the websites set up barriers. But in most cases, a huge amount of data is required which is difficult for a human to scrape. Therefore, we have Web Scrapping tools to automate the process. One such Web Scrapping tool is BeautifulSoup.

BeautifulSoup is a Python Web Scrapping library for pulling data out and parsing of HTML and XML files. To install BeautifulSoup type the below command in the terminal.

pip install BeautifulSoup4

BeautifulSoup is a tool for HTML parsing but we will need a web client to grab something from the internet. 
This can be achieved in Python by using the package urllib.

In this article, our task will be to –

  • Collect the name of the product and price.
  • Save the collected data in .csv format.

As an example, we will collect the title and price of the book from the website: BookToScrape



BooksToScrape Website with All Products in 1st Page

Approach

  • Import the libraries: BeautifulSoup and urllib.
  • Read HTML link using urllib.
  • Allow parsing of the link using Beautiful soup.
  • Look for the tag that contains all the products of that particular webpage and extract it.
  • Look for the tag that displays the name and price of the book and extract it.
  • With all the information extracted, print and save everything in a csv file.

All books and its underlying information.

Below is the implementation. 

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# import web grabbing client and
# HTML parser
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
  
# variale to store website link as string
  
# grab website and store in variable uclient
uClient = uReq(myurl)
  
# read and close HTML
page_html = uClient.read()
uClient.close()
  
# call BeautifulSoup for parsing
page_soup = soup(page_html, "html.parser")
  
# grabs all the products under list tag
bookshelf = page_soup.findAll(
    "li", {"class": "col-xs-6 col-sm-4 col-md-3 col-lg-3"})
  
# create csv file of all products
filename = ("Books.csv")
f = open(filename, "w")
  
headers = "Book title, Price\n"
f.write(headers)
  
for books in bookshelf:
  
    # collect title of all books
    book_title = books.h3.a["title"]
  
    # collect book price of all books
    book_price = books.findAll("p", {"class": "price_color"})
    price = book_price[0].text.strip()
  
    print("Title of the book :" + book_title)
    print("Price of the book :" + price)
  
    f.write(book_title + "," + price+"\n")
  
f.close()

chevron_right


Output :

CSV File:

Let’s explain the above code. In this example, a single book and its information are contained under the <li> tag. So we use –

  • The findAll() function looks for all the li tags with class(can be ID if its ID) named “col-xs-6 col-sm-4 col-md-3 col-lg-3” and stores it in the variable bookshelf.

bookshelf = page_soup.findAll(“li” , {“class” : “col-xs-6 col-sm-4 col-md-3 col-lg-3”})

  • It will store all information about that <li> class which are the books. You can see in the picture above, each <li> tag and class : “col-xs-6 col-sm-4 col-md-3 col-lg-3” represents one book.
  • Now that we have all our books, we need to select which particular information we must extract from each book and that is Title and Price
  • You can see in the image above , the title of each book is under <h3> tag which is under the <a> tag with ‘title’. So the above function is used for extracting the Title of each book.
book_title = books.h3.a["title"]
  • For the price we can see that it is under <p> tag in class : “price_color” so we use findAll() 
book_price = books.findAll("p", {"class" : "price_color"})
price = book_price[0].text.strip()
  • The 0 index takes in the price of the first book of the page and stores it in variable price. The function .text.strip() collects only the text and strips of any spaces forwards and backwards.

Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.




My Personal Notes arrow_drop_up
Recommended Articles
Page :