Open In App

Collecting data with Scrapy

Prerequisites: 

Scrapy is a web scraping library that is used to scrape, parse and collect web data. Now once our spider has scrapped the data then it decides whether to:



Hence for all these functions, we are having a pipelines.py file which is used to handle scraped data through various components (known as a class ) which are executed sequentially. In this article, we will be learning through the pipelines.py file, how it is used to collect the data scraped by scrapy using SQLite3 database language.

Initializing Directory and setting up the Project

Let’s, first of all, create a scrapy project. For that make sure that Python and PIP are installed in the system. Then run the below-given commands one-by-one to create a scrapy project similar to the one which we will be using in this article.



# To create a folder named GFGScrapy
mkdir GFGScrapy    
cd GFGScrapy

# making virtual env there.
virtualenv .    
cd scripts  

# activating it.
activate     
cd..

Output:

Creating virtual environment

Syntax:

pip install scrapy

Now to create a scrapy project use the below-given command and also create a spider.

# project name is scrapytutorial
scrapy startproject scrapytutorial  
cd scrapytutorial

# link is of the website we are looking to crawl
scrapy genspider spider_to_crawl https://quotes.toscrape.com

Once you have created a scrapy project using pip installer, then the output of the project directory looks like the one given in the image. 

Directory structure

Scrapy directory structure

The directory structure consists of the following path (sample)

C://<project-name>/<project-name>

In the above image, the project name is scrapytutorial and it has many files inside it as shown.

Collecting data with Scrapy

Let’s have a look at our spider_to_crawl.py file present inside our spiders folder. This is the file where we are writing the URL where our spider has to crawl and also a method named parse() which is used to describe what should be done with the data scraped by the spider.

This file is automatically generated by “scrapy genspider” command used above. The file is named after the spider’s name. Below given is the default file generated.

Default spider_to_crawl file’s structure

Note:

Since now we are ready with our project so now we can move on to see how pipelines.py file is implemented to store data scraped by the spider. 

Item pipeline is a pipeline method that is written inside pipelines.py file and is used to perform the below-given operations on the scraped data sequentially. The various operations we can perform on the scraped items are listed below:

For performing different operations on items we have to declare a separated component( classes in the file) which consists of various methods, used for performing operations. The pipelines file in default has a class named after the project name. We can also create our own classes to write what operations they have to perform. If any pipelines file consists of more than one class than we should mention their execution order explicitly. 

Operations are performed sequentially so we are using settings.py file to describe the order in which the operations should be done. i.e. we can mention which operation to be performed first and which to be performed next. This is usually done when we are performing several operations on the items.

Each component (class) must have one default function named process_item(), which is the default method that is always called inside the class or component of the pipelines file.

Syntax:

process_item( self, item, spider )

Parameters:

  • self : This is reference to the self object calling the method.
  • item : These are the items list scraped by the spider
  • spider : mentions the spider used to scrape.

The return type of this method is the modified or unmodified item object or an error will be raised if any fault is found in item.

This method is also used to call other method in this class which can be used to modify or store data.

Apart from these, we can also define our own methods (such as init() etc) to do other tasks like creating a database to store data or writing code that converts data to other forms.

Working of pipelines.py

Now let us look at how pipelines.py works:

Registering the components

It is important to register all the component we created in items pipelines file in the settings.py file in the directory structure under the title ITEM_PIPELINES.

Syntax:

ITEM_PIPELINES = {

myproject.pipelines.component : <priority number>

#many other components

}

Here the priority number is the order in which the components will be called by the scrapy.

Creating Items to be passed over files

One more thing to note is that we will require a description of what our item will contain in items.py file. Hence our items.py file contains the below-given code:




# Define here the models for your scraped items
 
import scrapy
 
class ScrapytutorialItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
     
    # only one field that it of Quote.
    Quote = scrapy.Field()

We will require this file to be imported into our spider_to_crawl.py file. Hence in this way, we can create items to be passed to the pipeline.Now we are clear with the idea of pipeline and how they work.

Implementing SQLite3

Now it’s time to learn how to implement SQLite3 to create databases and tables in python.

Hence in this way, we can store data in the database. To visualize the collected data we have to use SQLite online as by default our system has no method to open such type of file. If you have some helpers software installed then you don’t need this.

Now we are ready to move to the example. In this example, we will be using all the above techniques which we have learned and create a database of our scraped data. We will be using the above-mentioned site to scrape Quotes data and store it in our Database using SQLite3 in pipelines.py file. So we will use the idea of how to implement SQLite3 in python to create a pipeline that will receive data from spider scraping and will insert that data to the table in the database created. So let’s begin to write the code in the spider_to_crawl.py file. Here we are declaring our spider and giving the required URL as an input so that spider could scrape through it.

 spider_to_crawl.py:




import scrapy
 
# importing the items structure described
# in items.py file
from ..items import ScrapytutorialItem
 
 
class SpiderToCrawlSpider(scrapy.Spider):
    name = 'spider_to_crawl'
    #allowed_domains = ['https://quotes.toscrape.com/']
    start_urls = ['https://quotes.toscrape.com/']
 
    def parse(self, response):
       
        # creating items dictionary
        items = ScrapytutorialItem()
         
        # this is selected by pressing ctrl+f in console
        # and selecting the appropriate rule of Xpath
        Quotes_all = response.xpath('//div/div/div/span[1]')
 
        # These paths are based on the selectors
 
        for quote in Quotes_all:  # extracting data
            items['Quote'] = quote.css('::text').extract()
            yield items
            # calling pipelines components for further
            # processing.

We are now adding the pipeline methods below which are to be written in the pipelines.py File so that the database will be created.

pipelines.py file




from itemadapter import ItemAdapter
import sqlite3
 
 
class ScrapytutorialPipeline(object):
 
    # init method to initialize the database and
    # create connection and tables
    def __init__(self):
       
        # Creating connection to database
        self.create_conn()
         
        # calling method to create table
        self.create_table()
 
    # create connection method to create database
    # or use database to store scraped data
    def create_conn(self):
       
        # connecting to database.
        self.conn = sqlite3.connect("mydata.db")
         
        # collecting reference to cursor of connection
        self.curr = self.conn.cursor()
 
 
    # Create table method using SQL commands to create table
    def create_table(self):
        self.curr.execute("""DROP TABLE IF EXISTS firsttable""")
        self.curr.execute("""create table firsttable(
                        Quote text
                        )""")
 
    # store items to databases.
    def process_item(self, item, spider):
        self.putitemsintable(item)
        return item
 
    def putitemsintable(self, item):
       
        # extracting item and adding to table using SQL commands.
        self.curr.execute("""insert into firsttable values (?)""", (
            item['Quote'][0],
        ))
        self.conn.commit()  # closing the connection.

Items.py and settings.py files should look like:

Items.py and settings.py

After this use the given command to scrape and collect the data.

Syntax:

scrap crawl filename

After using the command “scrapy crawl spider_to_crawl”, the processing will take in the given manner:

Here the init() method is called which is called as a default method always in any python file. It then calls all other methods which are used to create a table and initialize the database.

Then process_item() method is used to call a method named putitemintable() which stores the data in database. Then after executing this method the reference is returned to the spider to pass other items to be operated.

Let’s see the output of the stored data after scraping the quotes.

Output:

Crawling of our spider

Data is stored in table.

Hence in this way we are able to collect the web data efficiently in a database.


Article Tags :