Collecting data with Scrapy

Last Updated : 24 Apr, 2023

Prerequisites:

Scrapy is a web scraping library that is used to scrape, parse and collect web data. Now once our spider has scrapped the data then it decides whether to:

Keep the data.
Drop the data or items.
stop and store the processed data items.

Hence for all these functions, we are having a pipelines.py file which is used to handle scraped data through various components (known as a class ) which are executed sequentially. In this article, we will be learning through the pipelines.py file, how it is used to collect the data scraped by scrapy using SQLite3 database language.

Initializing Directory and setting up the Project

Let’s, first of all, create a scrapy project. For that make sure that Python and PIP are installed in the system. Then run the below-given commands one-by-one to create a scrapy project similar to the one which we will be using in this article.

Let’s first create a virtual environment in a folder named GFGScrapy and activate that virtual environment there.

# To create a folder named GFGScrapy
mkdir GFGScrapy    
cd GFGScrapy

# making virtual env there.
virtualenv .    
cd scripts  

# activating it.
activate     
cd..

Output:

Creating virtual environment

Now it’s time to create a scrapy project. For that Make sure that scrapy is installed in the system or not. If not installed install it using the below-given command.

Syntax:

pip install scrapy

Now to create a scrapy project use the below-given command and also create a spider.

# project name is scrapytutorial
scrapy startproject scrapytutorial  
cd scrapytutorial

# link is of the website we are looking to crawl
scrapy genspider spider_to_crawl https://quotes.toscrape.com

Once you have created a scrapy project using pip installer, then the output of the project directory looks like the one given in the image.

Directory structure

Scrapy directory structure

The directory structure consists of the following path (sample)

C://<project-name>/<project-name>

In the above image, the project name is scrapytutorial and it has many files inside it as shown.

The files we are interested in are spider_to_crawl.py file (where we used to describe the methods for our spiders) and pipelines.py file where we will be describing components that will handle our further data processing which is to be done with the scraped data. In simple terms, this file is used to describe the methods which are used for further operations on data.
The third most important file is settings.py file where we will be registering our components (created in pipelines,.py file) orderly.
The next most important file is items.py file. This file is used to describe the form or dictionary structure in which data will be flowed from spider_to_crawl to pipelines.py file. Here we will be giving some keys which will be present in each item.

Collecting data with Scrapy

Let’s have a look at our spider_to_crawl.py file present inside our spiders folder. This is the file where we are writing the URL where our spider has to crawl and also a method named parse() which is used to describe what should be done with the data scraped by the spider.

This file is automatically generated by “scrapy genspider” command used above. The file is named after the spider’s name. Below given is the default file generated.

Default spider_to_crawl file’s structure

Note:

Note that we made some changes in the above default file i.e. commented out the allowed_domains line and also we made some changes in the start_urls (removed “http://”).
We don’t require to install SQLite3 in our system as it comes pre-installed along with python. Hence we can just import it and start using it.

Since now we are ready with our project so now we can move on to see how pipelines.py file is implemented to store data scraped by the spider.

Item pipeline is a pipeline method that is written inside pipelines.py file and is used to perform the below-given operations on the scraped data sequentially. The various operations we can perform on the scraped items are listed below:

Parse the scraped files or data.
Store the scraped data in databases.
Converting files from one format to another. eg to JSON.

For performing different operations on items we have to declare a separated component( classes in the file) which consists of various methods, used for performing operations. The pipelines file in default has a class named after the project name. We can also create our own classes to write what operations they have to perform. If any pipelines file consists of more than one class than we should mention their execution order explicitly.

Operations are performed sequentially so we are using settings.py file to describe the order in which the operations should be done. i.e. we can mention which operation to be performed first and which to be performed next. This is usually done when we are performing several operations on the items.

Each component (class) must have one default function named process_item(), which is the default method that is always called inside the class or component of the pipelines file.

Syntax:

process_item( self, item, spider )

Parameters:

self : This is reference to the self object calling the method.

item : These are the items list scraped by the spider

spider : mentions the spider used to scrape.

The return type of this method is the modified or unmodified item object or an error will be raised if any fault is found in item.

This method is also used to call other method in this class which can be used to modify or store data.

Apart from these, we can also define our own methods (such as init() etc) to do other tasks like creating a database to store data or writing code that converts data to other forms.

Working of pipelines.py

Now let us look at how pipelines.py works:

At first, our spider will scrape the web data and using its parse method it will create items (describe in items.py file) out of it. Then these items are passed to the pipelines.py file.
After receiving the items, pipelines file calls all the components described in itself in a sequential order mentioned in settings.py file. These components uses their default function to process the data item.
Hence after processing is completed next data item is transferred from the spider and same phenomena goes on until the web scraping is completed.

Registering the components

It is important to register all the component we created in items pipelines file in the settings.py file in the directory structure under the title ITEM_PIPELINES.

Syntax:

ITEM_PIPELINES = {

myproject.pipelines.component : <priority number>

#many other components

}

Here the priority number is the order in which the components will be called by the scrapy.

Creating Items to be passed over files

One more thing to note is that we will require a description of what our item will contain in items.py file. Hence our items.py file contains the below-given code:

Python3

# Define here the models for your scraped items
 
import scrapy
 
class ScrapytutorialItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
     
    # only one field that it of Quote.
    Quote = scrapy.Field()

We will require this file to be imported into our spider_to_crawl.py file. Hence in this way, we can create items to be passed to the pipeline.Now we are clear with the idea of pipeline and how they work.

Implementing SQLite3

Now it’s time to learn how to implement SQLite3 to create databases and tables in python.

First, we will make init() method since it is called first in any python class. Hence in this function, we will mention the call to other methods named: create_conn() and create_table() to create connection and table of the database respectively.
Now in create_conn() we will be connect() method of SQLite class to connect(or create if not exists) to the mentioned database.
In create_table() we had written SQL command and is telling to the cursor reference of the connection to execute that command to create the table.
At last process_item() method will be called(default) which calls another method which in turn puts the scraped items data to that table created in init().

Hence in this way, we can store data in the database. To visualize the collected data we have to use SQLite online as by default our system has no method to open such type of file. If you have some helpers software installed then you don’t need this.

Now we are ready to move to the example. In this example, we will be using all the above techniques which we have learned and create a database of our scraped data. We will be using the above-mentioned site to scrape Quotes data and store it in our Database using SQLite3 in pipelines.py file. So we will use the idea of how to implement SQLite3 in python to create a pipeline that will receive data from spider scraping and will insert that data to the table in the database created. So let’s begin to write the code in the spider_to_crawl.py file. Here we are declaring our spider and giving the required URL as an input so that spider could scrape through it.

spider_to_crawl.py:

Python3

import scrapy
 
# importing the items structure described 
# in items.py file
from ..items import ScrapytutorialItem
 
 
class SpiderToCrawlSpider(scrapy.Spider):
    name = 'spider_to_crawl'
    #allowed_domains = ['https://quotes.toscrape.com/']
    start_urls = ['https://quotes.toscrape.com/']
 
    def parse(self, response):
       
        # creating items dictionary
        items = ScrapytutorialItem()
         
        # this is selected by pressing ctrl+f in console 
        # and selecting the appropriate rule of Xpath
        Quotes_all = response.xpath('//div/div/div/span[1]')
 
        # These paths are based on the selectors
 
        for quote in Quotes_all:  # extracting data
            items['Quote'] = quote.css('::text').extract()
            yield items
            # calling pipelines components for further 
            # processing.

We are now adding the pipeline methods below which are to be written in the pipelines.py File so that the database will be created.

pipelines.py file

Python3

from itemadapter import ItemAdapter
import sqlite3
 
 
class ScrapytutorialPipeline(object):
 
    # init method to initialize the database and
    # create connection and tables
    def __init__(self):
       
        # Creating connection to database
        self.create_conn()
         
        # calling method to create table
        self.create_table()
 
    # create connection method to create database
    # or use database to store scraped data
    def create_conn(self):
       
        # connecting to database.
        self.conn = sqlite3.connect("mydata.db")
         
        # collecting reference to cursor of connection
        self.curr = self.conn.cursor()
 
 
    # Create table method using SQL commands to create table
    def create_table(self):
        self.curr.execute("""DROP TABLE IF EXISTS firsttable""")
        self.curr.execute("""create table firsttable(
                        Quote text
                        )""") 
 
    # store items to databases.
    def process_item(self, item, spider):
        self.putitemsintable(item)
        return item
 
    def putitemsintable(self, item):
       
        # extracting item and adding to table using SQL commands.
        self.curr.execute("""insert into firsttable values (?)""", (
            item['Quote'][0],
        ))
        self.conn.commit()  # closing the connection.

Items.py and settings.py files should look like:

Items.py and settings.py

After this use the given command to scrape and collect the data.

Syntax:

scrap crawl filename

After using the command “scrapy crawl spider_to_crawl”, the processing will take in the given manner:

In spider.py we had mentioned the code that our spider should go to that site and extract all data mentioned in the URL format and then will create items list of it and pass that list to the pipelines.py file for further processing.
We are also creating an items object to contain data to be passed and registered it at items.py file in the directory.
Then when the spider crawls it collects data in items object and transfers it to the pipelines and what happens next is already clear from the above code with hints in comments. pipelines.py file creates a database and stores all the incoming items.

Here the init() method is called which is called as a default method always in any python file. It then calls all other methods which are used to create a table and initialize the database.

Then process_item() method is used to call a method named putitemintable() which stores the data in database. Then after executing this method the reference is returned to the spider to pass other items to be operated.

Let’s see the output of the stored data after scraping the quotes.

Output: