Open In App

How to Convert Scrapy item to JSON?

Last Updated : 20 Jul, 2022
Improve
Improve
Like Article
Like
Save
Share
Report

Prerequisite: 

Scrapy is a web scraping tool used to collect web data and can also be used to modify and store data in whatever form we want. Whenever data is being scraped by the spider of scrapy, we are converting that raw data to items of scrapy, and then we will pass that item for further processing to pipelines. In pipelines, these items will be converted to JSON data, and we can either print it or can save it in another file. Hence, we can retrieve JSON data out of web scraped data.

Initializing Directory and Setting Up Project

Let’s first create a scrapy project. For that make sure that Python and PIP is installed in the system. Then run the below-given commands one-by-one to create a scrapy project similar to the one which we will be using in this article.

  • Let’s first create a virtual environment in a folder named GFGScrapy and activate that virtual environment there.
# To create a folder named GFGScrapy
mkdir GFGScrapy
cd GFGScrapy

# making virtual env there
virtualenv
cd scripts  

# activating it
activate
cd..

Hence, after running all these commands we will get the output as shown:

  • Now it’s time to create a scrapy project. For that Make sure that scrapy is installed in the system or not. If not installed install it using the below-given command.

Syntax:

pip install scrapy

Now to create a scrapy project use the below-given command and also create a spider.

scrapy startproject scrapytutorial  //project name is scrapytutorial

cd scrapytutorial

scrapy genspider spider_to_crawl https://quotes.toscrape.com/

//The link above mentions the website where we are going to crawl the spider.

Once you have created a scrapy project using pip installer, then the output of the project directory looks like the one given in the image. (Please refer to this if you want to know more about a scrapy project and get familiar with it).

The directory structure consists of the following path (sample)

C://<project-name>/<project-name>

In the above image, the project name is scrapytutorial and it has many files inside it as shown.

The files we are interested in are spider_to_crawl.py file (where we used to describe the methods for our spiders) and pipelines.py file where we will be describing components that will handle our further data processing which is to be done with the scraped data. In simple terms, this file is used to describe the methods which are used for further operations on data. The third most important file is settings.py file where we will be registering our components (created in pipelines,.py file) orderly. The next most important file is items.py file. This file is used to describe the form or dictionary structure in which data will flow from spider_to_crawl to pipelines.py file. Here we will be giving some keys which will be present in each item.

Let’s have a look at our spider_to_crawl.py file present inside our spiders folder. This is the file where we are writing the URL where our spider has to crawl and also a method named as parse() which is used to describe what should be done with the data scraped by the spider.

This file is automatically generated by “scrapy genspider” command used above. The file is named after the spider’s name. Below given is the default file generated.

Image

Note that we made some changes in the above default file i.e. commented out allowed_domains line and also we made some changes in the strat_urls (removed “http://“).

Converting scrapy to JSON

Pipelines are methods by which we can convert or modify or store items of scraped data. Hence, let’s first talk about some of its components.

A look to the default Pipelines.py file is shown below:

For performing different operations on items we have to declare a separated component( classes in the file) which consists of various methods, used for performing operations. The pipelines file in default has a class named after the project name. We can also create our own classes to write what operations they have to perform.

Each component of the pipelines.py file is consisting of one default method named as process_item().

Syntax:

process_item( self, item, spider):

This method intakes three variables one is a reference to self-object, another is the item of scraped data send by the spider and the third is the spider itself. This method is used to modify or store the data items that are scraped by the spider. We have to mention the way how the received item packets are to be modified in this method only.

This is the default method which is always called inside the class of pipelines.py file.

Apart from these, we can also create our own methods that can be used to modify or make other changes to data items. Hence since we have to convert our scraped data to JSON format, so we are required to have a component(class) that would do our respective work. But before that, we have to perform two main things.

1) First, we have to register the name of the pipeline component in our settings.py file. The syntax is given below.

Syntax:

ITEM_PIPELINES = {

myproject.pipelines.component : <priority number>

#many other components

}

Here the priority number is the order in which the components will be called by the scrapy.

Hence, for the above project, the below-given format will be applied.

ITEM_PIPELINES = {

   ‘scrapytutorial.pipelines.ScrapytutorialPipeline’: 300,

}

2) Another thing that we have to perform is to declare the format of the item which we have to use to pass our data to the pipeline. So for that, we will be using our items.py file.

The below-given code creates an item with one key variable named as “Quote” in our items.py file. Then we have to import this file in our spider_to_crawl.py file(shown in example).

Python3




# Define here the models for your scraped items
#
# See documentation in:
 
import scrapy
 
 
class ScrapytutorialItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    Quote = scrapy.Field()  # only one field that it of Quote.


The above code creates Items with only one key, we can create items with many keys.

Now since we have seen how to implement components in pipelines.py file and how settings are done and items are declared. Now we are ready to have an example in which we will be converting our scraped data items to JSON format. To convert the data in JSON format we will be using the JSON library of python along with its dumps property.  The idea is that we will get the scraped data in pipelines.py file, and then we will open a file and write all the JSON data in it. So methods named :

  • open_spider() will be called to open the file (result.json) when spider starts crawling.
  • close_spider() will be called to close the file when spider is closed and scraping is over.
  • process_item() will always be called (since it is default) and will be mainly responsible to convert the data to JSON format and print the data to the file. We will be using the concept of python web frameworks, i.e. how they convert backend retrieved data to JSON and other formats.

Hence, the code in our pipelines.py looks like:

Python3




from itemadapter import ItemAdapter
import json  # Json package of python module.
 
 
class ScrapytutorialPipeline:
    def process_item(self, item, spider):  # default method
        # calling dumps to create json data.
        line = json.dumps(dict(item)) + "\n"
        # converting item to dict above, since dumps only intakes dict.
        self.file.write(line)                    # writing content in output file.
        return item
 
    def open_spider(self, spider):
        self.file = open('result.json', 'w')
 
    def close_spider(self, spider):
        self.file.close()


Our spider_to_crawl.py looks like

Python3




import scrapy
from ..items import ScrapytutorialItem   
 
class SpiderToCrawlSpider(scrapy.Spider):
    name = 'spider_to_crawl'
 
    start_urls = ['https://quotes.toscrape.com/']
 
    def parse(self, response):
         
        # creating items dictionary
        items = ScrapytutorialItem()   
        Quotes_all = response.xpath('//div/div/div/span[1]')
 
        # These paths are based on the selectors
         
        for quote in Quotes_all: #extracting data
            items['Quote'] = quote.css('::text').extract()
            yield items


Our settings.py file looks like:

Our items.py file looks like

After using the command “scrapy crawl spider_to_crawl“, The below given steps are going to take place.

  • The spider is crawled due to which result.json file is created. Now the spider scrapes the web page and collect the data in Quotes_all Variable. Then each data is extracted from the variable and is passed to the item declared in the file as the value of the key i.e. Quote. The at last in the yield we are calling pipelines.py file for further processing.
  • We are receiving item variable from spider in pipelines.py file which is than converted to JSON using dumps method and then the output is written in the opened file.
  • The file is than closed, and we can see the output.

  • JSON file created

Output:



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads