Writing Scrapy Python Output to JSON file

Last Updated : 14 Sep, 2021

In this article, we are going to see how to write scrapy output into a JSON file in Python.

Using scrapy command-line shell

This is the easiest way to save data to JSON is by using the following command:

scrapy crawl <spiderName> -O <fileName>.json

This will generate a file with a provided file name containing all scraped data.

Note that using -O in the command line overwrites any existing file with that name whereas using -o appends the new content to the existing file. However, appending to a JSON file makes the file contents invalid JSON. So use the following command to append data to an existing file.

scrapy crawl <spiderName> -o <fileName>.jl

Note: .jl represents JSON lines format.

Stepwise implementation:

Step 1: Creating the project

Now to start a new project in scrapy use the following command

scrapy startproject tutorial

This will create a directory with the following content:

Move to the tutorial directory we created using the following command:

cd tutorial

Step 2: Creating a spider (tutorial/spiders/quotes_spider.py)

Spiders are the programs that user defines and scrapy uses to scrape information from website(s). This is the code for our Spider. Create a file named quotes_spider.py under the tutorial/spiders directory in your project:

Python3

import scrapy 
  
  
class QuotesSpider(scrapy.Spider): 
    
    # name of variable should be 'name' only 
    name = "quotes" 
  
    # urls from which will be used to extract information 
    # list should be named 'start_urls' only 
    start_urls = [ 
        'http://quotes.toscrape.com/page/1/', 
        'http://quotes.toscrape.com/page/2/', 
    ] 
  
    def parse(self, response): 
        
        # handle the response downloaded for each of the 
        # requests made should be named 'parse' only 
        for quote in response.css('div.quote'): 
            yield { 
                'text': quote.css('span.text::text').get(), 
                'author': quote.css('small.author::text').get(), 
                'tags': quote.css('div.tags a.tag::text').getall(), 
            }