Scraping a JSON response with Scrapy

Last Updated : 21 Mar, 2023

Scrapy is a popular Python library for web scraping, which provides an easy and efficient way to extract data from websites for a variety of tasks including data mining and information processing. In addition to being a general-purpose web crawler, Scrapy may also be used to retrieve data via APIs.

One of the most common data formats returned by APIs is JSON, which stands for JavaScript Object Notation. In this article, we’ll look at how to scrape a JSON response using Scrapy.

To install Scrapy write the following command in your command line or on your terminal:

pip install scrapy

Example

Now we’ll look at an example to extract data from the bored public API endpoint (https://www.boredapi.com/api/activity).

Here’s what the actual data returned looks like:

{
  "activity": "Learn calligraphy",
  "type": "education",
  "participants": 1,
  "price": 0.1,
  "link": "",
  "key": "4565537",
  "accessibility": 0.1
}

Python3

# import modules 
import scrapy 
import json 
  
  
class Spider(scrapy.Spider): 
    name = "bored"
  
    def start_requests(self): 
        url = "https://www.boredapi.com/api/activity"
  
        yield scrapy.Request(url, self.parse) 
  
    def parse(self, response): 
        data = json.loads(response.text) 
  
        activity = data["activity"] 
        type = data["type"] 
        participants = data["participants"] 
  
        yield {"Activity": activity, "Type": type,  
               "Participants": participants} 

Explanation:

Here we have a Scrapy spider named Spider. The spider has 3 main parts:

The name variable – sets the name of the spider to “bored”.
The start_requests method – initiates the request to the API endpoint at “https://www.boredapi.com/api/activity”. The method yields a Scrapy request object and passes it to the parse method.
The parse method – handles the response from the API endpoint. The method loads the JSON response data into a Python dictionary using the json.loads function. Then, it extracts the values of the “activity”, “type”, and “participants” keys from the dictionary and stores them in variables with the same names. Finally, it yields a dictionary with the activity, type, and participants as keys and their corresponding values.

To run this file type the following into your terminal:

scrapy runspider <file name>

Output:

the output of the above command

Now, this output will contain a lot of unnecessary lines so it’ll be better to store your parsed responses in a separate file. You can do it by adding a -o tag to the command for the output file.

The “-L ERROR” is added to prevent any outputs other than error messages.

activity.json looks like this:

Suggest improvement

How to get Scrapy Output File in XML File?

Logging in Scrapy

Share your thoughts in the comments

Getting Started With Scrapy

Scrapy Basics

Data Collection and Management

Data Extraction and Export

Appliaction And Projects