Writing Scrapy Python Output to JSON file
In this article, we are going to see how to write scrapy output into a JSON file in Python.
Using scrapy command-line shell
This is the easiest way to save data to JSON is by using the following command:
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning - Basic Level Course
scrapy crawl <spiderName> -O <fileName>.json
This will generate a file with a provided file name containing all scraped data.
Note that using -O in the command line overwrites any existing file with that name whereas using -o appends the new content to the existing file. However, appending to a JSON file makes the file contents invalid JSON. So use the following command to append data to an existing file.
scrapy crawl <spiderName> -o <fileName>.jl
Note: .jl represents JSON lines format.
Step 1: Creating the project
Now to start a new project in scrapy use the following command
scrapy startproject tutorial
This will create a directory with the following content:
Move to the tutorial directory we created using the following command:
Step 2: Creating a spider (tutorial/spiders/quotes_spider.py)
Spiders are the programs that user defines and scrapy uses to scrape information from website(s). This is the code for our Spider. Create a file named quotes_spider.py under the tutorial/spiders directory in your project:
This is a simple spider to get the quotes, author names, and tags from the website.
Step 5: Running the program
To run the program and save scrawled data to JSON using:
scrapy crawl quotes -O quotes.json
We can see that a file quotes.json has been created in our project structure, this file contains all the scraped data.
These are just a few of many quotes of quotes.json file scraped by our spider.