Skip to content
Related Articles

Related Articles

Improve Article

Saving scraped items to JSON and CSV file using Scrapy

  • Last Updated : 09 Aug, 2021
Geek Week

In this article, we will see how to use crawling with Scrapy, and, Exporting data to JSON and CSV format. We will scrape data from a webpage, using a Scrapy spider, and export the same to two different file formats.

Here we will extract from the link  http://quotes.toscrape.com/tag/friendship/. This website is provided by the makers of Scrapy, for learning about the library. Let us understand the above approach stepwise:

Step 1: Create scrapy project

Execute the following command, at the terminal, to create a Scrapy project –

scrapy startproject gfg_friendshipquotes 

This will create a new directory, called “gfg_friendshipquotes”, in your current directory. Now change the directory, to the newly created folder. 



The folder structure of ‘gfg_friendshipquotes’ is as displayed below.  Keep the contents of the configuration files as they are, currently.

Step 2: To create a spider file, we use the command ‘genspider ‘. Please see that genspider command is executed at the same directory level, where scrapy.cfg file is present. The command is –

scrapy genspider spider_filename “url_of_page_to_scrape” 

Now, execute the following at the terminal:

scrapy genspider gfg_friendquotes “quotes.toscrape.com/tag/friendship/”

This should be created, a spider Python file, called “gfg_friendquotes.py”, in the spiders folder as:



The default code of the gfg_friendquotes.py file is as follows:

Python




# Import the required library
import scrapy
 
# Spider class
class GfgFriendquotesSpider(scrapy.Spider):
   
   # The name of the spider
    name = 'gfg_friendquotes'
     
    # The domain, the spider will crawl
    allowed_domains = ['quotes.toscrape.com/tag/friendship/']
     
    # The URL of the webpage, data from which
    # will get scraped
     
    # default start function which will hold
    # the code for navigating and gathering
    # the data from tags
    def parse(self, response):
        pass

Step 3: Now, let’s analyze the XPath expressions for required elements. If you visit the link, http://quotes.toscrape.com/tag/friendship/ it looks as follows:

URL of the page that we will scrape

We are going to scrape the friendship quotes titles, authors, and tags. When you right-click on Quotes, block it, and select Inspect option, one can notice they belong to class “quote”. As you hover over the rest of the quote blocks, one can notice that all the quotes, in the webpage, have the CSS class attribute as “quote”.

Right-Click, Inspect, check CSS attributes of first Quote block

To extract, the text of the quote, right-click on the first quote, and, say Inspect. The title/text of the quote, belongs to the CSS class attribute, “text”.

Right-Click first Title, Inspect, check CSS class attributes 

To extract the author name of the quote, right-click on the first name, and, say Inspect. It belongs to the CSS class “author”. There is an itemprop CSS attribute, defined here as well with the same name. We will use this attribute in our code.

 Right-Click on author name to get its CSS attributes

Step 7: To extract the tags of the quote, right-click on the first tag, and, say Inspect. A single tag belongs to the CSS class “tag”. Together, they have an itemprop CSS attribute, “keywords” defined. They also have a “content” CSS attribute, with all the tags in one line. If you observe, the actual text of tags is present inside <a>, hyperlink elements. Hence, fetching from the “content” attribute would be easier. 

 Right-Click on Tags  to get its CSS attributes

The final code, for the spider file, after including the XPath expressions, is as follows –

Python3




# Import the required libraries
import scrapy
 
# Default class created when we run the "genspider" command
 
 
class GfgFriendquotesSpider(scrapy.Spider):
    # Name of the spider as mentioned in the "genspider" command
    name = 'gfg_friendquotes'
    # Domains allowed for scraping, as mentioned in the "genspider" command
    allowed_domains = ['quotes.toscrape.com/tag/friendship/']
    # URL(s) to scrape as mentioned in the "genspider" command
    # The scrapy spider, starts making  requests, to URLs mentioned here
 
    # Default callback method responsible for returning the scraped output and processing it.
    def parse(self, response):
       # XPath expression of all the Quote elements.
        # All quotes belong to CSS attribute class having value 'quote'
        quotes = response.xpath('//*[@class="quote"]')
        # Loop through the quotes object, to get required elements data.
        for quote in quotes:
            # XPath expression to fetch 'title' of the Quote
            # Title belong to CSS attribute class having value 'text'
            title = quote.xpath('.//*[@class="text"]/text()').extract_first()
            # XPath expression to fetch 'author name' of the Quote
            # Author name belong to CSS attribute itemprop having value 'author'
            author = quote.xpath(
                './/*[@itemprop="author"]/text()').extract_first()
            # XPath expression to fetch 'tags' of the Quote
            # Tags belong to CSS attribute itemprop having value 'keywords'
            tags = quote.xpath(
                './/*[@itemprop="keywords"]/@content').extract_first()
            # Return the output
            yield {'Text': title,
                   'Author': author,
                   'Tags': tags}

Scrapy allows the extracted data to be stored in formats like JSON, CSV, XML etc. This tutorial shows two methods of doing so. One can write the following command at the terminal:



scrapy crawl “spider_name” -o store_data_extracted_filename.file_extension

Alternatively, one can export the output to a file, by mentioning FEED_FORMAT and FEED_URI in the settings.py file. 

Creating JSON file

For storing the data in a JSON file, one can follow any of the methods mentioned below:

scrapy crawl gfg_friendquotes -o friendshipquotes.json

Alternatively, we can mention FEED_FORMAT and FEED_URI in the settings.py file. The settings.py file should be as follows:

Python




BOT_NAME = 'gfg_friendshipquotes'
 
SPIDER_MODULES = ['gfg_friendshipquotes.spiders']
NEWSPIDER_MODULE = 'gfg_friendshipquotes.spiders'
 
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
 
# Desired file format
FEED_FORMAT = "json"
 
# Name of the file where
# data extracted is stored
FEED_URI = "friendshipfeed.json"

 Output:

Using any of the methods above, the  JSON files are generated in the project folder as:

The extracted data, exported to JSON  files

The expected JSON file looks as follows:
 

The Exported JSON data, crawled by spider code

Creating CSV file:

For storing the data in a CSV file, one can follow any of the methods mentioned below.



Write the following command at the terminal: 

scrapy crawl gfg_friendquotes -o friendshipquotes.csv

Alternatively, we can mention FEED_FORMAT and FEED_URI in the settings.py file. The settings.py file should be as follows:

Python




BOT_NAME = 'gfg_friendshipquotes'
 
SPIDER_MODULES = ['gfg_friendshipquotes.spiders']
NEWSPIDER_MODULE = 'gfg_friendshipquotes.spiders'
 
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
 
# Desired file format
FEED_FORMAT = "csv"
 
# Name of the file where data extracted is stored
FEED_URI = "friendshipfeed.csv"

Output:

The CSV files are generated in the project folder as:

The exported files are created in your scrapy project structure

The exported CSV file looks as follows:

The Exported CSV data, crawled by spider code

 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course




My Personal Notes arrow_drop_up
Recommended Articles
Page :