Open In App

Scraping dynamic content using Python-Scrapy

Improve
Improve
Like Article
Like
Save
Share
Report

Let’s suppose we are reading some content from a source like websites, and we want to save that data on our device. We can copy the data in a notebook or notepad for reuse in future jobs. This way, we used scraping(if we didn’t have a font or database, the form brute removes the data in documents, sites, and codes).

But now there exist many tools for scraping one site. However, this example was choosing Scrapy for scraping Python Site. Scrapy is a framework that extracting data structures or information from pages.

Installation 

Firstly we have to check the installation of the python, scrapy, and vscode or similar editor on our computer. After that, we can choose two manners to start the project. At first which an operating virtual environment(in python venv or virtual environment is ambient of development) whereas in the other method does not use virtual environment.

With venv: In this case, was used the command source for entering tho mode venv and install scrapy in this mode.

command which install python –> sudo pacman -S python

command which install vs-code –> sudo pacman -S code

command which install scrapy in computer –> sudo pacman -S scrapy

command to create a development ambient –> python3.9 -m venv venv

command to execute or open development ambient –> source venv/bin/activate

command which install scrapy in python packages in development ambient –> pip install scrapy

Without venv: For the application was used the commands pip, pacman for building the packages.

command which install python –> sudo pacman -S python

command which install vs-code –> sudo pacman -S code

command which install scrapy in computer –> sudo pacman -S scrapy

command which install scrapy in python packages –> pip install scrapy

Getting Started

In this part, after installation scrapy, you have a chose a local in your computer for creating a project Scrapy, and open the terminal and write the command scrapy startproject [name of project], which creating project scrapy.

With venv and Without venv:

command which starting project in scrapy –> scrapy startproject example_gfg

After creating the path of the project, they are necessary to enter it.

command cd to enter in path of projects –> cd example_gfg

In the project is a path called spiders. According to documentation, they create the spiders to which realize scraping in sites.

Each spider has a name, start_url, and function methods.

cd example_gfg/example_gfg/spiders

Python3




import scrapy
 
class python_Spider(scrapy.Spider):
    name = ""
    start_urls = []


 
 

According to the code above, which extracts in site Python: the events along the year, the spider name is python_events and start URL (in spider name and start URL, we can change).

Python3




import scrapy
 
class python_Spider(scrapy.Spider):
      name = "geeksforgeeks_article"
     
      start_urls = [
      ]
 
    def parse(self, response):


 
 

We use parse method and call this function, this function is used to extracts data from the sites, however, to scrape the sites it is necessary to understand the command response selector CSS and XPath.
 

  • Request: It is a request which realizes a call for objects or data.
  • Response: It obtains an answer to the Request.
  • Selector: It represents a method that consists of a select part or tag in Html of a site for extraction.
  • Scrapy utilizes two methods to selector:
    • XPath: It a language of search navigated in documents that use tags.
    • CSS: It is Cascading Style Sheets, which searches for tags in id or class in HTML.

Into this loop, we used yield(yield is a word reserve in Python which is similar to a stopped temporarily in function or frozen) to create a dictionary with the name, date, and link of events.

Python3




import scrapy
 
class python_Spider(scrapy.Spider):
    name = "python_events"
     
    start_urls = [
    ]
 
    def parse(self, response):
         
        for item in response.css('ol'):
         
            yield {
                'title': item.css('a::text').get(),
                'link': item.css('a::attr(href)').get(),
            }


 
 

Test Project with Scrapy Shell

Scrapy had a shell with could tested command selector CSS. 

Without venv and With venv:

scrapy shell "https://www.geeksforgeeks.org/data-structures/?ref=shm"
response.css("a").get()
response.css("title").get()
response.css("title::text").get()
response.css("a::text").get() 
response.css("a::attr(href)").get()

Demonstration

  • We produced the code and tested a scrapy shell.
  • We ran the code or spider.
  • We divided it into two forms for developing the project, without venv or with venv.

Without venv : We should enter the path of the project or spider for executing the command.

scrapy crawl geeksforgeeks_article

 

With venv: We could execute the command in any path if we are in mode venv.
 

scrapy crawl geeksforgeeks_article

 

We can store the data in a file, with the commands below:

scrapy crawl geeksforgeeks_article -O geeksforgeeks_article.csv 

or

scrapy crawl geeksforgeeks_article -o geeksforgeeks_article.csv

O(create and insert in a new data file) and o(create and append in a new data file) are commands to create a new file and insert. 
 

Outputs:

Output scraping 1

Output scraping 2

 



Last Updated : 26 May, 2021
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads