Open In App

How to run Scrapy spiders in Python

Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we are going to discuss how to schedule Scrapy crawl execution programmatically using Python. Scrapy is a powerful web scraping framework, and it’s often necessary to schedule the execution of a Scrapy crawl at specific intervals. Scheduling Scrapy crawl execution programmatically allows you to automate the process of scraping data and ensures that you have the most up-to-date data.

Required Packages

Install Scrapy and schedule the library.

pip install schedule
pip install scrapy

Schedule Scrapy Crawl

In order to schedule Scrapy crawl execution, we will use the schedule library. This library allows us to schedule a task to be executed at a specific time or interval.

Step 1: Create a new folder

Step 2: Inside the folder, start a new project by the following command:

scrapy startproject <project_name>

Step 3: Import schedule library and create a function that runs the Scrapy crawl.

Python3




import schedule
import time
from scrapy import cmdline
  
def crawl():
    cmdline.execute("scrapy crawl my_spider".split())


Step 4: Use the schedule library to schedule the crawl function to run at a specific interval

In this example, the crawl function is scheduled to run every 5 minutes. The schedule.run_pending() method checks if any scheduled tasks are due to be run and the time.sleep(1) method is used to prevent the program from using all the CPU resources. You can also schedule the task at a specific time using schedule.every().day.at(“10:30”).do(crawl). You can also use schedule.clear() method to clear all the scheduled tasks.

Python3




schedule.every(5).minutes.do(crawl)
  
while True:
    schedule.run_pending()
    time.sleep(1)


Example 1

Create a new folder. Inside the folder, start a new project(Quotes). Create WikiSpider.py file in this code is using the Scrapy library to create a spider that scrapes data from Wikipedia. The spider, called “WikiSpider”, is set to start at the URL “https://en.wikipedia.org/wiki/Database” and is configured with a number of custom settings, such as the user agent, download delay, and a number of concurrent requests. The spider’s parse method is called when the spider is done crawling and it gets the title of the page and all the paragraphs from the page and writes it in a txt file named “wiki.txt”. The code also uses the schedule library to run the spider every 30 seconds and uses an infinite loop to keep running the scheduled spider until it is stopped manually.

Python3




import scrapy
import schedule
import time
from scrapy import cmdline
  
# This class is a spider for scraping data from wikipedia
class WikiSpider(scrapy.Spider):
    name = "wiki"
    # the starting url for the spider to crawl
    start_urls = ["https://en.wikipedia.org/wiki/Database"]
    # settings for the spider such as user agent, download delay, 
    # and number of concurrent requests
    custom_settings = {
    'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0;Win64) \
    AppleWebkit/537.36 (KHTML, like Gecko) \
    Chrome/89.0.4389.82 Safari/537.36',
    'DOWNLOAD_DELAY': 1,
    'CONCURRENT_REQUESTS': 1,
    'RETRY_TIMES': 3,
    'RETRY_HTTP_CODES': [500, 503, 504, 400, 403, 404, 408],
    'DOWNLOADER_MIDDLEWARES': {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
    }
    }
    # parse method that is called when the spider is done crawling
    def parse(self, response):
        # get the title of the page
        title = response.css("title::text").get()
        # get all the paragraphs from the page
        paragraphs = response.css("p::text").getall()
    # Open the file in write mode
        print(title)
        print(paragraphs)
        with open("wiki.txt", "w") as f:
            f.write(title)
            for para in paragraphs:
                f.write(para+"\n")
  
# function to run the spider
def crawl_wiki():
    cmdline.execute("scrapy runspider WikiSpider.py".split())
  
# schedule the spider to run every 30 seconds
schedule.every(30).seconds.do(crawl_wiki)
  
# infinite loop to run the scheduled spider
while True:
    schedule.run_pending()
    time.sleep(1)


Output: Run the Spider

scrapy runspider WikiSpider.py

On running WikiSpider.py, a wiki.txt file is created which contains contents from https://en.wikipedia.org/wiki/Database scheduled every 30 seconds.

How to schedule Scrapy crawl execution programmatically?

 

How to schedule Scrapy crawl execution programmatically?

wiki.txt is created on running the spider

EXAMPLE 2

Here is an example of a Scrapy spider that scrapes quotes from a website and prints the output to the console. The spider is scheduled to run every hour using the schedule library.

Create a new folder. Inside the folder, start a new project(Quotes). Create QuotesSpider.py file in this code is using the Scrapy library to create a spider that scrapes data from a website that contains quotes. The spider, called “QuotesSpider”, is set to start at the URL “http://quotes.toscrape.com/page/1/”, which is a website that contains quotes.
The spider’s parse method is called when the spider is done crawling and it gets the text, author, and tags of each quote and yields it as a dictionary. Also, it checks for the next page and follows the link if it exists. The code also uses the schedule library to run the spider every 30 seconds and uses an infinite loop to keep running the scheduled spider until it is stopped manually.

Python3




import scrapy
import schedule
import time
from scrapy import cmdline
  
# This class is a spider for scraping data from quotes website
  
  
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    # the starting url for the spider to crawl
    start_urls = [
  
    ]
    # settings for the spider such as user agent, download delay,
    # and number of concurrent requests
    custom_settings = {
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0;Win64) \
        AppleWebkit/537.36 (KHTML, like Gecko) \
        Chrome/89.0.4389.82 Safari/537.36',
        'DOWNLOAD_DELAY': 1,
        'CONCURRENT_REQUESTS': 1,
        'RETRY_TIMES': 3,
        'RETRY_HTTP_CODES': [500, 503, 504, 400, 403, 404, 408],
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
            'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
            'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
        }
    }
    # parse method that is called when the spider is done crawling
  
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
    # check for next page and follow the link
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)
  
# function to run the spider
  
  
def crawl_quotes():
    cmdline.execute("scrapy runspider QuotesSpider.py".split())
  
  
# schedule the spider to run every 30 seconds
schedule.every(30).seconds.do(crawl_quotes)
  
# infinite loop to run the scheduled spider
while True:
    schedule.run_pending()
    time.sleep(1)


Output: Run the Spider

scrapy runspider QuotesSpider.py

How to schedule Scrapy crawl execution programmatically?

 

How to schedule Scrapy crawl execution programmatically?

This output will be printed to the console every time the spider runs, as specified in the schedule.



Last Updated : 05 Feb, 2023
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads