Open In App

Scraping Javascript Enabled Websites using Scrapy-Selenium

Scrapy-selenium is a middleware that is used in web scraping. scrapy do not support scraping modern sites that uses javascript frameworks and this is the reason that this middleware is used with scrapy to scrape those modern sites.Scrapy-selenium provide the functionalities of selenium that help in working with javascript websites. Other advantages provided by this is driver by which we can also see what is happening behind the scenes. As selenium is automated tool it also provides us to how to deal with input tags and scrape according to what you pass in input field. Passing inputs in input fields became easier by using selenium.First time scrapy-selenium was introduced in 2018 and its an opensource. The alternative to this can be scrapy-splash

scrapy startproject projectname  (projectname is name of project)
scrapy genspider spidername example.com




# for firefox
from shutil import which
 
SELENIUM_DRIVER_NAME = 'firefox'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('geckodriver')
SELENIUM_DRIVER_ARGUMENTS=['-headless'
 
# for chrome driver
from shutil import which
 
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS=['--headless'
 
DOWNLOADER_MIDDLEWARES = {
     'scrapy_selenium.SeleniumMiddleware': 800
     }

command- scrapy crawl spidername (scrapy crawl integratedspider in this project)




import scrapy
 
class IntegratedspiderSpider(scrapy.Spider):
    name = 'integratedspider' # name of spider
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/']
 
    def parse(self, response):
        pass




import scrapy
from scrapy_selenium import SeleniumRequest
 
class IntegratedspiderSpider(scrapy.Spider):
    name = 'integratedspider'
    def start_requests(self):
        yield SeleniumRequest(
            url ="https://www.geeksforgeeks.org/",
            wait_time = 3,
            screenshot = True,
            callback = self.parse,
            dont_filter = True   
        )
 
    def parse(self, response):
        pass




import scrapy
from scrapy_selenium import SeleniumRequest
  
class IntegratedspiderSpider(scrapy.Spider):
    name = 'integratedspider'
    def start_requests(self):
        yield SeleniumRequest(
            url = "https://practice.geeksforgeeks.org/courses/online",
            wait_time = 3,
            screenshot = True,
            callback = self.parse,
            dont_filter = True
        )
  
    def parse(self, response):
        # courses make list of all items that came in this xpath
        # this xpath is of cards containing courses details
        courses = response.xpath('//*[@id ="active-courses-content"]/div/div/div')
  
        # course is each course in the courses list
        for course in courses:
            # xpath of course name is added in the course path
            # text() will scrape text from h4 tag that contains course name
            course_name = course.xpath('.//a/div[2]/div/div[2]/h4/text()').get()
  
            # course_name is a string containing \n and extra spaces
            # these \n and extra spaces are removed
  
            course_name = course_name.split('\n')[1]
            course_name = course_name.strip()
  
              
            yield {
                'course Name':course_name
            }

Official link github repo


Article Tags :