Scrapy-selenium is a middleware that is used in web scraping. scrapy do not support scraping modern sites that uses javascript frameworks and this is the reason that this middleware is used with scrapy to scrape those modern sites.Scrapy-selenium provide the functionalities of selenium that help in working with javascript websites. Other advantages provided by this is driver by which we can also see what is happening behind the scenes. As selenium is automated tool it also provides us to how to deal with input tags and scrape according to what you pass in input field. Passing inputs in input fields became easier by using selenium.First time scrapy-selenium was introduced in 2018 and its an opensource. The alternative to this can be scrapy-splash
-
Install and Setup Scrapy –
- Install scrapy
- Run
scrapy startproject projectname (projectname is name of project)
- Now, let’s Run,
scrapy genspider spidername example.com
(replace spidername with your preferred spider name and example.com with website that you want to scrape). Note: Later also url can be changed, inside your scrapy spider.
scrapy spider:
-
Integrating scrapy-selenium in scrapy project:
- Install scrapy-selenium and add this in your settings.py file
# for firefox
from
shutil
import
which
SELENIUM_DRIVER_NAME
=
'firefox'
SELENIUM_DRIVER_EXECUTABLE_PATH
=
which(
'geckodriver'
)
SELENIUM_DRIVER_ARGUMENTS
=
[
'-headless'
]
# for chrome driver
from
shutil
import
which
SELENIUM_DRIVER_NAME
=
'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH
=
which(
'chromedriver'
)
SELENIUM_DRIVER_ARGUMENTS
=
[
'--headless'
]
DOWNLOADER_MIDDLEWARES
=
{
'scrapy_selenium.SeleniumMiddleware'
:
800
}
chevron_rightfilter_none - In this project chrome driver is used.Chrome driver is to be downloaded according to version of chrome browser. Go to help section in your chrome browser then click about Google chrome and check your version.Download chrome driver from website as referred here To download chrome driver
-
Where to add chromedriver:
-
Addition in settings.py file:
-
Change to be made in spider file:
-
To run project:
command- scrapy crawl spidername (scrapy crawl integratedspider in this project)
-
spider code before scrapy-selenium:
import
scrapy
class
IntegratedspiderSpider(scrapy.Spider):
name
=
'integratedspider'
# name of spider
allowed_domains
=
[
'example.com'
]
def
parse(
self
, response):
pass
chevron_rightfilter_none
- Install scrapy-selenium and add this in your settings.py file
-
Important fields in scrapy-selenium:
- name- name is a variable where name of spider is written and each spider is recognized
by this name. The command to run spider is, scrapy crawl spidername (Here spidername is
referred to that name which is defined in the spider). - function start_requests- The first requests to perform are obtained by calling the start_requests() method which generates Request for the URL specified in the url field in yield SeleniumRequest and the parse method as callback function for the Requests
- url- Here url of the site is provided.
- screenshot- You can take a screenshot of a web page with the method get_screenshot_as_file() with as parameter the filename and screenshot will save in project.
- callback- The function that will be called with the response of this request as its first parameter.
- dont_filter- indicates that this request should not be filtered by the scheduler. if same url is send to parse it will not give exception of same url already accessed. What it means is same url can be accessed more then once.default value is false.
- wait_time- Scrapy doesn’t wait a fixed amount of time between requests. But by this field we can assign it during callback.
- name- name is a variable where name of spider is written and each spider is recognized
-
General structure of scrapy-selenium spider:
import
scrapy
from
scrapy_selenium
import
SeleniumRequest
class
IntegratedspiderSpider(scrapy.Spider):
name
=
'integratedspider'
def
start_requests(
self
):
yield
SeleniumRequest(
wait_time
=
3
,
screenshot
=
True
,
callback
=
self
.parse,
dont_filter
=
True
)
def
parse(
self
, response):
pass
chevron_rightfilter_none -
Project of Scraping with scrapy-selenium:
scraping online courses names from geeksforgeeks site using scrapy-seleniumGetting X-path of element we need to scrap –
Code to scrap Courses Data from Geeksforgeeks –
import
scrapy
from
scrapy_selenium
import
SeleniumRequest
class
IntegratedspiderSpider(scrapy.Spider):
name
=
'integratedspider'
def
start_requests(
self
):
yield
SeleniumRequest(
wait_time
=
3
,
screenshot
=
True
,
callback
=
self
.parse,
dont_filter
=
True
)
def
parse(
self
, response):
# courses make list of all items that came in this xpath
# this xpath is of cards containing courses details
courses
=
response.xpath(
'//*[@id ="active-courses-content"]/div/div/div'
)
# course is each course in the courses list
for
course
in
courses:
# xpath of course name is added in the course path
# text() will scrape text from h4 tag that contains course name
course_name
=
course.xpath(
'.//a/div[2]/div/div[2]/h4/text()'
).get()
# course_name is a string containing \n and extra spaces
# these \n and extra spaces are removed
course_name
=
course_name.split(
'\n'
)[
1
]
course_name
=
course_name.strip()
yield
{
'course Name'
:course_name
}
chevron_rightfilter_noneOutput –
Official link github repo
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.