Scrape content from dynamic websites
To scrape content from a static page, we use BeautifulSoup as our package for scraping, and it works flawlessly for static pages. We use requests to load page into our python script. Now, if the page we are trying to load is dynamic in nature and we request this page by requests library, it would send the JS code to be executed locally. Requests package does not execute this JS code and just gives it as the page source.
BeautifulSoup does not catch the interactions with DOM via Java Script. Let’s suppose, if you have a table that is generated by JS. BeautifulSoup will not be able to capture it, while Selenium can.
If there was just a need to scrape static websites, we would’ve used just bs4. But, for dynamically generated webpages, we use selenium.
Selenium
Selenium is a free (open-source) automated testing framework used to validate web applications across different browsers and platforms. You can use multiple programming languages like Java, C#, Python etc to create Selenium Test Scripts. Here, we use Python as our main language.
First up, the installation :
1) Selenium bindings in python
pip install selenium
2) Web drivers
Selenium requires a web driver to interface with the chosen browser.Web drivers is a package to interact with web browser. It interacts with the web browser or a remote web server through a wire protocol which is common to all. You can check out and install the web drivers of your browser choice.
Chrome: https://sites.google.com/a/chromium.org/chromedriver/downloads
Firefox: https://github.com/mozilla/geckodriver/releases
Safari: https://webkit.org/blog/6900/webdriver-support-in-safari-10/
Beautifulsoup
Beautifulsoup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
To use beautiful soup, we have this wonderful binding of it in python :
1) BS4 bindings in python
pip install bs4
Let’s suppose the site is dynamic and simple scraping leads to returning a Nonetype object.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Chrome( './chromedriver' )
driver.get(url)
time.sleep( 5 )
html = driver.page_source
soup = BeautifulSoup(html, "html.parser" )
all_divs = soup.find( 'div' , { 'id' : 'nameSearch' })
job_profiles = all_divs.find_all( 'a' )
count = 0
for job_profile in job_profiles :
print (job_profile.text)
count = count + 1
if (count = = 10 ) :
break
driver.close()
|
Here’s the video of the scraper in action : Working_scraper_video
Output of the code :
Last Updated :
05 Sep, 2020
Like Article
Save Article
Share your thoughts in the comments
Please Login to comment...