Scrape content from dynamic websites

Last Updated : 05 Sep, 2020

To scrape content from a static page, we use BeautifulSoup as our package for scraping, and it works flawlessly for static pages. We use requests to load page into our python script. Now, if the page we are trying to load is dynamic in nature and we request this page by requests library, it would send the JS code to be executed locally. Requests package does not execute this JS code and just gives it as the page source.

BeautifulSoup does not catch the interactions with DOM via Java Script. Let’s suppose, if you have a table that is generated by JS. BeautifulSoup will not be able to capture it, while Selenium can.

If there was just a need to scrape static websites, we would’ve used just bs4. But, for dynamically generated webpages, we use selenium.
Selenium

Selenium is a free (open-source) automated testing framework used to validate web applications across different browsers and platforms. You can use multiple programming languages like Java, C#, Python etc to create Selenium Test Scripts. Here, we use Python as our main language.

First up, the installation :

1) Selenium bindings in python

pip install selenium

2) Web drivers
Selenium requires a web driver to interface with the chosen browser.Web drivers is a package to interact with web browser. It interacts with the web browser or a remote web server through a wire protocol which is common to all. You can check out and install the web drivers of your browser choice.

Chrome: https://sites.google.com/a/chromium.org/chromedriver/downloads
Firefox: https://github.com/mozilla/geckodriver/releases
Safari:    https://webkit.org/blog/6900/webdriver-support-in-safari-10/

Beautifulsoup

Beautifulsoup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

To use beautiful soup, we have this wonderful binding of it in python :
1) BS4 bindings in python

pip install bs4

Let’s suppose the site is dynamic and simple scraping leads to returning a Nonetype object.

#### This program scrapes naukri.com's page and gives our result as a  
#### list of all the job_profiles which are currently present there.  
  
import requests 
from bs4 import BeautifulSoup 
from selenium import webdriver 
from selenium.webdriver.common.keys import Keys 
import time 
  
#url of the page we want to scrape 
url = "https://www.naukri.com/top-jobs-by-designations# desigtop600"
  
# initiating the webdriver. Parameter includes the path of the webdriver. 
driver = webdriver.Chrome('./chromedriver')  
driver.get(url)  
  
# this is just to ensure that the page is loaded 
time.sleep(5)  
  
html = driver.page_source 
  
# this renders the JS code and stores all 
# of the information in static HTML code. 
  
# Now, we could simply apply bs4 to html variable 
soup = BeautifulSoup(html, "html.parser") 
all_divs = soup.find('div', {'id' : 'nameSearch'}) 
job_profiles = all_divs.find_all('a') 
  
# printing top ten job profiles 
count = 0
for job_profile in job_profiles : 
    print(job_profile.text) 
    count = count + 1
    if(count == 10) : 
        break
  
driver.close() # closing the webdriver 

Here’s the video of the scraper in action : Working_scraper_video

Output of the code :

Suggest improvement

Predicting Air Quality Index using Python

Automate Instagram Messages using Python

Share your thoughts in the comments

Projects for Beginners

Projects for Intermediate

Web Scraping

Automating boring Stuff Using Python

Tkinter Projects

Turtle Projects

OpenCV Projects

Python Django Projects

Python Text to Speech and Vice-Versa

More Projects on Python