Scrape content from dynamic websites

To scrape content from a static page, we use BeautifulSoup as our package for scraping, and it works flawlessly for static pages. We use requests to load page into our python script. Now, if the page we are trying to load is dynamic in nature and we request this page by requests library, it would send the JS code to be executed locally. Requests package does not execute this JS code and just gives it as the page source.

BeautifulSoup does not catch the interactions with DOM via Java Script. Let’s suppose, if you have a table that is generated by JS. BeautifulSoup will not be able to capture it, while Selenium can.

If there was just a need to scrape static websites, we would’ve used just bs4. But, for dynamically generated webpages, we use selenium.
Selenium

Selenium is a free (open-source) automated testing framework used to validate web applications across different browsers and platforms. You can use multiple programming languages like Java, C#, Python etc to create Selenium Test Scripts. Here, we use Python as our main language.

First up, the installation :



1) Selenium bindings in python

pip install selenium

2) Web drivers
Selenium requires a web driver to interface with the chosen browser.Web drivers is a package to interact with web browser. It interacts with the web browser or a remote web server through a wire protocol which is common to all. You can check out and install the web drivers of your browser choice.

Chrome: https://sites.google.com/a/chromium.org/chromedriver/downloads
Firefox: https://github.com/mozilla/geckodriver/releases
Safari:    https://webkit.org/blog/6900/webdriver-support-in-safari-10/ 

Beautifulsoup

Beautifulsoup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

To use beautiful soup, we have this wonderful binding of it in python :
1) BS4 bindings in python

pip install bs4

Let’s suppose the site is dynamic and simple scraping leads to returning a Nonetype object.

filter_none

edit
close

play_arrow

link
brightness_4
code

#### This program scrapes naukri.com's page and gives our result as a 
#### list of all the job_profiles which are currently present there. 
  
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
  
#url of the page we want to scrape
  
# initiating the webdriver. Parameter includes the path of the webdriver.
driver = webdriver.Chrome('./chromedriver'
driver.get(url) 
  
# this is just to ensure that the page is loaded
time.sleep(5
  
html = driver.page_source
  
# this renders the JS code and stores all
# of the information in static HTML code.
  
# Now, we could simply apply bs4 to html variable
soup = BeautifulSoup(html, "html.parser")
all_divs = soup.find('div', {'id' : 'nameSearch'})
job_profiles = all_divs.find_all('a')
  
# printing top ten job profiles
count = 0
for job_profile in job_profiles :
    print(job_profile.text)
    count = count + 1
    if(count == 10) :
        break
  
driver.close() # closing the webdriver

chevron_right


Here’s the video of the scraper in action : Working_scraper_video

Output of the code :


Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.




My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.


Article Tags :

1


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.