To scrape content from a static page, we use BeautifulSoup as our package for scraping, and it works flawlessly for static pages. We use requests to load page into our python script. Now, if the page we are trying to load is dynamic in nature and we request this page by requests library, it would send the JS code to be executed locally. Requests package does not execute this JS code and just gives it as the page source.
BeautifulSoup does not catch the interactions with DOM via Java Script. Let’s suppose, if you have a table that is generated by JS. BeautifulSoup will not be able to capture it, while Selenium can.
If there was just a need to scrape static websites, we would’ve used just bs4. But, for dynamically generated webpages, we use selenium.
Selenium is a free (open-source) automated testing framework used to validate web applications across different browsers and platforms. You can use multiple programming languages like Java, C#, Python etc to create Selenium Test Scripts. Here, we use Python as our main language.
First up, the installation :
1) Selenium bindings in python
pip install selenium
2) Web drivers
Selenium requires a web driver to interface with the chosen browser.Web drivers is a package to interact with web browser. It interacts with the web browser or a remote web server through a wire protocol which is common to all. You can check out and install the web drivers of your browser choice.
Chrome: https://sites.google.com/a/chromium.org/chromedriver/downloads Firefox: https://github.com/mozilla/geckodriver/releases Safari: https://webkit.org/blog/6900/webdriver-support-in-safari-10/
Beautifulsoup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
To use beautiful soup, we have this wonderful binding of it in python :
1) BS4 bindings in python
pip install bs4
Let’s suppose the site is dynamic and simple scraping leads to returning a Nonetype object.
Here’s the video of the scraper in action : Working_scraper_video
Output of the code :
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.
- Python program to Recursively scrape all the URLs of the website
- How to Scrape Web Data from Google using Python?
- Scrape Tables From any website using Python
- Scrape most reviewed news and tweet using Python
- Create GUI to Web Scrape articles in Python
- Scrape Instagram using Instagramy in Python
- How to scrape all the text from body tag using Beautifulsoup in Python?
- How to Scrape Paragraphs using Python?
- How to scrape Comment using Beautifulsoup in Python?
- Scrape Google Search Results using Python BeautifulSoup
- Scraping websites with Newspaper3k in Python
- Python Program to Reverse the Content of a File using Stack
- Python | Converting String content to dictionary
- Python program to modify the content of a Binary File
- response.content - Python requests
- PyQt5 – How to clear the content of label | clear and setText method
- PyQt5 - How to access content of label ?
- PyQt5 - Change the size of Radio button according to content length
- PyQt5 - Font and size of content in Radio button
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to firstname.lastname@example.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.