Scraping data in network traffic using Python
In this article, we will learn how to scrap data in network traffic using Python.
- selenium: Selenium is a portable framework for controlling web browser.
- time: This module provides various time-related functions.
- json: This module is required to work with JSON data.
- browsermobproxy: This module helps us to get the HAR file from network traffic.
There are two ways by which we can scrap the network traffic data.
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning - Basic Level Course
Method 1: Using selenium’s get_log() method
To start with this download and extract the chrome webdriver from here according to the version of your chrome browser and copy the executable path.
- Import the DesiredCapabilities from the selenium module and enable performance logging.
- Startup the chrome webdriver with executable_path and default chrome-options or add some arguments to it and the modified desired_capabilities.
- Send a GET request to the website using driver.get() and wait for few seconds to load the page.
- Get the performance logs using driver.get_log() and store it in a variable.
- Iterate every log and parse it using json.loads() to filter all the Network related logs.
- Write the filtered logs to a JSON file by converting to JSON string using json.dumps().
Method 2: Using browsermobproxy to capture the HAR file from the network tab of the browser
For this, the following requirements need to be satisfied.
- Download and Install Java v8 from here
- Download and extract browsermobproxy from here and copy the path of bin folder.
- Install browsermob-proxy using pip using the command in terminal :
pip install browsermob-proxy
- Download and extract the chrome webdriver from here, according the version of your chrome browser and copy the executable path.
- Import the Server module from browsermobproxy and start up the Server with the copied bin folder path and set port as 8090.
- Call the create_proxy method to create the proxy object from Server and set “trustAllServers” parameter as true.
- Startup the chrome webdriver with executable_path and chrome-options discussed in code below.
- Now, create a new HAR file using the proxy object with the domain of the website.
- Send a GET request using driver.get() and wait for few seconds to load it properly.
- Write the HAR file of network traffic from the proxy object to a HAR file by converting it to JSON string using json.dumps().