Scraping data in network traffic using Python
In this article, we will learn how to scrap data in network traffic using Python.
- selenium: Selenium is a portable framework for controlling web browser.
- time: This module provides various time-related functions.
- json: This module is required to work with JSON data.
- browsermobproxy: This module helps us to get the HAR file from network traffic.
There are two ways by which we can scrap the network traffic data.
Method 1: Using selenium’s get_log() method
To start with this download and extract the chrome webdriver from here according to the version of your chrome browser and copy the executable path.
- Import the DesiredCapabilities from the selenium module and enable performance logging.
- Startup the chrome webdriver with executable_path and default chrome-options or add some arguments to it and the modified desired_capabilities.
- Send a GET request to the website using driver.get() and wait for few seconds to load the page.
- Get the performance logs using driver.get_log() and store it in a variable.
- Iterate every log and parse it using json.loads() to filter all the Network related logs.
- Write the filtered logs to a JSON file by converting to JSON string using json.dumps().
Method 2: Using browsermobproxy to capture the HAR file from the network tab of the browser
For this, the following requirements need to be satisfied.
- Download and Install Java v8 from here
- Download and extract browsermobproxy from here and copy the path of bin folder.
- Install browsermob-proxy using pip using the command in terminal :
pip install browsermob-proxy
- Download and extract the chrome webdriver from here, according the version of your chrome browser and copy the executable path.
- Import the Server module from browsermobproxy and start up the Server with the copied bin folder path and set port as 8090.
- Call the create_proxy method to create the proxy object from Server and set “trustAllServers” parameter as true.
- Startup the chrome webdriver with executable_path and chrome-options discussed in code below.
- Now, create a new HAR file using the proxy object with the domain of the website.
- Send a GET request using driver.get() and wait for few seconds to load it properly.
- Write the HAR file of network traffic from the proxy object to a HAR file by converting it to JSON string using json.dumps().