Scraping data in network traffic using Python

In this article, we will learn how to scrap data in network traffic using Python.

Modules Needed

selenium: Selenium is a portable framework for controlling web browser.
time: This module provides various time-related functions.
json: This module is required to work with JSON data.
browsermobproxy: This module helps us to get the HAR file from network traffic.

There are two ways by which we can scrap the network traffic data.

Method 1: Using selenium’s get_log() method

To start with this download and extract the chrome webdriver from here according to the version of your chrome browser and copy the executable path.

Approach:

Import the DesiredCapabilities from the selenium module and enable performance logging.
Startup the chrome webdriver with executable_path and default chrome-options or add some arguments to it and the modified desired_capabilities.
Send a GET request to the website using driver.get() and wait for few seconds to load the page.

Syntax:

driver.get(url)

Get the performance logs using driver.get_log() and store it in a variable.

Syntax:

driver.get_log(“performance”)

Iterate every log and parse it using json.loads() to filter all the Network related logs.
Write the filtered logs to a JSON file by converting to JSON string using json.dumps().

Example:

Python3

# Import the required modules 

from selenium import webdriver 

from selenium.webdriver.common.desired_capabilities import DesiredCapabilities 

import time 

import json 

# Main Function 

if __name__ == "__main__": 

    # Enable Performance Logging of Chrome. 

    desired_capabilities = DesiredCapabilities.CHROME 

    desired_capabilities["goog:loggingPrefs"] = {"performance": "ALL"} 

    # Create the webdriver object and pass the arguments 

    options = webdriver.ChromeOptions() 

    # Chrome will start in Headless mode 

    options.add_argument('headless') 

    # Ignores any certificate errors if there is any 

    options.add_argument("--ignore-certificate-errors") 

    # Startup the chrome webdriver with executable path and 

    # pass the chrome options and desired capabilities as 

    # parameters. 

    driver = webdriver.Chrome(executable_path="C:/chromedriver.exe", 

                              chrome_options=options, 

                              desired_capabilities=desired_capabilities) 

    # Send a request to the website and let it load 

    driver.get("https://www.geeksforgeeks.org/") 

    # Sleeps for 10 seconds 

    time.sleep(10) 

    # Gets all the logs from performance in Chrome 

    logs = driver.get_log("performance") 

    # Opens a writable JSON file and writes the logs in it 

    with open("network_log.json", "w", encoding="utf-8") as f: 

        f.write("[") 

        # Iterates every logs and parses it using JSON 

        for log in logs: 

            network_log = json.loads(log["message"])["message"] 

            # Checks if the current 'method' key has any 

            # Network related value. 

            if("Network.response" in network_log["method"] 

                    or "Network.request" in network_log["method"] 

                    or "Network.webSocket" in network_log["method"]): 

                # Writes the network log to a JSON file by 

                # converting the dictionary to a JSON string 

                # using json.dumps(). 

                f.write(json.dumps(network_log)+",") 

        f.write("{}]") 

    print("Quitting Selenium WebDriver") 

    driver.quit() 

    # Read the JSON File and parse it using 

    # json.loads() to find the urls containing images. 

    json_file_path = "network_log.json"

    with open(json_file_path, "r", encoding="utf-8") as f: 

        logs = json.loads(f.read()) 

    # Iterate the logs 

    for log in logs: 

        # Except block will be accessed if any of the 

        # following keys are missing. 

        try: 

            # URL is present inside the following keys 

            url = log["params"]["request"]["url"] 

            # Checks if the extension is .png or .jpg 

            if url[len(url)-4:] == ".png" or url[len(url)-4:] == ".jpg": 

                print(url, end='\n\n') 

        except Exception as e: 

            pass

Output:

The image URL’s are highlighted above.

network_log.json containing the image URL’s

Method 2: Using browsermobproxy to capture the HAR file from the network tab of the browser

For this, the following requirements need to be satisfied.

Download and Install Java v8 from here
Download and extract browsermobproxy from here and copy the path of bin folder.
Install browsermob-proxy using pip using the command in terminal :

pip install browsermob-proxy

Download and extract the chrome webdriver from here, according the version of your chrome browser and copy the executable path.

Approach:

Import the Server module from browsermobproxy and start up the Server with the copied bin folder path and set port as 8090.
Call the create_proxy method to create the proxy object from Server and set “trustAllServers” parameter as true.
Startup the chrome webdriver with executable_path and chrome-options discussed in code below.
Now, create a new HAR file using the proxy object with the domain of the website.
Send a GET request using driver.get() and wait for few seconds to load it properly.

Syntax:

driver.get(url)

Write the HAR file of network traffic from the proxy object to a HAR file by converting it to JSON string using json.dumps().

Example:

Python3

# Import the required modules 

from selenium import webdriver 

from browsermobproxy import Server 

import time 

import json 

# Main Function 

if __name__ == "__main__": 

    # Enter the path of bin folder by 

    # extracting browsermob-proxy-2.1.4-bin 

    path_to_browsermobproxy = "C:\\browsermob-proxy-2.1.4\\bin\\"

    # Start the server with the path and port 8090 

    server = Server(path_to_browsermobproxy 

                    + "browsermob-proxy", options={'port': 8090}) 

    server.start() 

    # Create the proxy with following parameter as true 

    proxy = server.create_proxy(params={"trustAllServers": "true"}) 

    # Create the webdriver object and pass the arguments 

    options = webdriver.ChromeOptions() 

    # Chrome will start in Headless mode 

    options.add_argument('headless') 

    # Ignores any certificate errors if there is any 

    options.add_argument("--ignore-certificate-errors") 

    # Setting up Proxy for chrome 

    options.add_argument("--proxy-server={0}".format(proxy.proxy)) 

    # Startup the chrome webdriver with executable path and 

    # the chrome options as parameters. 

    driver = webdriver.Chrome(executable_path="C:/chromedriver.exe", 

                              chrome_options=options) 

    # Create a new HAR file of the following domain 

    # using the proxy. 

    proxy.new_har("geeksforgeeks.org/") 

    # Send a request to the website and let it load 

    driver.get("https://www.geeksforgeeks.org/") 

    # Sleeps for 10 seconds 

    time.sleep(10) 

    # Write it to a HAR file. 

    with open("network_log1.har", "w", encoding="utf-8") as f: 

        f.write(json.dumps(proxy.har)) 

    print("Quitting Selenium WebDriver") 

    driver.quit() 

    # Read HAR File and parse it using JSON 

    # to find the urls containing images. 

    har_file_path = "network_log1.har"

    with open(har_file_path, "r", encoding="utf-8") as f: 

        logs = json.loads(f.read()) 

    # Store the network logs from 'entries' key and 

    # iterate them 

    network_logs = logs['log']['entries'] 

    for log in network_logs: 

        # Except block will be accessed if any of the 

        # following keys are missing 

        try: 

            # URL is present inside the following keys 

            url = log['request']['url'] 

            # Checks if the extension is .png or .jpg 

            if url[len(url)-4:] == '.png' or url[len(url)-4:] == '.jpg': 

                print(url, end="\n\n") 

        except Exception as e: 

            # print(e) 

            pass

Output:

The image URL’s are highlighted above.

network_log1.har containing the image URL’s

Article Tags :

Python

Python-selenium

Web-scraping