Open In App

Scraping data in network traffic using Python

Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we will learn how to scrap data in network traffic using Python.

Modules Needed

  • selenium: Selenium is a portable framework for controlling web browser.
  • time: This module provides various time-related functions.
  • json: This module is required to work with JSON data.
  • browsermobproxy: This module helps us to get the HAR file from network traffic.

There are two ways by which we can scrap the network traffic data.

Method 1: Using selenium’s get_log() method 

To start with this download and extract the chrome webdriver from here according to the version of your chrome browser and copy the executable path.

Approach:

  • Import the DesiredCapabilities from the selenium module and enable performance logging.
  • Startup the chrome webdriver with executable_path and default chrome-options or add some arguments to it and the modified desired_capabilities.
  • Send a GET request to the website using driver.get() and wait for few seconds to load the page.

Syntax:

driver.get(url)

  • Get the performance logs using driver.get_log() and store it in a variable.

Syntax:

driver.get_log(“performance”)

  • Iterate every log and parse it using json.loads() to filter all the Network related logs.
  • Write the filtered logs to a JSON file by converting to JSON string using json.dumps().

Example:

Python3




# Import the required modules
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import time
import json
  
  
# Main Function
if __name__ == "__main__":
  
    # Enable Performance Logging of Chrome.
    desired_capabilities = DesiredCapabilities.CHROME
    desired_capabilities["goog:loggingPrefs"] = {"performance": "ALL"}
  
    # Create the webdriver object and pass the arguments
    options = webdriver.ChromeOptions()
  
    # Chrome will start in Headless mode
    options.add_argument('headless')
  
    # Ignores any certificate errors if there is any
    options.add_argument("--ignore-certificate-errors")
  
    # Startup the chrome webdriver with executable path and
    # pass the chrome options and desired capabilities as
    # parameters.
    driver = webdriver.Chrome(executable_path="C:/chromedriver.exe",
                              chrome_options=options,
                              desired_capabilities=desired_capabilities)
  
    # Send a request to the website and let it load
    driver.get("https://www.geeksforgeeks.org/")
  
    # Sleeps for 10 seconds
    time.sleep(10)
  
    # Gets all the logs from performance in Chrome
    logs = driver.get_log("performance")
  
    # Opens a writable JSON file and writes the logs in it
    with open("network_log.json", "w", encoding="utf-8") as f:
        f.write("[")
  
        # Iterates every logs and parses it using JSON
        for log in logs:
            network_log = json.loads(log["message"])["message"]
  
            # Checks if the current 'method' key has any
            # Network related value.
            if("Network.response" in network_log["method"]
                    or "Network.request" in network_log["method"]
                    or "Network.webSocket" in network_log["method"]):
  
                # Writes the network log to a JSON file by
                # converting the dictionary to a JSON string
                # using json.dumps().
                f.write(json.dumps(network_log)+",")
        f.write("{}]")
  
    print("Quitting Selenium WebDriver")
    driver.quit()
  
    # Read the JSON File and parse it using
    # json.loads() to find the urls containing images.
    json_file_path = "network_log.json"
    with open(json_file_path, "r", encoding="utf-8") as f:
        logs = json.loads(f.read())
  
    # Iterate the logs
    for log in logs:
  
        # Except block will be accessed if any of the
        # following keys are missing.
        try:
            # URL is present inside the following keys
            url = log["params"]["request"]["url"]
  
            # Checks if the extension is .png or .jpg
            if url[len(url)-4:] == ".png" or url[len(url)-4:] == ".jpg":
                print(url, end='\n\n')
        except Exception as e:
            pass


Output:

The image URL’s are highlighted above.

network_log.json containing the image URL’s

Method 2: Using browsermobproxy to capture the HAR file from the network tab of the browser

For this, the following requirements need to be satisfied.

  • Download and Install Java v8 from here
  • Download and extract browsermobproxy from here and copy the path of bin folder.
  • Install browsermob-proxy using pip using the command in terminal : 

pip install browsermob-proxy

  • Download and extract the chrome webdriver from here, according the version of your chrome browser and copy the executable path.

Approach:

  • Import the Server module from browsermobproxy and start up the Server with the copied bin folder path and set port as 8090.
  • Call the create_proxy method to create the proxy object from Server and set “trustAllServers” parameter as true.
  • Startup the chrome webdriver with executable_path and chrome-options discussed in code below.
  • Now, create a new HAR file using the proxy object with the domain of the website.
  • Send a GET request using driver.get() and wait for few seconds to load it properly.

Syntax:

driver.get(url)

  • Write the HAR file of network traffic from the proxy object to a HAR file by converting it to JSON string using json.dumps().

Example:

Python3




# Import the required modules
from selenium import webdriver
from browsermobproxy import Server
import time
import json
  
  
# Main Function
if __name__ == "__main__":
  
    # Enter the path of bin folder by
    # extracting browsermob-proxy-2.1.4-bin
    path_to_browsermobproxy = "C:\\browsermob-proxy-2.1.4\\bin\\"
  
    # Start the server with the path and port 8090
    server = Server(path_to_browsermobproxy
                    + "browsermob-proxy", options={'port': 8090})
    server.start()
  
    # Create the proxy with following parameter as true
    proxy = server.create_proxy(params={"trustAllServers": "true"})
  
    # Create the webdriver object and pass the arguments
    options = webdriver.ChromeOptions()
  
    # Chrome will start in Headless mode
    options.add_argument('headless')
  
    # Ignores any certificate errors if there is any
    options.add_argument("--ignore-certificate-errors")
  
    # Setting up Proxy for chrome
    options.add_argument("--proxy-server={0}".format(proxy.proxy))
  
    # Startup the chrome webdriver with executable path and
    # the chrome options as parameters.
    driver = webdriver.Chrome(executable_path="C:/chromedriver.exe",
                              chrome_options=options)
  
    # Create a new HAR file of the following domain
    # using the proxy.
    proxy.new_har("geeksforgeeks.org/")
  
    # Send a request to the website and let it load
    driver.get("https://www.geeksforgeeks.org/")
  
    # Sleeps for 10 seconds
    time.sleep(10)
  
    # Write it to a HAR file.
    with open("network_log1.har", "w", encoding="utf-8") as f:
        f.write(json.dumps(proxy.har))
  
    print("Quitting Selenium WebDriver")
    driver.quit()
  
    # Read HAR File and parse it using JSON
    # to find the urls containing images.
    har_file_path = "network_log1.har"
    with open(har_file_path, "r", encoding="utf-8") as f:
        logs = json.loads(f.read())
  
    # Store the network logs from 'entries' key and
    # iterate them
    network_logs = logs['log']['entries']
    for log in network_logs:
  
        # Except block will be accessed if any of the
        # following keys are missing
        try:
            # URL is present inside the following keys
            url = log['request']['url']
  
            # Checks if the extension is .png or .jpg
            if url[len(url)-4:] == '.png' or url[len(url)-4:] == '.jpg':
                print(url, end="\n\n")
        except Exception as e:
            # print(e)
            pass


Output:

The image URL’s are highlighted above.

network_log1.har containing the image URL’s



Last Updated : 13 Jul, 2021
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads