Open In App

Scraping data in network traffic using Python

Last Updated : 13 Jul, 2021
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we will learn how to scrap data in network traffic using Python.

Modules Needed

  • selenium: Selenium is a portable framework for controlling web browser.
  • time: This module provides various time-related functions.
  • json: This module is required to work with JSON data.
  • browsermobproxy: This module helps us to get the HAR file from network traffic.

There are two ways by which we can scrap the network traffic data.

Method 1: Using selenium’s get_log() method 

To start with this download and extract the chrome webdriver from here according to the version of your chrome browser and copy the executable path.

Approach:

  • Import the DesiredCapabilities from the selenium module and enable performance logging.
  • Startup the chrome webdriver with executable_path and default chrome-options or add some arguments to it and the modified desired_capabilities.
  • Send a GET request to the website using driver.get() and wait for few seconds to load the page.

Syntax:

driver.get(url)

  • Get the performance logs using driver.get_log() and store it in a variable.

Syntax:

driver.get_log(“performance”)

  • Iterate every log and parse it using json.loads() to filter all the Network related logs.
  • Write the filtered logs to a JSON file by converting to JSON string using json.dumps().

Example:

Python3




# Import the required modules
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import time
import json
  
  
# Main Function
if __name__ == "__main__":
  
    # Enable Performance Logging of Chrome.
    desired_capabilities = DesiredCapabilities.CHROME
    desired_capabilities["goog:loggingPrefs"] = {"performance": "ALL"}
  
    # Create the webdriver object and pass the arguments
    options = webdriver.ChromeOptions()
  
    # Chrome will start in Headless mode
    options.add_argument('headless')
  
    # Ignores any certificate errors if there is any
    options.add_argument("--ignore-certificate-errors")
  
    # Startup the chrome webdriver with executable path and
    # pass the chrome options and desired capabilities as
    # parameters.
    driver = webdriver.Chrome(executable_path="C:/chromedriver.exe",
                              chrome_options=options,
                              desired_capabilities=desired_capabilities)
  
    # Send a request to the website and let it load
    driver.get("https://www.geeksforgeeks.org/")
  
    # Sleeps for 10 seconds
    time.sleep(10)
  
    # Gets all the logs from performance in Chrome
    logs = driver.get_log("performance")
  
    # Opens a writable JSON file and writes the logs in it
    with open("network_log.json", "w", encoding="utf-8") as f:
        f.write("[")
  
        # Iterates every logs and parses it using JSON
        for log in logs:
            network_log = json.loads(log["message"])["message"]
  
            # Checks if the current 'method' key has any
            # Network related value.
            if("Network.response" in network_log["method"]
                    or "Network.request" in network_log["method"]
                    or "Network.webSocket" in network_log["method"]):
  
                # Writes the network log to a JSON file by
                # converting the dictionary to a JSON string
                # using json.dumps().
                f.write(json.dumps(network_log)+",")
        f.write("{}]")
  
    print("Quitting Selenium WebDriver")
    driver.quit()
  
    # Read the JSON File and parse it using
    # json.loads() to find the urls containing images.
    json_file_path = "network_log.json"
    with open(json_file_path, "r", encoding="utf-8") as f:
        logs = json.loads(f.read())
  
    # Iterate the logs
    for log in logs:
  
        # Except block will be accessed if any of the
        # following keys are missing.
        try:
            # URL is present inside the following keys
            url = log["params"]["request"]["url"]
  
            # Checks if the extension is .png or .jpg
            if url[len(url)-4:] == ".png" or url[len(url)-4:] == ".jpg":
                print(url, end='\n\n')
        except Exception as e:
            pass


Output:

The image URL’s are highlighted above.

network_log.json containing the image URL’s

Method 2: Using browsermobproxy to capture the HAR file from the network tab of the browser

For this, the following requirements need to be satisfied.

  • Download and Install Java v8 from here
  • Download and extract browsermobproxy from here and copy the path of bin folder.
  • Install browsermob-proxy using pip using the command in terminal : 

pip install browsermob-proxy

  • Download and extract the chrome webdriver from here, according the version of your chrome browser and copy the executable path.

Approach:

  • Import the Server module from browsermobproxy and start up the Server with the copied bin folder path and set port as 8090.
  • Call the create_proxy method to create the proxy object from Server and set “trustAllServers” parameter as true.
  • Startup the chrome webdriver with executable_path and chrome-options discussed in code below.
  • Now, create a new HAR file using the proxy object with the domain of the website.
  • Send a GET request using driver.get() and wait for few seconds to load it properly.

Syntax:

driver.get(url)

  • Write the HAR file of network traffic from the proxy object to a HAR file by converting it to JSON string using json.dumps().

Example:

Python3




# Import the required modules
from selenium import webdriver
from browsermobproxy import Server
import time
import json
  
  
# Main Function
if __name__ == "__main__":
  
    # Enter the path of bin folder by
    # extracting browsermob-proxy-2.1.4-bin
    path_to_browsermobproxy = "C:\\browsermob-proxy-2.1.4\\bin\\"
  
    # Start the server with the path and port 8090
    server = Server(path_to_browsermobproxy
                    + "browsermob-proxy", options={'port': 8090})
    server.start()
  
    # Create the proxy with following parameter as true
    proxy = server.create_proxy(params={"trustAllServers": "true"})
  
    # Create the webdriver object and pass the arguments
    options = webdriver.ChromeOptions()
  
    # Chrome will start in Headless mode
    options.add_argument('headless')
  
    # Ignores any certificate errors if there is any
    options.add_argument("--ignore-certificate-errors")
  
    # Setting up Proxy for chrome
    options.add_argument("--proxy-server={0}".format(proxy.proxy))
  
    # Startup the chrome webdriver with executable path and
    # the chrome options as parameters.
    driver = webdriver.Chrome(executable_path="C:/chromedriver.exe",
                              chrome_options=options)
  
    # Create a new HAR file of the following domain
    # using the proxy.
    proxy.new_har("geeksforgeeks.org/")
  
    # Send a request to the website and let it load
    driver.get("https://www.geeksforgeeks.org/")
  
    # Sleeps for 10 seconds
    time.sleep(10)
  
    # Write it to a HAR file.
    with open("network_log1.har", "w", encoding="utf-8") as f:
        f.write(json.dumps(proxy.har))
  
    print("Quitting Selenium WebDriver")
    driver.quit()
  
    # Read HAR File and parse it using JSON
    # to find the urls containing images.
    har_file_path = "network_log1.har"
    with open(har_file_path, "r", encoding="utf-8") as f:
        logs = json.loads(f.read())
  
    # Store the network logs from 'entries' key and
    # iterate them
    network_logs = logs['log']['entries']
    for log in network_logs:
  
        # Except block will be accessed if any of the
        # following keys are missing
        try:
            # URL is present inside the following keys
            url = log['request']['url']
  
            # Checks if the extension is .png or .jpg
            if url[len(url)-4:] == '.png' or url[len(url)-4:] == '.jpg':
                print(url, end="\n\n")
        except Exception as e:
            # print(e)
            pass


Output:

The image URL’s are highlighted above.

network_log1.har containing the image URL’s



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads