Open In App

Web crawling with Python

Last Updated : 02 May, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

Web crawling is a powerful technique that allows us to gather information from other websites by navigating through their pages and finding all the URLs of the websites with the relevant data. Python has various libraries and frameworks that support web crawling. In this article, we will see about web crawling using Python and various examples of their usage.

Pre-requisites

Web Crawling Using Python

Below are some of the examples by which we can implement web crawling in Python:

Basic Example of Web Crawling

Here, we will see how we can do web crawling with the help of a simple example. In this example, a GET request is made to the GeeksforGeeks website using the `requests` library, and the HTTP status code and the HTML content of the response are printed.

Python
# Import the requests library
import requests

# Define the URL of the website to scrape
URL = "https://www.geeksforgeeks.org/"

# Send a GET request to the specified URL and store the response in 'resp'
resp = requests.get(URL)

# Print the HTTP status code of the response to check if the request was successful
print("Status Code:", resp.status_code)

# Print the HTML content of the response
print("\nResponse Content:")
print(resp.text)

Output:

Status Code: 200
Response Content:
<!DOCTYPE html>
<!--[if IE 7]>
<html class="ie ie7" lang="en-US" prefix="og: http://ogp.me/ns#">
<![endif]-->
<!--[if IE 8]>
<html class="ie ie8" lang="en-US" prefix="og: http://ogp.me/ns#">
<![endif]-->
<!--[if !(IE 7) | !(IE 8) ]><!-->
<html lang="en-US" prefix="og: http://ogp.me/ns#" >
<!-- ... other HTML content ... -->
</html>
........................................

Web Crawling in JSON Format

Here, with this example, we will see how we can crawl JSON format files and then convert them to Python Dictionary. In this example, a GET request is made to the ISS location API using the `requests` library. If the request is successful (indicated by a status code of 200), the ISS’s current location data is fetched and printed. Otherwise, an error message with the corresponding status code is displayed.

Python
# Import the requests library
import requests

# Define the URL for the ISS (International Space Station) location API
URL = "http://api.open-notify.org/iss-now.json"

# Send a GET request to the API and store the response
response = requests.get(URL)

# Check if the request was successful (status code 200 indicates success)
if response.status_code == 200:
    data = response.json()

    # Print the parsed data (ISS location details)
    print("ISS Location Data:")
    print(data)
else:
    print(
        f"Error: Failed to retrieve data. Status code: {response.status_code}")

Output:

ISS Location Data:
{'iss_position': {'latitude': '36.4100', 'longitude': '-12.5948'}, 'message': 'success', 'timestamp': 1704776784}

Web Scraping Images with Python

We can also download and extract images from a web page by using web crawling in Python. In this example, a GET request is made to fetch an image from a given URL using the `requests` library. If the request is successful, the image data is saved to a local file named “gfg_logo.png”. Otherwise, a failure message is displayed.

Python
import requests

image_url = "https://media.geeksforgeeks.org/wp-content/uploads/20230505175603/100-Days-of-Machine-Learning.webp"
output_filename = "gfg_logo.png"

response = requests.get(image_url)

if response.status_code == 200:
    with open(output_filename, "wb") as file:
        file.write(response.content)
    print(f"Image downloaded successfully as {output_filename}")
else:
    print("Failed to download the image.")

Output:

Web crawling with Python

Crawling Elements Using XPath

We wil see how we can crawl the web pages and extract our favourable output from the websites using XPath. Here, we will extract the temperature information from the weather website that provides information regarding the weather of a particular place. Here, we will extract the current temperature of Noida by using Python code.

In this example, a GET request is made to fetch weather information from a specified URL using the `requests` library. If the request is successful, the HTML content is parsed using `lxml.etree`. The script then searches for the temperature element within the parsed content. If found, it prints the current temperature; otherwise, it indicates that the temperature element was not found or that the webpage fetching failed.

Python
from lxml import etree
import requests

weather_url = "https://weather.com/en-IN/weather/today/l/60f76bec229c75a05ac18013521f7bfb52c75869637f3449105e9cb79738d492"

response = requests.get(weather_url)

if response.status_code == 200:
    dom = etree.HTML(response.text)
    elements = dom.xpath(
        "//span[@data-testid='TemperatureValue' and contains(@class,'CurrentConditions')]")
    
    if elements:
        temperature = elements[0].text
        print(f"The current temperature is: {temperature}")
    else:
        print("Temperature element not found.")
else:
    print("Failed to fetch the webpage.")

Output:

The current temperature is: 12

Reading Tables on the Web Using Pandas

Here, we will see how we can read tables on the web by using Pandas and web crawling In this example, the `pandas` library is used to extract tables from a specified URL using its `read_html` function. If tables are successfully extracted from the webpage, they are printed one by one with a separator. If no tables are found, a message indicating this is displayed.

Python
import pandas as pd

url = "https://www.geeksforgeeks.org/html-tables/"
extracted_tables = pd.read_html(url)

if extracted_tables:
    for idx, table in enumerate(extracted_tables, 1):
        print(f"Table {idx}:")
        print(table)
        print("-" * 50)
else:
    print("No tables found on the webpage.")

Output:

Table 1:
HTML Tags Descriptions
0 <table> Defines the structure for organizing data in r…
1 <tr> Represents a row within an HTML table, contain…
2 <th> Shows a table header cell that typically holds…
3 <td> Represents a standard data cell, holding conte…
4 <caption> Provides a title or description for the entire…
5 <thead> Defines the header section of a table, often c…
6 <tbody> Represents the main content area of a table, s…
7 <tfoot> Specifies the footer section of a table, typic…
8 <col> Defines attributes for table columns that can …
9 <colgroup> Groups together a set of columns in a table to…
————————————————–
Table 2:
0
0 <!DOCTYPE html> <html> <body> <table> <tr> …
————————————————–
Table 3:
0
0 <!DOCTYPE html> <html> <body> <table> <tr> …

Crawl a Web Page and Get Most Frequent Words

The task is to count the most frequent words, which extracts data from dynamic sources. First, create a web crawler or scraper with the help of the requests module and a beautiful soup module, which will extract data from the web pages and store them in a list. There might be some undesired words or symbols (like special symbols, and blank spaces), which can be filtered to ease the counts and get the desired results. After counting each word, we also can have the count of most (say 10 or 20) frequent words.

In this example, a web page is crawled using `requests` and parsed with `BeautifulSoup`. The script focuses on a specific class within the webpage (`’entry-content’`) and extracts words from it. The extracted words are then cleaned of symbols and other non-alphabetic characters. After cleaning, a dictionary of word frequencies is created, and the 10 most common words are printed.

Python
# Python3 program for a word frequency
#pip install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup
from collections import Counter

def start(url):
    # Fetch the webpage using the URL
    source_code = requests.get(url).text
    
    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(source_code, 'html.parser')
    
    # Initialize an empty list to store the words
    wordlist = []
    
    # Find all 'div' elements with the 'entry-content' class and extract text
    for each_text in soup.findAll('div', {'class': 'entry-content'}):
        content = each_text.text
        
        # Convert text to lowercase and split into words
        words = content.lower().split()
        
        # Add words to the wordlist
        for each_word in words:
            wordlist.append(each_word)
    
    # Clean the word list and create the frequency dictionary
    clean_wordlist(wordlist)

def clean_wordlist(wordlist):
    # Initialize an empty list to store clean words
    clean_list = []
    
    # Define symbols to be removed from words
    symbols = "!@#$%^&*()_-+={[}]|\\;:\"<>?/.,"
    
    # Clean the words in the wordlist
    for word in wordlist:
        for symbol in symbols:
            word = word.replace(symbol, '')
        
        # Only add non-empty words to the clean list
        if len(word) > 0:
            clean_list.append(word)
    
    # Create a dictionary and count word frequencies
    create_dictionary(clean_list)

def create_dictionary(clean_list):
    # Create a Counter object to count word frequencies
    word_count = Counter(clean_list)
    
    # Get the 10 most common words
    top = word_count.most_common(10)
    
    # Print the top 10 most common words
    print("Top 10 most frequent words:")
    for word, count in top:
        print(f'{word}: {count}')

if __name__ == "__main__":
    # Replace the URL with the webpage you want to scrape
    url = 'https://example.com'
    
    # Call the start function with the URL
    start(url)

Output:

[('to', 10), ('in', 7), ('is', 6), ('language', 6), ('the', 5),
('programming', 5), ('a', 5), ('c', 5), ('you', 5), ('of', 4)]



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads