Web crawling with Python

Last Updated : 09 Jan, 2024

Web crawling is a powerful technique that allows us to gather information from other websites by navigating through their pages and finding all the URLs of the websites with the relevant data. Python has various libraries and frameworks that support web crawling. In this article, we will see about web crawling using Python and various examples of their usage.

Pre-requisites

Web Crawling Using Python

Below are some of the examples by which we can implement web crawling in Python:

Basic Example of Web Crawling

Here, we will see how we can do web crawling with the help of a simple example. In this example, a GET request is made to the GeeksforGeeks website using the `requests` library, and the HTTP status code and the HTML content of the response are printed.

Python3

# Import the requests library
import requests
 
# Define the URL of the website to scrape
URL = "https://www.geeksforgeeks.org/"
 
# Send a GET request to the specified URL and store the response in 'resp'
resp = requests.get(URL)
 
# Print the HTTP status code of the response to check if the request was successful
print("Status Code:", resp.status_code)
 
# Print the HTML content of the response
print("\nResponse Content:")
print(resp.text)

Output:

Status Code: 200
Response Content:
<!DOCTYPE html>
<!--[if IE 7]>
<html class="ie ie7" lang="en-US" prefix="og: http://ogp.me/ns#">
<![endif]-->
<!--[if IE 8]>
<html class="ie ie8" lang="en-US" prefix="og: http://ogp.me/ns#">
<![endif]-->
<!--[if !(IE 7) | !(IE 8)  ]><!-->
<html lang="en-US" prefix="og: http://ogp.me/ns#" >
<!-- ... other HTML content ... -->
</html>
........................................

Web Crawling in JSON Format

Here, with this example, we will see how we can crawl JSON format files and then convert them to Python Dictionary. In this example, a GET request is made to the ISS location API using the `requests` library. If the request is successful (indicated by a status code of 200), the ISS’s current location data is fetched and printed. Otherwise, an error message with the corresponding status code is displayed.

Python3

# Import the requests library
import requests
 
# Define the URL for the ISS (International Space Station) location API
URL = "http://api.open-notify.org/iss-now.json"
 
# Send a GET request to the API and store the response
response = requests.get(URL)
 
# Check if the request was successful (status code 200 indicates success)
if response.status_code == 200:
    data = response.json()
 
    # Print the parsed data (ISS location details)
    print("ISS Location Data:")
    print(data)
else:
    print(
        f"Error: Failed to retrieve data. Status code: {response.status_code}")

Output:

ISS Location Data:
{'iss_position': {'latitude': '36.4100', 'longitude': '-12.5948'}, 'message': 'success', 'timestamp': 1704776784}

Web Scraping Images with Python

We can also download and extract images from a web page by using web crawling in Python. In this example, a GET request is made to fetch an image from a given URL using the `requests` library. If the request is successful, the image data is saved to a local file named “gfg_logo.png”. Otherwise, a failure message is displayed.

Python3

import requests
 
image_url = "https://media.geeksforgeeks.org/wp-content/uploads/20230505175603/100-Days-of-Machine-Learning.webp"
output_filename = "gfg_logo.png"
 
response = requests.get(image_url)
 
if response.status_code == 200:
    with open(output_filename, "wb") as file:
        file.write(response.content)
    print(f"Image downloaded successfully as {output_filename}")
else:
    print("Failed to download the image.")

Output:

Web crawling with Python

Crawling Elements Using XPath

We wil see how we can crawl the web pages and extract our favourable output from the websites using XPath. Here, we will extract the temperature information from the weather website that provides information regarding the weather of a particular place. Here, we will extract the current temperature of Noida by using Python code.

In this example, a GET request is made to fetch weather information from a specified URL using the `requests` library. If the request is successful, the HTML content is parsed using `lxml.etree`. The script then searches for the temperature element within the parsed content. If found, it prints the current temperature; otherwise, it indicates that the temperature element was not found or that the webpage fetching failed.

Python3

from lxml import etree
import requests
 
weather_url = "https://weather.com/en-IN/weather/today/l/60f76bec229c75a05ac18013521f7bfb52c75869637f3449105e9cb79738d492"
 
response = requests.get(weather_url)
 
if response.status_code == 200:
    dom = etree.HTML(response.text)
    elements = dom.xpath(
        "//span[@data-testid='TemperatureValue' and contains(@class,'CurrentConditions')]")
     
    if elements:
        temperature = elements[0].text
        print(f"The current temperature is: {temperature}")
    else:
        print("Temperature element not found.")
else:
    print("Failed to fetch the webpage.")

Output:

The current temperature is: 12

Reading Tables on the Web Using Pandas

Here, we will see how we can read tables on the web by using Pandas and web crawling In this example, the `pandas` library is used to extract tables from a specified URL using its `read_html` function. If tables are successfully extracted from the webpage, they are printed one by one with a separator. If no tables are found, a message indicating this is displayed.

Python3

import pandas as pd
 
url = "https://www.geeksforgeeks.org/html-tables/"
extracted_tables = pd.read_html(url)
 
if extracted_tables:
    for idx, table in enumerate(extracted_tables, 1):
        print(f"Table {idx}:")
        print(table)
        print("-" * 50)
else:
    print("No tables found on the webpage.")

Output:

Table 1:
HTML Tags Descriptions
0 <table> Defines the structure for organizing data in r…
1 <tr> Represents a row within an HTML table, contain…
2 <th> Shows a table header cell that typically holds…
3 <td> Represents a standard data cell, holding conte…
4 <caption> Provides a title or description for the entire…
5 <thead> Defines the header section of a table, often c…
6 <tbody> Represents the main content area of a table, s…
7 <tfoot> Specifies the footer section of a table, typic…
8 <col> Defines attributes for table columns that can …
9 <colgroup> Groups together a set of columns in a table to…
————————————————–
Table 2:
0
0 <!DOCTYPE html> <html> <body> <table> <tr> …
————————————————–
Table 3:
0
0 <!DOCTYPE html> <html> <body> <table> <tr> …

Crawl a Web Page and Get Most Frequent Words

The task is to count the most frequent words, which extracts data from dynamic sources. First, create a web crawler or scraper with the help of the requests module and a beautiful soup module, which will extract data from the web pages and store them in a list. There might be some undesired words or symbols (like special symbols, and blank spaces), which can be filtered to ease the counts and get the desired results. After counting each word, we also can have the count of most (say 10 or 20) frequent words.

In this example, a web page is crawled using `requests` and parsed with `BeautifulSoup`. The script focuses on a specific class within the webpage (`’entry-content’`) and extracts words from it. The extracted words are then cleaned of symbols and other non-alphabetic characters. After cleaning, a dictionary of word frequencies is created, and the 10 most common words are printed.

Python3

# Python3 program for a word frequency
# counter after crawling/scraping a web-page
import requests
from bs4 import BeautifulSoup
import operator
from collections import Counter
 
def start(url):
    wordlist = []
    source_code = requests.get(url).text
    soup = BeautifulSoup(source_code, 'html.parser')
 
    for each_text in soup.findAll('div', {'class': 'entry-content'}):
        content = each_text.text
        words = content.lower().split()
 
        for each_word in words:
            wordlist.append(each_word)
        clean_wordlist(wordlist)
 
def clean_wordlist(wordlist):
 
    clean_list = []
    for word in wordlist:
        symbols = "!@#$%^&*()_-+={[}]|\;:\"<>?/., "
 
        for i in range(len(symbols)):
            word = word.replace(symbols[i], '')
 
        if len(word) > 0:
            clean_list.append(word)
    create_dictionary(clean_list)
 
def create_dictionary(clean_list):
    word_count = {}
 
    for word in clean_list:
        if word in word_count:
            word_count[word] += 1
        else:
            word_count[word] = 1
    c = Counter(word_count)
 
    # returns the most occurring elements
    top = c.most_common(10)
    print(top)
 
if __name__ == '__main__':
    url = "https://www.geeksforgeeks.org/programming-language-choose/"
    # starts crawling and prints output
    start(url)

Output:

[('to', 10), ('in', 7), ('is', 6), ('language', 6), ('the', 5),
 ('programming', 5), ('a', 5), ('c', 5), ('you', 5), ('of', 4)]

Suggest improvement

What Can I Do With Python?

Share your thoughts in the comments

Web crawling with Python

Web Crawling Using Python

Basic Example of Web Crawling

Python3

Web Crawling in JSON Format

Python3

Web Scraping Images with Python

Python3

Crawling Elements Using XPath

Python3

Reading Tables on the Web Using Pandas

Python3

Crawl a Web Page and Get Most Frequent Words

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?