Open In App

Extract all the URLs that are nested within <li> tags using BeautifulSoup

Last Updated : 07 Jun, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

Beautiful Soup is a python library used for extracting html and xml files. In this article we will understand how we can extract all the URLSs from a web page that are nested within <li> tags.

Module needed and installation:

  • BeautifulSoup: Our primary module contains a method to access a webpage over HTTP.
pip install bs4
  • Requests: used to perform GET request to the webpage and get their content.

Note: You do not need to install it separately as it downloads automatically with bs4, but in case of any problem you can download it manually.

pip install requests

Approach

  1. We will first import our required libraries.
  2. We will perform a get request to the desired web page from which we want all the URLs from.
  3. We will pass the text to the BeautifulSoup function and convert it to a soup object.
  4. Using a for loop we will look for all the <li> tags in the webpage.
  5. If a <li> tag has an anchor tag in it we will look for the href attribute and store its parameter in a list. It is the url we were looking for.
  6. The print the list that contains all the urls.

Let’s have a look at the code, We will see what’s happening at each significant step.

Step 1: Initialize the Python program by importing all the required libraries and setting up the URL of the web page from which you want all the URLs contained in an anchor tag.

In the following example, we will take another geek for geeks article on implementing web scraping using BeautifulSoup and extract all the URLs stored in anchor tags nested within <li> tag.

LInk of the article is : https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/

Python3




# Importing libraries
import requests
from bs4 import BeautifulSoup
  
# setting up the URL


Step 2: We will perform a get request to the desired URL and pass all the text from it into BeautifuLSoup and convert it into a soup object. We will set the parser as html.parser. You can set it different depending on the webpage you are scraping.

Python3




# perform get request to the url
reqs = requests.get(URL)
  
# extract all the text that you received 
# from the GET request  
content = reqs.text
  
# convert the text to a beautiful soup object
soup = BeautifulSoup(content, 'html.parser')


Step 3: Create an empty list to store all the URLs that you will receive as your desired output.  Run a for loop that iterates over all the <li> tags in the web page. Then for each <li> tag check if it has an anchor tag in it. If that anchor tag has an href attribute then store the parameter of that href in the list that you created.

Python3




# Empty list to store the output
urls = []
  
# For loop that iterates over all the <li> tags
for h in soup.findAll('li'):
    
    # looking for anchor tag inside the <li>tag
    a = h.find('a')
    try:
          
        # looking for href inside anchor tag
        if 'href' in a.attrs:
              
            # storing the value of href in a separate 
            # variable
            url = a.get('href')
              
            # appending the url to the output list
            urls.append(url)
      
    # if the list does not has a anchor tag or an anchor 
    # tag does not has a href params we pass
    except:
        pass


Step 4: We print the output by iterating over the list of the url.

Python3




# print all the urls stored in the urls list
for url in urls:
    print(url)


Complete code:

Python3




# Importing libraries
import requests
from bs4 import BeautifulSoup
  
# setting up the URL
  
# perform get request to the url
reqs = requests.get(URL)
  
# extract all the text that you received from
# the GET request
content = reqs.text
  
# convert the text to a beautiful soup object
soup = BeautifulSoup(content, 'html.parser')
  
# Empty list to store the output
urls = []
  
# For loop that iterates over all the <li> tags
for h in soup.findAll('li'):
    
    # looking for anchor tag inside the <li>tag
    a = h.find('a')
    try:
        
        # looking for href inside anchor tag
        if 'href' in a.attrs:
            
            # storing the value of href in a separate variable
            url = a.get('href')
              
            # appending the url to the output list
            urls.append(url)
              
    # if the list does not has a anchor tag or an anchor tag
    # does not has a href params we pass
    except:
        pass
  
# print all the urls stored in the urls list
for url in urls:
    print(url)


Output:



Previous Article
Next Article

Similar Reads

Insert tags or strings immediately before and after specified tags using BeautifulSoup
BeautifulSoup is a Python library that is used for extracting data out of markup languages like HTML, XML...etc. For example let us say we have some web pages that needed to display relevant data related to some research like processing information such as date or address but that do not have any way to download it, in such cases BeautifulSoup come
2 min read
How to Scrape Nested Tags using BeautifulSoup?
We can scrap the Nested tag in beautiful soup with help of. (dot) operator. After creating a soup of the page if we want to navigate nested tag then with the help of. we can do it. For scraping Nested Tag using Beautifulsoup follow the below-mentioned steps. Step-by-step Approach Step 1: The first step will be for scraping we need to import beautif
3 min read
Extract all the URLs from the webpage Using Python
Scraping is a very essential skill for everyone to get data from any website. In this article, we are going to write Python scripts to extract all the URLs from the website or you can save it as a CSV file. Module Needed: bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with
2 min read
Get a list of all the heading tags using BeautifulSoup
In order to print all the heading tags using BeautifulSoup, we use the find_all() method. The find_all method is one of the most common methods in BeautifulSoup. It looks through a tag and retrieves all the occurrences of that tag. Syntax: find_all(name, attrs, recursive, string, limit, **kwargs) An HTML document consists of the following tags - h1
2 min read
Remove all style, scripts, and HTML tags using BeautifulSoup
Prerequisite: BeautifulSoup, Requests Beautiful Soup is a Python library for pulling data out of HTML and XML files. In this article, we are going to discuss how to remove all style, scripts, and HTML tags using beautiful soup. Required Modules: bs4: Beautiful Soup (bs4) is a python library primarily used to extract data from HTML, XML, and other m
2 min read
Get all HTML tags with BeautifulSoup
Web scraping is a process of using bots like software called web scrapers in extracting information from HTML or XML content. Beautiful Soup is one such library used for scraping data through python. Beautiful Soup parses through the HTML content of the web page and collects it to provide iteration, searching and modification features on it. To pro
2 min read
BeautifulSoup object - Python Beautifulsoup
BeautifulSoup object is provided by Beautiful Soup which is a web scraping framework for Python. Web scraping is the process of extracting data from the website using automated tools to make the process faster. The BeautifulSoup object represents the parsed document as a whole. For most purposes, you can treat it as a Tag object. Syntax: BeautifulS
2 min read
How to Remove tags using BeautifulSoup in Python?
Prerequisite- Beautifulsoup module In this article, we are going to draft a python script that removes a tag from the tree and then completely destroys it and its contents. For this, decompose() method is used which comes built into the module. Syntax: Beautifulsoup.Tag.decompose() Tag.decompose() removes a tag from the tree of a given HTML documen
2 min read
How to remove empty tags using BeautifulSoup in Python?
Prerequisite: Requests, BeautifulSoup, strip The task is to write a program that removes the empty tag from HTML code. In Beautiful Soup there is no in-built method to remove tags that has no content. Module Needed:bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Python
2 min read
Show text inside the tags using BeautifulSoup
Prerequisite: RequestsBeautifulSoup In this article, we will learn how to get a text from HTML tags using BeautifulSoup. Here we will use requests &amp; BeautifulSoup Module in Python. The requests library is an integral part of Python for making HTTP requests to a specified URL. Whether it be REST APIs or Web Scraping, requests are must be learned
2 min read
Article Tags :
Practice Tags :