Python program to extract Strings between HTML Tags

Last Updated : 17 May, 2023

Given a String and HTML tag, extract all the strings between the specified tag.

Input : ‘<b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it.’ , tag = “br”
Output : [‘Gfg’, ‘Best’, ‘Reading CS’]
Explanation : All strings between “br” tag are extracted.

Input : ‘<h1>Gfg</h1> is <h1>Best</h1> I love <h1>Reading CS</h1>’ , tag = “h1”
Output : [‘Gfg’, ‘Best’, ‘Reading CS’]
Explanation : All strings between “h1” tag are extracted.

Using re module this task can be performed. In this we employ, findall() function to extract all the strings by matching appropriate regex built using tag and symbols.

Python3

# importing re module
import re
 
# initializing string
test_str = '<b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it.'
 
# printing original string
print("The original string is : " + str(test_str))
 
# initializing tag
tag = "b"
 
# regex to extract required strings
reg_str = "<" + tag + ">(.*?)</" + tag + ">"
res = re.findall(reg_str, test_str)
 
# printing result
print("The Strings extracted : " + str(res))

Output:

The original string is : <b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it. The Strings extracted : [‘Gfg’, ‘Best’, ‘Reading CS’]

Time Complexity: O(N), where N is the length of the input string.

Auxiliary Space: O(N)

Method 2: Using string manipulation

Initialize a string named “test_str” with some HTML content.
Initialize a string named “tag” with the name of the tag whose content needs to be extracted.
Find the index of the first occurrence of the opening tag in the “test_str” using the “find()” method and store it in a variable named “start_idx”.
Initialize an empty list named “res” to store the extracted strings.
Use a while loop to extract the strings between the tags. The loop will run until there are no more occurrences of the opening tag.
Inside the loop, find the index of the closing tag using the “find()” method and store it in a variable named “end_idx”. If the closing tag is not found, exit the loop.
Extract the string between the tags using string slicing, and append it to the “res” list.
Find the index of the next occurrence of the opening tag using the “find()” method and update the “start_idx” variable.
Repeat steps 6-8 until there are no more occurrences of the opening tag.
Print the extracted strings using the “print()” function. The strings are converted to a string using the “str()” function before being printed.

Python3

# initializing string
test_str = '<b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it.'
 
# initializing tag
tag = "b"
 
# finding the index of the first occurrence of the opening tag
start_idx = test_str.find("<" + tag + ">")
 
# initializing an empty list to store the extracted strings
res = []
 
# extracting the strings between the tags
while start_idx != -1:
    end_idx = test_str.find("</" + tag + ">", start_idx)
    if end_idx == -1:
        break
    res.append(test_str[start_idx+len(tag)+2:end_idx])
    start_idx = test_str.find("<" + tag + ">", end_idx)
 
# printing the extracted strings
print("The Strings extracted : " + str(res))

Output

The Strings extracted : ['Gfg', 'Best', 'Reading CS']

Time complexity: O(n), where n is the length of the input string.
Auxiliary space: O(m), where m is the number of occurrences of the tag in the input string.

Method 3: Using recursion method:

Algorithm:

Find the index of the first occurrence of the opening tag.
If no opening tag is found, return an empty list.
Extract the string between the opening and closing tags using the start index of the opening tag and the end index of the closing tag.
Recursively call the function with the remaining string after the current tag.
Return the list of extracted strings.

Python3

def extract_strings_recursive(test_str, tag):
    # finding the index of the first occurrence of the opening tag
    start_idx = test_str.find("<" + tag + ">")
 
    # base case
    if start_idx == -1:
        return []
 
    # extracting the string between the opening and closing tags
    end_idx = test_str.find("</" + tag + ">", start_idx)
    res = [test_str[start_idx+len(tag)+2:end_idx]]
 
    # recursive call to extract strings after the current tag
    res += extract_strings_recursive(test_str[end_idx+len(tag)+3:], tag)
 
    return res
 
# example usage
test_str = '<b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it.'
tag = "b"
# printing original string
print("The original string is : " + str(test_str))
  
res = extract_strings_recursive(test_str, tag)
print("The Strings extracted : " + str(res))
#This code is contributed by Jyothi Pinjala.

Output

The original string is : <b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it.
The Strings extracted : ['Gfg', 'Best', 'Reading CS']

Time Complexity:
The time complexity of this algorithm is O(n), where n is the length of the input string. This is because we iterate through the string only once, and the operations within the loop are constant time.

Auxiliary Space:
The space complexity of this algorithm is also O(n), where n is the length of the input string. This is because we create a new list for each recursive call, and the maximum depth of the recursion is n/2 (when the input string consists entirely of opening and closing tags). However, in practice, the depth of the recursion will be much smaller than n/2.

Suggest improvement

Python program to find Indices of Overlapping Substrings

Python - Check if String Contain Only Defined Characters using Regex

Share your thoughts in the comments

Python Matrix Exercises

Python Functions Exercises

Python Lambda Exercises

Python Pattern printing Exercises

Python DateTime Exercises

Python OOPS Exercises

Python Regex Exercises

Python LinkedList Exercises

Python Searching Exercises

Python Sorting Exercises

Python DSA Exercises

Python File Handling Exercises

Python CSV Exercises

Python JSON Exercises

Python OS Module Exercises

Python Tkinter Exercises

Python Web Scraping Exercises

Python Selenium Exercises

Python program to extract Strings between HTML Tags

Python3

Python3

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?