Open In App

Python program to extract Strings between HTML Tags

Improve
Improve
Like Article
Like
Save
Share
Report

Given a String and HTML tag, extract all the strings between the specified tag.

Input :  ‘<b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it.’ , tag = “br” 
Output : [‘Gfg’, ‘Best’, ‘Reading CS’]
Explanation : All strings between “br” tag are extracted.

Input : ‘<h1>Gfg</h1> is <h1>Best</h1> I love <h1>Reading CS</h1>’  , tag = “h1” 
Output : [‘Gfg’, ‘Best’, ‘Reading CS’] 
Explanation : All strings between “h1” tag are extracted. 

Using re module this task can be performed. In this we employ, findall() function to extract all the strings by matching appropriate regex built using tag and symbols.

Python3




# importing re module
import re
 
# initializing string
test_str = '<b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it.'
 
# printing original string
print("The original string is : " + str(test_str))
 
# initializing tag
tag = "b"
 
# regex to extract required strings
reg_str = "<" + tag + ">(.*?)</" + tag + ">"
res = re.findall(reg_str, test_str)
 
# printing result
print("The Strings extracted : " + str(res))


Output:

The original string is : <b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it. The Strings extracted : [‘Gfg’, ‘Best’, ‘Reading CS’]

Time Complexity: O(N), where N is the length of the input string.

Auxiliary Space: O(N)

Method 2: Using string manipulation

  1. Initialize a string named “test_str” with some HTML content.
  2. Initialize a string named “tag” with the name of the tag whose content needs to be extracted.
  3. Find the index of the first occurrence of the opening tag in the “test_str” using the “find()” method and store it in a variable named “start_idx”.
  4. Initialize an empty list named “res” to store the extracted strings.
  5. Use a while loop to extract the strings between the tags. The loop will run until there are no more occurrences of the opening tag.
  6. Inside the loop, find the index of the closing tag using the “find()” method and store it in a variable named “end_idx”. If the closing tag is not found, exit the loop.
  7. Extract the string between the tags using string slicing, and append it to the “res” list.
  8. Find the index of the next occurrence of the opening tag using the “find()” method and update the “start_idx” variable.
  9. Repeat steps 6-8 until there are no more occurrences of the opening tag.
  10. Print the extracted strings using the “print()” function. The strings are converted to a string using the “str()” function before being printed.

Python3




# initializing string
test_str = '<b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it.'
 
# initializing tag
tag = "b"
 
# finding the index of the first occurrence of the opening tag
start_idx = test_str.find("<" + tag + ">")
 
# initializing an empty list to store the extracted strings
res = []
 
# extracting the strings between the tags
while start_idx != -1:
    end_idx = test_str.find("</" + tag + ">", start_idx)
    if end_idx == -1:
        break
    res.append(test_str[start_idx+len(tag)+2:end_idx])
    start_idx = test_str.find("<" + tag + ">", end_idx)
 
# printing the extracted strings
print("The Strings extracted : " + str(res))


Output

The Strings extracted : ['Gfg', 'Best', 'Reading CS']

Time complexity: O(n), where n is the length of the input string.
Auxiliary space: O(m), where m is the number of occurrences of the tag in the input string.

Method 3: Using  recursion method:

Algorithm:

  1. Find the index of the first occurrence of the opening tag.
  2. If no opening tag is found, return an empty list.
  3. Extract the string between the opening and closing tags using the start index of the opening tag and the end index of the closing tag.
  4. Recursively call the function with the remaining string after the current tag.
  5. Return the list of extracted strings.

Python3




def extract_strings_recursive(test_str, tag):
    # finding the index of the first occurrence of the opening tag
    start_idx = test_str.find("<" + tag + ">")
 
    # base case
    if start_idx == -1:
        return []
 
    # extracting the string between the opening and closing tags
    end_idx = test_str.find("</" + tag + ">", start_idx)
    res = [test_str[start_idx+len(tag)+2:end_idx]]
 
    # recursive call to extract strings after the current tag
    res += extract_strings_recursive(test_str[end_idx+len(tag)+3:], tag)
 
    return res
 
# example usage
test_str = '<b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it.'
tag = "b"
# printing original string
print("The original string is : " + str(test_str))
  
res = extract_strings_recursive(test_str, tag)
print("The Strings extracted : " + str(res))
#This code is contributed by Jyothi Pinjala.


Output

The original string is : <b>Gfg</b> is <b>Best</b>. I love <b>Reading CS</b> from it.
The Strings extracted : ['Gfg', 'Best', 'Reading CS']

Time Complexity:
The time complexity of this algorithm is O(n), where n is the length of the input string. This is because we iterate through the string only once, and the operations within the loop are constant time.

Auxiliary Space:
The space complexity of this algorithm is also O(n), where n is the length of the input string. This is because we create a new list for each recursive call, and the maximum depth of the recursion is n/2 (when the input string consists entirely of opening and closing tags). However, in practice, the depth of the recursion will be much smaller than n/2.



Last Updated : 17 May, 2023
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads