Open In App

HTML Parser in C/C++

HTML Parser is a program/software by which useful statements can be extracted, leaving html tags (like <h1>, <span>, <p> etc) behind. 

Examples:



Input: <h1>Geeks for Geeks</h1>
Output: Geeks for Geeks 
Explanation- <h1> and </h1> are opening and closing heading tags, so they got parsed leaving “Geeks for Geeks” as the output.

Input: <p>    Geeks for Geeks</p>
Output: Geeks for Geeks
Explanation- <p> and </p> are opening and closing paragraph tags, so they get parsed and the parser ignores space character, leaving “Geeks for Geeks” as the output.



Approach: Let the input string be S of size N. Follow the steps below to solve the problem:

Below is the implementation of the above approach in C language:




// C program for the above approach
  
#include <stdbool.h>
#include <stdio.h>
#include <string.h>
  
// Function to parse the HTML code
void parser(char* S)
{
    // Store the length of the
    // input string
    int n = strlen(S);
    int start = 0, end = 0;
    int i, j;
  
    // Traverse the string
    for (i = 0; i < n; i++) {
        // If S[i] is '>', update
        // start to i+1 and break
        if (S[i] == '>') {
            start = i + 1;
            break;
        }
    }
  
    // Remove the blank spaces
    while (S[start] == ' ') {
        start++;
    }
  
    // Traverse the string
    for (i = start; i < n; i++) {
        // If S[i] is '<', update
        // end to i-1 and break
        if (S[i] == '<') {
            end = i - 1;
            break;
        }
    }
  
    // Print the characters in the
    // range [start, end]
    for (j = start; j <= end; j++) {
        printf("%c", S[j]);
    }
  
    printf("\n");
}
  
// Driver Code
int main()
{
    // Given Input
    char input1[] = "<h1>This is a statement</h1>";
    char input2[] = "<h1>         This is a statement with some spaces</h1>";
    char input3[] = "<p> This is a statement with some @ #$ ., / special characters</p>         ";
  
    printf("Parsed Statements:\n");
  
    // Function Call
    parser(input1);
    parser(input2);
    parser(input3);
  
    return 0;
}

Output:
Parsed Statements:
This is a statement
This is a statement with some spaces
This is a statement with some @ #$ ., / special characters

Below is the implementation of the above approach in C++ language:




// C++ program for the
// above approach
#include <bits/stdc++.h>
using namespace std;
  
// Function to parse the
// HTML code
void parser(char* S)
{
    // Store the length of the
    // input string
    int n = strlen(S);
    int start = 0, end = 0;
  
    // Traverse the string
    for (int i = 0; i < n; i++) {
        // If S[i] is '>', update
        // start to i+1 and break
        if (S[i] == '>') {
            start = i + 1;
            break;
        }
    }
  
    // Remove the blank space
    while (S[start] == ' ') {
        start++;
    }
  
    // Traverse the string
    for (int i = start; i < n; i++) {
        // If S[i] is '<', update
        // end to i-1 and break
        if (S[i] == '<') {
            end = i - 1;
            break;
        }
    }
  
    // Print the characters in the
    // range [start, end]
    for (int j = start; j <= end; j++) {
        cout << S[j];
    }
  
    cout << endl;
}
  
// Driver Code
int main()
{
    // Given Input
    char input1[] = "<h1>This is a statement</h1>";
    char input2[] = "<h1>         This is a statement with  some spaces</h1>";
    char input3[] = "<p> This is a statement with some @ #$ ., / special characters</p>         ";
  
    cout << "Parsed Statements:\n";
  
    // Function Call
    parser(input1);
    parser(input2);
    parser(input3);
    return 0;
}

Output:
Parsed Statements:
This is a statement
This is a statement with  some spaces
This is a statement with some @ #$ ., / special characters

Time Complexity: O(N)
Auxiliary Space: O(1)

Note: This program parses only one statement at a time.


Article Tags :