HTML Parser in C/C++

HTML Parser is a program/software by which useful statements can be extracted, leaving html tags (like <h1>, <span>, <p> etc) behind.

Examples:

Input: <h1>Geeks for Geeks</h1>
Output: Geeks for Geeks
Explanation- <h1> and </h1> are opening and closing heading tags, so they got parsed leaving “Geeks for Geeks” as the output.

Input: <p> Geeks for Geeks</p>
Output: Geeks for Geeks
Explanation- <p> and </p> are opening and closing paragraph tags, so they get parsed and the parser ignores space character, leaving “Geeks for Geeks” as the output.

Approach: Let the input string be S of size N. Follow the steps below to solve the problem:

Declare two variables, start and end to point to the starting and ending point of the statement.
Traverse the string, S uses the variable i and if S[i] is equal to ‘>’, update the start variable to i+1 and break out of the loop.
Remove the blank spaces from the start by running a loop while S[start] is equal to ‘ ‘, and increment the start variable by 1 in each iteration.
Again, traverse the string, S from start using the variable i and if S[i] is equal to ‘<‘, update the end to i-1 and break out of the loop.
Run a loop and print the characters of the string S in the range [start, end].

Below is the implementation of the above approach in C language:

// C program for the above approach 

#include <stdbool.h> 
#include <stdio.h> 
#include <string.h> 

// Function to parse the HTML code 

void parser(char* S) 
{ 

    // Store the length of the 

    // input string 

    int n = strlen(S); 

    int start = 0, end = 0; 

    int i, j; 

    // Traverse the string 

    for (i = 0; i < n; i++) { 

        // If S[i] is '>', update 

        // start to i+1 and break 

        if (S[i] == '>') { 

            start = i + 1; 

            break; 

        } 

    } 

    // Remove the blank spaces 

    while (S[start] == ' ') { 

        start++; 

    } 

    // Traverse the string 

    for (i = start; i < n; i++) { 

        // If S[i] is '<', update 

        // end to i-1 and break 

        if (S[i] == '<') { 

            end = i - 1; 

            break; 

        } 

    } 

    // Print the characters in the 

    // range [start, end] 

    for (j = start; j <= end; j++) { 

        printf("%c", S[j]); 

    } 

    printf("\n"); 
} 

// Driver Code 

int main() 
{ 

    // Given Input 

    char input1[] = "<h1>This is a statement</h1>"; 

    char input2[] = "<h1>         This is a statement with some spaces</h1>"; 

    char input3[] = "<p> This is a statement with some @ #$ ., / special characters</p>         "; 

    printf("Parsed Statements:\n"); 

    // Function Call 

    parser(input1); 

    parser(input2); 

    parser(input3); 

    return 0; 
}

Output:

Parsed Statements:
This is a statement
This is a statement with some spaces
This is a statement with some @ #$ ., / special characters

Below is the implementation of the above approach in C++ language:

// C++ program for the 
// above approach 
#include <bits/stdc++.h> 

using namespace std; 

// Function to parse the 
// HTML code 

void parser(char* S) 
{ 

    // Store the length of the 

    // input string 

    int n = strlen(S); 

    int start = 0, end = 0; 

    // Traverse the string 

    for (int i = 0; i < n; i++) { 

        // If S[i] is '>', update 

        // start to i+1 and break 

        if (S[i] == '>') { 

            start = i + 1; 

            break; 

        } 

    } 

    // Remove the blank space 

    while (S[start] == ' ') { 

        start++; 

    } 

    // Traverse the string 

    for (int i = start; i < n; i++) { 

        // If S[i] is '<', update 

        // end to i-1 and break 

        if (S[i] == '<') { 

            end = i - 1; 

            break; 

        } 

    } 

    // Print the characters in the 

    // range [start, end] 

    for (int j = start; j <= end; j++) { 

        cout << S[j]; 

    } 

    cout << endl; 
} 

// Driver Code 

int main() 
{ 

    // Given Input 

    char input1[] = "<h1>This is a statement</h1>"; 

    char input2[] = "<h1>         This is a statement with  some spaces</h1>"; 

    char input3[] = "<p> This is a statement with some @ #$ ., / special characters</p>         "; 

    cout << "Parsed Statements:\n"; 

    // Function Call 

    parser(input1); 

    parser(input2); 

    parser(input3); 

    return 0; 
}

Output:

Parsed Statements:
This is a statement
This is a statement with  some spaces
This is a statement with some @ #$ ., / special characters

Time Complexity: O(N)
Auxiliary Space: O(1)

Note: This program parses only one statement at a time.

Article Tags :

C Programs

C++ Programs

HTML