HTML | parsing and processing

Last Updated : 28 May, 2020

Before starting with the concept, let’s go through the terminology in detail. The word parsing means to divide something into its components and then describe their syntactic roles. The word processing is a familiar word and stands for dealing with something using a standard procedure. Combined these two explain how HTML parser works in generating DOM trees from text/html resources.

This approach defines the parsing rules for HTML documents for determining whether they are syntactically correct or not. The points where the syntax fails to match, a parse error is initiated. At the end of the procedure if a resource is determined to be in the HTML syntax, then it is an HTML document.

OVERVIEW OF THE PARSING MODEL

The input to the HTML parsing process consists of a stream of code points, which are then passed through a tokenization stage followed by a tree construction stage to produce a Document object as an output. Mostly, the data handled by the tokenization stage comes from the network, but it can also come from a script running in the user agent, e.g. using the document.write() API. The tokenizer and the tree construction stage have only one set of states, but while the tree construction stage is working with one token, the tokenizer can be resumed. Because of this tree construction stage is often considered reentrant. To handle such cases, parsers have a script nesting level, which must initially be set to 0 and a parser pause flag, which must be initialized to false.

PARSE ERRORS: As mentioned earlier, while parsing a resource, it is checked with its syntax and if something doesn’t match the standard protocol it raises a Parse error. If a resource is found to be error-free it becomes a document. Parse errors only deal with errors regarding the syntax of an HTML document. In addition to checking for parse errors conformance checkers also validate documents to match the basic conformance requirements. The error handling for parse errors is well-defined. If one or more parse conditions are found within the document, it is the duty of Conformance checkers to report at least one of them and report none if no error is raised. Conformance checkers may report more than one parse error condition if more than one parse error condition is encountered in the document.

UNDERSTANDING EACH LAYER

The input byte stream:
The stream of code points that will be the input for the tokenization stage will be initially seen by the user agent as a stream of byte typically coming from a network or a from a local file system. The bytes encode the actual characters as per a particular character encoding, which the user agent uses to decode the bytes into characters.

Given a character encoding, the bytes in the input byte stream must be converted to characters for using them with the tokenizer as its input stream, bypassing the input byte stream and character encoding to decode.

When the HTML parser is decoding an input byte stream, it uses a character encoding and a confidence that is either tentative, certain, or irrelevant. The encoding used, and the type of confidence in that encoding is employed during the parsing to determine whether to change the encoding. If no encoding is necessary, e.g. because the parser is operating on a Unicode stream and doesn’t have to use a character encoding at all, then the confidence is irrelevant.
Input stream preprocessor: The input stream is made of the characters pushed into it as the input byte stream is decoded or from the various APIs that directly manipulate the input stream. Before the tokenization stage, the newlines are normalized in the input stream. Initially, the next input character is the first character in the input that is yet to be consumed and the current input character is the last character to have been consumed. The insertion point is the position where content inserted using () is actually inserted. The insertion point is not an absolute offset into the input stream rather it is relative to the position of the character immediately after it. Initially, the insertion point is undefined.
Tokenization: Implementations are expected to act as if they are using the following state machine to tokenize HTML. The state machine is expected to start in a data state. Most states take a single character, which either switches the state machine to a new state to re-consume the current input character or switches it to a new state to consume the next character. Some states have more complicated behavior and can take in several characters before switching to another state. In some cases, the tokenizer state is also affected by the tree construction stage.
The output generated in this step is either a series of zero or more of the following tokens: DOCTYPE, start tag, end tag, comment, character, end-of-file. Also creating and emitting tokens are two completely different concepts. When a token is emitted, it must immediately be attended by the tree construction stage. The tree construction stage can affect the state of the tokenization stage and is even allowed to insert additional characters into the stream.
Tree construction: The sequence of tokens from the tokenization state form the input for the Tree construction stage. Once the parser is created, the tree construction stage is associated with the Document Object Model (DOM). The output of this stage consists of dynamically modifying or extending that document’s DOM tree. As each token is dispatched from the tokenizer the user agent is expected to follow a certain algorithm in order to deal with them.

Suggest improvement

How to parse and process HTML/XML using PHP ?

Share your thoughts in the comments

HTML | parsing and processing

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?