Skip to content
Related Articles

Related Articles

Token, Patterns, and Lexems

View Discussion
Improve Article
Save Article
Like Article
  • Difficulty Level : Easy
  • Last Updated : 04 Feb, 2022

A compiler is system software that translates the source program written in a high-level language into a low-level language. The compilation process of source code is divided into several phases in order to ease the process of development and designing. The phases work in sequence as the output of the previous phase is utilized in the next phase. The various phases are as follows:

  1. Lexical Analysis
  2. Syntax Analysis
  3. Semantic Analysis
  4. Intermediate Code Generation
  5. Code Optimization
  6. Storage Allocation
  7. Code Generation

Lexical Analysis Phase:  In this phase, input is the source program that is to be read from left to right and the output we get is a sequence of tokens that will be analyzed by the next Syntax Analysis phase. During scanning the source code, white space characters, comments, carriage return characters, preprocessor directives, macros, line feed characters, blank spaces, tabs, etc. are removed. The Lexical analyzer or Scanner also helps in error detection. To exemplify, if the source code contains invalid constants, incorrect spelling of keywords, etc. is taken care of by the lexical analysis phase. Regular expressions are used as a standard notation for specifying tokens of a programming language. 

Token

It is basically a sequence of characters that are treated as a unit as it cannot be further broken down. In programming languages like C language- keywords (int, char, float, const, goto, continue, etc.) identifiers (user-defined names), operators (+, -, *,  /), delimiters/punctuators like comma (,), semicolon(;), braces ({ }), etc. , strings can be considered as tokens. This phase recognizes three types of tokens: Terminal Symbols (TRM)- Keywords and Operators, Literals (LIT), and Identifiers (IDN).

Let’s understand now how to calculate tokens in a source code (C language):

Example 1:

int a = 10;   //Input Source code 

Tokens
int (keyword), a(identifier), =(operator), 10(constant) and ;(punctuation-semicolon)

Answer – Total number of tokens = 5

Example 2:

int main() {

  // printf() sends the string inside quotation to
  // the standard output (the display)
  printf("Welcome to GeeksforGeeks!");
  return 0;
}
Tokens
'int', 'main', '(', ')', '{', 'printf', '(', ' "Welcome to GeeksforGeeks!" ', 
')', ';', 'return', '0', ';', '}'

Answer – Total number of tokens = 14

Lexeme

It is a sequence of characters in the source code that are matched by given predefined language rules for every lexeme to be specified as a valid token.

Example:

main is lexeme of type identifier(token)
(,),{,} are lexemes of type punctuation(token)

Pattern

It specifies a set of rules that a scanner follows to create a token.

Example of Programming Language (C, C++): 

For a keyword to be identified as a valid token, the pattern is the sequence of characters that make the keyword.

For identifier to be identified as a valid token, the pattern is the predefined rules that it must start with alphabet, followed by alphabet or a digit.

Difference between Token, Lexeme, and Pattern

CriteriaTokenLexemePattern
DefinitionToken is basically a sequence of characters that are treated as a unit as it cannot be further broken down.It is a sequence of characters in the source code that are matched by given predefined language rules for every lexeme to be specified as a valid token. It specifies a set of rules that a scanner follows to create a token.
Interpretation of type Keyword all the reserved keywords of that language(main, printf, etc.)int, gotoThe sequence of characters that make the keyword.
Interpretation of type Identifiername of a variable, function, etcmain, ait must start with the alphabet, followed by the alphabet or a digit.
Interpretation of type Operatorall the operators are considered tokens.+, =+, =
Interpretation of type Punctuation each kind of punctuation is considered a token. e.g. semicolon, bracket, comma, etc. (, ), {, }(, ), {, }
Interpretation of type Literal a grammar rule or boolean literal.“Welcome to GeeksforGeeks!”any string of characters (except ‘ ‘) between ” and “

The output of Lexical Analysis Phase:

The output of Lexical Analyzer serves as an input to Syntax Analyzer as a sequence of tokens and not the series of lexemes because during the syntax analysis phase individual unit is not vital but the category or class to which this lexeme belongs is considerable. 

Example:

z = x + y;
This statement has the below form for syntax analyzer
<id> = <id> + <id>;      //<id>- identifier (token)

The Lexical Analyzer not only provides a series of tokens but also creates a Symbol Table that consists of all the tokens present in the source code except Whitespaces and comments.

My Personal Notes arrow_drop_up
Recommended Articles
Page :

Start Your Coding Journey Now!