Open In App

Dictionary Based Tokenization in NLP

Last Updated : 04 Jun, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

Natural Language Processing (NLP) is a subfield of artificial intelligence that aims to enable computers to process, understand, and generate human language. One of the critical tasks in NLP is tokenization, which is the process of splitting text into smaller meaningful units, known as tokens. Dictionary-based tokenization is a common method used in NLP to segment text into tokens based on a pre-defined dictionary.

Dictionary-based tokenization is a technique in natural language processing (NLP) that involves splitting a text into individual tokens based on a predefined dictionary of multi-word expressions. This is useful when the standard word tokenization techniques may not be sufficient for certain applications, such as sentiment analysis or named entity recognition, where multi-word expressions need to be treated as a single token.

Dictionary-based tokenization divides the text into tokens by using a predefined dictionary of multi-word expressions. A dictionary is a list of words, phrases, and other linguistic constructions along with the definitions, speech patterns, and other pertinent data that go with them. Each word in the text is compared to the terms in the dictionary as part of the dictionary-based tokenization process, and the text is then divided into tokens based on the matches discovered.  We can tokenize the name, and phrases by creating a custom dictionary. 

A token in natural language processing is a group of characters that stands for a single meaning. Words, phrases, integers, and punctuation marks can all be used as tokens. Several NLP activities, including text classification, sentiment analysis, machine translation, and named entity recognition, depend on the tokenization process.

Several methods, including rule-based tokenization, machine learning-based tokenization, and hybrid tokenization, can be used to conduct the dictionary-based tokenization process. Rule-based tokenization divides the text into tokens according to the text’s characteristics, such as punctuation, capitalization, and spacing. Tokenization that is based on machine learning entails training a model to separate text into tokens based on a set of training data. To increase accuracy and efficiency, hybrid tokenization blends rule-based and machine-learning-based methods.

Steps needed for implementing Dictionary-based tokenization:

  • Step 1: Collect a dictionary of words and their corresponding parts of speech. The dictionary can be created manually or obtained from a pre-existing source such as WordNet or Wikipedia.
  • Step 2: Preprocess the text by removing any noise such as punctuation marks, stop words, and HTML tags.
  • Step 3: Tokenize the text into words using a whitespace tokenizer or a sentence tokenizer.
  • Step 4: Identify the parts of speech of each word in the text using a part-of-speech tagger such as the Stanford POS Tagger.
  • Step 5: Segment the text into tokens by comparing each word in the text with the words in the dictionary. If a match is found, the corresponding word in the dictionary is used as a token. Otherwise, the word is split into smaller sub-tokens based on its parts of speech.

For example, consider the following sentence:

Jammu Kashmir is an integral part of India.
My name is Pawan Kumar Gunjan.
He is from Himachal Pradesh.

The steps involved in the dictionary-based tokenization of this sentence are as follows:

Step 1: Import the necessary libraries

Python3




from nltk import word_tokenize
from nltk.tokenize import MWETokenizer


Step 2: Create a custom dictionary using the name or phrases

Collect a dictionary of words having joint words like phrases or names. Let the dictionary contain the following name or phrases.

Python3




dictionary = [("Jammu", "Kashmir"), 
              ("Pawan", "Kumar", "Gunjan"), 
              ("Himachal", "Pradesh")]


Step 3: Create an instance of MWETokenizer with the dictionary

Python3




Dictionary_tokenizer = MWETokenizer(dictionary, separator=' ')


Step 4: Create a text dataset and tokenize with word_tokenize

Python3




text = """
Jammu Kashmir is an integral part of India.
My name is Pawan Kumar Gunjan.
He is from Himachal Pradesh.
"""
tokens = word_tokenize(text)
tokens


Output:

['Jammu',
 'Kashmir',
 'is',
 'an',
 'integral',
 'part',
 'of',
 'India',
 '.',
 'My',
 'name',
 'is',
 'Pawan',
 'Kumar',
 'Gunjan',
 '.',
 'He',
 'is',
 'from',
 'Himachal',
 'Pradesh',
 '.']

Step 5:  Apply Dictionary based tokenization with Dictionary_tokenizer

Python3




dictionary_based_token =Dictionary_tokenizer.tokenize(tokens)
dictionary_based_token


Output:

['Jammu Kashmir',
 'is',
 'an',
 'integral',
 'part',
 'of',
 'India',
 '.',
 'My',
 'name',
 'is',
 'Pawan Kumar Gunjan',
 '.',
 'He',
 'is',
 'from',
 'Himachal Pradesh',
 '.']

We can easily observe the differences between General word tokenization and Dictionary-based tokenization. This is useful when we know the phrases or joint words present in the TEXT DOCUMENT and we want to assign these joint words as single tokens.

Full code implementations

Python3




# import the necessary libraries
from nltk import word_tokenize
from nltk.tokenize import MWETokenizer
  
# customn dictionary
dictionary = [("Jammu", "Kashmir"), 
              ("Pawan", "Kumar", "Gunjan"), 
              ("Himachal", "Pradesh")]
  
# Create an instance of MWETokenizer with the dictionary
Dictionary_tokenizer = MWETokenizer(dictionary, separator=' ')
  
# Text
text = """
Jammu Kashmir is an integral part of India.
My name is Pawan Kumar Gunjan.
He is from Himachal Pradesh.
"""
  
tokens = word_tokenize(text)
print('General Word Tokenization \n',tokens)
  
dictionary_based_token =Dictionary_tokenizer.tokenize(tokens)
print('Dictionary based tokenization \n',dictionary_based_token)


Output:

General Word Tokenization 
 ['Jammu', 'Kashmir', 'is', 'an', 'integral', 'part', 'of', 'India', '.', 'My', 'name', 'is', 'Pawan', 'Kumar', 'Gunjan', '.', 'He', 'is', 'from', 'Himachal', 'Pradesh', '.']
Dictionary based tokenization 
 ['Jammu Kashmir', 'is', 'an', 'integral', 'part', 'of', 'India', '.', 'My', 'name', 'is', 'Pawan Kumar Gunjan', '.', 'He', 'is', 'from', 'Himachal Pradesh', '.']


Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads