Open In App
Related Articles

NLP | How tokenizing text, sentence, words works

Improve Article
Improve
Save Article
Save
Like Article
Like

Natural Language Processing (NLP) is a subfield of computer science, artificial intelligence, information engineering, and human-computer interaction. This field focuses on how to program computers to process and analyze large amounts of natural language data. It is difficult to perform as the process of reading and understanding languages is far more complex than it seems at first glance.

What is Tokenization?

Tokenization is the process of dividing a text into smaller units known as tokens. Tokens are typically words or subwords in the context of natural language processing (NLP) and computer science. Tokenization is a critical step in many NLP tasks, including text processing, language modelling, and machine translation.

Tokenization is the process of tokenizing or splitting a string, or text into a list of tokens. One can think of tokens as parts like a word is a token in a sentence, and a sentence is a token in a paragraph. Key points of the article – 

  • Text into sentences tokenization
  • Sentences into words tokenization
  • Sentences using regular expressions tokenization

Types of Tokenizations

Tokenization can be classified into several types based on how the text is segmented. Here are some types of tokenization:

Word Tokenization:

Word tokenization divides the text into individual words. Many NLP tasks use this approach, in which words are treated as the basic units of meaning.

Example:

Input: "Tokenization is an important NLP task."
Output: ["Tokenization", "is", "an", "important", "NLP", "task", "."]

Sentence Tokenization:

The text is segmented into sentences during sentence tokenization. This is useful for tasks requiring individual sentence analysis or processing.

Example:

Input: "Tokenization is an important NLP task. It helps break down text into smaller units."
Output: ["Tokenization is an important NLP task.", "It helps break down text into smaller units."]

Subword Tokenization:

Subword tokenization entails breaking down words into smaller units, which can be especially useful when dealing with morphologically rich languages or rare words.

Example:

Input: "tokenization"
Output: ["token", "ization"]

Character Tokenization:

This process divides the text into individual characters. This can be useful for modelling character-level language.

Example:

Input: "Tokenization"
Output: ["T", "o", "k", "e", "n", "i", "z", "a", "t", "i", "o", "n"]

Need of Tokenization

Tokenization is a crucial step in text processing and natural language processing (NLP) for several reasons.

  • Effective Text Processing: Tokenization reduces the size of raw text so that it can be handled more easily for processing and analysis.
  • Feature extraction: Text data can be represented numerically for algorithmic comprehension by using tokens as features in machine learning models.
  • Language Modelling: Tokenization in NLP facilitates the creation of organised representations of language, which is useful for tasks like text generation and language modelling.
  • Information Retrieval: Tokenization is essential for indexing and searching in systems that store and retrieve information efficiently based on words or phrases.
  • Text Analysis: Tokenization is used in many NLP tasks, including sentiment analysis and named entity recognition, to determine the function and context of individual words in a sentence.
  • Vocabulary Management: By generating a list of distinct tokens that stand in for words in the dataset, tokenization helps manage a corpus’s vocabulary.
  • Task-Specific Adaptation: Tokenization can be customised to meet the needs of particular NLP tasks, meaning that it will work best in applications such as summarization and machine translation.
  • Preprocessing Step: This essential preprocessing step transforms unprocessed text into a format appropriate for additional statistical and computational analysis.

Code Implementation for Tokenization

Sentence Tokenization- Splitting sentences in the paragraph 

Python3




from nltk.tokenize import sent_tokenize
 
text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article."
sent_tokenize(text)


Output: 

['Hello everyone.',
'Welcome to GeeksforGeeks.',
'You are studying NLP article']

How sent_tokenize works ? The sent_tokenize function uses an instance of PunktSentenceTokenizer from the nltk.tokenize.punkt module, which is already been trained and thus very well knows to mark the end and beginning of sentence at what characters and punctuation.  

PunktSentenceTokenizer- When we have huge chunks of data then it is efficient to use it. 

Python3




import nltk.data
 
# Loading PunktSentenceTokenizer using English pickle file
tokenizer = nltk.data.load('tokenizers/punkt/PY3/english.pickle')
 
tokenizer.tokenize(text)


Output: 

['Hello everyone.',
'Welcome to GeeksforGeeks.',
'You are studying NLP article']

Tokenize sentence of different language- One can also tokenize sentence from different languages using different pickle file other than English. 

Python3




import nltk.data
 
spanish_tokenizer = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')
 
text = 'Hola amigo. Estoy bien.'
spanish_tokenizer.tokenize(text)


Output: 

['Hola amigo.', 
'Estoy bien.']

Word Tokenization – Splitting words in a sentence. 

Python3




from nltk.tokenize import word_tokenize
 
text = "Hello everyone. Welcome to GeeksforGeeks."
word_tokenize(text)


Output: 

['Hello', 'everyone', '.', 'Welcome', 'to', 'GeeksforGeeks', '.']

How word_tokenize works? word_tokenize() function is a wrapper function that calls tokenize() on an instance of the TreebankWordTokenizer class.

Using TreebankWordTokenizer 

Python3




from nltk.tokenize import TreebankWordTokenizer
 
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(text)


Output:

['Hello', 'everyone.', 'Welcome', 'to', 'GeeksforGeeks', '.']

These tokenizers work by separating the words using punctuation and spaces. And as mentioned in the code outputs above, it doesn’t discard the punctuation, allowing a user to decide what to do with the punctuations at the time of pre-processing.

PunktWordTokenizer – It doesn’t separates the punctuation from the words. 

Python3




from nltk.tokenize import PunktWordTokenizer
 
tokenizer = PunktWordTokenizer()
tokenizer.tokenize("Let's see how it's working.")


Output:

['Let', "'s", 'see', 'how', 'it', "'s", 'working', '.']

WordPunctTokenizer – It separates the punctuation from the words. 

Python3




from nltk.tokenize import WordPunctTokenizer
 
tokenizer = WordPunctTokenizer()
tokenizer.tokenize("Let's see how it's working.")


Output:

['Let', "'", 's', 'see', 'how', 'it', "'", 's', 'working', '.']

Using Regular Expression 

Python3




from nltk.tokenize import RegexpTokenizer
 
tokenizer = RegexpTokenizer("[\w']")
text = "Let's see how it's working."
tokenizer.tokenize(text)


Output: 

["Let's", 'see', 'how', "it's", 'working']

Using Regular Expression 

Python3




from nltk.tokenize import regexp_tokenize
 
text = "Let's see how it's working."
regexp_tokenize(text, "[\w']")


Output: 

["Let's", 'see', 'how', "it's", 'working']

Frequently Asked Questions (FAQs)

Q. 1 What is Tokenization in NLP?

Tokenization is the process of converting a sequence of text into smaller parts known as tokens in the context of Natural Language Processing (NLP) and machine learning. These tokens can be as short as a character or as long as a sentence.

Q. 2 What is Lemmetization in NLP?

Lemmatization is a text pre-processing method that helps natural language processing (NLP) models find similarities by reducing a word to its most basic meaning. A lemmatization algorithm, for instance, would reduce the word better to its lemme, or good.

Q. 3 Which are most common types of tokenization?

Word tokenization, which divides text into words, sentence tokenization, which divides text into sentences, subword tokenization, which divides words into smaller units, and character tokenization, which divides text into individual characters, are common forms of tokenization.


Whether you're preparing for your first job interview or aiming to upskill in this ever-evolving tech landscape, GeeksforGeeks Courses are your key to success. We provide top-quality content at affordable prices, all geared towards accelerating your growth in a time-bound manner. Join the millions we've already empowered, and we're here to do the same for you. Don't miss out - check it out now!

Last Updated : 09 Dec, 2023
Like Article
Save Article
Previous
Next
Similar Reads
Complete Tutorials