NLP Sequencing

Last Updated : 24 Oct, 2020

NLP Sequencing is the sequence of numbers that we will generate from a large corpus or body of statements by training a neural network. We will take a set of sentences and assign them numeric tokens based on the training set sentences.

Example:

sentences = [
'I love geeksforgeeks',
'You love geeksforgeeks',
'What do you think about geeksforgeeks?'
]

Word Index: {'geeksforgeeks': 1, 'love': 2, 'you': 3, 'i': 4,
             'what': 5, 'do': 6, 'think': 7, 'about': 8}

Sequences: [[4, 2, 1], [3, 2, 1], [5, 6, 3, 7, 8, 1]]

Now if the test set consists of the word the network has not seen before, or we have to predict the word in the sentence then we can add a simple placeholder token.

Let the test set be :

 test_data = [
'i really love geeksforgeeks',
'Do you like geeksforgeeks'
]

Then we will define an additional placeholder for words it hasn’t seen before. The placeholder by default gets index as 1.

Word Index = {‘placeholder’: 1, ‘geeksforgeeks’: 2, ‘love’: 3, ‘you’: 4, ‘i’: 5, ‘what’: 6, ‘do’: 7, ‘think’: 8, ‘about’: 9}

Sequences = [[5, 3, 2], [4, 3, 2], [6, 7, 4, 8, 9, 2]]

As the word ‘really’ and ‘like’ has not been encountered before it is simply replaced by the placeholder which is indexed by 1.

So, the test sequence now becomes,

Test Sequence = [[5, 1, 3, 2], [7, 4, 1, 2]]

Code: Implementation with TensorFlow

# importing all the modules required 
import tensorflow as tf 
from tensorflow import keras 
from tensorflow.keras.preprocessing.text import Tokenizer 
from tensorflow.keras.preprocessing.sequence import pad_sequences 
  
# the initial corpus of sentences or the training set 
sentences = [ 
    'I love geeksforgeeks', 
    'You love geeksforgeeks', 
    'What do you think about geeksforgeeks?'
] 
  
tokenizer = Tokenizer(num_words = 100) 
  
# the tokenizer also removes punctuations 
tokenizer.fit_on_texts(sentences)   
word_index = tokenizer.word_index 
sequences = tokenizer.texts_to_sequences(sentences) 
print("Word Index: ", word_index) 
print("Sequences: ", sequences) 
  
# defining a placeholder token and naming it as placeholder 
tokenizer = Tokenizer(num_words=100,  
                      oov_token="placeholder") 
tokenizer.fit_on_texts(sentences) 
word_index = tokenizer.word_index 
  
  
sequences = tokenizer.texts_to_sequences(sentences) 
print("\nSequences = ", sequences) 
  
  
# the training data with words the network hasn't encountered 
test_data = [ 
    'i really love geeksforgeeks', 
    'Do you like geeksforgeeks'
] 
  
test_seq = tokenizer.texts_to_sequences(test_data) 
print("\nTest Sequence = ", test_seq) 

Output:

Word Index:  {'geeksforgeeks': 1, 'love': 2, 'you': 3, 'i': 4, 'what': 5, 'do': 6, 'think': 7, 'about': 8}
Sequences:  [[4, 2, 1], [3, 2, 1], [5, 6, 3, 7, 8, 1]]

Sequences =  [[5, 3, 2], [4, 3, 2], [6, 7, 4, 8, 9, 2]]

Test Sequence =  [[5, 1, 3, 2], [7, 4, 1, 2]]

Suggest improvement

Introduction to Natural Language Processing

Bias-Variance Trade Off - Machine Learning

Share your thoughts in the comments

Getting Started with Machine Learning

Data Preprocessing

Classification & Regression

K-Nearest Neighbors (KNN)

Support Vector Machines

Decision Tree

Ensemble Learning

Generative Model

Time Series Forecasting

Clustering Algorithm

Convolutional Neural Networks

Recurrent Neural Networks

Reinforcement Learning

Model Deployment and Productionization

Advanced Topics