Open In App
Related Articles

NLP Sequencing

Improve
Improve
Improve
Like Article
Like
Save Article
Save
Report issue
Report

NLP Sequencing is the sequence of numbers that we will generate from a large corpus or body of statements by training a neural network. We will take a set of sentences and assign them numeric tokens based on the training set sentences.

Example:

sentences = [
'I love geeksforgeeks',
'You love geeksforgeeks',
'What do you think about geeksforgeeks?'
]

Word Index: {'geeksforgeeks': 1, 'love': 2, 'you': 3, 'i': 4,
             'what': 5, 'do': 6, 'think': 7, 'about': 8}

Sequences: [[4, 2, 1], [3, 2, 1], [5, 6, 3, 7, 8, 1]]

Now if the test set consists of the word the network has not seen before, or we have to predict the word in the sentence then we can add a simple placeholder token.

Let the test set be :

 test_data = [
'i really love geeksforgeeks',
'Do you like geeksforgeeks'
]

Then we will define an additional placeholder for words it hasn’t seen before. The placeholder by default gets index as 1.

Word Index =  {‘placeholder’: 1, ‘geeksforgeeks’: 2, ‘love’: 3, ‘you’: 4, ‘i’: 5, ‘what’: 6, ‘do’: 7, ‘think’: 8, ‘about’: 9}

Sequences =  [[5, 3, 2], [4, 3, 2], [6, 7, 4, 8, 9, 2]]

As the word ‘really’ and ‘like’ has not been encountered before it is simply replaced by the placeholder which is indexed by 1.

So, the test sequence now becomes,

Test Sequence = [[5, 1, 3, 2], [7, 4, 1, 2]]

Code: Implementation with TensorFlow
# importing all the modules required
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
  
# the initial corpus of sentences or the training set
sentences = [
    'I love geeksforgeeks',
    'You love geeksforgeeks',
    'What do you think about geeksforgeeks?'
]
  
tokenizer = Tokenizer(num_words = 100)
  
# the tokenizer also removes punctuations
tokenizer.fit_on_texts(sentences)  
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)
print("Word Index: ", word_index)
print("Sequences: ", sequences)
  
# defining a placeholder token and naming it as placeholder
tokenizer = Tokenizer(num_words=100
                      oov_token="placeholder")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
  
  
sequences = tokenizer.texts_to_sequences(sentences)
print("\nSequences = ", sequences)
  
  
# the training data with words the network hasn't encountered
test_data = [
    'i really love geeksforgeeks',
    'Do you like geeksforgeeks'
]
  
test_seq = tokenizer.texts_to_sequences(test_data)
print("\nTest Sequence = ", test_seq)

                    

Output: 

Word Index:  {'geeksforgeeks': 1, 'love': 2, 'you': 3, 'i': 4, 'what': 5, 'do': 6, 'think': 7, 'about': 8}
Sequences:  [[4, 2, 1], [3, 2, 1], [5, 6, 3, 7, 8, 1]]

Sequences =  [[5, 3, 2], [4, 3, 2], [6, 7, 4, 8, 9, 2]]

Test Sequence =  [[5, 1, 3, 2], [7, 4, 1, 2]]


Last Updated : 24 Oct, 2020
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads