NLP Sequencing is the sequence of numbers that we will generate from a large corpus or body of statements by training a neural network. We will take a set of sentences and assign them numeric tokens based on the training set sentences.
Example:
sentences = [
'I love geeksforgeeks',
'You love geeksforgeeks',
'What do you think about geeksforgeeks?'
]
Word Index: {'geeksforgeeks': 1, 'love': 2, 'you': 3, 'i': 4,
'what': 5, 'do': 6, 'think': 7, 'about': 8}
Sequences: [[4, 2, 1], [3, 2, 1], [5, 6, 3, 7, 8, 1]]
Now if the test set consists of the word the network has not seen before, or we have to predict the word in the sentence then we can add a simple placeholder token.
Let the test set be :
test_data = [
'i really love geeksforgeeks',
'Do you like geeksforgeeks'
]
Then we will define an additional placeholder for words it hasn’t seen before. The placeholder by default gets index as 1.
Word Index = {‘placeholder’: 1, ‘geeksforgeeks’: 2, ‘love’: 3, ‘you’: 4, ‘i’: 5, ‘what’: 6, ‘do’: 7, ‘think’: 8, ‘about’: 9}
Sequences = [[5, 3, 2], [4, 3, 2], [6, 7, 4, 8, 9, 2]]
As the word ‘really’ and ‘like’ has not been encountered before it is simply replaced by the placeholder which is indexed by 1.
So, the test sequence now becomes,
Test Sequence = [[5, 1, 3, 2], [7, 4, 1, 2]]
Code: Implementation with TensorFlow
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
sentences = [
'I love geeksforgeeks' ,
'You love geeksforgeeks' ,
'What do you think about geeksforgeeks?'
]
tokenizer = Tokenizer(num_words = 100 )
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)
print ( "Word Index: " , word_index)
print ( "Sequences: " , sequences)
tokenizer = Tokenizer(num_words = 100 ,
oov_token = "placeholder" )
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)
print ( "\nSequences = " , sequences)
test_data = [
'i really love geeksforgeeks' ,
'Do you like geeksforgeeks'
]
test_seq = tokenizer.texts_to_sequences(test_data)
print ( "\nTest Sequence = " , test_seq)
|
Output:
Word Index: {'geeksforgeeks': 1, 'love': 2, 'you': 3, 'i': 4, 'what': 5, 'do': 6, 'think': 7, 'about': 8}
Sequences: [[4, 2, 1], [3, 2, 1], [5, 6, 3, 7, 8, 1]]
Sequences = [[5, 3, 2], [4, 3, 2], [6, 7, 4, 8, 9, 2]]
Test Sequence = [[5, 1, 3, 2], [7, 4, 1, 2]]
Whether you're preparing for your first job interview or aiming to upskill in this ever-evolving tech landscape,
GeeksforGeeks Courses are your key to success. We provide top-quality content at affordable prices, all geared towards accelerating your growth in a time-bound manner. Join the millions we've already empowered, and we're here to do the same for you. Don't miss out -
check it out now!
Last Updated :
24 Oct, 2020
Like Article
Save Article
Vote for difficulty