Sentence Autocomplete Using Pytorch

Natural Language Processing(NLP) is one of the most flourishing parts of deep learning. Several applications of NLP are being used continuously in daily life. In this article, we are going to see how we can use NLP to autocomplete half-written sentences using deep learning methods. We will also see how we can generate clean data for training our NLP model. We will cover the following steps in this article

Cleaning the text data for training the NLP model
Loading the dataset using PyTorch
Creating the LSTM model
Training an NLP model
Making inferences from the trained model

We have seen applications like google keyboard where Google recommends what to type next based on the words which we have already written in the chatbox draft. However, to recommend the next term application like Google has been trained on billions of written sentences. In our model, we will use Wikipedia sentences that are freely available on the internet to download and that we can use for training our model.

Dataset for Sentence Autocomplete Model

We will use a Wikipedia dataset that we can download from here. The one main problem with the Wikipedia dataset is that it has special characters, non-meaningful words, and unknown words we can not use directly in our model that is why before using the dataset for training our model we must have to clean it. Since our dataset is an ms-word document we will use the python-docx library for reading the dataset document. We can use the following command for installing the library.

!pip install python-docx

Python Code for Cleaning the Dataset

We will use the Python re-module for removing special characters and words between them. Also since Python is a case-sensitive language we will convert all the words to lower cases. As we are developing these models for only English-speaking audiences, we will remove non-English words.

Python3

import re 

import string 

import torch 

import pandas as pd 

from docx import Document 

# Read the DOCX file 

doc_path = "wikipedia.docx" 

doc = Document(doc_path) 

# Extract text from paragraphs 

text_data = [paragraph.text for paragraph in doc.paragraphs] 

# Convert text to lowercase 

text_data =  

# Remove special characters and words between them using regex 

text_data = [re.sub(r"\[.*?\]", "", text) for text in text_data] 

# Remove words not in the English alphabet 

english_alphabet = set(string.ascii_lowercase) 

text_data = [' '.join([word for word in text.split()\ 

                       if all(char in english_alphabet\ 

                              for char in word)]) for text in text_data] 

# Remove leading/trailing whitespaces 

text_data =  

# Remove empty sentences 

text_data =  

# Create a DataFrame with the cleaned text data 

df = pd.DataFrame({"Text": text_data}) 

# Save the cleaned text data to a CSV file 

output_path = "output.csv" 
# Set index=False to exclude the index column in the output 

df.to_csv(output_path, index=False)   

print("Text data cleaned and saved to:", output_path)

Output:

Text data cleaned and saved to: /content/output.csv

This code will output a CSV file which will be a cleaned dataset document that we can use for training our model. We can download this cleaned CSV file from here.

Class For Loading The Dataset Using Pytorch

We will define a custom Python class having a torch.utils.data.Dataset as metaclass for loading the dataset. In this class, we will use the load_words method for reading the CSV file. Also, we will use the get_unique_words method for counting the frequency of unique words in the dataset. The __len__ method will determine the length of the dataset. Whereas the __getitems__ method will create a tensor for each word.

Python3

class TextDataset(torch.utils.data.Dataset): 

    def __init__(self, args): 

        self.args = args 

        self.words = self.load_words() 

        self.unique_words = self.get_unique_words() 

        self.index_to_word = {index: word for index,\ 

                              word in enumerate(self.unique_words)} 

        self.word_to_index = {word: index for index, \ 

                              word in enumerate(self.unique_words)} 

        self.word_indexes = [self.word_to_index[w] for w in self.words] 

    def load_words(self): 

        train_df = pd.read_csv('/content/output.csv') 

        text = train_df['Text'].str.cat(sep=' ') 

        return text.split(' ') 

    def get_unique_words(self): 

        word_counts = Counter(self.words) 

        return sorted(word_counts, key=word_counts.get, reverse=True) 

    def __len__(self): 

        return len(self.word_indexes) - self.args 

    def __getitem__(self, index): 

        return ( 

            torch.tensor(self.word_indexes[index:index + self.args]), 

            torch.tensor(self.word_indexes[index + 1:index + self.args+ 1]) 

        )

LSTM Model For Sentence Autocompletion

We will use a long-short-term memory network (LSTM) for our model. As we know LSTM network has an edge over simple RNNs since they have three extra gates which prevents gradient vanishing problem in the neural networks. Let us see the parameters and methods we are using in this model one by one.

__init__(self, dataset): This is the initialization method of the class. It sets up the architecture and parameters of the LSTM model. Here are the key components that are being used within this method: self.lstm_size: It specifies the size (number of units) in the LSTM hidden state. self.embedding_dim: It Specifies the dimensionality of the word embeddings. self.num_layers: It specifies the number of layers in the LSTM. Since we don’t want our model to be computationally expensive we will use only 3 layers. self.embedding: It defines an embedding layer that converts input indices to dense word embeddings. self.lstm: This argument defines the LSTM layer with the specified input size, hidden size, number of layers, and dropout rate. self.fc: Here we are using a fully connected later at the end of the network Definesar layer that maps the LSTM output to the vocabulary size to generate logits.
forward(self, x, prev_state): This method performs the forward pass of the model. It takes an input tensor x and the previous state of the LSTM prev_state as input. Here are the steps within this method: self.embedding(x): The embed argument passes the input tensor x through the embedding layer to get the word embeddings. self.lstm(embed, prev_state): The output state passes the embeddings and previous state through the LSTM layer to get the output and updated state from the network. self.fc(output): Whereas the fully connected layer produces the logits value (the model’s predictions) and state (the updated LSTM state) for each incomplete sentence.
init_state(self, sequence_length): This method initializes the LSTM state with all zeros. It takes the length of the input sequence as input and returns an initial state tensor with zeros. The state tensor has a shape of (num_layers, sequence_length, lstm_size)

Python3

from torch import nn 

class LSTMModel(nn.Module): 

    def __init__(self, dataset): 

        super(LSTMModel, self).__init__() 

        self.lstm_size = 128

        self.embedding_dim = 128

        self.num_layers = 3

        n_vocab = len(dataset.unique_words) 

        self.embedding = nn.Embedding( 

            num_embeddings=n_vocab, 

            embedding_dim=self.embedding_dim, 

        ) 

        self.lstm = nn.LSTM( 

            input_size=self.embedding_dim, 

            hidden_size=self.lstm_size, 

            num_layers=self.num_layers, 

            dropout=0.2, 

        ) 

        self.fc = nn.Linear(self.lstm_size, n_vocab) 

    def forward(self, x, prev_state): 

        embed = self.embedding(x) 

        output, state = self.lstm(embed, prev_state) 

        logits = self.fc(output) 

        return logits, state 

    def init_state(self, sequence_length): 

        return ( 

            torch.zeros(self.num_layers, \ 

                        sequence_length, self.lstm_size), 

            torch.zeros(self.num_layers, \ 

                        sequence_length, self.lstm_size) 

        )

Initiating Hyperparameters and DataLoader for Training Sentence Autocomplete Model

In the model, we will define hyperparameters basically hyperparameters are the configurable parameters that determine the behavior of the model during training. In this code, the hyperparameters are defined as follows:

sequence_length: The length of the input sequence.
batch_size: The number of samples processed in each iteration.
learning_rate: The step size at which the optimizer adjusts the model’s parameters.
num_epochs: The number of times the entire dataset is passed through the model during training.

We divided our dataset into two parts for training and validation using Pytorch. We used PyTorch’s DataLoader to create iterators for efficient loading of data during training and validation. The training and validation datasets are passed to the respective data loaders, specifying the batch size and whether shuffling is required for the training data.

For training the model we have used a training loop that iterates over the specified number of epochs. For each epoch, the model is put in training mode using(model. train()), and the total loss is initialized. The loop iterates over batches of data from the training data loader. In every loop a sequence of task happen these are as:

For each batch, the optimizer’s gradients are reset using the Pytorch function (optimizer.zero_grad()),
The initial hidden state for the LSTM model is obtained using the equation (hidden = model.init_state(sequence_length))
The model is called with the inputs and hidden state using equation (outputs, _ = model(inputs, hidden)).
The loss is then computed using the predicted outputs and the targets and the gradients are calculated using the Pytorch (loss.backwards ()) function.

The optimizer then performs a parameter update using the optimizer.step(), and the loss is added to the total loss for the epoch. After the loop, the average loss for the epoch is calculated and printed.

Python3

from torch.utils.data import  DataLoader, random_split 

# Hyperparameters 

sequence_length = 10

batch_size = 64

learning_rate = 0.001

num_epochs = 10

# Create the dataset 

dataset = TextDataset(sequence_length) 

# Split the dataset into training and validation sets 

train_size = int(0.8 * len(dataset)) 

val_size = len(dataset) - train_size 

train_dataset, val_dataset = random_split(dataset, 

                                 [train_size, val_size]) 

# Create data loaders 

train_loader = DataLoader(train_dataset, 

                      batch_size=batch_size, shuffle=True) 

val_loader = DataLoader(val_dataset, 

                        batch_size=batch_size) 

# Create the model 

model = LSTMModel(dataset) 

# Define the loss function and optimizer 

criterion = nn.CrossEntropyLoss() 

optimizer = torch.optim.Adam(model.parameters(),\ 

                             lr=learning_rate) 

# Training loop 

for epoch in range(num_epochs): 

    model.train() 

    total_loss = 0.0

    for batch in train_loader: 

        inputs, targets = batch 

        optimizer.zero_grad() 

        hidden = model.init_state(sequence_length) 

        outputs, _ = model(inputs, hidden) 

        loss = criterion(outputs.view(-1, 

                      len(dataset.unique_words)), \ 

                         targets.view(-1)) 

        loss.backward() 

        optimizer.step() 

        total_loss += loss.item() 

    # Calculate average loss for the epoch 

    average_loss = total_loss / len(train_loader) 

    # Print the epoch and average loss 

    print(f"Epoch [{epoch+1}/{num_epochs}],\ 

                    Average Loss: {average_loss:.4f}") 

    # Validation loop 

    model.eval() 

    val_loss = 0.0

    with torch.no_grad(): 

        for batch in val_loader: 

            inputs, targets = batch 

            hidden = model.init_state(sequence_length) 

            outputs, _ = model(inputs, hidden) 

            loss = criterion(outputs.view(-1, 

                              len(dataset.unique_words)), \ 

                             targets.view(-1)) 

            val_loss += loss.item() 

    # Calculate average validation loss for the epoch 

    average_val_loss = val_loss / len(val_loader) 

    # Print the epoch and average validation loss 

    print(f"Epoch[{epoch+1}/{num_epochs}], 

          Validation Loss: {average_val_loss: .4f}")

Output:

Epoch [1/10], Average Loss: 6.8103
Epoch [1/10], Validation Loss: 6.5937
Epoch [2/10], Average Loss: 6.4668
Epoch [2/10], Validation Loss: 6.3104
Epoch [3/10], Average Loss: 6.2176
Epoch [3/10], Validation Loss: 6.1290
Epoch [4/10], Average Loss: 6.0840
Epoch [4/10], Validation Loss: 6.0208
Epoch [5/10], Average Loss: 5.9850
Epoch [5/10], Validation Loss: 5.9312
Epoch [6/10], Average Loss: 5.8937
Epoch [6/10], Validation Loss: 5.8397
Epoch [7/10], Average Loss: 5.8056
Epoch [7/10], Validation Loss: 5.7547
Epoch [8/10], Average Loss: 5.7232
Epoch [8/10], Validation Loss: 5.6692
Epoch [9/10], Average Loss: 5.6379
Epoch [9/10], Validation Loss: 5.5762
Epoch [10/10], Average Loss: 5.5473
Epoch [10/10], Validation Loss: 5.4821

Making Inference From The Model

For making inferences from the model. We will take an incomplete input string sentence from the author then the input sentence is preprocessed by splitting it into individual words and converting each word to its corresponding index in the dataset. This is done using a list comprehension that iterates over the words in the input sentence and retrieves their corresponding indexes from the dataset using a word_to_index dictionary. The resulting list of indexes is stored in the input_indexes variable. We convert this value into a tensor that will be passed through the model for predicting an output word.

Python3

# Input a sentence 

input_sentence = "he is your "

# Preprocess the input sentence 

input_indexes = [dataset.word_to_index[word] for\ 

                 word in input_sentence.split()] 

input_tensor = torch.tensor(input_indexes, \ 

                            dtype=torch.long).unsqueeze(0) 

# Generate the next word 

model.eval() 

hidden = model.init_state(len(input_indexes)) 

outputs, _ = model(input_tensor, hidden) 

predicted_index = torch.argmax(outputs[0, -1, :]).item() 

predicted_word = dataset.index_to_word[predicted_index] 

# Print the predicted word 

print("Input Sentence:", input_sentence) 

print("Predicted Next Word:", predicted_word)

Output:

Input Sentence: he is your  
Predicted Next Word: mayor

Article Tags :

Data Science

NLP-Projects

Python-PyTorch