Open In App

Different Techniques for Sentence Semantic Similarity in NLP

Semantic similarity is the similarity between two words or two sentences/phrase/text. It measures how close or how different the two pieces of word or text are in terms of their meaning and context.

In this article, we will focus on how the semantic similarity between two sentences is derived. We will cover the following most used models.



  1. Dov2Vec – An extension of word2vec
  2. SBERT – Transformer-based model in which the encoder part captures the meaning of words in a sentence.
  3. InferSent -It uses bi-directional LSTM to encode sentences and infer semantics.
  4. USE (universal sentence encoder) – It’s a model trained by Google that generates fixed-size embeddings for sentences that can be used for any NLP task.

What is Semantic Similarity?

Semantic Similarity refers to the degree of similarity between the words. The focus is on the structure and lexical resemblance of words and phrases. Semantic similarity delves into the understanding and meaning of the content. The aim of the similarity is to measure how closely related or analogous the concepts, ideas, or information conveyed in two texts are.

In NLP semantic similarity is used in various tasks such as



  1. Question Answering – Enhances QA system by deriving semantic similarity between user queries and document content.
  2. Recommendation systems – Semantic similarity between user content and available content
  3. Summarization – Helps in summarizing similar content question answering, and text matching.
  4. Corpus clustering -Helps in grouping documents with similar content.

There are certain approaches for measuring semantic similarity in natural language processing (NLP) that include word embeddings, sentence embeddings, and transformer models.

Word Embedding

To understand semantic relationships between sentences one must be aware of the word embeddings. Word embeddings are used for vectorized representation of words. The simplest form of word embedding is a one-hot vector. However, these are sparse, very high dimensional, and do not capture meaning. The more advanced form consists of the Word2Vec (skip-gram, cbow), GloVe, and Fasttext which capture semantic information in low dimensional space. Kindly look at the embedded link to get a deeper understanding of this.

Word2Vec

Word2Vec represents the words as high-dimensional vectors so that we get semantically similar words close to each other in the vector space. There are two main architectures for Word2Vec:

Doc2Vec

Similar to word2vec Doc2Vec has two types of models based on skip gram and CBOW. We will look at the skip gram-based model as this model performs better than the cbow-based model. This skip-gram-based model is called ad PV-DM (Distributed Memory Model of Paragraph Vectors).

PV-DM model

PV-DM is an extension of Word2Vec in the sense that it consists of one paragraph vector in addition to the word vectors.

In summary, the algorithm itself has two key stages:

We use the learned paragraph vectors to predict some particular labels using a standard classifier, e.g., logistic regression.

Python Implementation of Doc2Vec

Below is the simple implementation of Doc2Vec.

  1. We first tokenize the words in each document and convert them to lowercase.
  2. We then create TaggedDocument objects required for training the Doc2Vec model. Each document is associated with a unique tag (document ID). This is the paragraph vector.
  3. The parameters (vector_size, window, min_count, workers, epochs) control the model’s dimensions, context window size, minimum word count, parallelization, and training epochs.
  4. We then infer a vector representation for a new document that was not part of the training data.
  5. We then calculate the similarity score.
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
 
# Sample data
data = ["The movie is awesome. It was a good thriller",
        "We are learning NLP throughg GeeksforGeeks",
        "The baby learned to walk in the 5th month itself"]
 
# Tokenizing the data
tokenized_data = [word_tokenize(document.lower()) for document in data]
 
# Creating TaggedDocument objects
tagged_data = [TaggedDocument(words=words, tags=[str(idx)])
               for idx, words in enumerate(tokenized_data)]
 
 
# Training the Doc2Vec model
model = Doc2Vec(vector_size=100, window=2, min_count=1, workers=4, epochs=1000)
model.build_vocab(tagged_data)
model.train(tagged_data, total_examples=model.corpus_count,
            epochs=model.epochs)
 
# Infer vector for a new document
new_document = "The baby was laughing and palying"
print('Original Document:', new_document)
 
inferred_vector = model.infer_vector(word_tokenize(new_document.lower()))
 
# Find most similar documents
similar_documents = model.dv.most_similar(
    [inferred_vector], topn=len(model.dv))
 
# Print the most similar documents
for index, score in similar_documents:
    print(f"Document {index}: Similarity Score: {score}")
    print(f"Document Text: {data[int(index)]}")
    print()

                    

Output:

Original Document: The baby was laughing and palying
Document 2: Similarity Score: 0.9838361740112305
Document Text: The baby learned to walk in the 5th month itself

Document 0: Similarity Score: 0.9455077648162842
Document Text: The movie is awesome. It was a good thriller

Document 1: Similarity Score: 0.8828089833259583
Document Text: We are learning NLP throughg GeeksforGeeks

SBERT

SBERT adds a pooling operation to the output of BERT to derive a fixed-sized sentence embedding. The sentence is converted into word embedding and passed through a BERT network to get the context vector. Researchers experimented with different pooling options but found that at the mean pooling works the best. The context vector is then averaged out to get the sentence embeddings.

SBERT uses three objective functions to update the weights of the BERT model. The Bert model is structured differently based on the type of training data that drives the objective function.

1. Classification objective Function

SBERT with Classification Objective Function

2. Regression Objective function

This also uses the pair of sentences with labels as training data. The network is also structured as a Siamese network. However, instead of the softmax layer the output of the pooling layer is used to calculate cosine similarity and mean squared-error loss is used as the objective function to train the BERT model weights.

SBERT with Regression Objective Function

3. Triplet objective function

Here the model is structured as triplet networks.

Mathematically, we minimize the following loss function.

Python Implementation

To implement it first we need to install Sentence transformer framework

!pip install -U sentence-transformers
#!pip install -U sentence-transformers
 
from scipy.spatial import distance
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
 
# Sample sentence
sentences = ["The movie is awesome. It was a good thriller",
             "We are learning NLP throughg GeeksforGeeks",
             "The baby learned to walk in the 5th month itself"]
 
 
test = "I liked the movie."
print('Test sentence:',test)
test_vec = model.encode([test])[0]
 
 
for sent in sentences:
    similarity_score = 1-distance.cosine(test_vec, model.encode([sent])[0])
    print(f'\nFor {sent}\nSimilarity Score = {similarity_score} ')

                    

Output:

Test sentence: I liked the movie.

For The movie is awesome. It was a good thriller
Similarity Score = 0.682051956653595

For We are learning NLP throughg GeeksforGeeks
Similarity Score = 0.0878136083483696

For The baby learned to walk in the 5th month itself
Similarity Score = 0.04816452041268349

InferSent

The structure comprises two components:

Sentence Encoder

Classifier

Training Sentence Encoder for Classification

Python Implementation

Implementing an infersent model is a bit lengthy process as there is no standard hugging face API available. Infersent comes traiend in two version. Version1 is trained with Glove and Version2 is trained with FastText wordembedding. We will use version 2 as it takes less time to download and process.

First, we need to build the infersent model. The below class does that. It has been sourced from Python Zoo model.

#Infersenct model class # copied from infersent github
 
%load_ext autoreload
%autoreload 2
%matplotlib inline
from random import randint
import numpy as np
import torch
import time
import torch.nn as nn
class InferSent(nn.Module):
 
    def __init__(self, config):
        super(InferSent, self).__init__()
        self.bsize = config['bsize']
        self.word_emb_dim = config['word_emb_dim']
        self.enc_lstm_dim = config['enc_lstm_dim']
        self.pool_type = config['pool_type']
        self.dpout_model = config['dpout_model']
        self.version = 1 if 'version' not in config else config['version']
 
        self.enc_lstm = nn.LSTM(self.word_emb_dim, self.enc_lstm_dim, 1,
                                bidirectional=True, dropout=self.dpout_model)
 
        assert self.version in [1, 2]
        if self.version == 1:
            self.bos = '<s>'
            self.eos = '</s>'
            self.max_pad = True
            self.moses_tok = False
        elif self.version == 2:
            self.bos = '<p>'
            self.eos = '</p>'
            self.max_pad = False
            self.moses_tok = True
 
    def is_cuda(self):
        # either all weights are on cpu or they are on gpu
        return self.enc_lstm.bias_hh_l0.data.is_cuda
 
    def forward(self, sent_tuple):
        # sent_len: [max_len, ..., min_len] (bsize)
        # sent: (seqlen x bsize x worddim)
        sent, sent_len = sent_tuple
 
        # Sort by length (keep idx)
        sent_len_sorted, idx_sort = np.sort(sent_len)[::-1], np.argsort(-sent_len)
        sent_len_sorted = sent_len_sorted.copy()
        idx_unsort = np.argsort(idx_sort)
 
        idx_sort = torch.from_numpy(idx_sort).cuda() if self.is_cuda() \
            else torch.from_numpy(idx_sort)
        sent = sent.index_select(1, idx_sort)
 
        # Handling padding in Recurrent Networks
        sent_packed = nn.utils.rnn.pack_padded_sequence(sent, sent_len_sorted)
        sent_output = self.enc_lstm(sent_packed)[0# seqlen x batch x 2*nhid
        sent_output = nn.utils.rnn.pad_packed_sequence(sent_output)[0]
 
        # Un-sort by length
        idx_unsort = torch.from_numpy(idx_unsort).cuda() if self.is_cuda() \
            else torch.from_numpy(idx_unsort)
        sent_output = sent_output.index_select(1, idx_unsort)
 
        # Pooling
        if self.pool_type == "mean":
            sent_len = torch.FloatTensor(sent_len.copy()).unsqueeze(1).cuda()
            emb = torch.sum(sent_output, 0).squeeze(0)
            emb = emb / sent_len.expand_as(emb)
        elif self.pool_type == "max":
            if not self.max_pad:
                sent_output[sent_output == 0] = -1e9
            emb = torch.max(sent_output, 0)[0]
            if emb.ndimension() == 3:
                emb = emb.squeeze(0)
                assert emb.ndimension() == 2
 
        return emb
 
    def set_w2v_path(self, w2v_path):
        self.w2v_path = w2v_path
 
    def get_word_dict(self, sentences, tokenize=True):
        # create vocab of words
        word_dict = {}
        sentences = [s.split() if not tokenize else self.tokenize(s) for s in sentences]
        for sent in sentences:
            for word in sent:
                if word not in word_dict:
                    word_dict[word] = ''
        word_dict[self.bos] = ''
        word_dict[self.eos] = ''
        return word_dict
 
    def get_w2v(self, word_dict):
        assert hasattr(self, 'w2v_path'), 'w2v path not set'
        # create word_vec with w2v vectors
        word_vec = {}
        with open(self.w2v_path) as f:
            for line in f:
                word, vec = line.split(' ', 1)
                if word in word_dict:
                    word_vec[word] = np.fromstring(vec, sep=' ')
        print('Found %s(/%s) words with w2v vectors' % (len(word_vec), len(word_dict)))
        return word_vec
 
    def get_w2v_k(self, K):
        assert hasattr(self, 'w2v_path'), 'w2v path not set'
        # create word_vec with k first w2v vectors
        k = 0
        word_vec = {}
        with open(self.w2v_path) as f:
            for line in f:
                word, vec = line.split(' ', 1)
                if k <= K:
                    word_vec[word] = np.fromstring(vec, sep=' ')
                    k += 1
                if k > K:
                    if word in [self.bos, self.eos]:
                        word_vec[word] = np.fromstring(vec, sep=' ')
 
                if k > K and all([w in word_vec for w in [self.bos, self.eos]]):
                    break
        return word_vec
 
    def build_vocab(self, sentences, tokenize=True):
        assert hasattr(self, 'w2v_path'), 'w2v path not set'
        word_dict = self.get_word_dict(sentences, tokenize)
        self.word_vec = self.get_w2v(word_dict)
        print('Vocab size : %s' % (len(self.word_vec)))
 
    # build w2v vocab with k most frequent words
    def build_vocab_k_words(self, K):
        assert hasattr(self, 'w2v_path'), 'w2v path not set'
        self.word_vec = self.get_w2v_k(K)
        print('Vocab size : %s' % (K))
 
    def update_vocab(self, sentences, tokenize=True):
        assert hasattr(self, 'w2v_path'), 'warning : w2v path not set'
        assert hasattr(self, 'word_vec'), 'build_vocab before updating it'
        word_dict = self.get_word_dict(sentences, tokenize)
 
        # keep only new words
        for word in self.word_vec:
            if word in word_dict:
                del word_dict[word]
 
        # udpate vocabulary
        if word_dict:
            new_word_vec = self.get_w2v(word_dict)
            self.word_vec.update(new_word_vec)
        else:
            new_word_vec = []
        print('New vocab size : %s (added %s words)'% (len(self.word_vec), len(new_word_vec)))
 
    def get_batch(self, batch):
        # sent in batch in decreasing order of lengths
        # batch: (bsize, max_len, word_dim)
        embed = np.zeros((len(batch[0]), len(batch), self.word_emb_dim))
 
        for i in range(len(batch)):
            for j in range(len(batch[i])):
                embed[j, i, :] = self.word_vec[batch[i][j]]
 
        return torch.FloatTensor(embed)
 
    def tokenize(self, s):
        from nltk.tokenize import word_tokenize
        if self.moses_tok:
            s = ' '.join(word_tokenize(s))
            s = s.replace(" n't ", "n 't "# HACK to get ~MOSES tokenization
            return s.split()
        else:
            return word_tokenize(s)
 
    def prepare_samples(self, sentences, bsize, tokenize, verbose):
        sentences = [[self.bos] + s.split() + [self.eos] if not tokenize else
                     [self.bos] + self.tokenize(s) + [self.eos] for s in sentences]
        n_w = np.sum([len(x) for x in sentences])
 
        # filters words without w2v vectors
        for i in range(len(sentences)):
            s_f = [word for word in sentences[i] if word in self.word_vec]
            if not s_f:
                import warnings
                warnings.warn('No words in "%s" (idx=%s) have w2v vectors. \
                               Replacing by "</s>"..' % (sentences[i], i))
                s_f = [self.eos]
            sentences[i] = s_f
 
        lengths = np.array([len(s) for s in sentences])
        n_wk = np.sum(lengths)
        if verbose:
            print('Nb words kept : %s/%s (%.1f%s)' % (
                        n_wk, n_w, 100.0 * n_wk / n_w, '%'))
 
        # sort by decreasing length
        lengths, idx_sort = np.sort(lengths)[::-1], np.argsort(-lengths)
        sentences = np.array(sentences)[idx_sort]
 
        return sentences, lengths, idx_sort
 
    def encode(self, sentences, bsize=64, tokenize=True, verbose=False):
        tic = time.time()
        sentences, lengths, idx_sort = self.prepare_samples(
                        sentences, bsize, tokenize, verbose)
 
        embeddings = []
        for stidx in range(0, len(sentences), bsize):
            batch = self.get_batch(sentences[stidx:stidx + bsize])
            if self.is_cuda():
                batch = batch.cuda()
            with torch.no_grad():
                batch = self.forward((batch, lengths[stidx:stidx + bsize])).data.cpu().numpy()
            embeddings.append(batch)
        embeddings = np.vstack(embeddings)
 
        # unsort
        idx_unsort = np.argsort(idx_sort)
        embeddings = embeddings[idx_unsort]
 
        if verbose:
            print('Speed : %.1f sentences/s (%s mode, bsize=%s)' % (
                    len(embeddings)/(time.time()-tic),
                    'gpu' if self.is_cuda() else 'cpu', bsize))
        return embeddings
 
    def visualize(self, sent, tokenize=True):
 
        sent = sent.split() if not tokenize else self.tokenize(sent)
        sent = [[self.bos] + [word for word in sent if word in self.word_vec] + [self.eos]]
 
        if ' '.join(sent[0]) == '%s %s' % (self.bos, self.eos):
            import warnings
            warnings.warn('No words in "%s" have w2v vectors. Replacing \
                           by "%s %s"..' % (sent, self.bos, self.eos))
        batch = self.get_batch(sent)
 
        if self.is_cuda():
            batch = batch.cuda()
        output = self.enc_lstm(batch)[0]
        output, idxs = torch.max(output, 0)
        # output, idxs = output.squeeze(), idxs.squeeze()
        idxs = idxs.data.cpu().numpy()
        argmaxs = [np.sum((idxs == k)) for k in range(len(sent[0]))]
 
        # visualize model
        import matplotlib.pyplot as plt
        plt.figure(figsize=(12,12))
        x = range(len(sent[0]))
        y = [100.0 * n / np.sum(argmaxs) for n in argmaxs]
        plt.xticks(x, sent[0], rotation=45)
        plt.bar(x, y)
        plt.ylabel('%')
        plt.title('Visualisation of words importance')
        plt.show()
 
        return output, idxs, argmaxs

                    

Next, we download the state-of-the-art fastText embeddings and download the pre-trained models.

!mkdir fastText
!curl -Lo fastText/crawl-300d-2M.vec.zip https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip
!unzip fastText/crawl-300d-2M.vec.zip -d fastText/
 
!mkdir encoder
!curl -Lo encoder/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl

                    


import nltk
nltk.download('punkt')
 
 
MODEL_PATH = 'encoder/infersent2.pkl'
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0}
model = InferSent(params_model)
model.load_state_dict(torch.load(MODEL_PATH))
 
 
W2V_PATH = 'fastText/crawl-300d-2M.vec'
model.set_w2v_path(W2V_PATH)
 
# Load embeddings of K most frequent words
model.build_vocab_k_words(K=100000)

                    


We then use the model for inference

# Sample sentence
from scipy.spatial import distance
sentences = ["The movie is awesome. It was a good thriller",
             "We are learning NLP throughg GeeksforGeeks",
             "The baby learned to walk in the 5th month itself"]
 
 
test = "I liked the movie"
print('Test Sentence:', test)
test_vec = model.encode([test])[0]
 
for sent in sentences:
    similarity_score = 1-distance.cosine(test_vec, model.encode([sent])[0])
    print(f'\nFor {sent}\nSimilarity Score = {similarity_score} ')

                    

Output:

Test Sentence: I liked the movie

For The movie is awesome. It was a good thriller
Similarity Score = 0.5299297571182251

For We are learning NLP throughg GeeksforGeeks
Similarity Score = 0.33156681060791016

For The baby learned to walk in the 5th month itself
Similarity Score = 0.20128820836544037

USE – Universal Sentence Encoder

At a high level, it consists of an encoder that summarizes any sentence to give a sentence embedding which can be used for any NLP task.

The encoder part comes in two forms and either of them can be used

Training of the USE

The USE is trained on a variety of unsupervised and supervised tasks such as Skipthoughts, NLI, and more using the below principles.

Training of Encoder

Python Implementation

We load the Universal Sentence Encoder’s TF Hub module.

import tensorflow as tf
 
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns
 
model = hub.load(module_url)
print("module %s loaded" % module_url)
 
 
def embed(input):
    return model(input)

                    


We draw the similarity score on our sample data below

from scipy.spatial import distance
 
 
test = ["I liked the movie very much"]
print('Test Sentence:',test)
test_vec = embed(test)
# Sample sentence
sentences = [["The movie is awesome and It was a good thriller"],
        ["We are learning NLP throughg GeeksforGeeks"],
        ["The baby learned to walk in the 5th month itself"]]
 
for sent in sentences:
    similarity_score = 1-distance.cosine(test_vec[0,:],embed(sent)[0,:])
    print(f'\nFor {sent}\nSimilarity Score = {similarity_score} ')

                    

Output

Test Sentence: ['I liked the movie very much']

For ['The movie is awesome and It was a good thriller']
Similarity Score = 0.6519516706466675

For ['We are learning NLP throughg GeeksforGeeks']
Similarity Score = 0.06988027691841125

For ['The baby learned to walk in the 5th month itself']
Similarity Score = -0.01121298223733902

Conclusion

In this article we understood semantic similarity and its application. We saw the architecture of top 4 sentence embedding models used for semantic similarity calculation of sentences and their implementation in python.


Article Tags :