Top K Nearest Words using Edit Distances in NLP

We can find the Top K nearest matching words to the given query input word by using the concept of Edit/Levenstein distance. If any word is the same word as the query input(word) then their Edit distance would be zero and is a perfect match and so on. So we can find the Edit distances between the query word and the words present in our vocabulary pairwise and then sort them according to their edit distances. This will help us arrange the words in the vocabulary in the order of their nearest match to the query. This is very useful to solve Information Retrieval tasks.

Code implementations

If we have a vocabulary of words, then we can find the K nearest matching words to the query with the help of Edit distance. Here Vocabulary can be a List of Strings.

Steps:

Create a function to find the edit distance between two words.
Create a second function to find the k nearest word. It uses the edit distance function to measure the distance between two words.
For the vocabulary, we are using some text and uses text.split() command to store as the list of words.
In the k nearest word-defined function input the query or word, vocabulary, and the number of the nearest word we want to find.

Python3

# code for Edit Distance 

def edit_Distance(s1, s2): 

    r, c = len(s1), len(s2) 

    dp = [[-1 for j in range(0, c+1)]for i in range(0, r+1)] 

    for i in range(0, r+1): 

        dp[i][0] = i 

    for j in range(0, c+1): 

        dp[0][j] = j 

    for i in range(1, r+1): 

        for j in range(1, c+1): 

            if(s1[i-1] == s2[j-1]): 

                dp[i][j] = dp[i-1][j-1] 

            else: 

                dp[i][j] = 1+min(dp[i-1][j], dp[i][j-1], dp[i-1][j-1]) 

    return dp[r] 

# code for finding the Top k matching words 
# here s1 is the query input 

def k_nearest_word(s1, vocabulary, k): 

    k_nearest_word = [] 

    word_dist = {} 

    for word in vocabulary: 

        word_dist[word] = edit_Distance(s1, word) 

    word_dist = dict(sorted(word_dist.items(), key=lambda item: item[1])) 

    ans = [] 

    li = list(word_dist.keys()) 

    for i in range(0, k): 

        ans.append(li[i]) 

    return ans 

# for Testing purposes 

text = ''' 
Pytorch is an open-source deep learning framework available with a Python and C++ interface.  
it resides inside the torch module.  
the data that has to be processed is input in the form of a tensor. 
that allows you to develop and train deep learning models.  
However, as the size and complexity of your models grow,  
the time it takes to train them can become prohibitive. 

'''
# vocabulary=['This','is','salt','saw','Bat','salty','salts'] 

vocabulary = text.split() 

query = 'Pytorch'

k = 3  # top k nearest matching 

li = k_nearest_word(query, vocabulary, k) 

print(li)

Output:

['Pytorch', 'torch', 'Python']

Using Levenshtein library

Steps:

Import the Levenshtein library. We will use it to find the distance between the two strings.
Create a second function to find the k nearest word. It uses the Levenshtein.distance function to measure the distance between two words.
For the vocabulary, we are using some text and uses text.split() command to store as the list of words.
In the k nearest word-defined function input the query or word, vocabulary, and the number of the nearest word we want to find.

Python3

import Levenshtein 

def find_top_k_nearest_words(query_input, words_list, k): 

    distances = [] 

    for word in words_list: 

        distance = Levenshtein.distance(query_input, word) 

        distances.append((word, distance)) 

    sorted_distances = sorted(distances, key=lambda x: x[1]) 

    return [d[0] for d in sorted_distances[:k]] 

nearest_words = find_top_k_nearest_words(query, vocabulary, k) 

print(nearest_words)

Output:

['Pytorch', 'torch', 'Python']

Article Tags :

AI-ML-DS

Machine Learning

NLP

Python

Natural-language-processing