Open In App

Top K Nearest Words using Edit Distances in NLP

We can find the Top K nearest matching words to the given query input word by using the concept of Edit/Levenstein distance. If any word is the same word as the query input(word) then their Edit distance would be zero and is a perfect match and so on. So we can find the Edit distances between the query word and the words present in our vocabulary pairwise and then sort them according to their edit distances. This will help us arrange the words in the vocabulary in the order of their nearest match to the query. This is very useful to solve Information Retrieval tasks.

Code implementations

If we have a vocabulary of words, then we can find the K nearest matching words to the query with the help of Edit distance. Here Vocabulary can be a List of Strings.



Steps:




# code for Edit Distance
def edit_Distance(s1, s2):
    r, c = len(s1), len(s2)
    dp = [[-1 for j in range(0, c+1)]for i in range(0, r+1)]
    for i in range(0, r+1):
        dp[i][0] = i
    for j in range(0, c+1):
        dp[0][j] = j
    for i in range(1, r+1):
        for j in range(1, c+1):
            if(s1[i-1] == s2[j-1]):
                dp[i][j] = dp[i-1][j-1]
            else:
                dp[i][j] = 1+min(dp[i-1][j], dp[i][j-1], dp[i-1][j-1])
    return dp[r]
  
  
# code for finding the Top k matching words
# here s1 is the query input
def k_nearest_word(s1, vocabulary, k):
    k_nearest_word = []
    word_dist = {}
    for word in vocabulary:
        word_dist[word] = edit_Distance(s1, word)
    word_dist = dict(sorted(word_dist.items(), key=lambda item: item[1]))
    ans = []
    li = list(word_dist.keys())
    for i in range(0, k):
        ans.append(li[i])
    return ans
  
  
# for Testing purposes
text = '''
Pytorch is an open-source deep learning framework available with a Python and C++ interface. 
it resides inside the torch module. 
the data that has to be processed is input in the form of a tensor.
that allows you to develop and train deep learning models. 
However, as the size and complexity of your models grow, 
the time it takes to train them can become prohibitive.
  
'''
# vocabulary=['This','is','salt','saw','Bat','salty','salts']
vocabulary = text.split()
query = 'Pytorch'
k = 3  # top k nearest matching
li = k_nearest_word(query, vocabulary, k)
print(li)

Output:

['Pytorch', 'torch', 'Python']

Using Levenshtein library

Steps:




import Levenshtein
  
def find_top_k_nearest_words(query_input, words_list, k):
    distances = []
    for word in words_list:
        distance = Levenshtein.distance(query_input, word)
        distances.append((word, distance))
    sorted_distances = sorted(distances, key=lambda x: x[1])
    return [d[0] for d in sorted_distances[:k]]
  
nearest_words = find_top_k_nearest_words(query, vocabulary, k)
print(nearest_words)

Output:



['Pytorch', 'torch', 'Python']

Article Tags :