Open In App

Top K Nearest Words using Edit Distances in NLP

Last Updated : 08 Jun, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

We can find the Top K nearest matching words to the given query input word by using the concept of Edit/Levenstein distance. If any word is the same word as the query input(word) then their Edit distance would be zero and is a perfect match and so on. So we can find the Edit distances between the query word and the words present in our vocabulary pairwise and then sort them according to their edit distances. This will help us arrange the words in the vocabulary in the order of their nearest match to the query. This is very useful to solve Information Retrieval tasks.

Code implementations

If we have a vocabulary of words, then we can find the K nearest matching words to the query with the help of Edit distance. Here Vocabulary can be a List of Strings.

Steps:

  • Create a function to find the edit distance between two words.
  • Create a second function to find the k nearest word. It uses the edit distance function to measure the distance between two words.
  • For the vocabulary, we are using some text and uses text.split() command to store as the list of words.
  • In the k nearest word-defined function input the query or word, vocabulary, and the number of the nearest word we want to find.

Python3




# code for Edit Distance
def edit_Distance(s1, s2):
    r, c = len(s1), len(s2)
    dp = [[-1 for j in range(0, c+1)]for i in range(0, r+1)]
    for i in range(0, r+1):
        dp[i][0] = i
    for j in range(0, c+1):
        dp[0][j] = j
    for i in range(1, r+1):
        for j in range(1, c+1):
            if(s1[i-1] == s2[j-1]):
                dp[i][j] = dp[i-1][j-1]
            else:
                dp[i][j] = 1+min(dp[i-1][j], dp[i][j-1], dp[i-1][j-1])
    return dp[r]
  
  
# code for finding the Top k matching words
# here s1 is the query input
def k_nearest_word(s1, vocabulary, k):
    k_nearest_word = []
    word_dist = {}
    for word in vocabulary:
        word_dist[word] = edit_Distance(s1, word)
    word_dist = dict(sorted(word_dist.items(), key=lambda item: item[1]))
    ans = []
    li = list(word_dist.keys())
    for i in range(0, k):
        ans.append(li[i])
    return ans
  
  
# for Testing purposes
text = '''
Pytorch is an open-source deep learning framework available with a Python and C++ interface. 
it resides inside the torch module. 
the data that has to be processed is input in the form of a tensor.
that allows you to develop and train deep learning models. 
However, as the size and complexity of your models grow, 
the time it takes to train them can become prohibitive.
  
'''
# vocabulary=['This','is','salt','saw','Bat','salty','salts']
vocabulary = text.split()
query = 'Pytorch'
k = 3  # top k nearest matching
li = k_nearest_word(query, vocabulary, k)
print(li)


Output:

['Pytorch', 'torch', 'Python']

Using Levenshtein library

Steps:

  • Import the Levenshtein library. We will use it to find the distance between the two strings.
  • Create a second function to find the k nearest word. It uses the Levenshtein.distance function to measure the distance between two words.
  • For the vocabulary, we are using some text and uses text.split() command to store as the list of words.
  • In the k nearest word-defined function input the query or word, vocabulary, and the number of the nearest word we want to find.

Python3




import Levenshtein
  
def find_top_k_nearest_words(query_input, words_list, k):
    distances = []
    for word in words_list:
        distance = Levenshtein.distance(query_input, word)
        distances.append((word, distance))
    sorted_distances = sorted(distances, key=lambda x: x[1])
    return [d[0] for d in sorted_distances[:k]]
  
nearest_words = find_top_k_nearest_words(query, vocabulary, k)
print(nearest_words)


Output:

['Pytorch', 'torch', 'Python']


Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads