Open In App

Python | Similarity metrics of strings

This particular utility is quite in demand nowadays due to the similarity computation requirements in many fields of Computer Science such as Machine Learning, A.I and web development domains, hence techniques to compute similarity between any given containers can be quite useful. Let’s discuss certain ways in which this can be done. 

Method #1 : Using Naive Approach(sum() + zip()) We can perform this particular task using the naive approach, using sum and zip functions we can formulate a utility function that can compute the similarity of both the strings. 






# Python3 code to demonstrate
# similarity between strings
# using naive method (sum() + zip())
 
# Utility function to compute similarity
def similar(str1, str2):
    str1 = str1 + ' ' * (len(str2) - len(str1))
    str2 = str2 + ' ' * (len(str1) - len(str2))
    return sum(1 if i == j else 0
               for i, j in zip(str1, str2)) / float(len(str1))
 
# Initializing strings
test_string1 = 'Geeksforgeeks'
test_string2 = 'Geeks4geeks'
 
# using naive method (sum() + zip())
# similarity between strings
res = similar(test_string1, test_string2)
 
# printing the result
print ("The similarity between 2 strings is : " + str(res))

Output : 
The similarity between 2 strings is : 0.38461538461538464

Time Complexity: O(n), where n is the length of the longer input string.



Auxiliary Space: O(1)

Method #2 : Using SequenceMatcher.ratio() There’s an inbuilt method, that helps to perform this particular task and is recommended to achieve this particular task as it doesn’t require custom approach but uses built in constructs to perform task more efficiently. 




# Python3 code to demonstrate
# similarity between strings
# using SequenceMatcher.ratio()
from difflib import SequenceMatcher
 
# Utility function to compute similarity
def similar(str1, str2):
    return SequenceMatcher(None, str1, str2).ratio()
 
# Initializing strings
test_string1 = 'Geeksforgeeks'
test_string2 = 'Geeks'
 
# using SequenceMatcher.ratio()
# similarity between strings
res = similar(test_string1, test_string2)
 
# printing the result
print ("The similarity between 2 strings is : " + str(res))

Output : 
The similarity between 2 strings is :  0.5555555555555556

Method #3 : Using difflib.ndiff

You can also use the ndiff function from the difflib library to compare the differences between two strings and compute their similarity. The ndiff function returns a list of strings representing the differences between the two input strings. You can then use this list to compute the similarity between the two strings.

Here is an example of how you can use the ndiff function to compute the similarity between two strings:




import difflib
#This code calculates the similarity between two strings using the ndiff method from the difflib library.
def compute_similarity(input_string, reference_string):
#The ndiff method returns a list of strings representing the differences between the two input strings.
    diff = difflib.ndiff(input_string, reference_string)
    diff_count = 0
    for line in diff:
      # a "-", indicating that it is a deleted character from the input string.
        if line.startswith("-"):
            diff_count += 1
# calculates the similarity by subtracting the ratio of the number of deleted characters to the length of the input string from 1
    return 1 - (diff_count / len(input_string))
 
input_string = "Geeksforgeeks"
reference_string = "Geeks4geeks"
similarity = compute_similarity(input_string, reference_string)
print(similarity)
#This code is contributed by Edula Vinay Kumar Reddy

Output
0.7692307692307692

The output of this code will be a floating point number between 0 and 1, representing the similarity between the two input strings. A value of 1 indicates that the two strings are identical, while a value of 0 indicates that they are completely different.

Method #4: Levenshtein distance algorithm. 

This algorithm calculates the minimum number of edits (insertions, deletions, or substitutions) required to transform one string into the other.

  1. Define the levenshtein_distance() function that takes two strings, s and t, as input and returns their Levenshtein distance.
  2. Get the lengths of the two input strings using the len() function and store them in variables m and n.
  3. Check if m is less than n. If it is, swap the strings s and t, and also swap m and n.
  4. Create a 2D list d of size (m+1)x(n+1) to store the distances between all pairs of prefixes of the two input strings.
  5. Initialize the first row of d with values 0 to n.
  6. Initialize the first column of d with values 0 to m.
  7. Use nested loops to compute the distances between all pairs of prefixes of the two input strings.
  8. For each pair of prefixes (i,j) of s and t, if the i-th character of s is equal to the j-th character of t, then the distance between the prefixes (i,j) and (i-1,j-1) is the same as the distance between the prefixes (i-1,j-1) and (i-2,j-2). Otherwise, the distance between the prefixes (i,j) and (i-1,j-1) is the minimum of the distances between the prefixes (i-1,j), (i,j-1), and (i-1,j-1), plus 1.
  9. The final value in d at position (m,n) is the Levenshtein distance between the two input strings, which we return from the function.
  10. Define the compute_similarity() function that takes two strings, input_string and reference_string, as input.
  11. Call the levenshtein_distance() function with input_string and reference_string as arguments, and store the distance in a variable distance.
  12. Get the maximum length of the two input strings using the max() function and store it in a variable max_length.
  13. Calculate the similarity between the two input strings as 1 – (distance / max_length).
  14. Return the similarity from the compute_similarity() function.
  15. Define input_string and reference_string to be the two strings to be compared.
  16. Call the compute_similarity() function with input_string and reference_string as arguments, and store the similarity in a variable similarity.
  17. Print the similarity using the print() function.




def levenshtein_distance(s, t):
    m, n = len(s), len(t)
    if m < n:
        s, t = t, s
        m, n = n, m
    d = [list(range(n + 1))] + [[i] + [0] * n for i in range(1, m + 1)]
    for j in range(1, n + 1):
        for i in range(1, m + 1):
            if s[i - 1] == t[j - 1]:
                d[i][j] = d[i - 1][j - 1]
            else:
                d[i][j] = min(d[i - 1][j], d[i][j - 1], d[i - 1][j - 1]) + 1
    return d[m][n]
 
def compute_similarity(input_string, reference_string):
    distance = levenshtein_distance(input_string, reference_string)
    max_length = max(len(input_string), len(reference_string))
    similarity = 1 - (distance / max_length)
    return similarity
 
input_string = "Geeksforgeeks"
reference_string = "Geeks4geeks"
similarity = compute_similarity(input_string, reference_string)
print(similarity)

Output
0.7692307692307692

Time complexity: O(mn), where m and n are the lengths of the two input strings.
Auxiliary space: O(mn), as we use a matrix of size (m+1)x(n+1) to store the distances between all pairs of prefixes of the two input strings.


Article Tags :