Dynamic Programming | Set 5 (Edit Distance)
Continuing further on dynamic programming series, edit distance is an interesting algorithm.
Problem: Given two strings of size m, n and set of operations replace (R), insert (I) and delete (D) all at equal cost. Find minimum number of edits (operations) required to convert one string into another.
Identifying Recursive Methods:
What will be sub-problem in this case? Consider finding edit distance of part of the strings, say small prefix. Let us denote them as [1...i] and [1...j] for some 1< i < m and 1 < j < n. Clearly it is solving smaller instance of final problem, denote it as E(i, j). Our goal is finding E(m, n) and minimizing the cost.
In the prefix, we can right align the strings in three ways (i, -), (-, j) and (i, j). The hyphen symbol (-) representing no character. An example can make it more clear.
Given strings SUNDAY and SATURDAY. We want to convert SUNDAY into SATURDAY with minimum edits. Let us pick i = 2 and j = 4 i.e. prefix strings are SUN and SATU respectively (assume the strings indices start at 1). The right most characters can be aligned in three different ways.
Case 1: Align characters U and U. They are equal, no edit is required. We still left with the problem of i = 1 and j = 3, E(i-1, j-1).
Case 2: Align right character from first string and no character from second string. We need a deletion (D) here. We still left with problem of i = 1 and j = 4, E(i-1, j).
Case 3: Align right character from second string and no character from first string. We need an insertion (I) here. We still left with problem of i = 2 and j = 3, E(i, j-1).
Combining all the subproblems minimum cost of aligning prefix strings ending at i and j given by
E(i, j) = min( [E(i-1, j) + D], [E(i, j-1) + I], [E(i-1, j-1) + R if i,j characters are not same] )
We still not yet done. What will be base case(s)?
When both of the strings are of size 0, the cost is 0. When only one of the string is zero, we need edit operations as that of non-zero length string. Mathematically,
E(0, 0) = 0, E(i, 0) = i, E(0, j) = j
Now it is easy to complete recursive method. Go through the code for recursive algorithm (edit_distance_recursive).
Dynamic Programming Method:
We can calculate the complexity of recursive expression fairly easily.
T(m, n) = T(m-1, n-1) + T(m, n-1) + T(m-1, n) + C
The complexity of T(m, n) can be calculated by successive substitution method or solving homogeneous equation of two variables. It will result in an exponential complexity algorithm. It is evident from the recursion tree that it will be solving subproblems again and again. Few strings result in many overlapping subproblems (try the below program with strings exponential and polynomial and note the delay in recursive method).
We can tabulate the repeating subproblems and look them up when required next time (bottom up). A two dimensional array formed by the strings can keep track of the minimum cost till the current character comparison. The visualization code will help in understanding the construction of matrix.
The time complexity of dynamic programming method is O(mn) as we need to construct the table fully. The space complexity is also O(mn). If we need only the cost of edit, we just need O(min(m, n)) space as it is required only to keep track of the current row and previous row.
Usually the costs D, I and R are not same. In such case the problem can be represented as an acyclic directed graph (DAG) with weights on each edge, and finding shortest path gives edit distance.
Applications:
There are many practical applications of edit distance algorithm, refer Lucene API for sample. Another example, display all the words in a dictionary that are near proximity to a given word\incorrectly spelled word.
// Dynamic Programming implementation of edit distance
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
// Change these strings to test the program
#define STRING_X "SUNDAY"
#define STRING_Y "SATURDAY"
#define SENTINEL (-1)
#define EDIT_COST (1)
inline
int min(int a, int b) {
return a < b ? a : b;
}
// Returns Minimum among a, b, c
int Minimum(int a, int b, int c)
{
return min(min(a, b), c);
}
// Strings of size m and n are passed.
// Construct the Table for X[0...m, m+1], Y[0...n, n+1]
int EditDistanceDP(char X[], char Y[])
{
// Cost of alignment
int cost = 0;
int leftCell, topCell, cornerCell;
int m = strlen(X)+1;
int n = strlen(Y)+1;
// T[m][n]
int *T = (int *)malloc(m * n * sizeof(int));
// Initialize table
for(int i = 0; i < m; i++)
for(int j = 0; j < n; j++)
*(T + i * n + j) = SENTINEL;
// Set up base cases
// T[i][0] = i
for(int i = 0; i < m; i++)
*(T + i * n) = i;
// T[0][j] = j
for(int j = 0; j < n; j++)
*(T + j) = j;
// Build the T in top-down fashion
for(int i = 1; i < m; i++)
{
for(int j = 1; j < n; j++)
{
// T[i][j-1]
leftCell = *(T + i*n + j-1);
leftCell += EDIT_COST; // deletion
// T[i-1][j]
topCell = *(T + (i-1)*n + j);
topCell += EDIT_COST; // insertion
// Top-left (corner) cell
// T[i-1][j-1]
cornerCell = *(T + (i-1)*n + (j-1) );
// edit[(i-1), (j-1)] = 0 if X[i] == Y[j], 1 otherwise
cornerCell += (X[i-1] != Y[j-1]); // may be replace
// Minimum cost of current cell
// Fill in the next cell T[i][j]
*(T + (i)*n + (j)) = Minimum(leftCell, topCell, cornerCell);
}
}
// Cost is in the cell T[m][n]
cost = *(T + m*n - 1);
free(T);
return cost;
}
// Recursive implementation
int EditDistanceRecursion( char *X, char *Y, int m, int n )
{
// Base cases
if( m == 0 && n == 0 )
return 0;
if( m == 0 )
return n;
if( n == 0 )
return m;
// Recurse
int left = EditDistanceRecursion(X, Y, m-1, n) + 1;
int right = EditDistanceRecursion(X, Y, m, n-1) + 1;
int corner = EditDistanceRecursion(X, Y, m-1, n-1) + (X[m] != Y[n]);
return Minimum(left, right, corner);
}
int main()
{
char X[] = STRING_X; // vertical
char Y[] = STRING_Y; // horizontal
printf("Minimum edits required to convert %s into %s is %d\n",
X, Y, EditDistanceDP(X, Y) );
printf("Minimum edits required to convert %s into %s is %d by recursion\n",
X, Y, EditDistanceRecursion(X, Y, strlen(X), strlen(Y)));
return 0;
}
— Venki. Please write comments if you find anything incorrect, or you want to share more information about the topic discussed above.
program giving wrong output for
s1 = "hello"
s2 = "hellooo"
output is cuming:
Minimum edits required to convert hello into hellooo is 2
Minimum edits required to convert hello into hellooo is 5 by recursion
There is a problem in "Minimum" function, thus answers are coming different with dp and recursive approach.
Please make it as follows:
int Minimum(int a, int b, int c) { int min; if( a < b && a < c ) min = a; else if( b < a && b < c ) min = b; else min = c; return min; }Thanks both of you for pointing the error. Code is updated.
can we think of applying these oprations in certain conditions...like insert or delete can give min cost if l1l2 delete or replace may be beneficial...plz do reply
What if we convert "SATURDAY" to "SUNDAY"? Results in both the methods used above are different.
i think in the base cases E(i,0) it should be like i*EDIT_COST instead of i
Yes. In the current program we took all edit operations of same cost.
Correct me if I am wrong, but can this question be solved by first finding the largest common sub-sequence and then subtracting it from the length of the greater string?
No, that will not always lead to optimum alignment.
Can you specify an example? I cannot get my head around this.
Manak, you can take another example given in the content. Consider the words
exponential - ponil = exent
polynomial - ponil = lyom
But the ED(exponential, polynomial) != ED(exent, lyom), here ED stands for Edit Distance.
Practice with few examples, if still not clear, let me know. I would need some time for detailed explanation.
Don't get it. Can you give a more detailed example including how to edit? Thank you.
get it.
the example is like
abcd
cde
LCS=2 "cd"
but edit distance is 3
In face if only deletion and insertion are possible. then LCS can be applied here
The following line should be changed
to
since we are at i,j we should be comparing x[i] and y[j]. What say?
Thanks for comment. The indexing is not an error. Please read the content. We use table of size m+1 x n+1. The indices i and j and one step ahead of the string location, so we need to subtract 1.
/* Paste your code here (You may delete these lines if not writing code) */ import java.util.Scanner; /** * * @author saurabh */ public class EditDistanceDPP { char[] s1,s2; public EditDistanceDPP() { Scanner sc = new Scanner(System.in); s1 = sc.nextLine().toCharArray(); s2 = sc.nextLine().toCharArray(); System.out.println("Edit distance is : "+editDistance(s1,s2)); } private int editDistance(char[] st1, char[] st2) { int[][] s = new int[s1.length+1][s2.length+1]; for(int i=0; i<=s1.length; i++) { for(int j=0; j<=s2.length; j++) { if(i==0) s[i][j]=j; else if(j==0) s[i][j]=i; else s[i][j] = min(s[i-1][j-1]+(st1[i-1]==st2[j-1]?0:1),s[i-1][j]+1,s[i][j-1]+1); } } return s[s1.length][s2.length]; } int min(int a, int b, int c) { return(a<b?a<c?a:c:b<c?b:c); } public static void main(String[] args) { EditDistanceDPP edd = new EditDistanceDPP(); } }This is a quite simple Dynamic Programming approach with time complexity as O(m*n) and space complexity also as O(m*n)....
Correct me..if anything is wrong in the above code...thanks....
kindly quote some references to this problem so that it becomes more clear.
thankyou
Algorithms by Das Guptha is good reference.
In function display the below changes should be made-->
//(base + r * col)1 should be replaced by *(base + r * col + c)
@Jatin, thanks. It was typo during post update. I have updated the post.
In the documentation of the table inside the program :
leftCell = table[i][j-1] ;
and
topCell = table[i-1][j] ;
It should be,
leftCell = table[i-1][j] ;
and
topCell = table[i][j-1] ;
this link may be helpful
http://www.youtube.com/watch?v=CB425OsE4Fo
"Given strings SUNDAY and SATURDAY. We want to convert SUNDAY into SATURDAY with minimum edits. Let us pick i = 2 and j = 4 i.e. prefix strings are SUN and SATU respectively"
in this line change i=3 or prefix as 'SU'.
Usually the costs D, I and R are not same. In such case the problem can be represented as an acyclic directed graph (DAG) with weights on each edge, and finding shortest path gives edit distance.
How to construct this graph? could you plz give some basic steps? just the logic.
http://en.wikipedia.org/wiki/Levenshtein_distance. The improvement of levenshtein text distance is http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
int EDIT[100][100]; int solve_edit( string a, string b) { for (int j=0;j<=b.size();j++) { EDIT[0][j]=j; } for (int i=1;i<=a.size();i++) { EDIT[i][0]=i; for (int j=1;j<=b.size();j++) { EDIT[i][j]= min( min( EDIT[i][j-1]+1,EDIT[i-1][j]+1), EDIT[i-1][j-1]+ (int)(a[i-1]!=b[j-1])); } } return EDIT[a.size()][b.size()]; }in description its written ---
Combining all the subproblems minimum cost of aligning prefix strings ending at i and j given by
E(i, j) = min( [E(i-1, j) + D], [E(i, j-1) + I], [E(i-1, j-1) + I if i,j characters are not same] )
in ---[E(i-1, j-1) + I if i,j characters are not same] )
shouldnt here be replace(R) instead of Insert(I)
else it would be two operations
[E(i-1, j-1) + I +D ... we insert one char from target string and delete from original string
@rajcools, thanks. It should be replace. I will update.
Instead of Using DAG, can't we simply define 3 different Edit Costs: Edit_Insert(ex. 1), Edit_Delete(2), Edit_Remove(5) and use these in the 3 cases??