Sequence Alignment problem

Given as an input two strings, X = x_{1} x_{2}... x_{m}, and Y = y_{1} y_{2}... y_{m}, output the alignment of the strings, character by character, so that the net penalty is minimised. The penalty is calculated as:
1. A penalty of p_{gap} occurs if a gap is inserted between the string.
2. A penalty of p_{xy} occurs for mis-matching the characters of X and Y.

Examples:

Input : X = CG, Y = CA, p_gap = 3, p_xy = 7
Output : X = CG_, Y = C_A, Total penalty = 6

Input : X = AGGGCT, Y = AGGCA, p_gap = 3, p_xy = 2
Output : X = AGGGCT, Y = A_GGCA, Total penalty = 5

Input : X = CG, Y = CA, p_gap = 3, p_xy = 5
Output : X = CG, Y = CA, Total penalty = 5


A brief Note on the history of the problem
The Sequence Alignment problem is one of the fundamental problems of Biological Sciences, aimed at finding the similarity of two amino-acid sequences. Comparing amino-acids is of prime importance to humans, since it gives vital information on evolution and development. Saul B. Needleman and Christian D. Wunsch devised a dynamic programming algorithm to the problem and got it published in 1970. Since then, numerous improvements have been made to improve the time complexity and space complexity, however these are beyond the scope of discussion in this post.

Solution We can use dynamic programming to solve this problem. The feasible solution is to introduce gaps into the strings, so as to equalise the lengths. Since it can be easily proved that the addition of extra gaps after equalising the lengths will only lead to increment of penalty.

Optimal Substructure
It can be observed from an optimal solution, for example from the given sample input, that the optimal solution narrows down to only three candidates.
1. x_{m} and y_{n}.
2. x_{m} and gap.
3. gap and y_{n}.

Proof of Optimal Substructure.
We can easily prove by contradiction. Let X - x_{m} be X^' and Y - y_{n} be Y^'. Suppose that the induced alignment of X^', Y^' has some penalty P, and a competitor alignment has a penalty P^*, with P^* < P.
Now, appending x_{m} and y_{n}, we get an alignment with penalty P^* + p_{xy} < P  + p_{xy}. This contradicts the optimality of the original alignment of X, Y.
Hence, proved.

Let dp[i][j] be the penalty of the optimal alignment of X_{i} and Y_{i}. Then, from the optimal substructure, dp[i][j] = min(dp[i-1][j-1] + p_{xy}, dp[i-1][j] + p_{gap}, dp[i][j-1] + p_{gap}) .
The total minimum penalty is thus, dp[m][n] .

Reconstructing the solution
To Reconstruct,
1. Trace back through the filled table, starting dp[m][n].
2. When (i, j)
…..2a. if it was filled using case 1, go to (i-1, j-1).
…..2b. if it was filled using case 2, go to (i-1, j).
…..2c. if it was filled using case 3, go to (i, j-1).
3. if either i = 0 or j = 0, match the remaining substring with gaps.

Below is the implementation of the above solution.

C++

filter_none

edit
close

play_arrow

link
brightness_4
code

// CPP program to implement sequence alignment
// problem.
#include <bits/stdc++.h>
  
using namespace std;
  
// function to find out the minimum penalty
void getMinimumPenalty(string x, string y, int pxy, int pgap)
{
    int i, j; // intialising variables
      
    int m = x.length(); // length of gene1
    int n = y.length(); // length of gene2
      
    // table for storing optimal substructure answers
    int dp[n+m+1][n+m+1] = {0};
  
    // intialising the table 
    for (i = 0; i <= (n+m); i++)
    {
        dp[i][0] = i * pgap;
        dp[0][i] = i * pgap;
    }    
  
    // calcuting the minimum penalty
    for (i = 1; i <= m; i++)
    {
        for (j = 1; j <= n; j++)
        {
            if (x[i - 1] == y[j - 1])
            {
                dp[i][j] = dp[i - 1][j - 1];
            }
            else
            {
                dp[i][j] = min({dp[i - 1][j - 1] + pxy , 
                                dp[i - 1][j] + pgap    , 
                                dp[i][j - 1] + pgap    });
            }
        }
    }
  
    // Reconstructing the solution
    int l = n + m; // maximum possible length
      
    i = m; j = n;
      
    int xpos = l;
    int ypos = l;
  
    // Final answers for the respective strings
    int xans[l+1], yans[l+1];
      
    while ( !(i == 0 || j == 0))
    {
        if (x[i - 1] == y[j - 1])
        {
            xans[xpos--] = (int)x[i - 1];
            yans[ypos--] = (int)y[j - 1];
            i--; j--;
        }
        else if (dp[i - 1][j - 1] + pxy == dp[i][j])
        {
            xans[xpos--] = (int)x[i - 1];
            yans[ypos--] = (int)y[j - 1];
            i--; j--;
        }
        else if (dp[i - 1][j] + pgap == dp[i][j])
        {
            xans[xpos--] = (int)x[i - 1];
            yans[ypos--] = (int)'_';
            i--;
        }
        else if (dp[i][j - 1] + pgap == dp[i][j])
        {
            xans[xpos--] = (int)'_';
            yans[ypos--] = (int)y[j - 1];
            j--;
        }
    }
    while (xpos > 0)
    {
        if (i > 0) xans[xpos--] = (int)x[--i];
        else xans[xpos--] = (int)'_';
    }
    while (ypos > 0)
    {
        if (j > 0) yans[ypos--] = (int)y[--j];
        else yans[ypos--] = (int)'_';
    }
  
    // Since we have assumed the answer to be n+m long, 
    // we need to remove the extra gaps in the starting 
    // id represents the index from which the arrays
    // xans, yans are useful
    int id = 1;
    for (i = l; i >= 1; i--)
    {
        if ((char)yans[i] == '_' && (char)xans[i] == '_')
        {
            id = i + 1;
            break;
        }
    }
  
    // Printing the final answer
    cout << "Minimum Penalty in aligning the genes = ";
    cout << dp[m][n] << "\n";
    cout << "The aligned genes are :\n";
    for (i = id; i <= l; i++)
    {
        cout<<(char)xans[i];
    }
    cout << "\n";
    for (i = id; i <= l; i++)
    {
        cout << (char)yans[i];
    }
    return;
}
  
// Driver code
int main(){
    // input strings
    string gene1 = "AGGGCT";
    string gene2 = "AGGCA";
      
    // intialsing penalties of different types
    int misMatchPenalty = 3;
    int gapPenalty = 2;
  
    // calling the function to calculate the result
    getMinimumPenalty(gene1, gene2, 
        misMatchPenalty, gapPenalty);
    return 0;
}

chevron_right


Java

filter_none

edit
close

play_arrow

link
brightness_4
code

// Java program to implement 
// sequence alignment problem.
import java.io.*;
import java.util.*;
import java.lang.*;
  
class GFG
{
// function to find out 
// the minimum penalty
static void getMinimumPenalty(String x, String y, 
                              int pxy, int pgap)
{
    int i, j; // intialising variables
      
    int m = x.length(); // length of gene1
    int n = y.length(); // length of gene2
      
    // table for storing optimal
    // substructure answers
    int dp[][] = new int[n + m + 1][n + m + 1];
      
    for (int[] x1 : dp)
    Arrays.fill(x1, 0);
  
    // intialising the table 
    for (i = 0; i <= (n + m); i++)
    {
        dp[i][0] = i * pgap;
        dp[0][i] = i * pgap;
    
  
    // calcuting the 
    // minimum penalty
    for (i = 1; i <= m; i++)
    {
        for (j = 1; j <= n; j++)
        {
            if (x.charAt(i - 1) == y.charAt(j - 1))
            {
                dp[i][j] = dp[i - 1][j - 1];
            }
            else
            {
                dp[i][j] = Math.min(Math.min(dp[i - 1][j - 1] + pxy , 
                                             dp[i - 1][j] + pgap) , 
                                             dp[i][j - 1] + pgap );
            }
        }
    }
  
    // Reconstructing the solution
    int l = n + m; // maximum possible length
      
    i = m; j = n;
      
    int xpos = l;
    int ypos = l;
  
    // Final answers for 
    // the respective strings
    int xans[] = new int[l + 1]; 
    int yans[] = new int[l + 1];
      
    while ( !(i == 0 || j == 0))
    {
        if (x.charAt(i - 1) == y.charAt(j - 1))
        {
            xans[xpos--] = (int)x.charAt(i - 1);
            yans[ypos--] = (int)y.charAt(j - 1);
            i--; j--;
        }
        else if (dp[i - 1][j - 1] + pxy == dp[i][j])
        {
            xans[xpos--] = (int)x.charAt(i - 1);
            yans[ypos--] = (int)y.charAt(j - 1);
            i--; j--;
        }
        else if (dp[i - 1][j] + pgap == dp[i][j])
        {
            xans[xpos--] = (int)x.charAt(i - 1);
            yans[ypos--] = (int)'_';
            i--;
        }
        else if (dp[i][j - 1] + pgap == dp[i][j])
        {
            xans[xpos--] = (int)'_';
            yans[ypos--] = (int)y.charAt(j - 1);
            j--;
        }
    }
    while (xpos > 0)
    {
        if (i > 0) xans[xpos--] = (int)x.charAt(--i);
        else xans[xpos--] = (int)'_';
    }
    while (ypos > 0)
    {
        if (j > 0) yans[ypos--] = (int)y.charAt(--j);
        else yans[ypos--] = (int)'_';
    }
  
    // Since we have assumed the 
    // answer to be n+m long, 
    // we need to remove the extra 
    // gaps in the starting id 
    // represents the index from 
    // which the arrays xans,
    // yans are useful
    int id = 1;
    for (i = l; i >= 1; i--)
    {
        if ((char)yans[i] == '_' && 
            (char)xans[i] == '_')
        {
            id = i + 1;
            break;
        }
    }
  
    // Printing the final answer
    System.out.print("Minimum Penalty in "
                     "aligning the genes = ");
    System.out.print(dp[m][n] + "\n");
    System.out.println("The aligned genes are :");
    for (i = id; i <= l; i++)
    {
        System.out.print((char)xans[i]);
    }
    System.out.print("\n");
    for (i = id; i <= l; i++)
    {
        System.out.print((char)yans[i]);
    }
    return;
}
  
// Driver code
public static void main(String[] args)
{
    // input strings
    String gene1 = "AGGGCT";
    String gene2 = "AGGCA";
      
    // intialsing penalties
    // of different types
    int misMatchPenalty = 3;
    int gapPenalty = 2;
  
    // calling the function to
    // calculate the result
    getMinimumPenalty(gene1, gene2, 
        misMatchPenalty, gapPenalty);
}
}

chevron_right


C#

filter_none

edit
close

play_arrow

link
brightness_4
code

// C# program to implement sequence alignment 
// problem. 
using System; 
    
class GFG 
    // function to find out the minimum penalty 
    public static void getMinimumPenalty(string x, string y, int pxy, int pgap) 
    
        int i, j; // intialising variables 
            
        int m = x.Length; // length of gene1 
        int n = y.Length; // length of gene2 
            
        // table for storing optimal substructure answers 
        int[,] dp = new int[n+m+1,n+m+1];
        for(int q = 0; q < n+m+1; q++)
            for(int w = 0; w < n+m+1; w++)
                dp[q,w] = 0;
        
        // intialising the table  
        for (i = 0; i <= (n+m); i++) 
        
            dp[i,0] = i * pgap; 
            dp[0,i] = i * pgap; 
        }     
        
        // calcuting the minimum penalty 
        for (i = 1; i <= m; i++) 
        
            for (j = 1; j <= n; j++) 
            
                if (x[i - 1] == y[j - 1]) 
                
                    dp[i,j] = dp[i - 1,j - 1]; 
                
                else
                
                    dp[i,j] = Math.Min(Math.Min(dp[i - 1,j - 1] + pxy ,  
                                    dp[i - 1,j] + pgap)    ,  
                                    dp[i,j - 1] + pgap    ); 
                
            
        
        
        // Reconstructing the solution 
        int l = n + m; // maximum possible length 
            
        i = m; j = n; 
            
        int xpos = l; 
        int ypos = l; 
        
        // Final answers for the respective strings 
        int[] xans = new int[l+1];
        int [] yans = new int[l+1]; 
            
        while ( !(i == 0 || j == 0)) 
        
            if (x[i - 1] == y[j - 1]) 
            
                xans[xpos--] = (int)x[i - 1]; 
                yans[ypos--] = (int)y[j - 1]; 
                i--; j--; 
            
            else if (dp[i - 1,j - 1] + pxy == dp[i,j]) 
            
                xans[xpos--] = (int)x[i - 1]; 
                yans[ypos--] = (int)y[j - 1]; 
                i--; j--; 
            
            else if (dp[i - 1,j] + pgap == dp[i,j]) 
            
                xans[xpos--] = (int)x[i - 1]; 
                yans[ypos--] = (int)'_'
                i--; 
            
            else if (dp[i,j - 1] + pgap == dp[i,j]) 
            
                xans[xpos--] = (int)'_'
                yans[ypos--] = (int)y[j - 1]; 
                j--; 
            
        
        while (xpos > 0) 
        
            if (i > 0) xans[xpos--] = (int)x[--i]; 
            else xans[xpos--] = (int)'_'
        
        while (ypos > 0) 
        
            if (j > 0) yans[ypos--] = (int)y[--j]; 
            else yans[ypos--] = (int)'_'
        
        
        // Since we have assumed the answer to be n+m long,  
        // we need to remove the extra gaps in the starting  
        // id represents the index from which the arrays 
        // xans, yans are useful 
        int id = 1; 
        for (i = l; i >= 1; i--) 
        
            if ((char)yans[i] == '_' && (char)xans[i] == '_'
            
                id = i + 1; 
                break
            
        
        
        // Printing the final answer 
        Console.Write("Minimum Penalty in aligning the genes = " + dp[m,n] + "\n"); 
        Console.Write("The aligned genes are :\n"); 
        for (i = id; i <= l; i++) 
        
            Console.Write((char)xans[i]); 
        
        Console.Write("\n"); 
        for (i = id; i <= l; i++) 
        
            Console.Write((char)yans[i]); 
        
        return
    
        
    // Driver code
    static void Main() 
    
        // input strings 
        string gene1 = "AGGGCT"
        string gene2 = "AGGCA"
            
        // intialsing penalties of different types 
        int misMatchPenalty = 3; 
        int gapPenalty = 2; 
        
        // calling the function to calculate the result 
        getMinimumPenalty(gene1, gene2,  
            misMatchPenalty, gapPenalty);
    }
    //This code is contributed by DrRoot_
}

chevron_right


PHP

filter_none

edit
close

play_arrow

link
brightness_4
code

<?php
// PHP program to implement 
// sequence alignment problem.
  
// function to find out
// the minimum penalty
function getMinimumPenalty($x, $y
                           $pxy, $pgap)
{
    $i; $j; // intializing variables
      
    $m = strlen($x); // length of gene1
    $n = strlen($y); // length of gene2
      
    // table for storing optimal
    // substructure answers
    $dp[$n + $m + 1][$n + $m + 1] = array(0);
  
    // intialising the table 
    for ($i = 0; $i <= ($n+$m); $i++)
    {
        $dp[$i][0] = $i * $pgap;
        $dp[0][$i] = $i * $pgap;
    
  
    // calcuting the 
    // minimum penalty
    for ($i = 1; $i <= $m; $i++)
    {
        for ($j = 1; $j <= $n; $j++)
        {
            if ($x[$i - 1] == $y[$j - 1])
            {
                $dp[$i][$j] = $dp[$i - 1][$j - 1];
            }
            else
            {
                $dp[$i][$j] = min($dp[$i - 1][$j - 1] + $pxy
                                  $dp[$i - 1][$j] + $pgap
                                  $dp[$i][$j - 1] + $pgap );
            }
        }
    }
  
    // Reconstructing the solution
    $l = $n + $m; // maximum possible length
      
    $i = $m; $j = $n;
      
    $xpos = $l;
    $ypos = $l;
  
    // Final answers for 
    // the respective strings
    // $xans[$l + 1]; $yans[$l + 1];
      
    while ( !($i == 0 || $j == 0))
    {
        if ($x[$i - 1] == $y[$j - 1])
        {
            $xans[$xpos--] = $x[$i - 1];
            $yans[$ypos--] = $y[$j - 1];
            $i--; $j--;
        }
        else if ($dp[$i - 1][$j - 1] + 
                 $pxy == $dp[$i][$j])
        {
            $xans[$xpos--] = $x[$i - 1];
            $yans[$ypos--] = $y[$j - 1];
            $i--; $j--;
        }
        else if ($dp[$i - 1][$j] + 
                 $pgap == $dp[$i][$j])
        {
            $xans[$xpos--] = $x[$i - 1];
            $yans[$ypos--] = '_';
            $i--;
        }
        else if ($dp[$i][$j - 1] + 
                 $pgap == $dp[$i][$j])
        {
            $xans[$xpos--] = '_';
            $yans[$ypos--] = $y[$j - 1];
            $j--;
        }
    }
    while ($xpos > 0)
    {
        if ($i > 0) $xans[$xpos--] = $x[--$i];
        else $xans[$xpos--] = '_';
    }
    while ($ypos > 0)
    {
        if ($j > 0) 
            $yans[$ypos--] = $y[--$j];
        else 
            $yans[$ypos--] = '_';
    }
  
    // Since we have assumed the
    // answer to be n+m long, 
    // we need to remove the extra
    // gaps in the starting 
    // id represents the index 
    // from which the arrays
    // xans, yans are useful
    $id = 1;
    for ($i = $l; $i >= 1; $i--)
    {
        if ($yans[$i] == '_' && 
            $xans[$i] == '_')
        {
            $id = $i + 1;
            break;
        }
    }
  
    // Printing the final answer
    echo "Minimum Penalty in "
         "aligning the genes = ";
    echo $dp[$m][$n] . "\n";
    echo "The aligned genes are :\n";
    for ($i = $id; $i <= $l; $i++)
    {
        echo $xans[$i];
    }
    echo "\n";
    for ($i = $id; $i <= $l; $i++)
    {
        echo $yans[$i];
    }
    return;
}
  
// Driver code
  
// input strings
$gene1 = "AGGGCT";
$gene2 = "AGGCA";
  
// intialsing penalties
// of different types
$misMatchPenalty = 3;
$gapPenalty = 2;
  
// calling the function 
// to calculate the result
getMinimumPenalty($gene1, $gene2
    $misMatchPenalty, $gapPenalty);
  
// This code is contributed by Abhinav96
?>

chevron_right


Output:

Minimum Penalty in aligning the genes = 5
The aligned genes are :
AGGGCT
A_GGCA

Time Complexity : \mathcal{O}(n*m)
Space Complexity : \mathcal{O}(n*m)



My Personal Notes arrow_drop_up

I am a problem solving enthusiast and I love competitive programming

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.



Improved By : AayushChaturvedi, DrRoot_