kasai’s Algorithm for Construction of LCP array from Suffix Array
Last Updated :
06 Jan, 2024
Background Suffix Array : A suffix array is a sorted array of all suffixes of a given string.
Let the given string be “banana”.
0 banana 5 a
1 anana Sort the Suffixes 3 ana
2 nana ----------------> 1 anana
3 ana alphabetically 0 banana
4 na 4 na
5 a 2 nana
The suffix array for “banana” :
suffix[] = {5, 3, 1, 0, 4, 2}
We have discussed Suffix Array and it O(nLogn) construction .
Once Suffix array is built, we can use it to efficiently search a pattern in a text. For example, we can use Binary Search to find a pattern (Complete code for the same is discussed here)
LCP Array
The Binary Search based solution discussed here takes O(m*Logn) time where m is length of the pattern to be searched and n is length of the text. With the help of LCP array, we can search a pattern in O(m + Log n) time. For example, if our task is to search “ana” in “banana”, m = 3, n = 5.
LCP Array is an array of size n (like Suffix Array). A value lcp[i] indicates length of the longest common prefix of the suffixes indexed by suffix[i] and suffix[i+1]. suffix[n-1] is not defined as there is no suffix after it.
txt[0..n-1] = "banana"
suffix[] = {5, 3, 1, 0, 4, 2|
lcp[] = {1, 3, 0, 0, 2, 0}
Suffixes represented by suffix array in order are:
{"a", "ana", "anana", "banana", "na", "nana"}
lcp[0] = Longest Common Prefix of "a" and "ana" = 1
lcp[1] = Longest Common Prefix of "ana" and "anana" = 3
lcp[2] = Longest Common Prefix of "anana" and "banana" = 0
lcp[3] = Longest Common Prefix of "banana" and "na" = 0
lcp[4] = Longest Common Prefix of "na" and "nana" = 2
lcp[5] = Longest Common Prefix of "nana" and None = 0
How to construct LCP array?
LCP array construction is done two ways:
1) Compute the LCP array as a byproduct to the suffix array (Manber & Myers Algorithm)
2) Use an already constructed suffix array in order to compute the LCP values. (Kasai Algorithm).
There exist algorithms that can construct Suffix Array in O(n) time and therefore we can always construct LCP array in O(n) time. But in the below implementation, a O(n Log n) algorithm is discussed.
kasai’s Algorithm
In this article, kasai’s Algorithm is discussed. The algorithm constructs LCP array from suffix array and input text in O(n) time. The idea is based on below fact:
Let lcp of suffix beginning at txt[i[ be k. If k is greater than 0, then lcp for suffix beginning at txt[i+1] will be at-least k-1. The reason is, relative order of characters remain same. If we delete the first character from both suffixes, we know that at least k characters will match. For example for substring “ana”, lcp is 3, so for string “na” lcp will be at-least 2. Refer this for proof.
Below is the C++ implementation of Kasai’s algorithm.
C++
#include <bits/stdc++.h>
using namespace std;
struct suffix
{
int index;
int rank[2];
};
int cmp( struct suffix a, struct suffix b)
{
return (a.rank[0] == b.rank[0])? (a.rank[1] < b.rank[1] ?1: 0):
(a.rank[0] < b.rank[0] ?1: 0);
}
vector< int > buildSuffixArray(string txt, int n)
{
struct suffix suffixes[n];
for ( int i = 0; i < n; i++)
{
suffixes[i].index = i;
suffixes[i].rank[0] = txt[i] - 'a' ;
suffixes[i].rank[1] = ((i+1) < n)? (txt[i + 1] - 'a' ): -1;
}
sort(suffixes, suffixes+n, cmp);
int ind[n];
for ( int k = 4; k < 2*n; k = k*2)
{
int rank = 0;
int prev_rank = suffixes[0].rank[0];
suffixes[0].rank[0] = rank;
ind[suffixes[0].index] = 0;
for ( int i = 1; i < n; i++)
{
if (suffixes[i].rank[0] == prev_rank &&
suffixes[i].rank[1] == suffixes[i-1].rank[1])
{
prev_rank = suffixes[i].rank[0];
suffixes[i].rank[0] = rank;
}
else
{
prev_rank = suffixes[i].rank[0];
suffixes[i].rank[0] = ++rank;
}
ind[suffixes[i].index] = i;
}
for ( int i = 0; i < n; i++)
{
int nextindex = suffixes[i].index + k/2;
suffixes[i].rank[1] = (nextindex < n)?
suffixes[ind[nextindex]].rank[0]: -1;
}
sort(suffixes, suffixes+n, cmp);
}
vector< int >suffixArr;
for ( int i = 0; i < n; i++)
suffixArr.push_back(suffixes[i].index);
return suffixArr;
}
vector< int > kasai(string txt, vector< int > suffixArr)
{
int n = suffixArr.size();
vector< int > lcp(n, 0);
vector< int > invSuff(n, 0);
for ( int i=0; i < n; i++)
invSuff[suffixArr[i]] = i;
int k = 0;
for ( int i=0; i<n; i++)
{
if (invSuff[i] == n-1)
{
k = 0;
continue ;
}
int j = suffixArr[invSuff[i]+1];
while (i+k<n && j+k<n && txt[i+k]==txt[j+k])
k++;
lcp[invSuff[i]] = k;
if (k>0)
k--;
}
return lcp;
}
void printArr(vector< int >arr, int n)
{
for ( int i = 0; i < n; i++)
cout << arr[i] << " " ;
cout << endl;
}
int main()
{
string str = "banana" ;
vector< int >suffixArr = buildSuffixArray(str, str.length());
int n = suffixArr.size();
cout << "Suffix Array : \n" ;
printArr(suffixArr, n);
vector< int >lcp = kasai(str, suffixArr);
cout << "\nLCP Array : \n" ;
printArr(lcp, n);
return 0;
}
|
Java
import java.util.Arrays;
import java.util.Vector;
class Suffix {
int index;
int [] rank = new int [ 2 ];
}
public class SuffixArray {
static int cmp(Suffix a, Suffix b) {
return (a.rank[ 0 ] == b.rank[ 0 ]) ? Integer.compare(a.rank[ 1 ], b.rank[ 1 ]) : Integer.compare(a.rank[ 0 ], b.rank[ 0 ]);
}
static Vector<Integer> buildSuffixArray(String txt, int n) {
Suffix[] suffixes = new Suffix[n];
for ( int i = 0 ; i < n; i++) {
suffixes[i] = new Suffix();
suffixes[i].index = i;
suffixes[i].rank[ 0 ] = txt.charAt(i) - 'a' ;
suffixes[i].rank[ 1 ] = (i + 1 ) < n ? (txt.charAt(i + 1 ) - 'a' ) : - 1 ;
}
Arrays.sort(suffixes, SuffixArray::cmp);
int [] ind = new int [n];
for ( int k = 4 ; k < 2 * n; k = k * 2 ) {
int rank = 0 ;
int prev_rank = suffixes[ 0 ].rank[ 0 ];
suffixes[ 0 ].rank[ 0 ] = rank;
ind[suffixes[ 0 ].index] = 0 ;
for ( int i = 1 ; i < n; i++) {
if (suffixes[i].rank[ 0 ] == prev_rank &&
suffixes[i].rank[ 1 ] == suffixes[i - 1 ].rank[ 1 ]) {
prev_rank = suffixes[i].rank[ 0 ];
suffixes[i].rank[ 0 ] = rank;
} else {
prev_rank = suffixes[i].rank[ 0 ];
suffixes[i].rank[ 0 ] = ++rank;
}
ind[suffixes[i].index] = i;
}
for ( int i = 0 ; i < n; i++) {
int nextindex = suffixes[i].index + k / 2 ;
suffixes[i].rank[ 1 ] = (nextindex < n) ?
suffixes[ind[nextindex]].rank[ 0 ] : - 1 ;
}
Arrays.sort(suffixes, SuffixArray::cmp);
}
Vector<Integer> suffixArr = new Vector<>();
for ( int i = 0 ; i < n; i++)
suffixArr.add(suffixes[i].index);
return suffixArr;
}
static Vector<Integer> kasai(String txt, Vector<Integer> suffixArr) {
int n = suffixArr.size();
int temp[]= new int [n];
Vector<Integer> lcp = new Vector<>(n);
int [] invSuff = new int [n];
for ( int i = 0 ; i < n; i++)
invSuff[suffixArr.get(i)] = i;
int k = 0 ;
for ( int i = 0 ; i < n; i++) {
if (invSuff[i] == n - 1 ) {
k = 0 ;
continue ;
}
int j = suffixArr.get(invSuff[i] + 1 );
while (i + k < n && j + k < n && txt.charAt(i + k) == txt.charAt(j + k))
k++;
temp[invSuff[i]]=k;
if (k > 0 )
k--;
}
for ( int i= 0 ;i<n;i++) {
lcp.add(temp[i]);
}
return lcp;
}
static void printArr(Vector<Integer> arr, int n) {
for (Integer value : arr)
System.out.print(value + " " );
System.out.println();
}
public static void main(String[] args) {
String str = "banana" ;
Vector<Integer> suffixArr = buildSuffixArray(str, str.length());
int n = suffixArr.size();
System.out.println( "Suffix Array : " );
printArr(suffixArr, n);
Vector<Integer> lcp = kasai(str, suffixArr);
System.out.println( "\nLCP Array : " );
printArr(lcp, n);
}
}
|
Python
class Suffix:
def __init__( self ):
self .index = 0
self .rank = [ 0 , 0 ]
def buildSuffixArray(txt, n):
suffixes = [Suffix() for _ in range (n)]
for i in range (n):
suffixes[i].index = i
suffixes[i].rank[ 0 ] = ord (txt[i]) - ord ( 'a' )
suffixes[i].rank[ 1 ] = ord (txt[i + 1 ]) - ord ( 'a' ) if i + 1 < n else - 1
suffixes.sort(key = lambda x: (x.rank[ 0 ], x.rank[ 1 ]))
ind = [ 0 ] * n
for k in range ( 4 , 2 * n, k * 2 ) if 'k' in locals () and k > 0 else range ( 4 , 2 * n, 1 ):
rank = 0
prev_rank = suffixes[ 0 ].rank[ 0 ]
suffixes[ 0 ].rank[ 0 ] = rank
ind[suffixes[ 0 ].index] = 0
for i in range ( 1 , n):
if suffixes[i].rank[ 0 ] = = prev_rank and suffixes[i].rank[ 1 ] = = suffixes[i - 1 ].rank[ 1 ]:
prev_rank = suffixes[i].rank[ 0 ]
suffixes[i].rank[ 0 ] = rank
else :
prev_rank = suffixes[i].rank[ 0 ]
suffixes[i].rank[ 0 ] = rank + 1
ind[suffixes[i].index] = i
for i in range (n):
nextindex = suffixes[i].index + k / / 2
suffixes[i].rank[ 1 ] = suffixes[ind[nextindex]].rank[ 0 ] if nextindex < n else - 1
suffixes.sort(key = lambda x: (x.rank[ 0 ], x.rank[ 1 ]))
suffixArr = [suffix.index for suffix in suffixes]
return suffixArr
def kasai(txt, suffixArr):
n = len (suffixArr)
lcp = [ 0 ] * n
invSuff = [ 0 ] * n
for i in range (n):
invSuff[suffixArr[i]] = i
k = 0
for i in range (n):
if invSuff[i] = = n - 1 :
k = 0
continue
j = suffixArr[invSuff[i] + 1 ]
while i + k < n and j + k < n and txt[i + k] = = txt[j + k]:
k + = 1
lcp[invSuff[i]] = k
if k > 0 :
k - = 1
return lcp
def printArr(arr):
print ( " " .join( map ( str , arr)))
if __name__ = = "__main__" :
input_str = "banana"
suffixArr = buildSuffixArray(input_str, len (input_str))
n = len (suffixArr)
print ( "Suffix Array:" )
printArr(suffixArr)
lcp = kasai(input_str, suffixArr)
print ( "\nLCP Array:" )
printArr(lcp)
|
C#
using System;
using System.Collections.Generic;
public struct Suffix
{
public int Index;
public int [] Rank;
}
public class SuffixArray {
private static int CompareSuffixes(Suffix a, Suffix b)
{
return (a.Rank[0] == b.Rank[0])
? (a.Rank[1] < b.Rank[1] ? -1 : 1)
: (a.Rank[0] < b.Rank[0] ? -1 : 1);
}
public static List< int > BuildSuffixArray( string txt,
int n)
{
Suffix[] suffixes = new Suffix[n];
for ( int i = 0; i < n; i++) {
suffixes[i].Index = i;
suffixes[i].Rank = new int [2] {
txt[i] - 'a' ,
(i + 1) < n ? txt[i + 1] - 'a' : -1
};
}
Array.Sort(suffixes, CompareSuffixes);
int [] ind
= new int [n];
for ( int k = 4; k < 2 * n; k = k * 2) {
int rank = 0;
int prevRank = suffixes[0].Rank[0];
suffixes[0].Rank[0] = rank;
ind[suffixes[0].Index] = 0;
for ( int i = 1; i < n; i++) {
if (suffixes[i].Rank[0] == prevRank
&& suffixes[i].Rank[1]
== suffixes[i - 1].Rank[1]) {
prevRank = suffixes[i].Rank[0];
suffixes[i].Rank[0] = rank;
}
else
{
prevRank = suffixes[i].Rank[0];
suffixes[i].Rank[0] = ++rank;
}
ind[suffixes[i].Index] = i;
}
for ( int i = 0; i < n; i++) {
int nextIndex = suffixes[i].Index + k / 2;
suffixes[i].Rank[1]
= (nextIndex < n)
? suffixes[ind[nextIndex]].Rank[0]
: -1;
}
Array.Sort(suffixes, CompareSuffixes);
}
List< int > suffixArr = new List< int >();
for ( int i = 0; i < n; i++)
suffixArr.Add(suffixes[i].Index);
return suffixArr;
}
public static List< int > Kasai( string txt,
List< int > suffixArr)
{
int n = suffixArr.Count;
List< int > lcp = new List< int >( new int [n]);
int [] invSuff = new int [n];
for ( int i = 0; i < n; i++)
invSuff[suffixArr[i]] = i;
int k = 0;
for ( int i = 0; i < n; i++) {
if (invSuff[i] == n - 1) {
k = 0;
continue ;
}
int j = suffixArr[invSuff[i] + 1];
while (i + k < n && j + k < n
&& txt[i + k] == txt[j + k])
k++;
lcp[invSuff[i]]
= k;
if (k > 0)
k--;
}
return lcp;
}
public static void PrintArr(List< int > arr, int n)
{
for ( int i = 0; i < n; i++)
Console.Write(arr[i] + " " );
Console.WriteLine();
}
public static void Main()
{
string str = "banana" ;
List< int > suffixArr
= BuildSuffixArray(str, str.Length);
int n = suffixArr.Count;
Console.WriteLine( "Suffix Array :" );
PrintArr(suffixArr, n);
List< int > lcp = Kasai(str, suffixArr);
Console.WriteLine( "\nLCP Array :" );
PrintArr(lcp, n);
}
}
|
Javascript
class Suffix {
constructor(index, rank1, rank2) {
this .index = index;
this .rank = [rank1, rank2];
}
}
function cmp(a, b) {
return (a.rank[0] === b.rank[0]) ? (a.rank[1] < b.rank[1] ? -1 : 1) :
(a.rank[0] < b.rank[0] ? -1 : 1);
}
function buildSuffixArray(txt) {
const n = txt.length;
const suffixes = [];
for (let i = 0; i < n; i++) {
suffixes[i] = new Suffix(i, txt.charCodeAt(i) - 'a' .charCodeAt(0),
((i + 1) < n) ? txt.charCodeAt(i + 1) - 'a' .charCodeAt(0) : -1);
}
suffixes.sort(cmp);
const ind = new Array(n).fill(0);
for (let k = 4; k < 2 * n; k *= 2) {
let rank = 0;
let prevRank = suffixes[0].rank[0];
suffixes[0].rank[0] = rank;
ind[suffixes[0].index] = 0;
for (let i = 1; i < n; i++) {
if (suffixes[i].rank[0] === prevRank &&
suffixes[i].rank[1] === suffixes[i - 1].rank[1]) {
prevRank = suffixes[i].rank[0];
suffixes[i].rank[0] = rank;
} else {
prevRank = suffixes[i].rank[0];
suffixes[i].rank[0] = ++rank;
}
ind[suffixes[i].index] = i;
}
for (let i = 0; i < n; i++) {
const nextIndex = suffixes[i].index + k / 2;
suffixes[i].rank[1] = (nextIndex < n) ?
suffixes[ind[nextIndex]].rank[0] : -1;
}
suffixes.sort(cmp);
}
const suffixArr = suffixes.map(suffix => suffix.index);
return suffixArr;
}
function kasai(txt, suffixArr) {
const n = suffixArr.length;
const lcp = new Array(n).fill(0);
const invSuff = new Array(n).fill(0);
for (let i = 0; i < n; i++)
invSuff[suffixArr[i]] = i;
let k = 0;
for (let i = 0; i < n; i++) {
if (invSuff[i] === n - 1) {
k = 0;
continue ;
}
let j = suffixArr[invSuff[i] + 1];
while (i + k < n && j + k < n && txt[i + k] === txt[j + k])
k++;
lcp[invSuff[i]] = k;
if (k > 0)
k--;
}
return lcp;
}
function printArr(arr) {
console.log(arr.join( " " ));
}
const str = "banana" ;
const suffixArr = buildSuffixArray(str);
const n = suffixArr.length;
console.log( "Suffix Array:" );
printArr(suffixArr);
const lcp = kasai(str, suffixArr);
console.log( "\nLCP Array:" );
printArr(lcp);
|
Output:
Suffix Array :
5 3 1 0 4 2
LCP Array :
1 3 0 0 2 0
Illustration:
txt[] = "banana", suffix[] = {5, 3, 1, 0, 4, 2|
Suffix array represents
{"a", "ana", "anana", "banana", "na", "nana"}
Inverse Suffix Array would be
invSuff[] = {3, 2, 5, 1, 4, 0}
LCP values are evaluated in below order
We first compute LCP of first suffix in text which is “banana“. We need next suffix in suffix array to compute LCP (Remember lcp[i] is defined as Longest Common Prefix of suffix[i] and suffix[i+1]). To find the next suffix in suffixArr[], we use SuffInv[]. The next suffix is “na”. Since there is no common prefix between “banana” and “na”, the value of LCP for “banana” is 0 and it is at index 3 in suffix array, so we fill lcp[3] as 0.
Next we compute LCP of second suffix which “anana“. Next suffix of “anana” in suffix array is “banana”. Since there is no common prefix, the value of LCP for “anana” is 0 and it is at index 2 in suffix array, so we fill lcp[2] as 0.
Next we compute LCP of third suffix which “nana“. Since there is no next suffix, the value of LCP for “nana” is not defined. We fill lcp[5] as 0.
Next suffix in text is “ana”. Next suffix of “ana” in suffix array is “anana”. Since there is a common prefix of length 3, the value of LCP for “ana” is 3. We fill lcp[1] as 3.
Now we lcp for next suffix in text which is “na“. This is where Kasai’s algorithm uses the trick that LCP value must be at least 2 because previous LCP value was 3. Since there is no character after “na”, final value of LCP is 2. We fill lcp[4] as 2.
Next suffix in text is “a“. LCP value must be at least 1 because the previous value was 2. Since there is no character after “a”, final value of LCP is 1. We fill lcp[0] as 1.
We will soon be discussing the implementation of search with the help of LCP array and how LCP array helps in reducing time complexity to O(m + Log n).
References:
http://web.stanford.edu/class/cs97si/suffix-array.pdf
http://www.mi.fu-berlin.de/wiki/pub/ABI/RnaSeqP4/suffix-array.pdf
http://codeforces.com/blog/entry/12796
Like Article
Suggest improvement
Share your thoughts in the comments
Please Login to comment...