Open In App

Ukkonen’s Suffix Tree Construction – Part 1

Last Updated : 08 Mar, 2024
Summarize
Comments
Improve
Suggest changes
Like Article
Like
Save
Share
Report
News Follow

Suffix Tree is very useful in numerous string processing and computational biology problems. Many books and e-resources talk about it theoretically and in few places, code implementation is discussed. But still, I felt something is missing and it’s not easy to implement code to construct suffix tree and it’s usage in many applications. This is an attempt to bridge the gap between theory and complete working code implementation. Here we will discuss Ukkonen’s Suffix Tree Construction Algorithm. We will discuss it in step by step detailed way and in multiple parts from theory to implementation. We will start with brute force way and try to understand different concepts, tricks involved in Ukkonen’s algorithm and in the last part, code implementation will be discussed. 
Note: You may find some portion of the algorithm difficult to understand while 1st or 2nd reading and it’s perfectly fine. With few more attempts and thought, you should be able to understand such portions. 

Book Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology by Dan Gusfield explains the concepts very well. 

A suffix tree T for a m-character string S is a rooted directed tree with exactly m leaves numbered 1 to m. (Given that last string character is unique in string) 
 

  • Root can have zero, one or more children.
  • Each internal node, other than the root, has at least two children.
  • Each edge is labelled with a nonempty substring of S.
  • No two edges coming out of same node can have edge-labels beginning with the same character.

Concatenation of the edge-labels on the path from the root to leaf i gives the suffix of S that starts at position i, i.e. S[i…m]. 

Note: Position starts with 1 (it’s not zero indexed, but later, while code implementation, we will used zero indexed position) 

For string S = xabxac with m = 6, suffix tree will look like following: 
 

Concatenation of the edge

It has one root node and two internal nodes and 6 leaf nodes. 

String Depth of red path is 1 and it represents suffix c starting at position 6 
String Depth of blue path is 4 and it represents suffix bxca starting at position 3 
String Depth of green path is 2 and it represents suffix ac starting at position 5 
String Depth of orange path is 6 and it represents suffix xabxac starting at position 1 

Edges with labels a (green) and xa (orange) are non-leaf edge (which ends at an internal node). All other edges are leaf edge (ends at a leaf) 

If one suffix of S matches a prefix of another suffix of S (when last character in not unique in string), then path for the first suffix would not end at a leaf. 

For String S = xabxa, with m = 5, following is the suffix tree: 
 

Ukkonen’s Suffix Tree Construction 1

Here we will have 5 suffixes: xabxa, abxa, bxa, xa and a. 
Path for suffixes ‘xa’ and ‘a’ do not end at a leaf. A tree like above (Figure 2) is called implicit suffix tree as some suffixes (‘xa’ and ‘a’) are not seen explicitly in tree. 

To avoid this problem, we add a character which is not present in string already. We normally use $, # etc as termination characters. 
Following is the suffix tree for string S = xabxa$ with m = 6 and now all 6 suffixes end at leaf. 
 

Ukkonen’s Suffix Tree Construction 2

A naive algorithm to build a suffix tree 
Given a string S of length m, enter a single edge for suffix S[1 ..m]$ (the entire string) into the tree, then successively enter suffix S[i..m]$ into the growing tree, for i increasing from 2 to m. Let Ni denote the intermediate tree that encodes all the suffixes from 1 to i. 
So Ni+1 is constructed from Ni as follows: 
 

  • Start at the root of Ni
  • Find the longest path from the root which matches a prefix of S[i+1..m]$
  • Match ends either at the node (say w) or in the middle of an edge [say (u, v)].
  • If it is in the middle of an edge (u, v), break the edge (u, v) into two edges by inserting a new node w just after the last character on the edge that matched a character in S[i+l..m] and just before the first character on the edge that mismatched. The new edge (u, w) is labelled with the part of the (u, v) label that matched with S[i+1..m], and the new edge (w, v) is labelled with the remaining part of the (u, v) label.
  • Create a new edge (w, i+1) from w to a new leaf labelled i+1 and it labels the new edge with the unmatched part of suffix S[i+1..m]

This takes O(m2) to build the suffix tree for the string S of length m. 
Following are few steps to build suffix tree based for string “xabxa$” based on above algorithm: 
 

Ukkonen’s Suffix Tree Construction 3

 

Ukkonen’s Suffix Tree Construction 4

 

Ukkonen’s Suffix Tree Construction 5

 

Ukkonen’s Suffix Tree Construction 6

Implicit suffix tree 
While generating suffix tree using Ukkonen’s algorithm, we will see implicit suffix tree in intermediate steps few times depending on characters in string S. In implicit suffix trees, there will be no edge with $ (or # or any other termination character) label and no internal node with only one edge going out of it. 
To get implicit suffix tree from a suffix tree S$, 
 

  • Remove all terminal symbol $ from the edge labels of the tree,
  • Remove any edge that has no label
  • Remove any node that has only one edge going out of it and merge the edges.

 

Implicit suffix tree

High Level Description of Ukkonen’s algorithm 
Ukkonen’s algorithm constructs an implicit suffix tree Ti for each prefix S[l ..i] of S (of length m). 
It first builds T1 using 1st character, then T2 using 2nd character, then T3 using 3rd character, …, Tm using mth character. 
Implicit suffix tree Ti+1 is built on top of implicit suffix tree Ti
The true suffix tree for S is built from Tm by adding $. 
At any time, Ukkonen’s algorithm builds the suffix tree for the characters seen so far and so it has on-line property that may be useful in some situations. 
Time taken is O(m). 

Ukkonen’s algorithm is divided into m phases (one phase for each character in the string with length m) 
In phase i+1, tree Ti+1 is built from tree Ti

Each phase i+1 is further divided into i+1 extensions, one for each of the i+1 suffixes of S[1..i+1] 
In extension j of phase i+1, the algorithm first finds the end of the path from the root labelled with substring S[j..i]. 
It then extends the substring by adding the character S(i+1) to its end (if it is not there already). 
In extension 1 of phase i+1, we put string S[1..i+1] in the tree. Here S[1..i] will already be present in tree due to previous phase i. We just need to add S[i+1]th character in tree (if not there already). 
In extension 2 of phase i+1, we put string S[2..i+1] in the tree. Here S[2..i] will already be present in tree due to previous phase i. We just need to add S[i+1]th character in tree (if not there already) 
In extension 3 of phase i+1, we put string S[3..i+1] in the tree. Here S[3..i] will already be present in tree due to previous phase i. We just need to add S[i+1]th character in tree (if not there already) 


In extension i+1 of phase i+1, we put string S[i+1..i+1] in the tree. This is just one character which may not be in tree (if character is seen first time so far). If so, we just add a new leaf edge with label S[i+1]. 

High Level Ukkonen’s algorithm 
Construct tree T1 
For i from 1 to m-1 do 
begin {phase i+1} 
          For j from 1 to i+1 
                    begin {extension j} 
                    Find the end of the path from the root labelled S[j..i] in the current tree. 
                    Extend that path by adding character S[i+l] if it is not there already 
          end; 
end; 

Suffix extension is all about adding the next character into the suffix tree built so far. 
In extension j of phase i+1, algorithm finds the end of S[j..i] (which is already in the tree due to previous phase i) and then it extends S[j..i] to be sure the suffix S[j..i+1] is in the tree. 

There are 3 extension rules: 
Rule 1: If the path from the root labelled S[j..i] ends at leaf edge (i.e. S[i] is last character on leaf edge) then character S[i+1] is just added to the end of the label on that leaf edge. 

Rule 2: If the path from the root labelled S[j..i] ends at non-leaf edge (i.e. there are more characters after S[i] on path) and next character is not s[i+1], then a new leaf edge with label s{i+1] and number j is created starting from character S[i+1]. 
A new internal node will also be created if s[1..i] ends inside (in-between) a non-leaf edge. 

Rule 3: If the path from the root labelled S[j..i] ends at non-leaf edge (i.e. there are more characters after S[i] on path) and next character is s[i+1] (already in tree), do nothing. 

One important point to note here is that from a given node (root or internal), there will be one and only one edge starting from one character. There will not be more than one edges going out of any node, starting with same character. 

Following is a step by step suffix tree construction of string xabxac using Ukkonen’s algorithm: 
 

High Level Description of Ukkonen’s algorithm 1

 

High Level Description of Ukkonen’s algorithm 2

 

High Level Description of Ukkonen’s algorithm 3

 

High Level Description of Ukkonen’s algorithm 4

 

High Level Description of Ukkonen’s algorithm 5

 

High Level Description of Ukkonen’s algorithm 6

In next parts (Part 2, Part 3, Part 4 and Part 5), we will discuss suffix links, active points, few tricks and finally code implementations (Part 6). 



Previous Article
Next Article

Similar Reads

Ukkonen's Suffix Tree Construction - Part 2
In Ukkonen’s Suffix Tree Construction – Part 1, we have seen high level Ukkonen’s Algorithm. This 2nd part is continuation of Part 1. Please go through Part 1, before looking at current article. In Suffix Tree Construction of string S of length m, there are m phases and for a phase j (1 <= j <= m), we add jth character in tree built so far an
11 min read
Ukkonen's Suffix Tree Construction - Part 3
This article is continuation of following two articles: Ukkonen’s Suffix Tree Construction – Part 1 Ukkonen’s Suffix Tree Construction – Part 2 Please go through Part 1 and Part 2, before looking at current article, where we have seen few basics on suffix tree, high level ukkonen’s algorithm, suffix link and three implementation tricks. Here we wil
15 min read
Ukkonen's Suffix Tree Construction - Part 4
This article is continuation of following three articles: Ukkonen’s Suffix Tree Construction – Part 1 Ukkonen’s Suffix Tree Construction – Part 2 Ukkonen’s Suffix Tree Construction – Part 3 Please go through Part 1, Part 2 and Part 3, before looking at current article, where we have seen few basics on suffix tree, high level ukkonen’s algorithm, su
11 min read
Ukkonen's Suffix Tree Construction - Part 5
This article is continuation of following four articles: Ukkonen’s Suffix Tree Construction – Part 1 Ukkonen’s Suffix Tree Construction – Part 2 Ukkonen’s Suffix Tree Construction – Part 3 Ukkonen’s Suffix Tree Construction – Part 4 Please go through Part 1, Part 2, Part 3 and Part 4, before looking at current article, where we have seen few basics
13 min read
Ukkonen's Suffix Tree Construction - Part 6
This article is continuation of following five articles: Ukkonen’s Suffix Tree Construction – Part 1 Ukkonen’s Suffix Tree Construction – Part 2 Ukkonen’s Suffix Tree Construction – Part 3 Ukkonen’s Suffix Tree Construction – Part 4 Ukkonen’s Suffix Tree Construction – Part 5Please go through Part 1, Part 2, Part 3, Part 4 and Part 5, before lookin
15+ min read
Suffix Tree Application 4 - Build Linear Time Suffix Array
Given a string, build it's Suffix Array We have already discussed following two ways of building suffix array: Naive O(n2Logn) algorithmEnhanced O(nLogn) algorithm Please go through these to have the basic understanding. Here we will see how to build suffix array in linear time using suffix tree.As a prerequisite, we must know how to build a suffix
15+ min read
Difference between Suffix Array and Suffix Tree
Suffix Array and Suffix Tree are data structures used for the efficient string processing and pattern matching. They provide the different ways to the store and query substrings each with the unique characteristics and use cases. Understanding the differences between them helps in the choosing the right data structure for the specific applications.
3 min read
­­kasai’s Algorithm for Construction of LCP array from Suffix Array
Background Suffix Array : A suffix array is a sorted array of all suffixes of a given string. Let the given string be "banana". 0 banana 5 a1 anana Sort the Suffixes 3 ana2 nana ----------------> 1 anana 3 ana alphabetically 0 banana 4 na 4 na 5 a 2 nanaThe suffix array for "banana" :suffix[] = {5, 3, 1, 0, 4, 2}We have discussed Suffix Array an
15+ min read
Check if count of substrings in S with string S1 as prefix and S2 as suffix is equal to that with S2 as prefix and S1 as suffix
Given three strings S, S1, and S2, the task is to check if the number of substrings that start and end with S1 and S2 is equal to the number of substrings that start and end with S2 and S1 or not. If found to be true, then print "Yes". Otherwise, print "No". Examples: Input: S = "helloworldworldhelloworld", S1 = "hello", S2 = "world"Output: NoExpla
8 min read
Construct array B as last element left of every suffix array obtained by performing given operations on every suffix of given array
Given an array arr[] of N integers, the task is to print the last element left of every suffix array obtained by performing the following operation on every suffix of the array, arr[]: Copy the elements of the suffix array into an array suff[].Update ith suffix element as suff[i] = (suff[i] OR suff[i+1]) - (suff[i] XOR suff[i+1]) reducing the size
9 min read
Find the suffix factorials of a suffix sum array of the given array
Given an array arr[] consisting of N positive integers, the task is to find the suffix factorials of a suffix sum array of the given array. Examples: Input: arr[] = {1, 2, 3, 4}Output: {3628800, 362880, 5040, 24}Explanation: The suffix sum of the given array is {10, 9, 7, 4}. Therefore, suffix factorials of the obtained suffix sum array is {10!, 9!
5 min read
Maximum prefix sum which is equal to suffix sum such that prefix and suffix do not overlap
Given an array arr[] of N Positive integers, the task is to find the largest prefix sum which is also the suffix sum and prefix and suffix do not overlap. Examples: Input: N = 5, arr = [1, 3, 2, 1, 4]Output: 4Explanation: consider prefix [1, 3] and suffix [4] which gives maximum prefix sum which is also suffix sum such that prefix and suffix do not
7 min read
Minimum common sum from K arrays after removing some part of their suffix
Given K (K > 2) arrays of different sizes in a 2D list arr[][] where elements in each array are non-negative. Find the minimum common sum of K arrays after removing part of the suffix(possibly none) from each array. Examples: Input: K = 3, arr = {{5, 2, 4}, {1, 4, 1, 1}, {2, 3}}Output: 5Explanation: 1st array: [5, 5+2, 5+2+4], = {5, 7, 11} remov
7 min read
Overview of Graph, Trie, Segment Tree and Suffix Tree Data Structures
Introduction:Graph: A graph is a collection of vertices (nodes) and edges that represent relationships between the vertices. Graphs are used to model and analyze networks, such as social networks or transportation networks.Trie: A trie, also known as a prefix tree, is a tree-like data structure that stores a collection of strings. It is used for ef
10 min read
Proto Van Emde Boas Tree | Set 2 | Construction
Van Emde Boas Tree supports search, minimum, maximum, successor, predecessor, insert and delete operations in O(lglgN) time which is faster than any of related data structures like priority queue, binary search tree, etc. Proto Van Emde Boas tree is similar prototype type data structure but it fails to achieve the complexity of O(lglgN), We will le
8 min read
Van Emde Boas Tree | Set 1 | Basics and Construction
It is highly recommended to fully understand Proto Van Emde Boas Tree. Van Emde Boas Tree supports search, successor, predecessor, insert and delete operations in O(lglgN) time which is faster than any of related data structures like priority queue, binary search tree, etc. Van Emde Boas Tree works with O(1) time-complexity for minimum and maximum
9 min read
Find partitions that maximises sum of count of 0's in left part and count of 1's in right part
Given a binary array nums of length N, The task is to find all possible partitions in given array such that sum of count of 0's in left part and count of 1's in right part is maximized. Example: Input: nums = {0, 0, 1, 0}Output: 2, 4Explanation: Division at indexindex - 0: numsleft is []. numsright is [0, 0, 1, 0]. The sum is 0 + 1 = 1.index - 1: n
7 min read
Suffix Tree Application 2 - Searching All Patterns
Given a text string and a pattern string, find all occurrences of the pattern in string. Few pattern searching algorithms (KMP, Rabin-Karp, Naive Algorithm, Finite Automata) are already discussed, which can be used for this check. Here we will discuss the suffix tree based algorithm. In the 1st Suffix Tree Application (Substring Check), we saw how
15+ min read
Suffix Tree Application 3 - Longest Repeated Substring
Given a text string, find Longest Repeated Substring in the text. If there are more than one Longest Repeated Substrings, get any one of them. Longest Repeated Substring in GEEKSFORGEEKS is: GEEKS Longest Repeated Substring in AAAAAAAAAA is: AAAAAAAAA Longest Repeated Substring in ABCDEFG is: No repeated substring Longest Repeated Substring in ABAB
15+ min read
Generalized Suffix Tree
In earlier suffix tree articles, we created suffix tree for one string and then we queried that tree for substring check, searching all patterns, longest repeated substring and built suffix array (All linear time operations).There are lots of other problems where multiple strings are involved. e.g. pattern searching in a text file or dictionary, sp
15+ min read
Suffix Tree Application 5 - Longest Common Substring
Given two strings X and Y, find the Longest Common Substring of X and Y.Naive [O(N*M2)] and Dynamic Programming [O(N*M)] approaches are already discussed here. In this article, we will discuss a linear time approach to find LCS using suffix tree (The 5th Suffix Tree Application). Here we will build generalized suffix tree for two strings X and Y as
15+ min read
Pattern Searching using Suffix Tree
Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char txt[]) that prints all occurrences of pat[] in txt[]. You may assume that n > m. Preprocess Pattern or Preprocess Text? We have discussed the following algorithms in the previous posts: KMP Algorithm Rabin Karp Algorithm Finite Automata based Algorithm B
4 min read
Suffix Tree Application 1 - Substring Check
Given a text string and a pattern string, check if a pattern exists in text or not.Few pattern searching algorithms (KMP, Rabin-Karp, Naive Algorithm, Finite Automata) are already discussed, which can be used for this check. Here we will discuss suffix tree based algorithm.As a prerequisite, we must know how to build a suffix tree in one or the oth
15+ min read
Suffix Tree Application 6 - Longest Palindromic Substring
Given a string, find the longest substring which is palindrome.We have already discussed Naïve [O(n3)], quadratic [O(n2)] and linear [O(n)] approaches in Set 1, Set 2 and Manacher’s Algorithm. In this article, we will discuss another linear time approach based on suffix tree. If given string is S, then approach is following:   Reverse the string S
15+ min read
Construction of Longest Increasing Subsequence (N log N)
In my previous post, I have explained about longest increasing sub-sequence (LIS) problem in detail. However, the post only covered code related to querying size of LIS, but not the construction of LIS. I left it as an exercise. If you have solved, cheers. If not, you are not alone, here is code. If you have not read my previous post, read here. No
10 min read
Construction of multiple AP Arrays
Given an array A[] of N integers, the task is to check if it is possible to construct several arrays (at least one) using all the elements of the array A[] such that in each array, the value of each element is equal to the number of elements to its left. Examples: Input: N = 9, A[] = {0, 0, 0, 0, 1, 1, 1, 2, 2}Output: YESExplanation: The array A[]
8 min read
Maximizing Stick Utilization for Square and Rectangle Construction
Given two arrays A[] and B[] of the same length N. There are N types of sticks of lengths specified. Each stick of length A[i] is present in B[i] units (i=1 to N). You have to construct the squares and rectangles using these sticks. Each unit of a stick can be used as the length or breadth of a rectangle or as a side of a square. A single unit of a
8 min read
Pattern Searching | Set 6 (Efficient Construction of Finite Automata)
In the previous post, we discussed the Finite Automata-based pattern searching algorithm. The FA (Finite Automata) construction method discussed in the previous post takes O((m^3)*NO_OF_CHARS) time. FA can be constructed in O(m*NO_OF_CHARS) time. In this post, we will discuss the O(m*NO_OF_CHARS) algorithm for FA construction. The idea is similar t
9 min read
Optimal Construction of Roads
Given a connected graph with N nodes and N-1 bidirectional edges given by a 2D array edges[][], such that edges[i] = {u, v, w} states that there is a bidirectional edge between node u and node v with a weight of w. The task is to answer Q queries such that for each query, query[i] = {u, v, w} print "1" if adding a new edge between node u and node v
14 min read
Check if an edge is a part of any Minimum Spanning Tree
Given a connected undirected weighted graph in the form of a 2D array where each row is of the type [start node, end node, weight] describing an edge, and also two integers (A, B). Return if the edge formed between (A, B) is a part of any of the Minimum Spanning Tree (MST) of the graph. Minimum Spanning Tree (MST): This is a special subgraph of the
13 min read