Counting k-mers via Suffix Array
Pre-requisite: Suffix Array.
What are k-mers?
The term k-mer typically refers to all the possible substrings of length k that are contained in a string. Counting all the k-mers in DNA/RNA sequencing reads is the preliminary step of many bioinformatics applications.
What is a Suffix Array?
A suffix array is a sorted array of all suffixes of a string. It is a data structure used, among others, in full text indices, data compression algorithms. More information can be found here.
Problem: We are given a string str and an integer k. We have to find all pairs (substr, i) such that substr is a length – k substring of str that occurs exactly i times.
Steps involved in the approach:
Let’s take the word “banana$” as an example.
Step 1: Compute the suffix array of the given text.
6 $ 5 a$ 3 ana$ 1 anana$ 0 banana$ 4 na$ 2 nana$
Step 2: Iterate through the suffix array keeping “curr_count”.
1. If the length of current suffix is less than k, then skip the iteration. That is, if k = 2, then iteration would be skipped when current suffix is $.
2. If the current suffix begins with the same length – k substring as the previous suffix, then increment curr_count. For example, during fourth iteration current suffix “anana$” starts with same substring of length k “an” as previous suffix “ana$” started with. So, we will increment curr_count in this case.
3. If condition 2 is not satisfied, then if length of previous suffix is equal to k, then that it is a valid pair and we will output it along with its current count, otherwise, we will skip that iteration.
curr_count Valid Pair 6 $ 1 5 a$ 1 3 ana$ 1 (a$, 1) 1 anana$ 1 0 banana$ 2 (an, 2) 4 na$ 1 (ba, 1) 2 nana$ 1 (na, 2)
Input : banana$ // Input text Output : (a$, 1) // k- mers (an, 2) (ba, 1) (na, 2) Input : geeksforgeeks Output : (ee, 2) (ek, 2) (fo, 1) (ge, 2) (ks, 2) (or, 1) (sf, 1)
The following is the C code for approach explained above:
Input Text: banana$ k-mers: (a$, 1) (an, 2) (ba, 1) (na, 2)
Time Complexity: O(s*len_text*log(len_text)), assuming s is the length of the longest suffix.
Attention reader! Don’t stop learning now. Get hold of all the important DSA concepts with the DSA Self Paced Course at a student-friendly price and become industry ready. To complete your preparation from learning a language to DS Algo and many more, please refer Complete Interview Preparation Course.