Pre-requisite: Suffix Array.
What are k-mers?
The term k-mer typically refers to all the possible substrings of length k that are contained in a string. Counting all the k-mers in DNA/RNA sequencing reads is the preliminary step of many bioinformatics applications.
What is a Suffix Array?
A suffix array is a sorted array of all suffixes of a string. It is a data structure used, among others, in full text indices, data compression algorithms. More information can be found here.
Problem: We are given a string str and an integer k. We have to find all pairs (substr, i) such that substr is a length – k substring of str that occurs exactly i times.
Steps involved in the approach:
Let’s take the word “banana$” as an example.
Step 1: Compute the suffix array of the given text.
6 $ 5 a$ 3 ana$ 1 anana$ 0 banana$ 4 na$ 2 nana$
Step 2: Iterate through the suffix array keeping “curr_count”.
1. If the length of current suffix is less than k, then skip the iteration. That is, if k = 2, then iteration would be skipped when current suffix is $.
2. If the current suffix begins with the same length – k substring as the previous suffix, then increment curr_count. For example, during fourth iteration current suffix “anana$” starts with same substring of length k “an” as previous suffix “ana$” started with. So, we will increment curr_count in this case.
3. If condition 2 is not satisfied, then if length of previous suffix is equal to k, then that it is a valid pair and we will output it along with its current count, otherwise, we will skip that iteration.
curr_count Valid Pair 6 $ 1 5 a$ 1 3 ana$ 1 (a$, 1) 1 anana$ 1 0 banana$ 2 (an, 2) 4 na$ 1 (ba, 1) 2 nana$ 1 (na, 2)
Input : banana$ // Input text Output : (a$, 1) // k- mers (an, 2) (ba, 1) (na, 2) Input : geeksforgeeks Output : (ee, 2) (ek, 2) (fo, 1) (ge, 2) (ks, 2) (or, 1) (sf, 1)
The following is the C code for approach explained above:
Input Text: banana$ k-mers: (a$, 1) (an, 2) (ba, 1) (na, 2)
Time Complexity: O(s*len_text*log(len_text)), assuming s is the length of the longest suffix.
Don’t stop now and take your learning to the next level. Learn all the important concepts of Data Structures and Algorithms with the help of the most trusted course: DSA Self Paced. Become industry ready at a student-friendly price.
- Suffix Tree Application 4 - Build Linear Time Suffix Array
- kasai’s Algorithm for Construction of LCP array from Suffix Array
- Suffix Array | Set 1 (Introduction)
- Counting inversions in an array using segment tree
- Suffix Array | Set 2 (nLogn Algorithm)
- Count of distinct substrings of a string using Suffix Array
- Counting Sort
- Counting Triangles in a Rectangular space using BIT
- Counting the number of words in a Trie
- Counting Inversions using Ordered Set and GNU C++ PBDS
- Counting even decimal value substrings in a binary string
- Check if a string is suffix of another
- Longest prefix which is also suffix
- Generalized Suffix Tree 1
- Find strings that end with a given suffix
- String from prefix and suffix of given two strings
- Pattern Searching using Suffix Tree
- Ukkonen's Suffix Tree Construction - Part 3
- Suffix Tree Application 2 - Searching All Patterns
- Ukkonen's Suffix Tree Construction - Part 6
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to firstname.lastname@example.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.
Improved By : zhewu