Pre-requisite: Suffix Array.
What are k-mers?
The term k-mer typically refers to all the possible substrings of length k that are contained in a string. Counting all the k-mers in DNA/RNA sequencing reads is the preliminary step of many bioinformatics applications.
What is a Suffix Array?
A suffix array is a sorted array of all suffixes of a string. It is a data structure used, among others, in full text indices, data compression algorithms. More information can be found here.
Problem: We are given a string str and an integer k. We have to find all pairs (substr, i) such that substr is a length – k substring of str that occurs exactly i times.
Steps involved in the approach:
Let’s take the word “banana$” as an example.
Step 1: Compute the suffix array of the given text.
6 $ 5 a$ 3 ana$ 1 anana$ 0 banana$ 4 na$ 2 nana$
Step 2: Iterate through the suffix array keeping “curr_count”.
1. If the length of current suffix is less than k, then skip the iteration. That is, if k = 2, then iteration would be skipped when current suffix is $.
2. If the current suffix begins with the same length – k substring as the previous suffix, then increment curr_count. For example, during fourth iteration current suffix “anana$” starts with same substring of length k “an” as previous suffix “ana$” started with. So, we will increment curr_count in this case.
3. If condition 2 is not satisfied, then if length of previous suffix is equal to k, then that it is a valid pair and we will output it along with its current count, otherwise, we will skip that iteration.
curr_count Valid Pair 6 $ 1 5 a$ 1 3 ana$ 1 (a$, 1) 1 anana$ 1 0 banana$ 2 (an, 2) 4 na$ 1 (ba, 1) 2 nana$ 1 (na, 2)
Input : banana$ // Input text Output : (a$, 1) // k- mers (an, 2) (ba, 1) (na, 2) Input : geeksforgeeks Output : (ee, 2) (ek, 2) (fo, 1) (ge, 2) (ks, 2) (or, 1) (sf, 1)
The following is the C code for approach explained above:
Input Text: banana$ k-mers: (a$, 1) (an, 2) (ba, 1) (na, 2)
Time Complexity: O(s*len_text*log(len_text)), assuming s is the length of the longest suffix.
- Suffix Tree Application 4 - Build Linear Time Suffix Array
- kasai’s Algorithm for Construction of LCP array from Suffix Array
- Suffix Array | Set 1 (Introduction)
- Suffix Array | Set 2 (nLogn Algorithm)
- Counting inversions in an array using segment tree
- Count of distinct substrings of a string using Suffix Array
- Counting Sort
- Counting the number of words in a Trie
- Counting Triangles in a Rectangular space using BIT
- Generalized Suffix Tree 1
- Check if a string is suffix of another
- Longest prefix which is also suffix
- Counting even decimal value substrings in a binary string
- String from prefix and suffix of given two strings
- Pattern Searching using Suffix Tree
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to email@example.com. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.
Improved By : zhewu