Related Articles

# Counting k-mers via Suffix Array

• Last Updated : 17 May, 2019

Pre-requisite: Suffix Array.

What are k-mers?
The term k-mer typically refers to all the possible substrings of length k that are contained in a string. Counting all the k-mers in DNA/RNA sequencing reads is the preliminary step of many bioinformatics applications.

What is a Suffix Array?
A suffix array is a sorted array of all suffixes of a string. It is a data structure used, among others, in full text indices, data compression algorithms. More information can be found here.

Problem: We are given a string str and an integer k. We have to find all pairs (substr, i) such that substr is a length – k substring of str that occurs exactly i times.

Steps involved in the approach:
Let’s take the word “banana\$” as an example.
Step 1: Compute the suffix array of the given text.

```          6     \$
5     a\$
3     ana\$
1     anana\$
0     banana\$
4     na\$
2     nana\$
```

Step 2: Iterate through the suffix array keeping “curr_count”.
1. If the length of current suffix is less than k, then skip the iteration. That is, if k = 2, then iteration would be skipped when current suffix is \$.
2. If the current suffix begins with the same length – k substring as the previous suffix, then increment curr_count. For example, during fourth iteration current suffix “anana\$” starts with same substring of length k “an” as previous suffix “ana\$” started with. So, we will increment curr_count in this case.
3. If condition 2 is not satisfied, then if length of previous suffix is equal to k, then that it is a valid pair and we will output it along with its current count, otherwise, we will skip that iteration.

```                 curr_count  Valid Pair
6     \$           1
5     a\$          1
3     ana\$        1         (a\$, 1)
1     anana\$      1
0     banana\$     2         (an, 2)
4     na\$         1         (ba, 1)
2     nana\$       1         (na, 2)
```

Examples:

```Input : banana\$ // Input text
Output : (a\$, 1) // k- mers
(an, 2)
(ba, 1)
(na, 2)

Input : geeksforgeeks
Output : (ee, 2)
(ek, 2)
(fo, 1)
(ge, 2)
(ks, 2)
(or, 1)
(sf, 1)
```

The following is the C code for approach explained above:

 `// C program to solve K-mer counting problem``#include ``#include ``#include `` ` `// Structure to store data of a rotation``struct` `rotation {``    ``int` `index;``    ``char``* suffix;``};`` ` `// Compares the rotations and``// sorts the rotations alphabetically``int` `cmpfunc(``const` `void``* x, ``const` `void``* y)``{``    ``struct` `rotation* rx = (``struct` `rotation*)x;``    ``struct` `rotation* ry = (``struct` `rotation*)y;``    ``return` `strcmp``(rx->suffix, ry->suffix);``}`` ` `// Takes input_text and its length as arguments``// and returns the corresponding suffix array``char``** computeSuffixArray(``char``* input_text, ``                               ``int` `len_text)``{``    ``int` `i;`` ` `    ``// Array of structures to store rotations``    ``// and their indexes``    ``struct` `rotation suff[len_text];`` ` `    ``// Structure is needed to maintain old ``    ``// indexes of rotations after sorting them``    ``for` `(i = 0; i < len_text; i++) {``        ``suff[i].index = i;``        ``suff[i].suffix = (input_text + i);``    ``}`` ` `    ``// Sorts rotations using comparison function``    ``// defined above``    ``qsort``(suff, len_text, ``sizeof``(``struct` `rotation), cmpfunc);`` ` `    ``// Stores the suffixes of sorted rotations``    ``char``** suffix_arr = ``       ``(``char``**)``malloc``(len_text * ``sizeof``(``char``*));`` ` `    ``for` `(i = 0; i < len_text; i++) {``        ``suffix_arr[i] = ``        ``(``char``*)``malloc``((len_text + 1) * ``sizeof``(``char``));``        ``strcpy``(suffix_arr[i], suff[i].suffix);``    ``}`` ` `    ``// Returns the computed suffix array``    ``return` `suffix_arr;``}`` ` `// Takes suffix array, its size and valid length as``// arguments and outputs the valid pairs of k - mers``void` `findValidPairs(``char``** suffix_arr, ``int` `n, ``int` `k)``{``    ``int` `curr_count = 1, i;``    ``char``* prev_suff = (``char``*)``malloc``(n * ``sizeof``(``char``));`` ` `    ``// Iterates over the suffix array,``    ``// keeping a current count``    ``for` `(i = 0; i < n; i++) {`` ` `        ``// Skipping the current suffix``        ``// if it has length < valid length``        ``if` `(``strlen``(suffix_arr[i]) < k) {`` ` `            ``if` `(i != 0 && ``strlen``(prev_suff) == k) {``                ``printf``(``"(%s, %d)\n"``, prev_suff, curr_count);``                ``curr_count = 1;}`` ` `            ``strcpy``(prev_suff, suffix_arr[i]);``            ``continue``;``        ``}`` ` `        ``// Incrementing the curr_count if first``        ``// k chars of prev_suff and current suffix``        ``// are same``        ``if` `(!(``memcmp``(prev_suff, suffix_arr[i], k))) {``            ``curr_count++;``        ``}``        ``else` `{`` ` `            ``// Pair is valid when i!=0 (as there is``            ``// no prev_suff for i = 0) and when strlen``            ``// of prev_suff is k``            ``if` `(i != 0 && ``strlen``(prev_suff) == k) {``                ``printf``(``"(%s, %d)\n"``, prev_suff, curr_count);``                ``curr_count = 1;``                ``memcpy``(prev_suff, suffix_arr[i], k);``                ``prev_suff[k] = ``'\0'``;``            ``}``            ``else` `{``                ``memcpy``(prev_suff, suffix_arr[i], k);``                ``prev_suff[k] = ``'\0'``;``                ``continue``;``            ``}``        ``}`` ` `        ``// Modifying prev_suff[i] to current suffix``        ``memcpy``(prev_suff, suffix_arr[i], k);``        ``prev_suff[k] = ``'\0'``;``    ``}`` ` `    ``// Printing the last valid pair``    ``printf``(``"(%s, %d)\n"``, prev_suff, curr_count);``}`` ` `// Driver program to test functions above``int` `main()``{``    ``char` `input_text[] = ``"geeksforgeeks"``;``    ``int` `k = 2;``    ``int` `len_text = ``strlen``(input_text);`` ` `    ``// Computes the suffix array of our text``    ``printf``(``"Input Text: %s\n"``, input_text);``    ``char``** suffix_arr = ``      ``computeSuffixArray(input_text, len_text);`` ` `    ``// Finds and outputs all valid pairs``    ``printf``(``"k-mers: \n"``);``    ``findValidPairs(suffix_arr, len_text, k);`` ` `    ``return` `0;``}`

Output:

```Input Text: banana\$
k-mers:
(a\$, 1)
(an, 2)
(ba, 1)
(na, 2)
```

Time Complexity: O(s*len_text*log(len_text)), assuming s is the length of the longest suffix.

Sources:
1. Suffix Array Wikipedia
2. Suffix Array CMU

Attention reader! Don’t stop learning now. Get hold of all the important DSA concepts with the DSA Self Paced Course at a student-friendly price and become industry ready.  To complete your preparation from learning a language to DS Algo and many more,  please refer Complete Interview Preparation Course.

In case you wish to attend live classes with experts, please refer DSA Live Classes for Working Professionals and Competitive Programming Live for Students.

My Personal Notes arrow_drop_up