Counting k-mers via Suffix Array

Pre-requisite: Suffix Array.

What are k-mers?
The term k-mer typically refers to all the possible substrings of length k that are contained in a string. Counting all the k-mers in DNA/RNA sequencing reads is the preliminary step of many bioinformatics applications.

What is a Suffix Array?
A suffix array is a sorted array of all suffixes of a string. It is a data structure used, among others, in full text indices, data compression algorithms. More information can be found here.

Problem: We are given a string str and an integer k. We have to find all pairs (substr, i) such that substr is a length – k substring of str that occurs exactly i times.

Steps involved in the approach:
Let’s take the word “banana$” as an example.
Step 1: Compute the suffix array of the given text.



          6     $   
          5     a$
          3     ana$
          1     anana$
          0     banana$
          4     na$                    
          2     nana$

Step 2: Iterate through the suffix array keeping “curr_count”.
1. If the length of current suffix is less than k, then skip the iteration. That is, if k = 2, then iteration would be skipped when current suffix is $.
2. If the current suffix begins with the same length – k substring as the previous suffix, then increment curr_count. For example, during fourth iteration current suffix “anana$” starts with same substring of length k “an” as previous suffix “ana$” started with. So, we will increment curr_count in this case.
3. If condition 2 is not satisfied, then if length of previous suffix is equal to k, then that it is a valid pair and we will output it along with its current count, otherwise, we will skip that iteration.

                 curr_count  Valid Pair
 6     $           1                     
 5     a$          1
 3     ana$        1         (a$, 1)
 1     anana$      1
 0     banana$     2         (an, 2)
 4     na$         1         (ba, 1)               
 2     nana$       1         (na, 2)

Examples:

Input : banana$ // Input text
Output : (a$, 1) // k- mers
         (an, 2)
         (ba, 1)
         (na, 2)
 
Input : geeksforgeeks
Output : (ee, 2) 
         (ek, 2)
         (fo, 1)
         (ge, 2)
         (ks, 2)
         (or, 1)
         (sf, 1)

The following is the C code for approach explained above:

filter_none

edit
close

play_arrow

link
brightness_4
code

// C program to solve K-mer counting problem
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
  
// Structure to store data of a rotation
struct rotation {
    int index;
    char* suffix;
};
  
// Compares the rotations and
// sorts the rotations alphabetically
int cmpfunc(const void* x, const void* y)
{
    struct rotation* rx = (struct rotation*)x;
    struct rotation* ry = (struct rotation*)y;
    return strcmp(rx->suffix, ry->suffix);
}
  
// Takes input_text and its length as arguments
// and returns the corresponding suffix array
char** computeSuffixArray(char* input_text, 
                               int len_text)
{
    int i;
  
    // Array of structures to store rotations
    // and their indexes
    struct rotation suff[len_text];
  
    // Structure is needed to maintain old 
    // indexes of rotations after sorting them
    for (i = 0; i < len_text; i++) {
        suff[i].index = i;
        suff[i].suffix = (input_text + i);
    }
  
    // Sorts rotations using comparison function
    // defined above
    qsort(suff, len_text, sizeof(struct rotation), cmpfunc);
  
    // Stores the suffixes of sorted rotations
    char** suffix_arr = 
       (char**)malloc(len_text * sizeof(char*));
  
    for (i = 0; i < len_text; i++) {
        suffix_arr[i] = 
        (char*)malloc((len_text + 1) * sizeof(char));
        strcpy(suffix_arr[i], suff[i].suffix);
    }
  
    // Returns the computed suffix array
    return suffix_arr;
}
  
// Takes suffix array, its size and valid length as
// arguments and outputs the valid pairs of k - mers
void findValidPairs(char** suffix_arr, int n, int k)
{
    int curr_count = 1, i;
    char* prev_suff = (char*)malloc(n * sizeof(char));
  
    // Iterates over the suffix array,
    // keeping a current count
    for (i = 0; i < n; i++) {
  
        // Skipping the current suffix
        // if it has length < valid length
        if (strlen(suffix_arr[i]) < k) {
  
            if (i != 0 && strlen(prev_suff) == k) {
                printf("(%s, %d)\n", prev_suff, curr_count);
                curr_count = 1;}
  
            strcpy(prev_suff, suffix_arr[i]);
            continue;
        }
  
        // Incrementing the curr_count if first
        // k chars of prev_suff and current suffix
        // are same
        if (!(memcmp(prev_suff, suffix_arr[i], k))) {
            curr_count++;
        }
        else {
  
            // Pair is valid when i!=0 (as there is
            // no prev_suff for i = 0) and when strlen
            // of prev_suff is k
            if (i != 0 && strlen(prev_suff) == k) {
                printf("(%s, %d)\n", prev_suff, curr_count);
                curr_count = 1;
                memcpy(prev_suff, suffix_arr[i], k);
                prev_suff[k] = '\0';
            }
            else {
                memcpy(prev_suff, suffix_arr[i], k);
                prev_suff[k] = '\0';
                continue;
            }
        }
  
        // Modifying prev_suff[i] to current suffix
        memcpy(prev_suff, suffix_arr[i], k);
        prev_suff[k] = '\0';
    }
  
    // Printing the last valid pair
    printf("(%s, %d)\n", prev_suff, curr_count);
}
  
// Driver program to test functions above
int main()
{
    char input_text[] = "geeksforgeeks";
    int k = 2;
    int len_text = strlen(input_text);
  
    // Computes the suffix array of our text
    printf("Input Text: %s\n", input_text);
    char** suffix_arr = 
      computeSuffixArray(input_text, len_text);
  
    // Finds and outputs all valid pairs
    printf("k-mers: \n");
    findValidPairs(suffix_arr, len_text, k);
  
    return 0;
}

chevron_right


Output:

Input Text: banana$ 
k-mers: 
(a$, 1)
(an, 2)
(ba, 1)
(na, 2)

Time Complexity: O(s*len_text*log(len_text)), assuming s is the length of the longest suffix.

Sources:
1. Suffix Array Wikipedia
2. Suffix Array CMU



My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.



Improved By : zhewu