Open In App

Inverting the Burrows – Wheeler Transform

Improve
Improve
Like Article
Like
Save
Share
Report

Prerequisite: Burrows – Wheeler Data Transform Algorithm 

Why inverse of BWT? The main idea behind it: 

1. The remarkable thing about BWT algorithm is that this particular transform is invertible with minimal data overhead. 

2. To compute inverse of BWT is to undo the BWT and recover the original string. The naive method of implementing this algorithm can be studied from here. The naive approach is speed and memory intensive and requires us to store |text| cyclic rotations of the string |text|. 

3. Let’s discuss a faster algorithm where we have with us only two things: 

  1. bwt_arr[] which is the last column of sorted rotations list given as “annb$aa”
  2. ‘x’ which is the row index at which our original string “banana$” appears in the sorted rotations list. We can see that ‘x’ is 4 in the example below.
 Row Index    Original Rotations    Sorted Rotations
 ~~~~~~~~~    ~~~~~~~~~~~~~~~~~~    ~~~~~~~~~~~~~~~~
    0             banana$               $banana
    1             anana$b               a$banan
    2             nana$ba               ana$ban
    3             ana$ban               anana$b
   *4             na$bana               banana$
    5             a$banan               na$bana
    6             $banana               nana$ba

4. An important observation: If the jth original rotation (which is original rotation shifted j characters to the left) is the ith row in the sorted order, then l_shift[i] records in the sorted order where (j+1)st original rotation appears. For example, the 0th original rotation “banana$” is row 4 of sorted order, and since l_shift[4] is 3, the next original rotation “anana$b” is row 3 of the sorted order.

Row Index  Original Rotations  Sorted Rotations l_shift 
~~~~~~~~~ ~~~~~~~~~~~~~~~~~~  ~~~~~~~~~~~~~~~~  ~~~~~~~
   0           banana$         $banana           4
   1           anana$b         a$banan           0
   2           nana$ba         ana$ban           5
   3           ana$ban         anana$b           6
  *4           na$bana         banana$           3
   5           a$banan         na$bana           1
   6           $banana         nana$ba           2

5. Our job is to deduce l_shift[] from the information available to us which is bwt_arr[] and ‘x’ and with its help compute the inverse of BWT. How to compute l_shift[] ? 1. We know BWT which is “annb$aa”. This implies that we know all the characters of our original string, even though they’re permuted in wrong order. 2. By sorting bwt_arr[], we can reconstruct first column of sorted rotations list and we call it sorted_bwt[].

 Row Index    Sorted Rotations   bwt_arr    l_shift  
 ~~~~~~~~~    ~~~~~~~~~~~~~~~~~~~~~~~~~~    ~~~~~~~   
     0         $  ?  ?  ?  ?  ?  a             4
     1         a  ?  ?  ?  ?  ?  n
     2         a  ?  ?  ?  ?  ?  n
     3         a  ?  ?  ?  ?  ?  b
    *4         b  ?  ?  ?  ?  ?  $             3
     5         n  ?  ?  ?  ?  ?  a
     6         n  ?  ?  ?  ?  ?  a

3. Since ‘$’ occurs only once in the string ‘sorted_bwt[]’ and rotations are formed using cyclic wrap around, we can deduce that l_shift[0] = 4. Similarly, ‘b’ occurs once, so we can deduce that l_shift[4] = 3. 

4. But, because ‘n’ appears twice, it seems ambiguous whether l_shift[5] = 1 and l_shift[6] = 2 or whether l_shift[5] = 2 and l_shift[6] = 1. 

5. Rule to solve this ambiguity is that if rows i and j both start with the same letter and i<j, then l_shift[i] < l_shift[j]. This implies l_shift[5] = 1 and l_shift[6] =2. Continuing in a similar fashion, l_shift[] gets computed to the following.

 Row Index    Sorted Rotations   bwt_arr    l_shift  
 ~~~~~~~~~    ~~~~~~~~~~~~~~~~~~~~~~~~~~    ~~~~~~~   
     0         $  ?  ?  ?  ?  ?  a             4
     1         a  ?  ?  ?  ?  ?  n             0
     2         a  ?  ?  ?  ?  ?  n             5
     3         a  ?  ?  ?  ?  ?  b             6
    *4         b  ?  ?  ?  ?  ?  $             3
     5         n  ?  ?  ?  ?  ?  a             1
     6         n  ?  ?  ?  ?  ?  a             2

Why is the ambiguity resolving rule valid? 

  1. The rotations are sorted in such a way that row 5 is lexicographically less than row 6. 
  2. Thus, the five unknown characters in row 5 must be less than the five unknown characters in row 6 (as both start with ‘n’). 
  3. We also know that between the two rows than end with ‘n’, row 1 is lower than row 2. 
  4. But, the five unknown characters in rows 5 and 6 are precisely the first five characters in rows 1 and 2 or this would contradict the fact that rotations were sorted. 
  5. Thus, l_shift[5] = 1 and l_shift[6] = 2. 

Way of implementation: 

1. Sort BWT: Using qsort(), we arrange characters of bwt_arr[] in sorted order and store it in sorted_arr[]

2. Compute l_shift[]: 

i. We take an array of pointers struct node *arr[], each of which points to a linked list. 

ii. Making each distinct character of bwt_arr[] a head node of a linked list, we append nodes to the linked list whose data part contains index at which that character occurs in bwt_arr[].

   i        *arr[128]           Linked Lists
~~~~~~~~~    ~~~~~~~~~      ~~~~~~~~~~~~~~~~~~~~~~  
   37          $     ----->    4 ->  NULL
   97          a     ----->    0 -> 5 -> 6 -> NULL
   110         n     ----->    1 -> 2 -> NULL
   98          b     ----->    3 -> NULL

iii. Making distinct characters of sorted_bwt[] heads of linked lists, we traverse linked lists and get corresponding l_shift[] values.

     int[] l_shift = { 4, 0, 5, 6, 3, 1, 2 };

3. Iterating string length times, we decode BWT with x = l_shift[x] and output bwt_arr[x].

     x = l_shift[4] 
     x = 3
     bwt_arr[3] = 'b'

     x = l_shift[3] 
     x = 6
     bwt_arr[6] = 'a'

Examples:

Input : annb$aa // Burrows - Wheeler Transform
        4 // Row index at which original message 
          // appears in sorted rotations list 
Output : banana$

Input : ard$rcaaaabb
        3
Output : abracadabra$

Following is the C code for way of implementation explained above: 

C




// C program to find inverse of Burrows
// Wheeler transform
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
 
// Structure to store info of a node of
// linked list
struct node {
    int data;
    struct node* next;
};
 
// Compares the characters of bwt_arr[]
// and sorts them alphabetically
int cmpfunc(const void* a, const void* b)
{
    const char* ia = (const char*)a;
    const char* ib = (const char*)b;
    return strcmp(ia, ib);
}
 
// Creates the new node
struct node* getNode(int i)
{
    struct node* nn =
        (struct node*)malloc(sizeof(struct node));
    nn->data = i;
    nn->next = NULL;
    return nn;
}
 
// Does insertion at end in the linked list
void addAtLast(struct node** head, struct node* nn)
{
    if (*head == NULL) {
        *head = nn;
        return;
    }
    struct node* temp = *head;
    while (temp->next != NULL)
        temp = temp->next;
    temp->next = nn;
}
 
// Computes l_shift[]
void* computeLShift(struct node** head, int index,
                    int* l_shift)
{
    l_shift[index] = (*head)->data;
    (*head) = (*head)->next;
}
 
void invert(char bwt_arr[])
{
    int i,len_bwt = strlen(bwt_arr);
    char* sorted_bwt = (char*)malloc(len_bwt * sizeof(char));
    strcpy(sorted_bwt, bwt_arr);
    int* l_shift = (int*)malloc(len_bwt * sizeof(int));
 
    // Index at which original string appears
    // in the sorted rotations list
    int x = 4;
 
    // Sorts the characters of bwt_arr[] alphabetically
    qsort(sorted_bwt, len_bwt, sizeof(char), cmpfunc);
 
    // Array of pointers that act as head nodes
    // to linked lists created to compute l_shift[]
    struct node* arr[128] = { NULL };
 
    // Takes each distinct character of bwt_arr[] as head
    // of a linked list and appends to it the new node
    // whose data part contains index at which
    // character occurs in bwt_arr[]
    for (i = 0; i < len_bwt; i++) {
        struct node* nn = getNode(i);
        addAtLast(&arr[bwt_arr[i]], nn);
    }
 
    // Takes each distinct character of sorted_arr[] as head
    // of a linked list and finds l_shift[]
    for (i = 0; i < len_bwt; i++)
        computeLShift(&arr[sorted_bwt[i]], i, l_shift);
 
    printf("Burrows - Wheeler Transform: %s\n", bwt_arr);
    printf("Inverse of Burrows - Wheeler Transform: ");
    // Decodes the bwt
    for (i = 0; i < len_bwt; i++) {
        x = l_shift[x];
        printf("%c", bwt_arr[x]);
    }
}
 
// Driver program to test functions above
int main()
{
    char bwt_arr[] = "annb$aa";
    invert(bwt_arr);
    return 0;
}


C++




#include <bits/stdc++.h>
using namespace std;
 
class Node {
    public:
        char Data;
        Node* Next;
 
        Node(char data) {
            Data = data;
            Next = NULL;
        }
};
 
class InvertBWT {
public:
    static void invert(string bwtArr) {
        int lenBwt = bwtArr.length();
        string sortedBwt = bwtArr;
        sort(sortedBwt.begin(), sortedBwt.end());
        int* lShift = new int[lenBwt];
 
        // Index at which original string appears
        // in the sorted rotations list
        int x = 4;
 
        // Array of lists to compute l_shift
        vector<int>* arr = new vector<int>[128];
 
        // Adds each character of bwtArr to a linked list
        // and appends to it the new node whose data part
        // contains index at which character occurs in bwtArr
        for (int i = 0; i < lenBwt; i++) {
            arr[bwtArr[i]].push_back(i);
        }
 
        // Adds each character of sortedBwt to a linked list
        // and finds lShift
        for (int i = 0; i < lenBwt; i++) {
            lShift[i] = arr[sortedBwt[i]][0];
            arr[sortedBwt[i]].erase(arr[sortedBwt[i]].begin());
        }
 
        // Decodes the bwt
        char* decoded = new char[lenBwt];
        for (int i = 0; i < lenBwt; i++) {
            x = lShift[x];
            decoded[lenBwt-1-i] = bwtArr[x];
        }
        string decodedStr(decoded, lenBwt);
 
        cout << "Burrows - Wheeler Transform: " << bwtArr << endl;
        cout << "Inverse of Burrows - Wheeler Transform: " << string(decodedStr.rbegin(), decodedStr.rend()) << endl;
    }
};
 
// Driver code
int main() {
    string bwtArr = "annb$aa";
    InvertBWT::invert(bwtArr);
    return 0;
}


Java




import java.io.*;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
 
class Node {
    public char Data;
    public Node Next;
 
    public Node(char data) {
        Data = data;
        Next = null;
    }
}
 
class InvertBWT {
    static void invert(String bwtArr) {
        int lenBwt = bwtArr.length();
        String sortedBwt = new String(bwtArr.chars().sorted().toArray(), 0, lenBwt);
        int[] lShift = new int[lenBwt];
 
        // Index at which original string appears
        // in the sorted rotations list
        int x = 4;
 
        // Array of lists to compute l_shift
        List<Integer>[] arr = new ArrayList[128];
        for (int i = 0; i < arr.length; i++) {
            arr[i] = new ArrayList<Integer>();
        }
 
        // Adds each character of bwtArr to a linked list
        // and appends to it the new node whose data part
        // contains index at which character occurs in bwtArr
        for (int i = 0; i < lenBwt; i++) {
            arr[bwtArr.charAt(i)].add(i);
        }
 
        // Adds each character of sortedBwt to a linked list
        // and finds lShift
        for (int i = 0; i < lenBwt; i++) {
            lShift[i] = arr[sortedBwt.charAt(i)].get(0);
            arr[sortedBwt.charAt(i)].remove(0);
        }
 
        // Decodes the bwt
        char[] decoded = new char[lenBwt];
        for (int i = 0; i < lenBwt; i++) {
            x = lShift[x];
            decoded[lenBwt-1-i] = bwtArr.charAt(x);
        }
        String decodedStr = new String(decoded);
 
        System.out.printf("Burrows - Wheeler Transform: %s\n", bwtArr);
        System.out.printf("Inverse of Burrows - Wheeler Transform: %s\n", new StringBuilder(decodedStr).reverse().toString());
    }
     
// Driver code
 public static void main(String[] args) {
        String bwtArr = "annb$aa";
        invert(bwtArr);
    }
 
}


Python3




# Python program for the above approach
 
import string
 
# Structure to store info of a node of linked list
class Node:
    def __init__(self, data):
        self.data = data
        self.next = None
 
# Does insertion at end in the linked list
def addAtLast(head, nn):
    if head is None:
        head = nn
        return head
    temp = head
    while temp.next is not None:
        temp = temp.next
    temp.next = nn
    return head
 
# Computes l_shift[]
def computeLShift(head, index, l_shift):
    l_shift[index] = head.data
    head = head.next
 
# Compares the characters of bwt_arr[] and sorts them alphabetically
def cmpfunc(a, b):
    return ord(a) - ord(b)
 
def invert(bwt_arr):
    len_bwt = len(bwt_arr)
    sorted_bwt = sorted(bwt_arr)
    l_shift = [0] * len_bwt
 
    # Index at which original string appears
    # in the sorted rotations list
    x = 4
 
    # Array of lists to compute l_shift
    arr = [[] for i in range(128)]
 
    # Adds each character of bwt_arr to a linked list
    # and appends to it the new node whose data part
    # contains index at which character occurs in bwt_arr
    for i in range(len_bwt):
        arr[ord(bwt_arr[i])].append(i)
 
    # Adds each character of sorted_bwt to a linked list
    # and finds l_shift
    for i in range(len_bwt):
        l_shift[i] = arr[ord(sorted_bwt[i])].pop(0)
 
    # Decodes the bwt
    decoded = [''] * len_bwt
    for i in range(len_bwt):
        x = l_shift[x]
        decoded[len_bwt-1-i] = bwt_arr[x]
    decoded_str = ''.join(decoded)
 
    print("Burrows - Wheeler Transform:", bwt_arr)
    print("Inverse of Burrows - Wheeler Transform:", decoded_str[::-1])
 
# Driver program to test functions above
if __name__ == "__main__":
    bwt_arr = "annb$aa"
    invert(bwt_arr)
 
# This code is contributed by Prince


C#




using System;
using System.Collections.Generic;
using System.Linq;
 
public class Node {
    public char Data { get; set; }
    public Node Next { get; set; }
 
    public Node(char data) {
        Data = data;
        Next = null;
    }
}
 
public class InvertBWT {
    public static void Main(string[] args) {
        string bwtArr = "annb$aa";
        Invert(bwtArr);
    }
 
    public static void Invert(string bwtArr) {
        int lenBwt = bwtArr.Length;
        string sortedBwt = new string(bwtArr.ToCharArray().OrderBy(c => c).ToArray());
        int[] lShift = new int[lenBwt];
 
        // Index at which original string appears
        // in the sorted rotations list
        int x = 4;
 
        // Array of lists to compute l_shift
        List<int>[] arr = new List<int>[128];
        for (int i = 0; i < arr.Length; i++) {
            arr[i] = new List<int>();
        }
 
        // Adds each character of bwtArr to a linked list
        // and appends to it the new node whose data part
        // contains index at which character occurs in bwtArr
        for (int i = 0; i < lenBwt; i++) {
            arr[bwtArr[i]].Add(i);
        }
 
        // Adds each character of sortedBwt to a linked list
        // and finds lShift
        for (int i = 0; i < lenBwt; i++) {
            lShift[i] = arr[sortedBwt[i]][0];
            arr[sortedBwt[i]].RemoveAt(0);
        }
 
        // Decodes the bwt
        char[] decoded = new char[lenBwt];
        for (int i = 0; i < lenBwt; i++) {
            x = lShift[x];
            decoded[lenBwt-1-i] = bwtArr[x];
        }
        string decodedStr = new string(decoded);
 
        Console.WriteLine("Burrows - Wheeler Transform: {0}", bwtArr);
        Console.WriteLine("Inverse of Burrows - Wheeler Transform: {0}", new string(decodedStr.ToCharArray().Reverse().ToArray()));
    }
}


Javascript




// JavaScript program for the above approach
 
// Structure to store info of a node of linked list
class Node {
  constructor(data) {
    this.data = data;
    this.next = null;
  }
}
 
// Does insertion at end in the linked list
function addAtLast(head, nn) {
  if (head === null) {
    head = nn;
    return head;
  }
  let temp = head;
  while (temp.next !== null) {
    temp = temp.next;
  }
  temp.next = nn;
  return head;
}
 
// Computes l_shift[]
function computeLShift(head, index, l_shift) {
  l_shift[index] = head.data;
  head = head.next;
}
 
// Compares the characters of bwt_arr[] and sorts them alphabetically
function cmpfunc(a, b) {
  return a.charCodeAt() - b.charCodeAt();
}
 
function invert(bwt_arr) {
    const len_bwt = bwt_arr.length;
    const sorted_bwt = bwt_arr.split('').sort().join('');
    const l_shift = Array(len_bwt).fill(0);
 
    // Index at which original string appears
    // in the sorted rotations list
    let x = 4;
 
    // Array of lists to compute l_shift
    const arr = Array(128).fill().map(() => []);
 
    // Adds each character of bwt_arr to a linked list
    // and appends to it the new node whose data part
    // contains index at which character occurs in bwt_arr
    for (let i = 0; i < len_bwt; i++) {
        arr[bwt_arr.charCodeAt(i)].push(i);
    }
 
    // Adds each character of sorted_bwt to a linked list
    // and finds l_shift
    for (let i = 0; i < len_bwt; i++) {
        l_shift[i] = arr[sorted_bwt.charCodeAt(i)].shift();
    }
 
    // Decodes the bwt
    const decoded = Array(len_bwt).fill('');
    for (let i = 0; i < len_bwt; i++) {
        x = l_shift[x];
        decoded[len_bwt-1-i] = bwt_arr.charAt(x);
    }
    const decoded_str = decoded.join('');
 
    console.log("Burrows - Wheeler Transform:", bwt_arr);
    console.log("Inverse of Burrows - Wheeler Transform:", decoded_str.split('').reverse().join(''));
}
 
 
// Driver program to test functions above
const bwt_arr = "annb$aa";
invert(bwt_arr);
 
// This code is contributed by adityashatmfh


Output

Burrows - Wheeler Transform: annb$aa
Inverse of Burrows - Wheeler Transform: banana$

Time Complexity: O(nLogn) as qsort() takes O(nLogn) time. 

Exercise: Implement inverse of Inverse of Burrows – Wheeler Transform in O(n) time. 



Last Updated : 20 Apr, 2023
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads