Burrows – Wheeler Data Transform Algorithm

Last Updated : 08 Dec, 2023

What is the Burrows-Wheeler Transform?

The BWT is a data transformation algorithm that restructures data in such a way that the transformed message is more compressible. Technically, it is a lexicographical reversible permutation of the characters of a string. It is first of the three steps to be performed in succession while implementing the Burrows-Wheeler Data Compression algorithm that forms the basis of the Unix compression utility bzip2.

Why BWT? The main idea behind it.

The most important application of BWT is found in biological sciences where genomes(long strings written in A, C, T, G alphabets) don’t have many runs but they do have many repeats. The idea of the BWT is to build an array whose rows are all cyclic shifts of the input string in dictionary order and return the last column of the array that tends to have long runs of identical characters. The benefit of this is that once the characters have been clustered together, they effectively have an ordering, which can make our string more compressible for other algorithms like run-length encoding and Huffman Coding. The remarkable thing about BWT is that this particular transform is reversible with minimal data overhead.

Steps involved in BWT algorithm

Let’s take the word “banana$” as an example.

Step 1: Form all cyclic rotations of the given text.

                                     banana$ 
       $    b                        $banana 
    a           a                    a$banan
   Cyclic rotations    ---------->   na$bana
    n         n                      ana$ban 
          a                          nana$ba
                                     anana$b

Step 2: The next step is to sort the rotations lexicographically. The ‘$’ sign is viewed as first letter lexicographically, even before ‘a’.

banana$                    $banana
$banana                    a$banan
a$banan       Sorting      ana$ban
na$bana      ---------->   anana$b 
ana$ban    alphabetically  banana$
nana$ba                    na$bana
anana$b                    nana$ba

Step 3: The last column is what we output as BWT.

BWT(banana$) = annb$aa

Examples:

Input: text = “banana$” Output: Burrows-Wheeler Transform = “annb$aa” Input: text = “abracadabra$” Output: Burrows-Wheeler Transform = “ard$rcaaaabb”

Why last column is considered BWT?

The last column has a better symbol clustering than any other columns.
If we only have BWT of our string, we can recover the rest of the cyclic rotations entirely. The rest of the columns don’t possess this characteristic which is highly important while computing the inverse of BWT.

Why ‘$’ sign is embedded in the text? We can compute BWT even if our text is not concatenated with any EOF character (‘$’ here). The implication of ‘$’ sign comes while computing the inverse of BWT. Way of implementation

Let’s instantiate “banana$” as our input_text and instantiate character array bwt_arr for our output.
Let’s get all the suffixes of “banana$” and compute it’s suffix_arr to store index of each suffix.

0 banana$                6 $   
1 anana$                 5 a$
2 nana$      Sorting     3 ana$
3 ana$     ---------->   1 anana$
4 na$     alphabetically 0 banana$
5 a$                     4 na$
6 $                      2 nana$

Iterating over the suffix_arr, let’s now add to our output array bwt_arr, the last character of each rotation.
The last character of each rotation of input_text starting at the position denoted by the current value in the suffix array can be calculated with input_text[(suffix_arr[i] – 1 + n ) % n], where n is the number of elements in the suffix_arr.

bwt_arr[0] 
  = input_text[(suffix_arr[0] - 1 + 7) % 7] 
  = input_text[5] 
  = a
bwt_arr[1] 
  = input_text[(suffix_arr[1] - 1 + 7) % 7] 
  = input_text[4] 
  = n

Following is the code for the way of implementation explained above

C++

// CPP program to find Burrows Wheeler transform
// of a given text
#include <bits/stdc++.h>
using namespace std;
 
// Structure to store data of a rotation
struct rotation {
    int index;
    char* suffix;
};
 
// Compares the rotations and
// sorts the rotations alphabetically
int cmpfunc(const void* x, const void* y)
{
    struct rotation* rx = (struct rotation*)x;
    struct rotation* ry = (struct rotation*)y;
    return strcmp(rx->suffix, ry->suffix);
}
 
// Takes text to be transformed and its length as
// arguments and returns the corresponding suffix array
int* computeSuffixArray(char* input_text, int len_text)
{
    // Array of structures to store rotations and
    // their indexes
    struct rotation suff[len_text];
 
    // Structure is needed to maintain old indexes of
    // rotations after sorting them
    for (int i = 0; i < len_text; i++) {
        suff[i].index = i;
        suff[i].suffix = (input_text + i);
    }
 
    // Sorts rotations using comparison
    // function defined above
    qsort(suff, len_text, sizeof(struct rotation), cmpfunc);
 
    // Stores the indexes of sorted rotations
    int* suffix_arr = (int*)malloc(len_text * sizeof(int));
    for (int i = 0; i < len_text; i++)
        suffix_arr[i] = suff[i].index;
 
    // Returns the computed suffix array
    return suffix_arr;
}
 
// Takes suffix array and its size
// as arguments and returns the
// Burrows - Wheeler Transform of given text
char* findLastChar(char* input_text, int* suffix_arr, int n)
{
    // Iterates over the suffix array to find
    // the last char of each cyclic rotation
    char* bwt_arr = (char*)malloc(n * sizeof(char));
    int i;
    for (i = 0; i < n; i++) {
        // Computes the last char which is given by
        // input_text[(suffix_arr[i] + n - 1) % n]
        int j = suffix_arr[i] - 1;
        if (j < 0)
            j = j + n;
 
        bwt_arr[i] = input_text[j];
    }
 
    bwt_arr[i] = '\0';
 
    // Returns the computed Burrows - Wheeler Transform
    return bwt_arr;
}
 
// Driver program to test functions above
int main()
{
    char input_text[] = "banana$";
    int len_text = strlen(input_text);
 
    // Computes the suffix array of our text
    int* suffix_arr
        = computeSuffixArray(input_text, len_text);
 
    // Adds to the output array the last char of each
    // rotation
    char* bwt_arr
        = findLastChar(input_text, suffix_arr, len_text);
 
    cout << "Input text : " << input_text << endl;
    cout << "Burrows - Wheeler Transform : " << bwt_arr
         << endl;
    return 0;
}
 
// This code is contributed by Susobhan Akhuli

C

// C program to find Burrows Wheeler transform
// of a given text
 
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
 
// Structure to store data of a rotation
struct rotation {
    int index;
    char* suffix;
};
 
// Compares the rotations and
// sorts the rotations alphabetically
int cmpfunc(const void* x, const void* y)
{
    struct rotation* rx = (struct rotation*)x;
    struct rotation* ry = (struct rotation*)y;
    return strcmp(rx->suffix, ry->suffix);
}
 
// Takes text to be transformed and its length as
// arguments and returns the corresponding suffix array
int* computeSuffixArray(char* input_text, int len_text)
{
    // Array of structures to store rotations and
    // their indexes
    struct rotation suff[len_text];
 
    // Structure is needed to maintain old indexes of
    // rotations after sorting them
    for (int i = 0; i < len_text; i++) {
        suff[i].index = i;
        suff[i].suffix = (input_text + i);
    }
 
    // Sorts rotations using comparison
    // function defined above
    qsort(suff, len_text, sizeof(struct rotation),
        cmpfunc);
 
    // Stores the indexes of sorted rotations
    int* suffix_arr
        = (int*)malloc(len_text * sizeof(int));
    for (int i = 0; i < len_text; i++)
        suffix_arr[i] = suff[i].index;
 
    // Returns the computed suffix array
    return suffix_arr;
}
 
// Takes suffix array and its size
// as arguments and returns the
// Burrows - Wheeler Transform of given text
char* findLastChar(char* input_text,
                int* suffix_arr, int n)
{
    // Iterates over the suffix array to find
    // the last char of each cyclic rotation
    char* bwt_arr = (char*)malloc(n * sizeof(char));
    int i;
    for (i = 0; i < n; i++) {
        // Computes the last char which is given by
        // input_text[(suffix_arr[i] + n - 1) % n]
        int j = suffix_arr[i] - 1;
        if (j < 0)
            j = j + n;
 
        bwt_arr[i] = input_text[j];
    }
 
    bwt_arr[i] = '\0';
 
    // Returns the computed Burrows - Wheeler Transform
    return bwt_arr;
}
 
// Driver program to test functions above
int main()
{
    char input_text[] = "banana$";
    int len_text = strlen(input_text);
 
    // Computes the suffix array of our text
    int* suffix_arr
        = computeSuffixArray(input_text, len_text);
 
    // Adds to the output array the last char
    // of each rotation
    char* bwt_arr
        = findLastChar(input_text, suffix_arr, len_text);
 
    printf("Input text : %s\n", input_text);
    printf("Burrows - Wheeler Transform : %s\n",
        bwt_arr);
    return 0;
}

Java

import java.util.Arrays;
 
class Rotation implements Comparable<Rotation> {
    int index;
    String suffix;
 
    public Rotation(int index, String suffix)
    {
        this.index = index;
        this.suffix = suffix;
    }
 
    @Override public int compareTo(Rotation o)
    {
        return this.suffix.compareTo(o.suffix);
    }
}
 
public class BurrowsWheelerTransform {
 
    public static int[] computeSuffixArray(String inputText)
    {
        int lenText = inputText.length();
 
        Rotation[] suff = new Rotation[lenText];
 
        for (int i = 0; i < lenText; i++) {
            suff[i]
                = new Rotation(i, inputText.substring(i));
        }
 
        Arrays.sort(suff);
 
        int[] suffixArr = new int[lenText];
        for (int i = 0; i < lenText; i++) {
            suffixArr[i] = suff[i].index;
        }
 
        return suffixArr;
    }
 
    public static String findLastChar(String inputText,
                                      int[] suffixArr)
    {
        int n = inputText.length();
 
        StringBuilder bwtArr = new StringBuilder();
        for (int i = 0; i < n; i++) {
            int j = suffixArr[i] - 1;
            if (j < 0) {
                j = j + n;
            }
            bwtArr.append(inputText.charAt(j));
        }
 
        return bwtArr.toString();
    }
 
    public static void main(String[] args)
    {
        String inputText = "banana$";
 
        int[] suffixArr = computeSuffixArray(inputText);
 
        String bwtArr = findLastChar(inputText, suffixArr);
 
        System.out.println("Input text : " + inputText);
        System.out.println("Burrows - Wheeler Transform : "
                           + bwtArr);
    }
}

Python3

# Python program to find Burrows-Wheeler Transform of a given text
# Compares the rotations and sorts the rotations alphabetically
 
 
def cmp_func(x, y):
    return (x[1] > y[1]) - (x[1] < y[1])
 
# Takes text to be transformed and its length as arguments
# and returns the corresponding suffix array
 
 
def compute_suffix_array(input_text, len_text):
    # Array of structures to store rotations and their indexes
    suff = [(i, input_text[i:]) for i in range(len_text)]
 
    # Sorts rotations using comparison function defined above
    suff.sort(key=lambda x: x[1])
 
    # Stores the indexes of sorted rotations
    suffix_arr = [i for i, _ in suff]
 
    # Returns the computed suffix array
    return suffix_arr
 
# Takes suffix array and its size as arguments
# and returns the Burrows-Wheeler Transform of given text
 
 
def find_last_char(input_text, suffix_arr, n):
    # Iterates over the suffix array to
    # find the last char of each cyclic rotation
    bwt_arr = ""
    for i in range(n):
        # Computes the last char which is given by 
        # input_text[(suffix_arr[i] + n - 1) % n]
        j = suffix_arr[i] - 1
        if j < 0:
            j = j + n
        bwt_arr += input_text[j]
 
    # Returns the computed Burrows-Wheeler Transform
    return bwt_arr
 
 
# Driver program to test functions above
input_text = "banana$"
len_text = len(input_text)
 
# Computes the suffix array of our text
suffix_arr = compute_suffix_array(input_text, len_text)
 
# Adds to the output array the last char of each rotation
bwt_arr = find_last_char(input_text, suffix_arr, len_text)
 
print("Input text :", input_text)
print("Burrows - Wheeler Transform :", bwt_arr)
 
# This code is contributed by Susobhan Akhuli

C#

using System;
using System.Collections.Generic;
 
public class BurrowsWheelerTransform
{
  // Structure to store data of a rotation
  public struct Rotation
  {
    public int Index;
    public string Suffix;
  }
 
  // Compares the rotations and sorts them alphabetically
  private static int CompareRotations(Rotation x, Rotation y)
  {
    return string.Compare(x.Suffix, y.Suffix);
  }
 
  // Takes text to be transformed and its length as
  // arguments and returns the corresponding suffix array
  private static int[] ComputeSuffixArray(string inputText)
  {
    int lenText = inputText.Length;
 
    // Array of structures to store rotations and their indexes
    Rotation[] suff = new Rotation[lenText];
 
    // Structure is needed to maintain old indexes of rotations
    // after sorting them
    for (int i = 0; i < lenText; i++)
    {
      suff[i].Index = i;
      suff[i].Suffix = inputText.Substring(i);
    }
 
    // Sorts rotations using comparison function defined above
    Array.Sort(suff, CompareRotations);
 
    // Stores the indexes of sorted rotations
    int[] suffixArr = new int[lenText];
    for (int i = 0; i < lenText; i++)
    {
      suffixArr[i] = suff[i].Index;
    }
 
    // Returns the computed suffix array
    return suffixArr;
  }
 
  // Takes suffix array and its size as arguments
  // and returns the Burrows-Wheeler Transform of given text
  private static string FindLastChar(string inputText, int[] suffixArr)
  {
    int n = suffixArr.Length;
 
    // Iterates over the suffix array to find the last char of each cyclic rotation
    char[] bwtArr = new char[n];
    for (int i = 0; i < n; i++)
    {
      // Computes the last char which is given by inputText[(suffixArr[i] + n - 1) % n]
      int j = suffixArr[i] - 1;
      if (j < 0)
      {
        j += n;
      }
 
      bwtArr[i] = inputText[j];
    }
 
    // Returns the computed Burrows-Wheeler Transform
    return new string(bwtArr);
  }
 
  // Driver program to test functions above
  public static void Main(string[] args)
  {
    string inputText = "banana$";
    int lenText = inputText.Length;
 
    // Computes the suffix array of our text
    int[] suffixArr = ComputeSuffixArray(inputText);
 
    // Adds to the output array the last char of each rotation
    string bwtArr = FindLastChar(inputText, suffixArr);
 
    Console.WriteLine($"Input text : {inputText}");
    Console.WriteLine($"Burrows-Wheeler Transform : {bwtArr}");
  }
}
// This code is contributed by divyansh2212

Javascript

// Compares the rotations and sorts the rotations alphabetically
 
function cmp_func(x, y) {
    return (x[1] > y[1]) - (x[1] < y[1]);
}
//Takes text to be transformed and its length as arguments
//and returns the corresponding suffix array
 
function compute_suffix_array(input_text, len_text) {
    let suff = [];
    for (let i = 0; i < len_text; i++) {
        suff.push([i, input_text.slice(i)]);
    }
    suff.sort(cmp_func);
    let suffix_arr = suff.map(item => item[0]);
    return suffix_arr;
}
// Takes suffix array and its size as arguments
// and returns the Burrows-Wheeler Transform of given text
function find_last_char(input_text, suffix_arr, n) {
    let bwt_arr = "";
    for (let i = 0; i < n; i++) {
        let j = suffix_arr[i] - 1;
        if (j < 0) {
            j = j + n;
        }
        bwt_arr += input_text[j];
    }
    return bwt_arr;
}
// Driver program to test functions above
 
let input_text = "banana$";
let len_text = input_text.length;
let suffix_arr = compute_suffix_array(input_text, len_text);
let bwt_arr = find_last_char(input_text, suffix_arr, len_text);
 
console.log("Input text: " + input_text);
console.log("Burrows-Wheeler Transform: " + bwt_arr);

PHP

<?php
// Comparison function for sorting rotations
function cmp_func($x, $y) {
    return ($x[1] > $y[1]) - ($x[1] < $y[1]);
}
 
// Function to compute the suffix array
function compute_suffix_array($input_text, $len_text) {
    $suff = [];
    // Array of structures to store rotations and their indexes
    for ($i = 0; $i < $len_text; $i++) {
        $suff[] = [$i, substr($input_text, $i)];
    }
 
    // Sort rotations using comparison function
    usort($suff, function($a, $b) {
        return strcmp($a[1], $b[1]);
    });
 
    // Store the indexes of sorted rotations
    $suffix_arr = array_map(function($x) { return $x[0]; }, $suff);
 
    // Return the computed suffix array
    return $suffix_arr;
}
 
// Function to find the last char of each cyclic rotation
function find_last_char($input_text, $suffix_arr, $n) {
    $bwt_arr = "";
    // Iterate over the suffix array to find the last char of each cyclic rotation
    for ($i = 0; $i < $n; $i++) {
        // Compute the last char
        $j = $suffix_arr[$i] - 1;
        if ($j < 0) {
            $j = $j + $n;
        }
        $bwt_arr .= $input_text[$j];
    }
 
    // Return the computed Burrows-Wheeler Transform
    return $bwt_arr;
}
 
// Driver code to test functions above
$input_text = "banana$";
$len_text = strlen($input_text);
 
// Compute the suffix array of our text
$suffix_arr = compute_suffix_array($input_text, $len_text);
 
// Add to the output array the last char of each rotation
$bwt_arr = find_last_char($input_text, $suffix_arr, $len_text);
 
echo "Input text: ", $input_text, "\n";
echo "Burrows-Wheeler Transform: ", $bwt_arr, "\n";
?>

Output

Input text : banana$
Burrows - Wheeler Transform : annb$aa

Time Complexity: O( $n^2$ Logn). This is because of the method used above to build suffix array which has O( $n^2$ Logn) time complexity, due to O(n) time for strings comparisons in O(nLogn) sorting algorithm. Exercise: