Canonical Huffman Coding

Last Updated : 16 Apr, 2024

Huffman Coding is a lossless data compression algorithm where each character in the data is assigned a variable length prefix code. The least frequent character gets the largest code and the most frequent one gets the smallest code. Encoding the data using this technique is very easy and efficient. However, decoding the bitstream generated using this technique is inefficient.Decoders(or Decompressors)require the knowledge of the encoding mechanism used in order to decode the encoded data back to the original characters.

Hence information about the encoding process needs to be passed to the decoder along with the encoded data as a table of characters and their corresponding codes. In regular Huffman coding of a large data, this table takes up a lot of memory space and also if a large no. of unique characters are present in the data then the compressed(or encoded) data size increases because of the presence of the codebook. Therefore to make the decoding process computationally efficient and still maintain a good compression ratio, Canonical Huffman codes were introduced.

In Canonical Huffman coding, the bit lengths of the standard Huffman codes generated for each symbol is used. The symbols are sorted first according to their bit lengths in non-decreasing order and then for each bit length, they are sorted lexicographically. The first symbol gets a code containing all zeros and of the same length as that of the original bit length. For the subsequent symbols, if the symbol has a bit length equal to that of the previous symbol, then the code of the previous symbol is incremented by one and assigned to the present symbol.

Otherwise, if the symbol has a bit length greater than that of the previous symbol, after incrementing the code of the previous symbol is zeros are appended until the length becomes equal to the bit length of the current symbol and the code is then assigned to the current symbol.
This process continues for the rest of the symbols.

The following example illustrates the process:

Consider the following data:

Character	Frequency
a	10
b	1
c	15
d	7

Standard Huffman Codes Generated with bit lengths:

Character	Huffman Codes	Bit lengths
a	11	2
b	100	3
c	0	1
d	101	3

Step 1: Sort the data according to bit lengths and then for each bit length sort the symbols lexicographically.

Character	Bit lengths
c	1
a	2
b	3
d	3

Step 2: Assign the code of the first symbol with the same number of ‘0’s as the bit length.
Code for ‘c’:0
Next symbol ‘a’ has bit length 2 > bit length of the previous symbol ‘c’ which is 1.Increment the code of the previous symbol by 1 and append (2-1)=1 zeros and assign the code to ‘a’.
Code for ‘a’:10
Next symbol ‘b’ has bit length 3 > bit length of the previous symbol ‘a’ which is 2.Increment the code of the previous symbol by 1 and append (3-2)=1 zeros and assign the code to ‘b’.
Code for ‘b’:110
Next symbol ‘d’ has bit length 3 = bit length of the previous symbol ‘b’ which is 3.Increment the code of the previous symbol by 1 and assign it to ‘d’.
Code for ‘d’:111
Step 3: Final result.

Character	Canonical Huffman Codes
c	0
a	10
b	110
d	111

The basic advantage of this method is that the encoding information passed to the decoder can be made more compact and memory efficient. For example, one can simply pass the bit lengths of the characters or symbols to the decoder. The canonical codes can be generated easily from the lengths as they are sequential.
For generating Huffman codes using Huffman Tree refer here.

Approach: A simple and efficient approach is to generate a Huffman tree for the data and use a data structure similar to TreeMap in java to store the symbols and bit lengths such that the information always remains sorted. The canonical codes can then be obtained using incrementation and bitwise left shift operations.

Implementation:

C++

#include <bits/stdc++.h>
using namespace std;

// Nodes of Huffman tree
class Node {
public:
    int data;
    char c;
    Node* left;
    Node* right;
};

// Comparator class helps to compare the node
// on the basis of one of its attributes.
// Here we will be compared
// on the basis of data values of the nodes.
class Pq_compare {
public:
    int operator() (Node* a, Node* b) {
        return a->data - b->data;
    }
};

class Canonical_Huffman {
    // Treemap to store the
    // code lengths(sorted) as keys
    // and corresponding(sorted)
    // set of characters as values
public:
    static map<int, set<char>> data;

    Canonical_Huffman() {
        data = map<int, set<char>>();
    }
    
    // Recursive function
    // to generate code lengths
    // from regular Huffman codes
    static void code_gen(Node* root, int code_length) {
        if (root == nullptr)
            return;

        // base case; if the left and right are null
        // then its a leaf node.
        if (root->left == nullptr && root->right == nullptr) {
            // check if key is present or not.
            // If not present add a new treeset
            // as value along with the key
            data[code_length].insert(root->c);
            return;
        }

        // Add 1 when going left or right.
        code_gen(root->left, code_length + 1);
        code_gen(root->right, code_length + 1);
    }

    static void testCanonicalHC(int n, char chararr[], int freq[]) {
        // min-priority queue(min-heap).
        priority_queue<Node*, vector<Node*>, Pq_compare> q;

        for (int i = 0; i < n; i++) {
            // creating a node object
            // and adding it to the priority-queue.
            Node* node = new Node();
            node->c = chararr[i];
            node->data = freq[i];
            node->left = nullptr;
            node->right = nullptr;

            // add functions adds
            // the node to the queue.
            q.push(node);
        }

        // Create a root node
        Node* root = nullptr;

        // extract the two minimum values
        // from the heap each time until
        // its size reduces to 1, extract until
        // all the nodes are extracted.
        while (q.size() > 1) {
            // first min extract.
            Node* x = q.top();
            q.pop();

            // Second min extract
            Node* y = q.top();
            q.pop();

            // new node f which is equal
            Node* nodeobj = new Node();
            
            // to the sum of the frequency of the two nodes 
            // assigning values to the f node
            nodeobj->data = x->data + y->data;
            nodeobj->c = '-';
            // first extracted node as left child.
            nodeobj->left = x;
            // second extracted node as the right child.
            nodeobj->right = y;
            // marking the f node as the root node
            root = nodeobj;
            // add this node to the priority-queue.
            q.push(nodeobj);
        }

        // creating a canonical Huffman object
        Canonical_Huffman obj = Canonical_Huffman();
        
        // generate code lengths by traversing the tree
        code_gen(root, 0);

        // Object array to the store the keys
        auto arr = data;
        
        // Set initial canonical code = 0
        int c_code = 0, curr_len = 0, next_len = 0;

        for (auto it = arr.begin(); it != arr.end(); it++) {
            set<char> s = it->second;
            
            // code length of current character
            curr_len = it->first;

            for (auto i = s.begin(); i != s.end(); i++) {
                // Display the canonical codes
                cout << *i << ":" << bitset<32>(c_code).to_string().substr(32 - curr_len, 32) << endl;
                
                // if values set is not
                // completed or if it is
                // the last set set code length
                // of next character as current
                // code length
                if (next(i) != s.end() || next(it) == arr.end())
                    next_len = curr_len;
                else
                    next_len = next(it)->first;
                
                // Generate canonical code
                // for next character using
                // regular code length of next
                // character
                c_code = (c_code + 1) << (next_len - curr_len);
            }
        }
    }
};

map<int, set<char>> Canonical_Huffman::data;

// Driver code
int main() {
    int n = 4;
    char chararr[] = {'a', 'b', 'c', 'd'};
    int freq[] = {10, 1, 15, 7};
    Canonical_Huffman::testCanonicalHC(n, chararr, freq);
    return 0;
}

Java

import java.util.*;

// Node class to store data and its frequency
class Node implements Comparable<Node> {
    char data;
    int freq;
    Node left, right;

    // Constructor
    Node(char c, int f) {
        data = c;
        freq = f;
        left = right = null;
    }

    // Comparator: less_than
    public int compareTo(Node other) {
        return this.freq - other.freq;
    }
}

public class CanonicalHuffman {
    // Function to generate Huffman codes
    static void codeGen(Node root, StringBuilder codeLength, Map<Integer, List<Character>> codeMap) {
        if (root == null) return;
        if (root.left == null && root.right == null) {
            codeMap.computeIfAbsent(codeLength.length(), k -> new ArrayList<>()).add(root.data);
            return;
        }
        codeGen(root.left, codeLength.append('0'), codeMap);
        codeLength.deleteCharAt(codeLength.length() - 1);
        codeGen(root.right, codeLength.append('1'), codeMap);
        codeLength.deleteCharAt(codeLength.length() - 1);
    }

    // Main function implementing Huffman coding
    static void testCanonicalHC(char[] charArr, int[] freq) {
        // Priority queue to store heap tree
        PriorityQueue<Node> q = new PriorityQueue<>();
        for (int i = 0; i < charArr.length; i++) {
            q.add(new Node(charArr[i], freq[i]));
        }

        while (q.size() > 1) {
            Node left = q.poll();
            Node right = q.poll();
            Node merged = new Node('-', left.freq + right.freq);
            merged.left = left;
            merged.right = right;
            q.add(merged);
        }

        Node root = q.poll();
        Map<Integer, List<Character>> codeMap = new HashMap<>();
        codeGen(root, new StringBuilder(), codeMap);

        // Generate Canonical Huffman codes
        Map<Character, String> canonicalMap = new TreeMap<>();
        int cCode = 0;
        for (int length : new TreeSet<>(codeMap.keySet())) {
            List<Character> chars = codeMap.get(length);
            for (char ch : chars) {
                canonicalMap.put(ch, String.format("%" + length + "s", Integer.toBinaryString(cCode++)).replace(' ', '0'));
            }
            cCode <<= 1;
        }

        // Print Canonical Huffman codes
        for (char ch : canonicalMap.keySet()) {
            System.out.println(ch + ": " + canonicalMap.get(ch));
        }
    }

    // Driver code
    public static void main(String[] args) {
        char[] charArr = {'a', 'b', 'c', 'd'};
        int[] freq = {10, 1, 15, 7};
        testCanonicalHC(charArr, freq);
    }
}

Python3

import heapq
from collections import defaultdict

# Node class to store data and its frequency
class Node:
    def __init__(self, char, freq):
        self.char = char
        self.freq = freq
        self.left = None
        self.right = None

    # Defining comparators less_than and equals
    def __lt__(self, other):
        return self.freq < other.freq

    def __eq__(self, other):
        if(other == None):
            return False
        if(not isinstance(other, Node)):
            return False
        return self.freq == other.freq

# Function to generate Huffman codes
def code_gen(root, code_length, code_map):
    if root is None:
        return

    if root.left is None and root.right is None:
        code_map[len(code_length)].append(root.char)

    code_gen(root.left, code_length + '0', code_map)
    code_gen(root.right, code_length + '1', code_map)

# Main function implementing Huffman coding
def testCanonicalHC(chararr, freq):
    # Priority queue to store heap tree
    q = [Node(chararr[i], freq[i]) for i in range(len(chararr))]
    heapq.heapify(q)

    while len(q) > 1:
        left = heapq.heappop(q)
        right = heapq.heappop(q)

        merged = Node(None, left.freq + right.freq)
        merged.left = left
        merged.right = right

        heapq.heappush(q, merged)

    root = heapq.heappop(q)
    code_map = defaultdict(list)
    code_gen(root, "", code_map)

    # Generate Canonical Huffman codes
    canonical_map = {}
    c_code = 0
    for length in sorted(code_map.keys()):
        for char in sorted(code_map[length]):
            canonical_map[char] = bin(c_code)[2:].zfill(length)
            c_code += 1
        c_code <<= 1

    # Print Canonical Huffman codes
    for char in sorted(canonical_map, key=lambda x: (len(canonical_map[x]), x)):
        print(f"{char}: {canonical_map[char]}")

# Driver code
if __name__ == "__main__":
    chararr = ['a', 'b', 'c', 'd']
    freq = [10, 1, 15, 7]
    testCanonicalHC(chararr, freq)

Output

c:0
a:10
b:110
d:111

Suggest improvement

Huffman Coding using Priority Queue

Huffman Decoding

Share your thoughts in the comments

Canonical Huffman Coding

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?