Huffman Coding is a lossless data compression algorithm where each character in the data is assigned a variable length prefix code. The least frequent character gets the largest code and the most frequent one gets the smallest code. Encoding the data using this technique is very easy and efficient. However, decoding the bitstream generated using this technique is inefficient.Decoders(or Decompressors)require the knowledge of the encoding mechanism used in order to decode the encoded data back to the original characters. Hence information about the encoding process needs to be passed to the decoder along with the encoded data as a table of characters and their corresponding codes. In regular Huffman coding of a large data, this table takes up a lot of memory space and also if a large no. of unique characters are present in the data then the compressed(or encoded) data size increases because of the presence of the codebook. Therefore to make the decoding process computationally efficient and still maintain a good compression ratio, Canonical Huffman codes were introduced.
In Canonical Huffman coding, the bit lengths of the standard Huffman codes generated for each symbol is used. The symbols are sorted first according to their bit lengths in non-decreasing order and then for each bit length, they are sorted lexicographically. The first symbol gets a code containing all zeros and of the same length as that of the original bit length. For the subsequent symbols, if the symbol has a bit length equal to that of the previous symbol, then the code of the previous symbol is incremented by one and assigned to the present symbol. Otherwise, if the symbol has a bit length greater than that of the previous symbol, after incrementing the code of the previous symbol is zeros are appended until the length becomes equal to the bit length of the current symbol and the code is then assigned to the current symbol.
This process continues for the rest of the symbols.
The following example illustrates the process:
Consider the following data:
Standard Huffman Codes Generated with bit lengths:
|Character||Huffman Codes||Bit lengths|
Step 1: Sort the data according to bit lengths and then for each bit length sort the symbols lexicographically.
Step 2: Assign the code of the first symbol with the same number of ‘0’s as the bit length.
Code for ‘c’:0
Next symbol ‘a’ has bit length 2 > bit length of the previous symbol ‘c’ which is 1.Increment the code of the previous symbol by 1 and append (2-1)=1 zeros and assign the code to ‘a’.
Code for ‘a’:10
Next symbol ‘b’ has bit length 3 > bit length of the previous symbol ‘a’ which is 2.Increment the code of the previous symbol by 1 and append (3-2)=1 zeros and assign the code to ‘b’.
Code for ‘b’:110
Next symbol ‘d’ has bit length 3 = bit length of the previous symbol ‘b’ which is 3.Increment the code of the previous symbol by 1 and assign it to ‘d’.
Code for ‘d’:111
Step 3: Final result.
|Character||Canonical Huffman Codes|
The basic advantage of this method is that the encoding information passed to the decoder can be made more compact and memory efficient. For example, one can simply pass the bit lengths of the characters or symbols to the decoder. The canonical codes can be generated easily from the lengths as they are sequential.
For generating Huffman codes uisng Huffman Tree refer here.
Approach: A simple and efficient approach is to generate a Huffman tree for the data and use a data structure similar to TreeMap in java to store the symbols and bit lengths such that the information always remains sorted. The canonical codes can then be obtained using incrementation and bitwise left shift operations.
c:0 a:10 b:110 d:111
- Huffman Coding | Greedy Algo-3
- Image Compression using Huffman Coding
- Efficient Huffman Coding for Sorted Input | Greedy Algo-4
- Huffman Decoding
- Practice Questions on Huffman Encoding
- Coviam Software Developer Internship Experience
- Length of the longest substring with consecutive characters
- Finding the path from one vertex to rest using BFS
- Android | App to Add Two Numbers
- Reverse Cuthill Mckee Algorithm
- Print all permutation of a string using ArrayList
- Shortest Path using Meet In The Middle
- Printing pre and post visited times in DFS of a graph
- Convert given string so that it holds only distinct characters
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to email@example.com. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.