Arithmetic Encoding and Decoding Using MATLAB

Last Updated : 01 Jun, 2022

Arithmetic coding is a type of entropy encoding utilized in lossless data compression. Ordinarily, a string of characters, for example, the words “hey” is represented for utilizing a fixed number of bits per character.

In the most straightforward case, the probability of every symbol occurring is equivalent. For instance, think about a set of three symbols, A, B, and C, each similarly prone to happen. Straightforward block encoding would require 2 bits for every symbol, which is inefficient as one of the bit varieties is rarely utilized. In other words, A = 00, B = 01, and C = 10, however, 11 is unused. An increasingly productive arrangement is to speak to a succession of these three symbols as a rational number in base 3 where every digit speaks to a symbol. For instance, the arrangement “ABBCAB” could become 0.011201. In arithmetic coding as an incentive in the stretch [0, 1). The subsequent stage is to encode this ternary number utilizing a fixed-guide paired number of adequate exactness toward recuperating it, for example, 0.00101100102 — this is just 10 bits. This is achievable for long arrangements in light of the fact that there are productive, set up calculations for changing over the base of subjectively exact numbers.

When all is said and done, arithmetic coders can deliver close ideal output for some random arrangement of symbols and probabilities (the ideal value is – log2P bits for every symbol of likelihood P). Compression algorithms that utilize arithmetic coding go by deciding a structure of the data – fundamentally an expectation of what examples will be found in the symbols of the message. The more precise this prediction is, the closer to ideal the output will be.

When all is said and done, each progression of the encoding procedure, aside from the absolute last, is the equivalent; the encoder has fundamentally only three bits of information to consider: The following symbol that should be encoded. The current span (at the very beginning of the encoding procedure, the stretch is set to [0,1], yet that will change) The probabilities the model allows to every one of the different symbols that are conceivable at this stage (as referenced prior, higher-request or versatile models imply that these probabilities are not really the equivalent in each progression.) The encoder isolates the current span into sub-spans, each speaking to a small amount of the current span relative to the likelihood of that symbol in the current setting. Whichever stretch relates to the real symbol that is close to being encoded turns into the span utilized in the subsequent stage. At the point when the sum total of what symbols have been encoded, the subsequent span unambiguously recognizes the arrangement of symbols that delivered it. Any individual who has a similar last span and model that is being utilized can remake the symbol succession that more likely than not entered the encoder to bring about that last stretch. It isn’t important to transmit the last stretch, in any case; it is just important to transmit one division that exists in that span. Specifically, it is just important to transmit enough digits (in whatever base) of the part so all divisions that start with those digits fall into the last stretch; this will ensure that the subsequent code is a prefix code.

Encoding

MATLAB

% input string
str = 'GeeksforGeeks';
fprintf('The entered string is : %s\n', str);
 
% length of the string
len = length(str);
fprintf('The length of the string is : %d\n', len);
 
% get unique characters from the string
u = unique(str);
fprintf('The unique characters are : %s\n', u);
 
% length of the unique characters string
len_unique = length(u);
fprintf('The length of unique character string is : %d\n', len_unique);
 
% General lookup table
 
% get zeros of length of unique characters
z = zeros(1, len_unique);
p = zeros(1, len_unique);
 
for i = 1 : len_unique
 
   % in 'z' variable we will find the 
   % occurrence of each characters from 'str'  
   z(i) = length(findstr(str, u(i)));
 
   % in 'p' variable we will get 
   % probability of those occurrences
   p(i) = z(i) / len;
end
display(z);
display(p);
 
% in 'cpr' variable we will get the cumulative 
% summation of 'p' from '1' till last value of 'p'
cpr = cumsum(p);
 
% in 'newcpr' variable we are taking 
% 'cpr' from '0' till last value of 'p' 
newcpr = [0 cpr];
 
display(cpr);
display(newcpr);
 
% make table till 'len_unique' size
for i = 1 : len_unique
 
   % in first column we are then 
   % placing 'newcpr' values
   interval(i, 1) = newcpr(i);
 
   % in second column we are 
   % placing 'cpr' values
   interval(i, 2) = cpr(i);
end
 
% Displaying the lookup table
display('The lookip table is : ')
display(interval);
 
% Encoder Table
 
low = 0;
high = 1;
for i = 1 : len
   for j = 1 : len_unique
 
       % if the value from 'str'
       % matches with 'u' then
       if str(i) == u(j);
           pos = j;
           j = j + 1;  
 
           % displaying the matched length 
           % of unique characters
           display(pos);
 
           % getting the tag value 
           % of the matched character
           range = high - low;
           high = low + (range .* interval(pos, 2));
           low = low + (range .* interval(pos, 1));
           i = i + 1;
           break
       end
   end
end
 
% displaying tag value
tag = low;
display(tag);

Output :

The entered string is : GeeksforGeeks
The length of the string is : 13
The unique characters are : Gefkors
The length of unique character string is : 7

z =

     2     4     1     2     1     1     2


p =

    0.1538    0.3077    0.0769    0.1538    0.0769    0.0769    0.1538


cpr =

    0.1538    0.4615    0.5385    0.6923    0.7692    0.8462    1.0000


newcpr =

         0    0.1538    0.4615    0.5385    0.6923    0.7692    0.8462    1.0000

The lookip table is : 

interval =

         0    0.1538
    0.1538    0.4615
    0.4615    0.5385
    0.5385    0.6923
    0.6923    0.7692
    0.7692    0.8462
    0.8462    1.0000


pos =

     1


pos =

     2


pos =

     2


pos =

     4


pos =

     7


pos =

     3


pos =

     5


pos =

     6


pos =

     1


pos =

     2


pos =

     2


pos =

     4


pos =

     7


tag =

    0.0409

Decoding

MATLAB

str = '';
 
for i = 1 : len
   for j = 1 : len_unique
      
       % If tag value falls between a certain 
       % range in lookup table then
       if tag > interval(j, 1) && tag < interval(j, 2)
           pos = j;
           tag = (tag - interval(pos, 1)) / p(j);
 
           % Getting the matched tag 
           % value character
           decoded_str = u(pos);
 
           % String concatenating 
           % 'decoded_str' into 'str'
           str = strcat(str, decoded_str);
           break
       end
   end
end
 
% Displaying final decoded string
disp(str)