# Introduction to Data Compression

In this article, we will discuss the overview of Data Compression and will discuss its method illustration, and also will cover the overview part entropy. Let’s discuss it one by one.

**Overview :**

One important area of research is data compression. It deals with the art and science of storing information in a compact form. One would have noticed that many compression packages are used to compress files. Compression reduces the cost of storage, increases the speed of algorithms, and reduces the transmission cost. Compression is achieved by removing redundancy, that is repetition of unnecessary data. Coding redundancy refers to the redundant data caused due to suboptimal coding techniques.

**Method illustration :**

- To illustrate this method let’s assume that there are six symbols, and binary code is used to assign a unique address to each of these symbols, as shown in the following table
- Binary code requires at least three bits to encode six symbols. It can also be observed that binary codes 110 and 111 are not used at all. This clearly shows that binary code is not efficient, and hence an efficient code is required to assign a unique address.

Symbols | W1 | W2 | W3 | W4 | W5 | W6 |
---|---|---|---|---|---|---|

Probability | 0.3 | 0.3 | 0.1 | 0.1 | 0.08 | 0.02 |

Binary code | 000 | 001 | 010 | 011 | 100 | 101 |

- An efficient code is one that uses a minimum number of bits for representing any information. The disadvantage of binary code is that it is fixed code; a Huffman code is better, as it is a variable code.
- Coding techniques are related to the concepts of entropy and information content, which are studied as a subject called information theory. Information theory also deals with uncertainty present in a message is called the information content. The information content is given as

log_{2 (1/pi) or -log2 pi . }

**Entropy :**

- Entropy is defined as a measure of orderliness that is present in the information. It is given as follows:

H= - ∑ p_{i log2 pi}

- Entropy is a positive quantity and specifies the minimum number of bits necessary to encode information. Thus, coding redundancy is given as the difference between the average number of bits used for coding and entropy.

coding redundancy = Average number of bits - Entropy

- By removing redundancy, any information can be stored in a compact manner. This is the basis of data compression.

Attention reader! Don’t stop learning now. Get hold of all the important CS Theory concepts for SDE interviews with the **CS Theory Course** at a student-friendly price and become industry ready.