Data Reduction in Data Mining
Prerequisite – Data Mining
The method of data reduction may achieve a condensed description of the original data which is much smaller in quantity but keeps the quality of the original data.
Methods of data reduction:
These are explained as following below.
1. Data Cube Aggregation:
This technique is used to aggregate data in a simpler form. For example, imagine that information you gathered for your analysis for the years 2012 to 2014, that data includes the revenue of your company every three months. They involve you in the annual sales, rather than the quarterly average, So we can summarize the data in such a way that the resulting data summarizes the total sales per year instead of per quarter. It summarizes the data.
2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute required for our analysis. It reduces data size as it eliminates outdated or redundant features.
 Stepwise Forward Selection –
The selection begins with an empty set of attributes later on we decide best of the original attributes on the set based on their relevance to other attributes. We know it as a pvalue in statistics.Suppose there are the following attributes in the data set in which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6} Initial reduced attribute set: { } Step1: {X1} Step2: {X1, X2} Step3: {X1, X2, X5} Final reduced attribute set: {X1, X2, X5}
 Stepwise Backward Selection –
This selection starts with a set of complete attributes in the original data and at each point, it eliminates the worst remaining attribute in the set.Suppose there are the following attributes in the data set in which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6} Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 } Step1: {X1, X2, X3, X4, X5} Step2: {X1, X2, X3, X5} Step3: {X1, X2, X5} Final reduced attribute set: {X1, X2, X5}
 Combination of forwarding and Backward Selection –
It allows us to remove the worst and select best attributes, saving time and making the process faster.
3. Data Compression:
The data compression technique reduces the size of the files using different encoding mechanisms (Huffman Encoding & runlength Encoding). We can divide it into two types based on their compression techniques.

Lossless Compression –
Encoding techniques (Run Length Encoding) allows a simple and minimal data size reduction. Lossless data compression uses algorithms to restore the precise original data from the compressed data.  Lossy Compression –
Methods such as Discrete Wavelet transform technique, PCA (principal component analysis) are examples of this compression. For e.g., JPEG image format is a lossy compression, but we can find the meaning equivalent to the original the image. In lossydata compression, the decompressed data may differ to the original data but are useful enough to retrieve information from them.
4. Numerosity Reduction:
In this reduction technique the actual data is replaced with mathematical models or smaller representation of the data instead of actual data, it is important to only store the model parameter. Or nonparametric method such as clustering, histogram, sampling. For More Information on Numerosity Reduction Visit the link below:
5. Discretization & Concept Hierarchy Operation:
Techniques of data discretization are used to divide the attributes of the continuous nature into data with intervals. We replace many constant values of the attributes by labels of small intervals. This means that mining results are shown in a concise, and easily understandable way.
 Topdown discretization –
If you first consider one or a couple of points (socalled breakpoints or split points) to divide the whole set of attributes and repeat of this method up to the end, then the process is known as topdown discretization also known as splitting.
 Bottomup discretization –
If you first consider all the constant values as splitpoints, some are discarded through a combination of the neighbourhood values in the interval, that process is called bottomup discretization.
Concept Hierarchies:
It reduces the data size by collecting and then replacing the lowlevel concepts (such as 43 for age) to highlevel concepts (categorical variables such as middle age or Senior).
For numeric data following techniques can be followed:
 Binning –
Binning is the process of changing numerical variables into categorical counterparts. The number of categorical counterparts depends on the number of bins specified by the user.  Histogram analysis –
Like the process of binning, the histogram is used to partition the value for the attribute X, into disjoint ranges called brackets. There are several partitioning rules: Equal Frequency partitioning: Partitioning the values based on their number of occurrences in the data set.
 Equal Width Partioning: Partioning the values in a fixed gap based on the number of bins i.e. a set of values ranging from 020.
 Clustering: Grouping the similar data together.
GeeksforGeeks has prepared a complete interview preparation course with premium videos, theory, practice problems, TA support and many more features. Please refer Placement 100 for details
Recommended Posts:
 Numerosity Reduction in Data Mining
 Difference Between Data Mining and Data Visualization
 Difference between Data Warehousing and Data Mining
 Data Mining: Data Attributes and Quality
 Difference Between Data Science and Data Mining
 Types of Sources of Data in Data Mining
 Data Mining: Data Warehouse Process
 Difference Between Data Mining and Text Mining
 Difference Between Big Data and Data Mining
 Data Preprocessing in Data Mining
 Data Normalization in Data Mining
 Data Integration in Data Mining
 Data Transformation in Data Mining
 Difference Between Data Mining and Web Mining
 Data Mining  Set 2
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.