Numerosity Reduction in Data Mining

Prerequisite: Data preprocessing

Why Data Reduction ?
Data reduction process reduces the size of data and makes it suitable and feasible for analysis. In the reduction process, integrity of the data must be preserved and data volume is reduced. There are many techniques that can be used for data reduction. Numerosity reduction is one of them.

Numerosity Reduction:
Numerosity Reduction is a data reduction technique which replaces the original data by smaller form of data representation. There are two techniques for numerosity reduction- Parametric and Non-Parametric methods.



Parametric Methods –

For parametric methods, data is represented using some model. The model is used to estimate the data, so that only parameters of data are required to be stored, instead of actual data. Regression and Log-Linear methods are used for creating such models.

Regression:
Regression can be a simple linear regression or multiple linear regression. When there is only single independent attribute, such regression model is called simple linear regression and if there are multiple independent attributes, then such regression models are called multiple linear regression.
In linear regression, the data are modeled to a fit straight line. For example, a random variable y can be modeled as a linear function of another random variable x with the equation y = ax+b
where a and b (regression coefficients) specifies the slope and y-intercept of the line, respectively.

In multiple linear regression, y will be modeled as a linear function of two or more predictor(independent) variables.
 
Log-Linear Model:
Log-linear model can be used to estimate the probability of each data point in a multidimensional space for a set of discretized attributes, based on a smaller subset of dimensional combinations. This allows a higher-dimensional data space to be constructed from lower-dimensional attributes.

Regression and log-linear model can both be used on sparse data, although their application may be limited.

Non-Parametric Methods –

These methods are used for storing reduced representations of the data include histograms, clustering, sampling and data cube aggregation.

Histograms:
Histogram is the data representation in terms of frequency. It uses binning to approximate data distribution and is a popular form of data reduction.

Clustering:
Clustering divides the data into groups/clusters. This technique partitions the whole data into different clusters. In data reduction, the cluster representation of the data are used to replace the actual data. It also helps to detect outliers in data.

Sampling:
Sampling can be used for data reduction because it allows a large data set to be represented by a much smaller random data sample (or subset).

Data Cube Aggregation:
Data cube aggregation involves moving the data from detailed level to a fewer number of dimensions. The resulting data set is smaller in volume, without loss of information necessary for the analysis task.



My Personal Notes arrow_drop_up

Love to write, Competitive programming is fun, Python is way

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.




Article Tags :

Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.