Numerosity Reduction in Data Mining
Prerequisite: Data preprocessing
Why Data Reduction ?
Data reduction process reduces the size of data and makes it suitable and feasible for analysis. In the reduction process, integrity of the data must be preserved and data volume is reduced. There are many techniques that can be used for data reduction. Numerosity reduction is one of them.
Numerosity Reduction is a data reduction technique which replaces the original data by smaller form of data representation. There are two techniques for numerosity reduction-
Parametric Methods –
For parametric methods, data is represented using some model. The model is used to estimate the data, so that only parameters of data are required to be stored, instead of actual data.
Log-Linear methods are used for creating such models.
Regression can be a simple linear regression or multiple linear regression. When there is only single independent attribute, such regression model is called simple linear regression and if there are multiple independent attributes, then such regression models are called multiple linear regression.
In linear regression, the data are modeled to a fit straight line. For example, a random variable y can be modeled as a linear function of another random variable x with the equation
y = ax+b
b (regression coefficients) specifies the slope and y-intercept of the line, respectively.
In multiple linear regression,
y will be modeled as a linear function of two or more predictor(independent) variables.
Log-linear model can be used to estimate the probability of each data point in a multidimensional space for a set of discretized attributes, based on a smaller subset of dimensional combinations. This allows a higher-dimensional data space to be constructed from lower-dimensional attributes.
Regression and log-linear model can both be used on sparse data, although their application may be limited.
Non-Parametric Methods –
These methods are used for storing reduced representations of the data include histograms, clustering, sampling and data cube aggregation.
Histogram is the data representation in terms of frequency. It uses binning to approximate data distribution and is a popular form of data reduction.
Clustering divides the data into groups/clusters. This technique partitions the whole data into different clusters. In data reduction, the cluster representation of the data are used to replace the actual data. It also helps to detect outliers in data.
Sampling can be used for data reduction because it allows a large data set to be represented by a much smaller random data sample (or subset).
Data Cube Aggregation:
Data cube aggregation involves moving the data from detailed level to a fewer number of dimensions. The resulting data set is smaller in volume, without loss of information necessary for the analysis task.