Prerequisite: Data preprocessing
Why Data Reduction ?
Data reduction process reduces the size of data and makes it suitable and feasible for analysis. In the reduction process, integrity of the data must be preserved and data volume is reduced. There are many techniques that can be used for data reduction. Numerosity reduction is one of them.
Numerosity Reduction is a data reduction technique which replaces the original data by smaller form of data representation. There are two techniques for numerosity reduction-
Parametric Methods –
For parametric methods, data is represented using some model. The model is used to estimate the data, so that only parameters of data are required to be stored, instead of actual data.
Log-Linear methods are used for creating such models.
Regression can be a simple linear regression or multiple linear regression. When there is only single independent attribute, such regression model is called simple linear regression and if there are multiple independent attributes, then such regression models are called multiple linear regression.
In linear regression, the data are modeled to a fit straight line. For example, a random variable y can be modeled as a linear function of another random variable x with the equation
y = ax+b
b (regression coefficients) specifies the slope and y-intercept of the line, respectively.
In multiple linear regression,
y will be modeled as a linear function of two or more predictor(independent) variables.
Log-linear model can be used to estimate the probability of each data point in a multidimensional space for a set of discretized attributes, based on a smaller subset of dimensional combinations. This allows a higher-dimensional data space to be constructed from lower-dimensional attributes.
Regression and log-linear model can both be used on sparse data, although their application may be limited.
Non-Parametric Methods –
These methods are used for storing reduced representations of the data include histograms, clustering, sampling and data cube aggregation.
Histogram is the data representation in terms of frequency. It uses binning to approximate data distribution and is a popular form of data reduction.
Clustering divides the data into groups/clusters. This technique partitions the whole data into different clusters. In data reduction, the cluster representation of the data are used to replace the actual data. It also helps to detect outliers in data.
Sampling can be used for data reduction because it allows a large data set to be represented by a much smaller random data sample (or subset).
Data Cube Aggregation:
Data cube aggregation involves moving the data from detailed level to a fewer number of dimensions. The resulting data set is smaller in volume, without loss of information necessary for the analysis task.
- Types of Sources of Data in Data Mining
- Difference between Data Warehousing and Data Mining
- Data Mining
- Data Mining | Set 2
- Data Normalization in Data Mining
- Data Integration in Data Mining
- Data Preprocessing in Data Mining
- KDD Process in Data Mining
- Redundancy and Correlation in Data Mining
- Relationship between Data Mining and Machine Learning
- Attribute Subset Selection in Data Mining
- Basic Concept of Classification (Data Mining)
- Frequent Item set in Data set (Association Rule Mining)
- Web Mining
- Difference between a Data Analyst and a Data Scientist
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to firstname.lastname@example.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.