What is Data Redundancy ?
During data integration in data mining, various data stores are used. This can lead to the problem of redundancy in data. An attribute (column or feature of data set) is called redundant if it can be derived from any other attribute or set of attributes. Inconsistencies in attribute or dimension naming can also lead to the redundancies in data set.
We have a data set having three attributes-
is_maleis 1 if the corresponding person is a male else it is 0 .
is_femaleis 1 if the corresponding person is a female else it is 0.
- χ2Test (Used for nominal Data or categorical or qualitative data)
- Correlation coefficient and covariance (Used for numeric Data or quantitative data)
- Types of Sources of Data in Data Mining
- Difference between Data Warehousing and Data Mining
- Data Mining
- Data Mining | Set 2
- Data Integration in Data Mining
- Data Normalization in Data Mining
- Data Preprocessing in Data Mining
- Binning in Data Mining
- KDD Process in Data Mining
- Numerosity Reduction in Data Mining
- Ensemble Classifier | Data Mining
- Basic Concept of Classification (Data Mining)
- Relationship between Data Mining and Machine Learning
- Comparison b/w Bagging and Boosting | Data Mining
- Attribute Subset Selection in Data Mining
On analysing the fact that if a person is not male (i.e
is_male is 0 corresponding the person_name) then, the person is surely a female (since there are only two value in output class- male and female). It implies that the two attributes are highly correlated and one attribute can determine the other. Hence, one of these attributes became redundant. So one of these two attributes can be dropped without any information loss.
Detection of Data Redundancy –
Redundancies can be detected using following methods
χ2 Test for Nominal Data –
This test is performed over nominal data.Let there are two attributes A and B in a data set.A contingency table is made for representing data tuples.
The formula used for this test is:
Where observed values are the actual count and expected values are the count obtained from contingency table joint events.
The χ2 checks the hypothesis that A and B are independent. If this hypothesis can be rejected, we can say that A and B are statistically correlated and one of them (either A or B) can be discarded.
Correlation Coefficient for Numeric Data –
This test is used for numeric data.In this case the correlation between attributes(say A and B) is computed by Pearson’s product moment coefficient also known as correlation coefficient
Formula used is:
Where n is the number of tuples, ai, bi are the respective values of A and B in tuple i.
Conclusion: Higher the correlation coefficient, more strongly the attributes are correlated and one of them (either A or B) can be discarded. If the correlation constant is 0 then the attributes are independent and if it is negative then one attribute discourages the other i.e if value of one attribute increases then value of other decreases.
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to email@example.com. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.