Redundancy and Correlation in Data Mining

Prerequisites:Chi-square test, covariance-and-correlation

What is Data Redundancy ?

During data integration in data mining, various data stores are used. This can lead to the problem of redundancy in data. An attribute (column or feature of data set) is called redundant if it can be derived from any other attribute or set of attributes. Inconsistencies in attribute or dimension naming can also lead to the redundancies in data set.

Example –
We have a data set having three attributes- person_name, is_male, is_female.

  • is_male is 1 if the corresponding person is a male else it is 0 .
  • is_female is 1 if the corresponding person is a female else it is 0.
  • On analysing the fact that if a person is not male (i.e is_male is 0 corresponding the person_name) then, the person is surely a female (since there are only two value in output class- male and female). It implies that the two attributes are highly correlated and one attribute can determine the other. Hence, one of these attributes became redundant. So one of these two attributes can be dropped without any information loss.

    Detection of Data Redundancy –

    Redundancies can be detected using following methods