Redundancy and Correlation in Data Mining

Last Updated : 01 Feb, 2023

Prerequisites:Chi-square test, covariance-and-correlation

What is Data Redundancy ?

During data integration in data mining, various data stores are used. This can lead to the problem of redundancy in data. An attribute (column or feature of data set) is called redundant if it can be derived from any other attribute or set of attributes. Inconsistencies in attribute or dimension naming can also lead to the redundancies in data set.

Data redundancy refers to the duplication of data in a computer system. This duplication can occur at various levels, such as at the hardware or software level, and can be intentional or unintentional. The main purpose of data redundancy is to provide a backup copy of data in case the primary copy is lost or becomes corrupted. This can help to ensure the availability and integrity of the data in the event of a failure or other problem.

Advantages of data redundancy include:

Increased data availability and reliability, as there are multiple copies of the data that can be used in case the primary copy is lost or becomes unavailable.
Improved data integrity, as multiple copies of the data can be compared to detect and correct errors.
Increased fault tolerance, as the system can continue to function even if one copy of the data is lost or corrupted.

Disadvantages of data redundancy include:

Increased storage requirements, as multiple copies of the data must be maintained.
Increased complexity of the system, as managing multiple copies of the data can be difficult and time-consuming.
Increased risk of data inconsistencies, as multiple copies of the data may become out of sync if updates are not properly propagated to all copies
Reduced performance, as the system may have to perform additional work to maintain and access multiple copies of the data.

Example – We have a data set having three attributes- person_name, is_male, is_female.

is_male is 1 if the corresponding person is a male else it is 0 .
is_female is 1 if the corresponding person is a female else it is 0.
- χ²Test (Used for nominal Data or categorical or qualitative data)
- Correlation coefficient and covariance (Used for numeric Data or quantitative data)