Open In App

Tuple Duplication in Data Mining

Improve
Improve
Like Article
Like
Save
Share
Report

Data Integration is a data preprocessing technique that involves combining data from multiple heterogeneous data sources into a coherent data store and providing a unified view of the data. These sources may include multiple data cubes, databases, or flat files.  

The data integration approaches are formally defined as triple <G, S, M> where,  

  • G stand for the global schema,
  • S stands for the heterogeneous source of schema,
  • M stands for mapping between the queries of source and global schema.

Data Integration in Data Mining

When the data is integrated from several databases or applications then the redundant data attributes usually occur. Redundancy and Tuple duplication are the important issues in data integration during data mining. An attribute like annual revenue may be redundant if it is derived from the attributes of other relations. Duplicate tuples also increase the size of the database and make it to be complex. Duplicate tuples and attributes cause inconsistencies in attributes and inconsistencies in the database or data sets.  Duplicate tuples generally occur due to inaccurate data entry or updating the files of similar data occurrences. 

If the relations in the database have denormalized tables then it also causes data redundancy. Dimension naming of attributes from different entities can also cause redundancies in the data set. Duplicate data tuples are also present in relational databases when the same real-world entity is expressed with different attribute names. This may be caused due to differences in representation or scaling of data values.

S.No. Petal length Petal Width Sepal Length Sepal Width
01. 3.4 5.6 4.7 4.5
02. 4.4 5.8 6.7 5.9
03. 5.9 6.9 7.8 5.8
04. 3.4 5.6 4.7 4.5

Let’s consider the above table of a flower data set which is a set of values. The first and last tuple values are the same in the table. So the last tuple is considered a duplicate. We consider the tuple as duplicate if all the attribute values of two rows are the same. 

Redundancies between attributes and duplicate tuples must be detected. Duplicate data tuples give the same results individually and this affects the overall performance of machine learning algorithms if the dataset contains the duplicate tuples. The duplicate tuples may also lead to difficulty in database maintenance. 

So removal of duplicate tuples is considered as the primary step in data processing in all applications. Before performing the operations on the data set or developing the model’s data set is cleaned to remove the noise data. Duplicate tuples are the noise data that affect the accuracy of the models. 

In some situations Duplicate tuples cause the serious issue, for example, A purchase order database contains attributes such as purchaser’s name and addresses if another purchaser has the same name and due to technical issues if these two purchasers have the same addresses then it becomes difficult to find the particular customer who has ordered the product. We can handle the duplicate tuples by removing them from the data set during the data cleaning process in data mining. Removing the duplicate tuples is the only way to handle the redundancy caused due to it.


Last Updated : 14 Mar, 2022
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads