Skip to content
Related Articles

Related Articles

Improve Article
Save Article
Like Article

Data Integration in Data Mining

  • Difficulty Level : Basic
  • Last Updated : 31 May, 2021

Data Integration is a data preprocessing technique that involves combining data from multiple heterogeneous data sources into a coherent data store and provide a unified view of the data. These sources may include multiple data cubes, databases, or flat files. 

The data integration approaches are formally defined as triple <G, S, M> where, 
G stand for the global schema, 
S stands for the heterogeneous source of schema, 
M stands for mapping between the queries of source and global schema. 

Attention reader! Don’t stop learning now. Get hold of all the important Machine Learning Concepts with the Machine Learning Foundation Course at a student-friendly price and become industry ready.



 



There are mainly 2 major approaches for data integration – one is the “tight coupling approach” and another is the “loose coupling approach”. 

Tight Coupling: 

  • Here, a data warehouse is treated as an information retrieval component.
  • In this coupling, data is combined from different sources into a single physical location through the process of ETL – Extraction, Transformation, and Loading.

Loose Coupling:  

  • Here, an interface is provided that takes the query from the user, transforms it in a way the source database can understand, and then sends the query directly to the source databases to obtain the result.
  • And the data only remains in the actual source databases.

Issues in Data Integration: 
There are no issues to consider during data integration: Schema Integration, Redundancy, Detection, and resolution of data value conflicts. These are explained in brief below. 

1. Schema Integration: 

  • Integrate metadata from different sources.
  • The real-world entities from multiple sources are matched referred to as the entity identification problem.

2. Redundancy: 

  • An attribute may be redundant if it can be derived or obtaining from another attribute or set of attributes.
  • Inconsistencies in attributes can also cause redundancies in the resulting data set.
  • Some redundancies can be detected by correlation analysis.

3. Detection and resolution of data value conflicts: 

  • This is the third important issue in data integration.
  • Attribute values from different sources may differ for the same real-world entity.
  • An attribute in one system may be recorded at a lower level abstraction than the “same” attribute in another.
My Personal Notes arrow_drop_up
Recommended Articles
Page :