Skip to content
Related Articles

Related Articles

Data Integration in Data Mining

Improve Article
Save Article
  • Difficulty Level : Basic
  • Last Updated : 30 Jun, 2022
Improve Article
Save Article

Data Integration is a data preprocessing technique that combines data from multiple heterogeneous data sources into a coherent data store and provides a unified view of the data. These sources may include multiple data cubes, databases, or flat files. 

The data integration approaches are formally defined as triple <G, S, M> where, 
G stand for the global schema, 
S stands for the heterogeneous source of schema, 
M stands for mapping between the queries of source and global schema. 


There are mainly 2 major approaches for data integration – one is the “tight coupling approach” and another is the “loose coupling approach”. 

Tight Coupling: 

  • Here, a data warehouse is treated as an information retrieval component.
  • In this coupling, data is combined from different sources into a single physical location through the process of ETL – Extraction, Transformation, and Loading.

Loose Coupling:  

  • Here, an interface is provided that takes the query from the user, transforms it in a way the source database can understand, and then sends the query directly to the source databases to obtain the result.
  • And the data only remains in the actual source databases.

Issues in Data Integration: 
There are three issues to consider during data integration: Schema Integration, Redundancy Detection, and resolution of data value conflicts. These are explained in brief below. 

1. Schema Integration: 

  • Integrate metadata from different sources.
  • The real-world entities from multiple sources are referred to as the entity identification problem.

2. Redundancy: 

  • An attribute may be redundant if it can be derived or obtained from another attribute or set of attributes.
  • Inconsistencies in attributes can also cause redundancies in the resulting data set.
  • Some redundancies can be detected by correlation analysis.

3. Detection and resolution of data value conflicts: 

  • This is the third critical issue in data integration.
  • Attribute values from different sources may differ for the same real-world entity.
  • An attribute in one system may be recorded at a lower level of abstraction than the “same” attribute in another.
My Personal Notes arrow_drop_up
Related Articles

Start Your Coding Journey Now!