Data Integration in Data Mining

Data Integration is a data preprocessing technique that involves combining data from multiple heterogeneous data sources into a coherent data store and provide a unified view of the data. These sources may include multiple data cubes, databases or flat files.

The data integration approach are formally defined as triple <G, S, M> where,
G stand for the global schema,
S stand for heterogenous source of schema,
M stand for mapping between the queries of source and global schema.





There are mainly 2 major approaches for data integration – one is “tight coupling approach” and another is “loose coupling approach”.

Tight Coupling:



  • Here, a data warehouse is treated as an information retrieval component.
  • In this coupling, data is combined from different sources into a single physical location through the process of ETL – Extraction, Transformation and Loading.

Loose Coupling:

  • Here, an interface is provided that takes the query from the user, transforms it in a way the source database can understand and then sends the query directly to the source databases to obtain the result.
  • And the data only remains in the actual source databases.

Issues in Data Integration:
There are no of issues to consider during data integration: Schema Integration, Redundancy, Detection and resolution of data value conflicts. These are explained in brief as following below.

1. Schema Integration:

  • Integrate metadata from different sources.
  • The real world entities from multiple source be matched referred to as the entity identification problem.
  • For example, How can the data analyst and computer be sure that customer id in one data base and customer number in another reference to the same attribute.

2. Redundancy:

  • An attribute may be redundant if it can be derived or obtaining from another attribute or set of attribute.
  • Inconsistencies in attribute can also cause redundanciesin the resulting data set.
  • Some redundancies can be detected by correlation analysis.

3. Detection and resolution of datavalue conflicts:

  • This is the third important issues in data integration.
  • Attribute values from another different sources may differ for the same real world entity.
  • An attribute in one system may be recorded at a lower level abstraction then the “same” attribute in another.

Attention reader! Don’t stop learning now. Get hold of all the important CS Theory concepts for SDE interviews with the CS Theory Course at a student-friendly price and become industry ready.

My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.