Data Integration in Data Mining

Data Integration is a data preprocessing technique that involves combining data from multiple heterogeneous data sources into a coherent data store and provide a unified view of the data. These sources may include multiple data cubes, databases or flat files.

The data integration approach are formally defined as triple <G, S, M> where,
G stand for the global schema,
S stand for heterogenous source of schema,
M stand for mapping between the queries of source and global schema.







There are mainly 2 major approaches for data integration – one is “tight coupling approach” and another is “loose coupling approach”.

Tight Coupling:

  • Here, a data warehouse is treated as an information retrieval component.
  • In this coupling, data is combined from different sources into a single physical location through the process of ETL – Extraction, Transformation and Loading.

Loose Coupling:

  • Here, an interface is provided that takes the query from the user, transforms it in a way the source database can understand and then sends the query directly to the source databases to obtain the result.
  • And the data only remains in the actual source databases.

Issues in Data Integration:
There are no of issues to consider during data integration: Schema Integration, Redundancy, Detection and resolution of data value conflicts. These are explained in brief as following below.

1. Schema Integration:

  • Integrate metadata from different sources.
  • The real world entities from multiple source be matched referred to as the entity identification problem.
  • For example, How can the data analyst and computer be sure that customer id in one data base and customer number in another reference to the same attribute.

2. Redundancy:

  • An attribute may be redundant if it can be derived or obtaining from another attribute or set of attribute.
  • Inconsistencies in attribute can also cause redundanciesin the resulting data set.
  • Some redundancies can be detected by correlation analysis.

3. Detection and resolution of datavalue conflicts:

  • This is the third important issues in data integration.
  • Attribute values from another different sources may differ for the same real world entity.
  • An attribute in one system may be recorded at a lower level abstraction then the “same” attribute in another.


My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.