Data Integration is a data preprocessing technique that involves combining data from multiple heterogeneous data sources into a coherent data store and provide a unified view of the data. These sources may include multiple data cubes, databases or flat files.
The data integration approach are formally defined as triple <G, S, M> where,
G stand for the global schema,
S stand for heterogenous source of schema,
M stand for mapping between the queries of source and global schema.
There are mainly 2 major approaches for data integration – one is “tight coupling approach” and another is “loose coupling approach”.
- Here, a data warehouse is treated as an information retrieval component.
- In this coupling, data is combined from different sources into a single physical location through the process of ETL – Extraction, Transformation and Loading.
- Here, an interface is provided that takes the query from the user, transforms it in a way the source database can understand and then sends the query directly to the source databases to obtain the result.
- And the data only remains in the actual source databases.
Issues in Data Integration:
There are no of issues to consider during data integration: Schema Integration, Redundancy, Detection and resolution of data value conflicts. These are explained in brief as following below.
1. Schema Integration:
- Integrate metadata from different sources.
- The real world entities from multiple source be matched referred to as the entity identification problem.
For example, How can the data analyst and computer be sure that customer id in one data base and customer number in another reference to the same attribute.
- An attribute may be redundant if it can be derived or obtaining from another attribute or set of attribute.
- Inconsistencies in attribute can also cause redundanciesin the resulting data set.
- Some redundancies can be detected by correlation analysis.
3. Detection and resolution of datavalue conflicts:
- This is the third important issues in data integration.
- Attribute values from another different sources may differ for the same real world entity.
- An attribute in one system may be recorded at a lower level abstraction then the “same” attribute in another.
Attention reader! Don’t stop learning now. Get hold of all the important CS Theory concepts for SDE interviews with the CS Theory Course at a student-friendly price and become industry ready.
- Difference Between Data Mining and Text Mining
- Difference Between Data Mining and Web Mining
- Types of Sources of Data in Data Mining
- Data Normalization in Data Mining
- Difference between Data Warehousing and Data Mining
- Data Preprocessing in Data Mining
- Data Mining: Data Warehouse Process
- Data Mining: Data Attributes and Quality
- Data Reduction in Data Mining
- Data Transformation in Data Mining
- Difference Between Data Science and Data Mining
- Difference Between Big Data and Data Mining
- Difference Between Data Mining and Data Visualization
- SciPy - Integration of a Differential Equation for Curve Fit
- Data Mining
- Basic Concept of Classification (Data Mining)
- KDD Process in Data Mining
- Frequent Item set in Data set (Association Rule Mining)
- Redundancy and Correlation in Data Mining
- Attribute Subset Selection in Data Mining
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to firstname.lastname@example.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.