Open In App

Entity Identification Problem in Data Mining

Nowadays, data mining is used in almost all places where a large amount of data is stored and processed. Data Integration is one of the major tasks of data preprocessing.  Integration of multiple databases or data files into the single store of identical data is known as Data Integration.  Data Integration is usually performed to create data sets for machine learning algorithms and to predict the statistical information from the data during the data mining. We integrate data from various resources like banking transactions, invoices, customer records, Twitter, blog postings, image, audio or video data, electronic data interchange (EDI) files, spreadsheets, and sensor data. 

Data mining often requires data integration, the merging of data from multiple data stores.  which combines data from multiple sources into a coherent data store, as in data warehousing. These sources may include multiple databases, data cubes, or flat files. There are a number of issues to consider during data integration like Schema integration and object matching.



So a careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set. This can help improve the accuracy and speed of the subsequent data mining process. The semantic heterogeneity and structure of data pose great challenges in data integration. How can we match schema and objects from different sources? Or How can equivalent real-world entities from multiple data sources be matched up? This problem is known as the entity identification problem.



Data is usually collected from multiple resources into a coherent store and it can be of different dimensions and datatypes. There are different representations of data and different scales of data.

Issues in Data Integration:

Data integration techniques:

Article Tags :