KDD Process in Data Mining

Last Updated : 23 May, 2023

In the context of computer science, “Data Mining” can be referred to as knowledge mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data dredging. Data Mining also known as Knowledge Discovery in Databases, refers to the nontrivial extraction of implicit, previously unknown and potentially useful information from data stored in databases.

The need of data mining is to extract useful information from large datasets and use it to make predictions or better decision-making. Nowadays, data mining is used in almost all places where a large amount of data is stored and processed.

For examples: Banking sector, Market Basket Analysis, Network Intrusion Detection.

KDD Process

KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful, previously unknown, and potentially valuable information from large datasets. The KDD process is an iterative process and it requires multiple iterations of the above steps to extract accurate knowledge from the data.The following steps are included in KDD process:

Data Cleaning

Data cleaning is defined as removal of noisy and irrelevant data from collection.

Cleaning in case of Missing values.
Cleaning noisy data, where noise is a random or variance error.
Cleaning with Data discrepancy detection and Data transformation tools.

Data Integration

Data integration is defined as heterogeneous data from multiple sources combined in a common source(DataWarehouse). Data integration using Data Migration tools, Data Synchronization tools and ETL(Extract-Load-Transformation) process.

Data Selection

Data selection is defined as the process where data relevant to the analysis is decided and retrieved from the data collection. For this we can use Neural network, Decision Trees, Naive bayes, Clustering, and Regression methods.

Data Transformation

Data Transformation is defined as the process of transforming data into appropriate form required by mining procedure. Data Transformation is a two step process:

Data Mapping: Assigning elements from source base to destination to capture transformations.
Code generation: Creation of the actual transformation program.

Data Mining

Data mining is defined as techniques that are applied to extract patterns potentially useful. It transforms task relevant data into patterns, and decides purpose of model using classification or characterization.

Pattern Evaluation

Pattern Evaluation is defined as identifying strictly increasing patterns representing knowledge based on given measures. It find interestingness score of each pattern, and uses summarization and Visualization to make data understandable by user.

Knowledge Representation

This involves presenting the results in a way that is meaningful and can be used to make decisions.

Note: KDD is an iterative process where evaluation measures can be enhanced, mining can be refined, new data can be integrated and transformed in order to get different and more appropriate results.Preprocessing of databases consists of Data cleaning and Data Integration.

Advantages of KDD

Improves decision-making: KDD provides valuable insights and knowledge that can help organizations make better decisions.
Increased efficiency: KDD automates repetitive and time-consuming tasks and makes the data ready for analysis, which saves time and money.
Better customer service: KDD helps organizations gain a better understanding of their customers’ needs and preferences, which can help them provide better customer service.
Fraud detection: KDD can be used to detect fraudulent activities by identifying patterns and anomalies in the data that may indicate fraud.
Predictive modeling: KDD can be used to build predictive models that can forecast future trends and patterns.

Disadvantages of KDD

Privacy concerns: KDD can raise privacy concerns as it involves collecting and analyzing large amounts of data, which can include sensitive information about individuals.
Complexity: KDD can be a complex process that requires specialized skills and knowledge to implement and interpret the results.
Unintended consequences: KDD can lead to unintended consequences, such as bias or discrimination, if the data or models are not properly understood or used.
Data Quality: KDD process heavily depends on the quality of data, if data is not accurate or consistent, the results can be misleading
High cost: KDD can be an expensive process, requiring significant investments in hardware, software, and personnel.
Overfitting: KDD process can lead to overfitting, which is a common problem in machine learning where a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new unseen data.

Difference between KDD and Data Mining

Parameter	KDD	Data Mining
Definition	KDD refers to a process of identifying valid, novel, potentially useful, and ultimately understandable patterns and relationships in data.	Data Mining refers to a process of extracting useful and valuable information or patterns from large data sets.
Objective	To find useful knowledge from data.	To extract useful information from data.
Techniques Used	Data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation, and knowledge representation and visualization.	Association rules, classification, clustering, regression, decision trees, neural networks, and dimensionality reduction.
Output	Structured information, such as rules and models, that can be used to make decisions or predictions.	Patterns, associations, or insights that can be used to improve decision-making or understanding.
Focus	Focus is on the discovery of useful knowledge, rather than simply finding patterns in data.	Data mining focus is on the discovery of patterns or relationships in data.
Role of domain expertise	Domain expertise is important in KDD, as it helps in defining the goals of the process, choosing appropriate data, and interpreting the results.	Domain expertise is less critical in data mining, as the algorithms are designed to identify patterns without relying on prior knowledge.

Suggest improvement

Data Mining Process

Share your thoughts in the comments