Data Mining: Data Attributes and Quality
Prerequisite – Data Mining
Data: It is how the data objects and their attributes are stored.
- An attribute is an object’s property or characteristics. For example. A person’s hair colour, air humidity etc.
- An attribute set defines an object. The object is also referred to as a record of the instances or entity.
Different types of attributes or data types:
In data mining, understanding the different types of attributes or data types is essential as it helps to determine the appropriate data analysis techniques to use. The following are the different types of data:
This type of data is also referred to as categorical data. Nominal data represents data that is qualitative and cannot be measured or compared with numbers. In nominal data, the values represent a category, and there is no inherent order or hierarchy. Examples of nominal data include gender, race, religion, and occupation. Nominal data is used in data mining for classification and clustering tasks.
This type of data is also categorical, but with an inherent order or hierarchy. Ordinal data represents qualitative data that can be ranked in a particular order. For instance, education level can be ranked from primary to tertiary, and social status can be ranked from low to high. In ordinal data, the distance between values is not uniform. This means that it is not possible to say that the difference between high and medium social status is the same as the difference between medium and low social status. Ordinal data is used in data mining for ranking and classification tasks.
This type of data has only two possible values, often represented as 0 or 1. Binary data is commonly used in classification tasks, where the target variable has only two possible outcomes. Examples of binary data include yes/no, true/false, and pass/fail. Binary data is used in data mining for classification and association rule mining tasks.
This type of data represents quantitative data with equal intervals between consecutive values. Interval data has no absolute zero point, and therefore, ratios cannot be computed. Examples of interval data include temperature, IQ scores, and time. Interval data is used in data mining for clustering and prediction tasks.
This type of data is similar to interval data, but with an absolute zero point. In ratio data, it is possible to compute ratios of two values, and this makes it possible to make meaningful comparisons. Examples of ratio data include height, weight, and income. Ratio data is used in data mining for prediction and association rule mining tasks.
This type of data represents unstructured data in the form of text. Text data can be found in social media posts, customer reviews, and news articles. Text data is used in data mining for sentiment analysis, text classification, and topic modeling tasks.
Data Quality: Why do we preprocess the data?
Data preprocessing is an essential step in data mining and machine learning as it helps to ensure the quality of data used for analysis. There are several factors that are used for data quality assessment, including:
This refers to missing data or information in the dataset. Missing data can result from various factors, such as errors during data entry or data loss during transmission. Preprocessing techniques, such as imputation, can be used to fill in missing values to ensure the completeness of the dataset.
This refers to conflicting or contradictory data in the dataset. Inconsistent data can result from errors in data entry, data integration, or data storage. Preprocessing techniques, such as data cleaning and data integration, can be used to detect and resolve inconsistencies in the dataset.
This refers to random or irrelevant data in the dataset. Noise can result from errors during data collection or data entry. Preprocessing techniques, such as data smoothing and outlier detection, can be used to remove noise from the dataset.
Outliers are data points that are significantly different from the other data points in the dataset. Outliers can result from errors in data collection, data entry, or data transmission. Preprocessing techniques, such as outlier detection and removal, can be used to identify and remove outliers from the dataset.
Redundancy refers to the presence of duplicate or overlapping data in the dataset. Redundant data can result from data integration or data storage. Preprocessing techniques, such as data deduplication, can be used to remove redundant data from the dataset.
This refers to the structure and format of the data in the dataset. Data may be in different formats, such as text, numerical, or categorical. Preprocessing techniques, such as data transformation and normalization, can be used to convert data into a consistent format for analysis.
Please Login to comment...