Data Mining: Data Attributes and Quality
Prerequisite – Data Mining
Data: It is how the data objects and their attributes are stored.
- An attribute is an object’s property or characteristics. For example. A person’s hair colour, air humidity etc.
- An attribute set defines an object. The object is also referred to as a record of the instances or entity.
Different types of attributes or data types:
- Nominal Attribute:
Nominal Attributes only provide enough attributes to differentiate between one object and another. Such as Student Roll No., Sex of the Person. - Ordinal Attribute:
The ordinal attribute value provides sufficient information to order the objects. Such as Rankings, Grades, Height - Binary Attribute:
These are 0 and 1. Where 0 is the absence of any features and 1 is the inclusion of any characteristics. - Numeric attribute:It is quantitative, such that quantity can be measured and represented in integer or real values ,are of two types
Interval Scaled attribute:
It is measured on a scale of equal size units,these attributes allow us to compare such as temperature in C or F and thus values of attributes have ordered.
Ratio Scaled attribute:
Both differences and ratios are significant for Ratio. For eg. age, length, and Weight.
Data Quality: Why do we preprocess the data?
Many characteristics act as a deciding factor for data quality, such as incompleteness and incoherent information, which are common properties of the big database in the real world. Factors used for data quality assessment are:
- Accuracy:
There are many possible reasons for flawed or inaccurate data here. i.e. Having incorrect values of properties that could be human or computer errors.
- Completeness:
For some reasons, incomplete data can occur, attributes of interest such as customer information for sales & transaction data may not always be available.
- Consistency:
Incorrect data can also result from inconsistencies in naming convention or data codes, or from input field incoherent format. Duplicate tuples need cleaning of details, too.
- Timeliness:
It also affects the quality of the data. At the end of the month, several sales representatives fail to file their sales records on time. There are also several corrections & adjustments which flow into after the end of the month. Data stored in the database are incomplete for a time after each month.
- Believability:
It is reflective of how much users trust the data.
- Interpretability:
It is a reflection of how easy the users can understand the data.
Please Login to comment...