Prerequisite – Data Mining
Data: It is how the data objects and their attributes are stored.
- An attribute is an object’s property or characteristics. For example. A person’s hair colour, air humidity etc.
- An attribute set defines an object. The object is also referred to as a record of the instances or entity.
Different types of attributes or data types:
- Nominal Attribute:
Nominal Attributes only provide enough attributes to differentiate between one object and another. Such as Student Roll No., Sex of the Person.
- Ordinal Attribute:
The ordinal attribute value provides sufficient information to order the objects. Such as Rankings, Grades, Height
- Binary Attribute:
These are 0 and 1. Where 0 is the absence of any features and 1 is the inclusion of any characteristics.
- Numeric attribute:It is quantitative, such that quantity can be measured and represented in integer or real values ,are of two types
Interval Scaled attribute:
It is measured on a scale of equal size units,these attributes allows us to compare such as temperature in C or F and thus values of attributes have order.
- Ratio Scaled attribute:
Both differences and ratios are significant for Ratio. For eg. age, length, Weight.
Data Quality: Why do we preprocess the data?
Many characteristics act as a deciding factor for data quality, such as incompleteness and incoherent information, which are common properties of the big database in the real world. Factors used for data quality assessment are:
There are many possible reasons for flawed or inaccurate data here. i.e. Having incorrect values of properties that could be human or computer errors.
For some reasons, incomplete data can occur, attributes of interest such as customer information for sales & transaction data may not always be available.
Incorrect data can also result from inconsistencies in naming convention or data codes, or from input field incoherent format. Duplicate tuples need cleaning of details, too.
It also affects the quality of the data. At the end of the month, several sales representatives fail to file their sales record on time. These are also several corrections & adjustments which flow into after the end of the month. Data stored in the database are incomplete for a time after each month.
It is reflective of how much users trust the data.
It is a reflection of how easy the users can understand the data.
Attention reader! Don’t stop learning now. Get hold of all the important CS Theory concepts for SDE interviews with the CS Theory Course at a student-friendly price and become industry ready.