Data Mining: Data Attributes and Quality

Prerequisite – Data Mining 
Data: It is how the data objects and their attributes are stored. 
 

  • An attribute is an object’s property or characteristics. For example. A person’s hair colour, air humidity etc.
  • An attribute set defines an object. The object is also referred to as a record of the instances or entity.

Different types of attributes or data types: 
 

  1. Nominal Attribute: 
    Nominal Attributes only provide enough attributes to differentiate between one object and another. Such as Student Roll No., Sex of the Person. 
     
  2. Ordinal Attribute: 
    The ordinal attribute value provides sufficient information to order the objects. Such as Rankings, Grades, Height
  3. Binary Attribute: 
    These are 0 and 1. Where 0 is the absence of any features and 1 is the inclusion of any characteristics.
  4. Numeric attribute:It is quantitative, such that quantity can be measured and represented in integer or real values ,are of two types
    Interval Scaled attribute: 
    It is measured on a scale of equal size units,these attributes allows us to compare such as temperature in C or F and thus values of attributes have order.
     
  5. Ratio Scaled attribute: 
    Both differences and ratios are significant for Ratio. For eg. age, length, Weight.

Data Quality: Why do we preprocess the data? 
Many characteristics act as a deciding factor for data quality, such as incompleteness and incoherent information, which are common properties of the big database in the real world. Factors used for data quality assessment are: 
 

  • Accuracy: 
    There are many possible reasons for flawed or inaccurate data here. i.e. Having incorrect values of properties that could be human or computer errors. 
     
  • Completeness: 
    For some reasons, incomplete data can occur, attributes of interest such as customer information for sales & transaction data may not always be available. 
     
  • Consistency: 
    Incorrect data can also result from inconsistencies in naming convention or data codes, or from input field incoherent format. Duplicate tuples need cleaning of details, too. 
     
  • Timeliness: 
    It also affects the quality of the data. At the end of the month, several sales representatives fail to file their sales record on time. These are also several corrections & adjustments which flow into after the end of the month. Data stored in the database are incomplete for a time after each month. 
     
  • Believability: 
    It is reflective of how much users trust the data. 
     
  • Interpretability: 
    It is a reflection of how easy the users can understand the data.

 

Attention reader! Don’t stop learning now. Get hold of all the important CS Theory concepts for SDE interviews with the CS Theory Course at a student-friendly price and become industry ready.

My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.



Improved By : shiksharanchi2000

Article Tags :
Practice Tags :


Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.