Attribute subset Selection is a technique which is used for data reduction in data mining process. Data reduction reduces the size of data so that it can be used for analysis purposes more efficiently.
Need of Attribute Subset Selection-
The data set may have a large number of attributes. But some of those attributes can be irrelevant or redundant. The goal of attribute subset selection is to find a minimum set of attributes such that dropping of those irrelevant attributes does not much affect the utility of data and the cost of data analysis could be reduced. Mining on a reduced data set also makes the discovered pattern easier to understand.
Process of Attribute Subset Selection-
The brute force approach can be very expensive in which each subset (2^n possible subsets) of the data having n attributes can be analysed.
The best way to do the task is to use the statistical significance tests such that best (or worst) attributes can be recognized. Statistical significance test assumes that attributes are independent of one another. This is a kind of greedy approach in which a significance level is decided (statistically ideal value of significance level is 5%) and the models are tested again and again until p-value (probability value) of all attributes is less than or equal to the selected significance level. The attributes having p-value higher than significance level are discarded. This procedure is repeated again and again until all the attribute in data set has p-value less than or equal to the significance level. This gives us the reduced data set having no irrelevant attributes.
Methods of Attribute Subset Selection-
1. Stepwise Forward Selection.
2. Stepwise Backward Elimination.
3. Combination of Forward Selection and Backward Elimination.
4. Decision Tree Induction.
All the above methods are greedy approaches for attribute subset selection.
- Stepwise Forward Selection: This procedure start with an empty set of attributes as the minimal set. The most relevant attributes are chosen(having minimum p-value) and are added to the minimal set. In each iteration, one attribute is added to a reduced set.
- Stepwise Backward Elimination: Here all the attributes are considered in the initial set of attributes. In each iteration, one attribute is eliminated from the set of attributes whose p-value is higher than significance level.
- Combination of Forward Selection and Backward Elimination: The stepwise forward selection and backward elimination are combined so as to select the relevant attributes most efficiently. This is the most common technique which is generally used for attribute selection.
- Decision Tree Induction: This approach uses decision tree for attribute selection. It constructs a flow chart like structure having nodes denoting a test on an attribute. Each branch corresponds to the outcome of test and leaf nodes is a class prediction. The attribute that is not the part of tree is considered irrelevant and hence discarded.
- Data-Mining | Fact Constellation in Data Warehouse modelling
- Data Mining | Sources of Data that can be mined
- Comparisons between Data Warehousing v/s Data Mining
- Data Mining
- Data Preprocessing in Data Mining
- Data Mining | ETL process
- Data Mining | KDD process
- Numerosity Reduction in Data Mining
- Frequent Item set in Data set (Association Rule Mining)
- DBMS | Characteristics of Biological Data (Genome Data Management)
- Difference between Data Warehouse and Data Mart
- Data Abstraction and Data Independence
- Functional Dependency and Attribute Closure
- Finding Attribute Closure and Candidate Keys using Functional Dependencies
- Big Data as a Technology
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to email@example.com. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.