Attribute subset Selection is a technique which is used for data reduction in data mining process. Data reduction reduces the size of data so that it can be used for analysis purposes more efficiently.
Need of Attribute Subset Selection-
The data set may have a large number of attributes. But some of those attributes can be irrelevant or redundant. The goal of attribute subset selection is to find a minimum set of attributes such that dropping of those irrelevant attributes does not much affect the utility of data and the cost of data analysis could be reduced. Mining on a reduced data set also makes the discovered pattern easier to understand.
Process of Attribute Subset Selection-
The brute force approach can be very expensive in which each subset (2^n possible subsets) of the data having n attributes can be analysed.
The best way to do the task is to use the statistical significance tests such that best (or worst) attributes can be recognized. Statistical significance test assumes that attributes are independent of one another. This is a kind of greedy approach in which a significance level is decided (statistically ideal value of significance level is 5%) and the models are tested again and again until p-value (probability value) of all attributes is less than or equal to the selected significance level. The attributes having p-value higher than significance level are discarded. This procedure is repeated again and again until all the attribute in data set has p-value less than or equal to the significance level. This gives us the reduced data set having no irrelevant attributes.
Methods of Attribute Subset Selection-
1. Stepwise Forward Selection.
2. Stepwise Backward Elimination.
3. Combination of Forward Selection and Backward Elimination.
4. Decision Tree Induction.
All the above methods are greedy approaches for attribute subset selection.
- Stepwise Forward Selection: This procedure start with an empty set of attributes as the minimal set. The most relevant attributes are chosen(having minimum p-value) and are added to the minimal set. In each iteration, one attribute is added to a reduced set.
- Stepwise Backward Elimination: Here all the attributes are considered in the initial set of attributes. In each iteration, one attribute is eliminated from the set of attributes whose p-value is higher than significance level.
- Combination of Forward Selection and Backward Elimination: The stepwise forward selection and backward elimination are combined so as to select the relevant attributes most efficiently. This is the most common technique which is generally used for attribute selection.
- Decision Tree Induction: This approach uses decision tree for attribute selection. It constructs a flow chart like structure having nodes denoting a test on an attribute. Each branch corresponds to the outcome of test and leaf nodes is a class prediction. The attribute that is not the part of tree is considered irrelevant and hence discarded.
- Difference between Data Warehousing and Data Mining
- Types of Sources of Data in Data Mining
- Data Mining
- Data Mining | Set 2
- Data Normalization in Data Mining
- Data Integration in Data Mining
- Data Preprocessing in Data Mining
- KDD Process in Data Mining
- Numerosity Reduction in Data Mining
- Redundancy and Correlation in Data Mining
- Relationship between Data Mining and Machine Learning
- Frequent Item set in Data set (Association Rule Mining)
- Web Mining
- Characteristics of Biological Data (Genome Data Management)
- Difference between a Data Analyst and a Data Scientist
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to firstname.lastname@example.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.