Open In App

Feature Selection with the Caret R Package

Last Updated : 21 Aug, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

The Caret (Classification And REgression Training) is an R package that provides a unified interface for performing machine learning tasks, such as data preprocessing, model training and performance evaluation.

  • One of the tasks that Caret can help with is feature selection, which involves selecting the subset of relevant features from a larger set of potential predictors.
  • The Caret R package is a popular machine learning package that provides a streamlined interface for the building and tuning predictive models. 
  • One important aspect of building predictive models is feature selection, which involves choosing a subset of the available features that are most relevant to  target variable. 
  • Caret provides several functions to the help with feature selection, including feature importance rankings and wrapper methods that evaluate subsets of features.

Steps : 

1.Load the required packages and data:

   The Before performing feature selection, we need to load the Caret package and the dataset we want to work with. 

Example:

R




library(caret)
data(iris)


2.Split the data into training and testing sets:

It is important to split the data into training and testing sets to avoid overfitting. You can use the createDataPartition function from the Caret package to split the data randomly. 

example:

R




set.seed(123)
trainIndex <- createDataPartition(iris$Species, p = 0.7, list = FALSE)
train <- iris[trainIndex,]
test <- iris[-trainIndex,]


The random number generator seed is established using the set.seed(123) function to ensure partitioning is reproducible.

Three arguments are passed to the createDataPartition() function: the variable to partition (iris$Species), the percentage of data to include in the training set (p = 0.7), and whether or not to return the partition as a list (list = FALSE).

The remaining rows in the testing set are contained in the test variable, whereas the resultant trainIndex variable contains the indices of the rows in iris that are in the training set.

3.Define the feature selection method and control parameters:

They use various feature selection methods in Caret, such as recursive feature elimination (RFE), genetic algorithms (GA), and Boruta. You can also define the control parameters for the each method. For example, to use RFE with random forest (RF) as the base model and cross-validation (CV) for performance evaluation, we can define the control parameters as follows:

R




control <- rfeControl(functions = rfFuncs, method = "cv", number = 10)


In this example, we use the rfFuncs function set, which contains RF-specific feature selection functions, and we set the method to “cv” to use CV for performance evaluation with 10 folds.

4. Define the model training method and control parameters:
The need to define the model training method and control parameters for the final model. For example, to use RF the final model and set the number of trees to 500, you can define the control parameters as follows:
 

R




rf <- trainControl(method = "cv", number = 10, verboseIter = FALSE)


output :

    Petal.Length Petal.Width
3            1.3         0.2
4            1.5         0.2
5            1.4         0.2
7            1.4         0.3
8            1.5         0.2
9            1.4         0.2
10           1.5         0.1
11           1.5         0.2
12           1.6         0.2
13           1.4         0.1
14           1.1         0.1
15           1.2         0.2
17           1.3         0.4
19           1.7         0.3
21           1.7         0.2
24           1.7         0.5
25           1.9         0.2
26           1.6         0.2
27           1.6         0.4
28           1.5         0.2
29           1.4         0.2
30           1.6         0.2
31           1.6         0.2
32           1.5         0.4
33           1.5         0.1
36           1.2         0.2
37           1.3         0.2
40           1.5         0.2
41           1.3         0.3
42           1.3         0.3
43           1.3         0.2
45           1.9         0.4
48           1.4         0.2
49           1.5         0.2
50           1.4         0.2
52           4.5         1.5
55           4.6         1.5
56           4.5         1.3
57           4.7         1.6
58           3.3         1.0
59           4.6         1.3
60           3.9         1.4
61           3.5         1.0
62           4.2         1.5
63           4.0         1.0
65           3.6         1.3
66           4.4         1.4
67           4.5         1.5
68           4.1         1.0
69           4.5         1.5
70           3.9         1.1
71           4.8         1.8
73           4.9         1.5
75           4.3         1.3
76           4.4         1.4
77           4.8         1.4
79           4.5         1.5
80           3.5         1.0
82           3.7         1.0
83           3.9         1.2
84           5.1         1.6
86           4.5         1.6
88           4.4         1.3
89           4.1         1.3
92           4.6         1.4
93           4.0         1.2
95           4.2         1.3
96           4.2         1.2
97           4.2         1.3
98           4.3         1.3
102          5.1         1.9
103          5.9         2.1
104          5.6         1.8
105          5.8         2.2
107          4.5         1.7
108          6.3         1.8
110          6.1         2.5
112          5.3         1.9
113          5.5         2.1
114          5.0         2.0
115          5.1         2.4
118          6.7         2.2
119          6.9         2.3
121          5.7         2.3
122          4.9         2.0
123          6.7         2.0
125          5.7         2.1
126          6.0         1.8
128          4.9         1.8
129          5.6         2.1
130          5.8         1.6
131          6.1         1.9
132          6.4         2.0
135          5.6         1.4
138          5.5         1.8
139          4.8         1.8
140          5.4         2.1
141          5.6         2.4
142          5.1         2.3
143          5.1         1.9
144          5.9         2.3
145          5.7         2.5
146          5.2         2.3
147          5.0         1.9
148          5.2         2.0

5. Train the feature selection model:

After defining the feature selection method and control parameters, we can train the feature selection model using the rfe function. For example, to select the top 2 features using RFE, you can train the model as follows:

R




rfe_model <- rfe(train[,1:4], train$Species, sizes = c(1:4), rfeControl = control)


In this example, we use the first 4 columns of  training set as predictors (train[,1:4]) and the Species column as the outcome (train$Species). We also set the sizes parameter to evaluate feature subsets of sizes 1 to 4.

6. Inspect the results:

R




print(rfRfeFit)


The print a summary of the feature selection results, including the optimal number of features and the features selected.

7. Use the selected features to train a model on the training data:

R




rfModel <- train(train[, rfRfeFit$optVariables], train$Species, method = "rf", trControl = trainControl, tuneGrid = rfParams)


The rfRfeFit$optVariables argument specifies the optimal set of features selected.

8. Evaluate the performance of the model on the test data:

R




predictions <- predict(rfModel, test[, rfRfeFit$optVariables])
confusionMatrix(predictions, test$Species)


This will generate a confusion matrix that summarizes the performance of the model on the test data.

Example :

R




library(caret)
data(iris)
 
# Split data into training and testing sets
set.seed(123)
trainIndex <- createDataPartition(iris$Species, p = 0.7, list = FALSE)
    train <- iris[trainIndex,]
    test <- iris[-trainIndex,]
 
# Train model with correlation-based feature selection
corProfile <- findCorrelation(cor(train[, -5]), cutoff = 0.75, names = FALSE)
 
# Print selected features
train[, -5][, corProfile]
 
# Evaluate performance on test data
rfFit <- train(train[, corProfile], train$Species, method = "rf")
     predictions <- predict(rfFit, test[, corProfile])
     confusionMatrix(predictions, test$Species)


output :

 


 

The correlation-based feature selection method to select the most correlated features for a random forest classifier. We split the iris dataset into training and testing sets, and then use the findCorrelation function to identify highly correlated features using a cutoff of 0.75. We then use the selected features to train a random forest classifier and make predictions on the test data.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads