Open In App

Random Forest Classifier using Scikit-learn

In this article, we will see how to build a Random Forest Classifier using the Scikit-Learn library of Python programming language and to do this, we use the IRIS dataset which is quite a common and famous dataset.

Random Forest

The Random forest or Random Decision Forest is a supervised Machine learning algorithm used for classification, regression, and other tasks using decision trees. Random Forests are particularly well-suited for handling large and complex datasets, dealing with high-dimensional feature spaces, and providing insights into feature importance. This algorithm’s ability to maintain high predictive accuracy while minimizing overfitting makes it a popular choice across various domains, including finance, healthcare, and image analysis, among others.



Random Forest Classifier

The Random forest classifier creates a set of decision trees from a randomly selected subset of the training set. It is a set of decision trees (DT) from a randomly selected subset of the training set and then It collects the votes from different decision trees to decide the final prediction.

Random Forest Classifier

Additionally, the random forest classifier can handle both classification and regression tasks, and its ability to provide feature importance scores makes it a valuable tool for understanding the significance of different variables in the dataset.



How Random Forest Classification works

Random Forest Classification is an ensemble learning technique designed to enhance the accuracy and robustness of classification tasks. The algorithm builds a multitude of decision trees during training and outputs the class that is the mode of the classification classes. Each decision tree in the random forest is constructed using a subset of the training data and a random subset of features introducing diversity among the trees, making the model more robust and less prone to overfitting.

The random forest algorithm employs a technique called bagging (Bootstrap Aggregating) to create these diverse subsets.

During the training phase, each tree is built by recursively partitioning the data based on the features. At each split, the algorithm selects the best feature from the random subset, optimizing for information gain or Gini impurity. The process continues until a predefined stopping criterion is met, such as reaching a maximum depth or having a minimum number of samples in each leaf node.

Once the random forest is trained, it can make predictions, using each tree “votes” for a class, and the class with the most votes becomes the predicted class for the input data.

Feature Selection in Random Forests

Feature selection in Random Forests is inherently embedded in the construction of individual decision trees and the aggregation process.

During the training phase, each decision tree is built using a random subset of features, contributing to diversity among the trees. The process is, known as feature bagging, helps prevent the dominance of any single feature and promotes a more robust model.

The algorithm evaluates various subsets of features at each split point, selecting the best feature for node splitting based on criteria such as information gain or Gini impurity. Consequently, Random Forests naturally incorporate a form of feature selection, ensuring that the ensemble benefits from a diverse set of features to enhance generalization and reduce overfitting.

Random Forest Classifier in Machine Learning

Step 1: Loading dataset




# importing required libraries
# importing Scikit-learn library and datasets package
from sklearn import datasets
 
# Loading the iris plants dataset (classification)
iris = datasets.load_iris()

Step 2: Checking dataset content and features names present in it.




print(iris.target_names)

Output:

[‘setosa’ ‘versicolor’ ‘virginica’]




print(iris.feature_names)

Output:

[‘sepal length (cm)’, ’sepal width (cm)’, ’petal length (cm)’, ’petal width (cm)’]

Step 3: Train Test Split




# dividing the datasets into two parts i.e. training datasets and test datasets
X, y = datasets.load_iris( return_X_y = True)
 
# Splitting arrays or matrices into random train and test subsets
from sklearn.model_selection import train_test_split
# i.e. 70 % training dataset and 30 % test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

Step 4: Import Random Forest Classifier module.




# importing random forest classifier from assemble module
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
# creating dataframe of IRIS dataset
data = pd.DataFrame({'sepallength': iris.data[:, 0], 'sepalwidth': iris.data[:, 1],
                     'petallength': iris.data[:, 2], 'petalwidth': iris.data[:, 3],
                     'species': iris.target})

Overview of the Dataset




# printing the top 5 datasets in iris dataset
print(data.head())

Output:

     sepallength   sepalwidth   petallength     petalwidth   species
0          5.1             3.5               1.4                0.2           0
1          4.9             3.0               1.4                0.2           0
2          4.7             3.2               1.3                0.2           0
3          4.6             3.1               1.5               0.2            0
4          5.0             3.6               1.4               0.2            0

Step 5: Training of Model




# creating a RF classifier
clf = RandomForestClassifier(n_estimators = 100
 
# Training the model on the training dataset
# fit function is used to train the model using the training sets as parameters
clf.fit(X_train, y_train)
 
# performing predictions on the test dataset
y_pred = clf.predict(X_test)
 
# metrics are used to find accuracy or error
from sklearn import metrics 
print()
 
# using metrics module for accuracy calculation
print("ACCURACY OF THE MODEL:", metrics.accuracy_score(y_test, y_pred))

Output:

ACCURACY OF THE MODEL: 0.9238095238095239

Step 6: Predictions




# predicting which type of flower it is.
clf.predict([[3, 3, 2, 2]])

Output:

array([0])

This implies it is setosa flower type as we got the three species or classes in our data set: Setosa, Versicolor, and Virginia.

Check the important features

Now we will also find out the important features or selecting features in the IRIS dataset by using the following lines of code.




# using the feature importance variable
import pandas as pd
feature_imp = pd.Series(clf.feature_importances_, index = iris.feature_names).sort_values(ascending = False)
feature_imp

Output:

petal length (cm)    0.440050
petal width (cm)     0.423437
sepal length (cm)    0.103293
sepal width (cm)     0.033220
dtype: float64

Random Forests in Python’s Scikit-Learn library come with a set of hyperparameters that allow you to fine-tune the behavior of the model. Understanding and selecting appropriate hyperparameters is crucial for optimizing model performance.

Random Forest Classifier Parameters

Advantages of Random Forest Classifier

Disadvantages of Random Forest Classifier

Conclusion

In conclusion, Random Forests, with their ensemble of decision trees, stand out as a robust solution for various machine learning tasks, showcasing their versatility and effectiveness.

Frequently Asked Questions(FAQs)

Q. What is the random forest classifier?

Random Forest Classifier is an ensemble learning method using multiple decision trees for classification tasks, improving accuracy. It excels in handling complex data, mitigating overfitting, and providing robust predictions with feature importance.

Q. Can random forest be used for regression ?

Random Forest can be used for both regression and classification tasks, making it a versatile machine learning algorithm.

Q. What is the principle of random forest?

Random Forest builds multiple decision trees using random subsets of the dataset and combines their outputs for improved accuracy.

Q. What are the applications of random forest?

Real applications include:

  • Medical Diagnosis: Identifying diseases based on patient data.
  • Finance: Credit scoring for risk assessment in lending.

Article Tags :