Random Forest Classifier using Scikit-learn

In this article, we will see how to build a Random Forest Classifier using the Scikit-Learn library of Python programming language and in order to do this, we use the IRIS dataset which is quite a common and famous dataset. The Random forest or Random Decision Forest is a supervised Machine learning algorithm used for classification, regression, and other tasks using decision trees.
The Random forest classifier creates a set of decision trees from a randomly selected subset of the training set. It is basically a set of decision trees (DT) from a randomly selected subset of the training set and then It collects the votes from different decision trees to decide the final prediction.
In this classification algorithm, we will use IRIS flower datasets to train and test the model. We will build a model to classify the type of flower.

Code: Loading dataset

filter_none

edit
close

play_arrow

link
brightness_4
code

# importing required libraries 
# importing Scikit-learn library and datasets package
from sklearn import datasets  
  
# Loading the iris plants dataset (classification)
iris = datasets.load_iris()    

chevron_right


Code: checking our dataset content and features names present in it.

filter_none

edit
close

play_arrow

link
brightness_4
code

print(iris.target_names)

chevron_right


Output:

[‘setosa’ ‘versicolor’ ‘virginica’]

Code:



filter_none

edit
close

play_arrow

link
brightness_4
code

print(iris.feature_names)

chevron_right


Output:

[‘sepal length (cm)’, ’sepal width (cm)’, ’petal length (cm)’, ’petal width (cm)’]

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

# dividing the datasets into two parts i.e. training datasets and test datasets
X, y = datasets.load_iris( return_X_y = True)
  
# Spliting arrays or matrices into random train and test subsets
from sklearn.model_selection import train_test_split
# i.e. 80 % training dataset and 30 % test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.70)

chevron_right


Code: Importing required libraries and random forest classifier module.

filter_none

edit
close

play_arrow

link
brightness_4
code

# importing random forest classifier from assemble module
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
# creating dataframe of IRIS dataset
data = pd.DataFrame({‘sepallength’: iris.data[:, 0], ’sepalwidth’: iris.data[:, 1],
                     ’petallength’: iris.data[:, 2], ’petalwidth’: iris.data[:, 3],
                     ’species’: iris.target})

chevron_right


Code: Looking at a dataset

filter_none

edit
close

play_arrow

link
brightness_4
code

# printing the top 5 datasets in iris dataset
print(data.head())

chevron_right


Output:

     sepallength   sepalwidth   petallength     petalwidth   species

0          5.1             3.5               1.4                0.2           0

1          4.9             3.0               1.4                0.2           0

2          4.7             3.2               1.3                0.2           0

3          4.6             3.1               1.5               0.2            0

4          5.0             3.6               1.4               0.2            0

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

# creating a RF classifier
clf = RandomForestClassifier(n_estimators = 100)  
  
# Training the model on the training dataset
# fit function is used to train the model using the training sets as parameters
clf.fit(X_train, y_train)
  
# performing predictions on the test dataset
y_pred = clf.predict(X_test)
  
# metrics are used to find accuracy or error
from sklearn import metrics  
print()
  
# using metrics module for accuracy calculation
print("ACCURACY OF THE MODEL: ", metrics.accuracy_score(y_test, y_pred))

chevron_right


Output:

ACCURACY OF THE MODEL: 0.9238095238095239

 

Code: predicting the type of flower from the data set

filter_none

edit
close

play_arrow

link
brightness_4
code

# predicting which type of flower it is.
clf.predict([[3, 3, 2, 2]])

chevron_right


Output:

array([0])

This implies it is setosa flower type as we got the three species or classes in our data set: Setosa, Versicolor, and Virginia. Now we will also find out the important features or selecting features in the IRIS dataset by using the following lines of code.

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

# importing random forest classifier from assemble module
from sklearn.ensemble import RandomForestClassifier
# Create a Random forest Classifier
clf = RandomForestClassifier(n_estimators = 100)
  
# Train the model using the training sets
clf.fit(X_train, y_train)

chevron_right


Code: Calculating feature importance

filter_none

edit
close

play_arrow

link
brightness_4
code

# using the feature importance variable
import pandas as pd
feature_imp = pd.Series(clf.feature_importances_, index = iris.feature_names).sort_values(ascending = False)
feature_imp

chevron_right


Output:

petal width (cm)     0.458607
petal length (cm)    0.413859
sepal length (cm)    0.103600
sepal width (cm)     0.023933
dtype: float64



My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.


Article Tags :
Practice Tags :


1


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.