Random Forest Classifier using Scikit-learn
In this article, we will see how to build a Random Forest Classifier using the Scikit-Learn library of Python programming language and in order to do this, we use the IRIS dataset which is quite a common and famous dataset. The Random forest or Random Decision Forest is a supervised Machine learning algorithm used for classification, regression, and other tasks using decision trees.
The Random forest classifier creates a set of decision trees from a randomly selected subset of the training set. It is basically a set of decision trees (DT) from a randomly selected subset of the training set and then It collects the votes from different decision trees to decide the final prediction.
In this classification algorithm, we will use IRIS flower datasets to train and test the model. We will build a model to classify the type of flower.
# importing required libraries # importing Scikit-learn library and datasets package from sklearn import datasets # Loading the iris plants dataset (classification) iris = datasets.load_iris() |
Code: checking our dataset content and features names present in it.
print (iris.target_names) |
Output:
[‘setosa’ ‘versicolor’ ‘virginica’]
Code:
print (iris.feature_names) |
Output:
[‘sepal length (cm)’, ’sepal width (cm)’, ’petal length (cm)’, ’petal width (cm)’]
Code:
# dividing the datasets into two parts i.e. training datasets and test datasets X, y = datasets.load_iris( return_X_y = True ) # Splitting arrays or matrices into random train and test subsets from sklearn.model_selection import train_test_split # i.e. 70 % training dataset and 30 % test datasets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30 ) |
Code: Importing required libraries and random forest classifier module.
# importing random forest classifier from assemble module from sklearn.ensemble import RandomForestClassifier import pandas as pd # creating dataframe of IRIS dataset data = pd.DataFrame({‘sepallength’: iris.data[:, 0 ], ’sepalwidth’: iris.data[:, 1 ], ’petallength’: iris.data[:, 2 ], ’petalwidth’: iris.data[:, 3 ], ’species’: iris.target}) |
Code: Looking at a dataset
# printing the top 5 datasets in iris dataset print (data.head()) |
Output:
sepallength sepalwidth petallength petalwidth species 0 5.1 3.5 1.4 0.2 0 1 4.9 3.0 1.4 0.2 0 2 4.7 3.2 1.3 0.2 0 3 4.6 3.1 1.5 0.2 0 4 5.0 3.6 1.4 0.2 0
Code:
# creating a RF classifier clf = RandomForestClassifier(n_estimators = 100 ) # Training the model on the training dataset # fit function is used to train the model using the training sets as parameters clf.fit(X_train, y_train) # performing predictions on the test dataset y_pred = clf.predict(X_test) # metrics are used to find accuracy or error from sklearn import metrics print () # using metrics module for accuracy calculation print ( "ACCURACY OF THE MODEL: " , metrics.accuracy_score(y_test, y_pred)) |
Output:
ACCURACY OF THE MODEL: 0.9238095238095239
Code: predicting the type of flower from the data set
# predicting which type of flower it is. clf.predict([[ 3 , 3 , 2 , 2 ]]) |
Output:
array([0])
This implies it is setosa flower type as we got the three species or classes in our data set: Setosa, Versicolor, and Virginia. Now we will also find out the important features or selecting features in the IRIS dataset by using the following lines of code.
Code:
# importing random forest classifier from assemble module from sklearn.ensemble import RandomForestClassifier # Create a Random forest Classifier clf = RandomForestClassifier(n_estimators = 100 ) # Train the model using the training sets clf.fit(X_train, y_train) |
Code: Calculating feature importance
# using the feature importance variable import pandas as pd feature_imp = pd.Series(clf.feature_importances_, index = iris.feature_names).sort_values(ascending = False ) feature_imp |
Output:
petal width (cm) 0.458607 petal length (cm) 0.413859 sepal length (cm) 0.103600 sepal width (cm) 0.023933 dtype: float64
Please Login to comment...