Open In App

Calculate ROC AUC for Classification Algorithm Such as Random Forest

Last Updated : 21 Mar, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

Receiver Operating Characteristic (ROC) curves and Area Under the Curve (AUC) are popular evaluation metrics for classification algorithms, In this article, we will discuss how to calculate the ROC AUC for a Random Forest classifier.

 ROC AUC is a metric that quantifies the ability of a binary classifier to distinguish between positive and negative classes. The ROC curve is a graph of the true positive rate (TPR) against the false positive rate (FPR) for different classification thresholds. TPR is the ratio of true positives to the total number of positive examples, while FPR is the ratio of false positives to the total number of negative examples. The AUC is the area under the ROC curve, ranging from 0.0 to 1.0, with a higher value indicating better classifier performance.

Explanation:

Step 1: Import required modules.

Python




from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt


Here we import the required modules, which include the RandomForestClassifier and roc_curve functions from the sklearn.ensemble and sklearn.metrics modules, respectively. We also import the load_breast_cancer function from the sklearn.datasets module to load the breast cancer datasets, and the train_test_split function from the sklearn.model_selection module to split the dataset into training and test sets. Finally, we import the pyplot module from the matplotlib library to plot the ROC curve.

Step 2: Load and split the Breast cancer dataset.

Load the datasets and separate features and target values, then split the train and test datasets.

Python




df = load_breast_cancer(as_frame=True)
df = df.frame
  
x = df.drop('target',axis=1)
y = df[['target']]
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3)


Step 3: Train a Random Forest classifier.

Python




# Train a Random Forest classifier
rf = RandomForestClassifier(n_estimators=5, max_depth=2)
rf.fit(X_train, y_train)


Here we train a Random Forest classifier using the RandomForestClassifier function with 5 estimators and a maximum depth of 2. We fit the classifier to the training data using the fit method.

Step 4: Get predicted class probabilities for the test set.

Python




# Get predicted class probabilities for the test set
y_pred_prob = rf.predict_proba(X_test)[:, 1]


Here we use the predict_proba method of the Random Forest classifier to obtain the predicted class probabilities for the test set. The method returns an array of shape (n_samples, n_classes), where n_samples is the number of samples in the test set, and n_classes is the number of classes in the problem. Since we’re using a binary classifier, n_classes is equal to 2, and we’re interested in the probability of the positive class.  which is the second column of the array. Therefore, we use the [:, 1] indexing to obtain a one-dimensional array of positive class probabilities.

Step 5: Compute the false positive rate (FPR) and true positive rate (TPR) for different classification thresholds.

Python




# Compute the false positive rate (FPR) 
# and true positive rate (TPR) for different classification thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob, pos_label=1)


Here we use the roc_curve function from the sklearn.metrics module to compute the false positive rate (FPR) and true positive rate (TPR) for different classification thresholds. The function takes as input the true labels of the test set (y_test) and the predicted class probabilities of the positive class (y_pred_prob). It returns three arrays: fpr, which contains the FPR values for different thresholds; tpr, which contains the TPR values for different thresholds; and thresholds, which contains the threshold values.

Step 6: Compute the ROC AUC score.

Python




# Compute the ROC AUC score
roc_auc = roc_auc_score(y_test, y_pred_prob)
roc_auc


Output:

0.9787264420331239

Here we use the roc_auc_score function from the sklearn.metrics module to compute the ROC AUC score. The function takes as input the true labels of the test set (y_test) and the predicted class probabilities of the positive class (y_pred_prob). It returns a scalar value representing the area under the ROC curve.

Step 7: Plot the ROC curve.

Python




# Plot the ROC curve
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
# roc curve for tpr = fpr 
plt.plot([0, 1], [0, 1], 'k--', label='Random classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()


Output:

ROC Curve for Breast Cancer : Geeksforgeeks

ROC Curve for Breast Cancer

Here we use the plot function of the pyplot module to plot the ROC curve. We pass the FPR values on the x-axis and the TPR values on the y-axis. We also add a label to the plot with the ROC AUC score as area. We plot a dashed line to represent a random classifier, which has an ROC curve of a straight line from (0,0) to (1,1). We add axis labels and a title to the plot, and a legend showing the ROC AUC score and the random classifier line.

Plot explanation:
The ROC curve is a plot of the true positive rate (TPR) on the y-axis against the false positive rate (FPR) on the x-axis, for different classification thresholds. The ROC curve shows how well the classifier can distinguish between positive and negative classes for different threshold values. A perfect classifier would have a TPR of 1 and an FPR of 0, which corresponds to the top-left corner of the plot. On the other hand, a random classifier would have an ROC curve of a straight line from (0,0) to (1,1), which is the dashed line in the plot. The closer the ROC curve is to the top-left corner, the better the classifier performs.

The ROC curve can be used to choose the best threshold for the classifier, depending on the trade-off between TPR and FPR. A threshold closer to 1 will have a lower FPR but a higher TPR, while a threshold closer to 0 will have a higher FPR but a lower TPR.

Step 8: Plot the predicted class probabilities.

Python




# Plot the predicted class probabilities
plt.hist(y_pred_prob, bins=10)
plt.xlim(0, 1)
plt.title('Histogram of predicted probabilities')
plt.xlabel('Predicted probability of Setosa')
plt.ylabel('Frequency')
plt.show()


Output:

Predicted Probabilities -Geeksforgeeks

Predicted Probabilities

Here we use the hist function of the pyplot module to plot a histogram of the predicted class probabilities for the positive class. We pass the predicted class probabilities and the number of bins as arguments to the function. We also set the x-axis limits, add axis labels and a title to the plot, and show the plot.

ROC curve for Multi-Class Classifications.

Here using iris datasets from sklearn.datasets, it has 3 classes. ROC curve can be used for binary classifications, so, here we will use OneVsRestClassifier from sklearn.multiclass with Random forest as a classifier. To plot the ROC curve 
 

Python




from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.datasets import load_iris
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
  
  
# Load the iris dataset
iris = load_iris()
  
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(iris.data,
                                                    iris.target,
                                                    test_size=0.5,
                                                    random_state=23)
  
# Train a Random Forest classifier
clf = OneVsRestClassifier(RandomForestClassifier())
  
# fit model
clf.fit(X_train, y_train)
  
# Get predicted class probabilities for the test set
y_pred_prob = clf.predict_proba(X_test)
  
# Compute the ROC AUC score
roc_auc = roc_auc_score(y_test, y_pred_prob, multi_class='ovr')
print('ROC AUC Score :',roc_auc)
  
# roc curve for Multi classes
colors = ['orange','red','green']
for i in range(len(iris.target_names)):    
    fpr, tpr, thresh = roc_curve(y_test, y_pred_prob[:,i], pos_label=i)
    plt.plot(fpr, tpr, linestyle='--',color=colors[i], label=iris.target_names[i]+' vs Rest')
# roc curve for tpr = fpr 
plt.plot([0, 1], [0, 1], 'k--', label='Random classifier')
plt.title('Multiclass (Iris) ROC curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive rate')
plt.legend()
plt.show()


Output:

ROC AUC Score : 0.9795855072463767
ROC curve for OneVSRest Multiclass Classifications -Geeksforgeeks

ROC curve for OneVSRest Multiclass Classifications

Conclusion:

In conclusion, calculating the ROC AUC score for a Random Forest classifier is a straightforward process in Python. The sklearn.metrics module provides functions for computing the ROC curve, the ROC AUC score, and the PR curve. The ROC curve and the PR curve are useful tools for evaluating the performance of binary classifiers, and they can help to choose the best threshold for the classifier based on the trade-off between different evaluation metrics.

The PR (precision-recall) curve is another evaluation metric for binary classification problems. The PR curve is a plot of the precision (y-axis) against the recall (x-axis), for different classification thresholds. Precision is defined as the number of true positives divided by the number of true positives plus false positives, while recall is defined as the number of true positives divided by the number of true positives plus false negatives. The PR curve shows how well the classifier can predict the positive class while minimizing the false positives.

Compared to the ROC curve, the PR curve is more suitable for imbalanced datasets, where the number of samples in the positive class is much smaller than the number of samples in the negative class. The PR curve is also useful when the cost of false positives and false negatives is different, as it can help to choose the best threshold for the classifier based on the precision-recall trade-off.

It is important to note that the ROC AUC should not be the only metric used to evaluate the performance of a classifier. Other metrics such as precision, recall, and F1 score may also be useful depending on the specific requirements of your application. Additionally, it’s important to consider the overall distribution of positive and negative examples in your data and the potential impact of imbalanced classes on your evaluation metrics.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads