Confusion Matrix In R

In machine learning and statistical classification, the confusion matrix serves as a fundamental tool for evaluating the performance of a predictive model. It provides a concise summary of the classification results produced by a model, revealing the number of true positives, true negatives, false positives, and false negatives. In R Programming Language creating and interpreting a confusion matrix is straightforward, thanks to the availability of various libraries and functions designed for this purpose.

What is a Confusion Matrix?

A confusion matrix is a tabular representation of the performance of a classification model. It compares the predicted labels of a model with the actual labels from the dataset. The matrix is organized into rows and columns, where each row represents the actual class and each column represents the predicted class. The four essential components of a confusion matrix are:

True Positive (TP): Correctly predicted values.
True Negative (TN): Correctly predicted as negative.
False Positive (FP): Instances that are incorrectly predicted as positive.
False Negative (FN): Instances that are incorrectly predicted as negative.

Creating a Confusion Matrix in R

R offers several packages for working with confusion matrices, including caret, MLmetrics, and yardstick. Let's explore how to create and interpret a confusion matrix using the caret package:

Binary Classification

In this example, we'll use a simple binary classification scenario to create and interpret a confusion matrix.

# Load required libraries
library(caret)

# Generate example data
actual <- factor(c(1, 0, 1, 1, 0, 0, 1, 0, 1, 0))
predicted <- factor(c(1, 0, 1, 1, 0, 0, 1, 0, 1, 0))

# Create confusion matrix
conf_matrix <- confusionMatrix(actual, predicted)

# Print confusion matrix
print(conf_matrix)

Output:

Confusion Matrix and Statistics

          Reference
Prediction 0 1
         0 5 0
         1 0 5
                                     
               Accuracy : 1          
                 95% CI : (0.6915, 1)
    No Information Rate : 0.5        
    P-Value [Acc > NIR] : 0.0009766  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0        
            Specificity : 1.0        
         Pos Pred Value : 1.0        
         Neg Pred Value : 1.0        
             Prevalence : 0.5        
         Detection Rate : 0.5        
   Detection Prevalence : 0.5        
      Balanced Accuracy : 1.0        
                                     
       'Positive' Class : 0

The output will display the confusion matrix along with various performance metrics such as accuracy, sensitivity (recall), specificity, and precision.

Multi-class Classification

In this example, we'll work with a multi-class classification scenario using the famous Iris dataset.

# Load required libraries
library(caret)

# Load Iris dataset
data(iris)

# Split data into training and testing sets
set.seed(123)
train_index <- createDataPartition(iris$Species, p = 0.8, list = FALSE)
train_data <- iris[train_index, ]
test_data <- iris[-train_index, ]

# Train a model (e.g., using a decision tree)
model <- train(Species ~ ., data = train_data, method = "rpart")

# Make predictions on test data
predicted <- predict(model, test_data)

# Create confusion matrix
conf_matrix <- confusionMatrix(test_data$Species, predicted)

# Print confusion matrix
print(conf_matrix)

Output:

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         10          0         0
  versicolor      0         10         0
  virginica       0          2         8

Overall Statistics
                                          
               Accuracy : 0.9333          
                 95% CI : (0.7793, 0.9918)
    No Information Rate : 0.4             
    P-Value [Acc > NIR] : 1.181e-09       
                                          
                  Kappa : 0.9             
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8333           1.0000
Specificity                 1.0000            1.0000           0.9091
Pos Pred Value              1.0000            1.0000           0.8000
Neg Pred Value              1.0000            0.9000           1.0000
Prevalence                  0.3333            0.4000           0.2667
Detection Rate              0.3333            0.3333           0.2667
Detection Prevalence        0.3333            0.3333           0.3333
Balanced Accuracy           1.0000            0.9167           0.9545

Once you have created the confusion matrix, interpreting it is crucial for understanding the performance of your model. Here's how you can interpret the various components:

Additionally, you can derive various performance metrics from the confusion matrix, such as accuracy, precision, recall (sensitivity), specificity, F1-score, and the area under the ROC curve (AUC).

Conclusion

The confusion matrix is a powerful tool for evaluating the performance of classification models in R. By providing a detailed breakdown of prediction outcomes, it enables data scientists and machine learning practitioners to assess the strengths and weaknesses of their models effectively. With the help of R packages like caret, creating and interpreting confusion matrices becomes an integral part of the model evaluation process, contributing to more informed decision-making and model refinement.

Article Tags :

R Language