How to calculate the F1 score and other custom metrics in PyTorch?

Last Updated : 17 Apr, 2024

Evaluating deep learning models goes beyond just training them; it means rigorously checking their performance to ensure they’re accurate, reliable, and efficient for real-world use. This evaluation is critical because it tells us how well a model has learned and how effective it might be in real-life situations. Using custom metrics is essential here, especially when standard metrics like accuracy aren’t enough or when the task needs a simpler explanation. Here, we will see how we can use Pytorch to calculate F1 score and other metrics.

What are evaluation Metrics?

Evaluation metrics are quantitative measures used to assess the performance of machine learning models. These metrics provide insights into how well a model is performing and can help guide decisions on model selection, parameter tuning, and feature engineering.

Precision: Measures the proportion of true positive predictions among all positive predictions made by the model.
Recall: Measures the proportion of true positive predictions among all actual positive instances in the dataset.
F1 Score: Harmonic mean of precision and recall, providing a balanced measure of a model’s performance.
ROC AUC: Area under the Receiver Operating Characteristic curve, which illustrates the trade-off between true positive rate and false positive rate for different thresholds of a binary classifier.

Implementation of _______

Dataset Loading

This code takes cifar 10 dataset
Splits it into training and testing sets using train_test_split, and assigns them to variables X_train, X_test, y_train, and y_test. The synthetic data consists of 1000 samples with 20 features and 2 classes. The random_state parameter is set for reproducibility.

Python3

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader, Subset

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load CIFAR-10 dataset
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
train_data = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
test_data = datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)

# Filtering for binary classification
binary_train_data = Subset(train_data, [i for i in range(len(train_data)) if train_data.targets[i] <= 1])
binary_test_data = Subset(test_data, [i for i in range(len(test_data)) if test_data.targets[i] <= 1])

train_loader = DataLoader(dataset=binary_train_data, batch_size=64, shuffle=True)
test_loader = DataLoader(dataset=binary_test_data, batch_size=64, shuffle=False)

Model Building

Next, we define and train a simple MLP model in PyTorch using the following steps:

Define the MLP model: The MLP class inherits from nn.Module, which is the base class for all neural network modules in PyTorch. In the constructor (__init__), the model is defined with two fully connected (linear) layers (fc1 and fc2) separated by a ReLU activation function (relu).
Forward pass: The forward method defines how the input x is processed through the layers of the network. The input x is flattened (view(-1, 3 * 32 * 32)) to match the input size expected by the first linear layer (fc1), then passed through the activation function (relu), and finally processed by the second linear layer (fc2).
Move model to device: The model is moved to the specified device (e.g., GPU if available) using the to method.
Define loss and optimizer: The CrossEntropyLoss is used as the loss function, which is suitable for multi-class classification problems. The Adam optimizer is used to update the model parameters based on the computed gradients.
Training loop: The model is trained for num_epochs epochs. In each epoch, the training data (train_loader) is iterated over in batches. For each batch, the images and labels are loaded to the specified device. The model is then used to make predictions (outputs) on the input images, and the loss is computed based on the predicted outputs and actual labels. The optimizer is used to update the model parameters based on the computed gradients (backward and step).
Print epoch and loss: At the end of each epoch, the epoch number and the loss value for that epoch are printed.

Python3

# Define MLP model
class MLP(nn.Module):
    def __init__(self):
        super(MLP, self).__init__()
        self.fc1 = nn.Linear(3 * 32 * 32, 512)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(512, 2)

    def forward(self, x):
        out = self.fc1(x.view(-1, 3 * 32 * 32))
        out = self.relu(out)
        out = self.fc2(out)
        return out

model = MLP().to(device)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train the model
num_epochs = 5
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        images, labels = images.to(device), labels.to(device)
        
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {loss.item()}')

Calculating Model Metrics: Precision, Recall, F1-score, ROC AUC

Set model to evaluation mode: model.eval() is used to set the model to evaluation mode. This disables any operations like dropout that are only meant to be applied during training.
Iterate over test dataset: The code iterates over the test dataset (test_loader) and makes predictions using the trained model for each batch of images. The predicted labels (predicted) are obtained by taking the maximum value along the second dimension of the output tensor (outputs) using torch.max.
Convert predictions and labels to lists: The predicted and true labels are converted to lists (y_pred and y_true, respectively) for easier calculation of precision, recall, and F1 score.
Convert lists to tensors: The lists y_pred and y_true are converted back to tensors (y_pred_tensor and y_true_tensor, respectively) for further calculation.
Calculate precision, recall, and F1 score: True Positives (TP), False Positives (FP), and False Negatives (FN) are calculated based on the predicted and true labels. Precision, recall, and F1 score are then calculated using these values.
Print the results: Precision, recall, and F1 score are printed to the console.

This approach to solving a binary classification problem encompasses dataset generation, model definition and training, and evaluation using custom metrics. The use of precision, recall, F1-score, and ROC AUC provides a comprehensive understanding of the model’s performance, beyond what accuracy alone can offer. These metrics are crucial for evaluating the model’s ability to correctly predict positive instances, its overall efficiency in classification, and its trade-offs between different types of errors.

Python3

# Evaluate the model using PyTorch
model.eval()
y_true = []
y_pred = []

for images, labels in test_loader:
    images = images.to(device)
    outputs = model(images)
    _, predicted = torch.max(outputs, 1)
    y_pred.extend(predicted.cpu().numpy())
    y_true.extend(labels.cpu().numpy())

# Convert lists to tensors for calculation
y_true_tensor = torch.tensor(y_true)
y_pred_tensor = torch.tensor(y_pred)

# Calculating precision, recall, and F1 score using PyTorch
TP = ((y_pred_tensor == 1) & (y_true_tensor == 1)).sum().item()
FP = ((y_pred_tensor == 1) & (y_true_tensor == 0)).sum().item()
FN = ((y_pred_tensor == 0) & (y_true_tensor == 1)).sum().item()

precision = TP / (TP + FP) if TP + FP > 0 else 0
recall = TP / (TP + FN) if TP + FN > 0 else 0
f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')

Output:

Epoch 1/5, Loss: 0.29176822304725647
Epoch 2/5, Loss: 0.43448692560195923
Epoch 3/5, Loss: 0.0890989825129509
Epoch 4/5, Loss: 0.28986942768096924
Epoch 5/5, Loss: 0.3814219832420349
Precision: 0.8680154142581888
Recall: 0.901
F1 Score: 0.8842001962708538

Conclusion:

This article covers a binary classification problem using PyTorch, from dataset generation to model evaluation. We created a synthetic dataset and trained a Multilayer Perceptron (MLP) model. We emphasized the importance of non-linearity and optimization in learning from data. Evaluating beyond accuracy, we calculated precision, recall, F1 score, and ROC AUC to understand the model’s performance in predicting classes and balancing sensitivity and specificity. This holistic approach highlights the significance of data preparation, model architecture understanding, and diverse evaluation metrics in successful machine learning projects. It ensures models align with real-world complexities, leading to accurate and applicable solutions across domains.

Suggest improvement

How to Compute The Area of a Set of Bounding Boxes in PyTorch?

Google Chrome Will Soon Let You Talk to Gemini In The Address Bar

Share your thoughts in the comments

How to calculate the F1 score and other custom metrics in PyTorch?

What are evaluation Metrics?

Implementation of _______

Dataset Loading

Model Building

Calculating Model Metrics: Precision, Recall, F1-score, ROC AUC

Conclusion:

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?