Swish activation function in Pytorch

Last Updated : 25 Oct, 2023

Activation functions are a fundamental component of artificial neural networks. They introduce non-linearity into the model, allowing it to learn complex relationships in the data. One such activation function, the Swish activation function, has gained attention for its unique properties and potential advantages over the widely used Rectified Linear Unit (ReLU) activation. In this article, we’ll delve into the Swish activation function, provide the mathematical formula, explore its advantages over ReLU, and demonstrate its implementation using PyTorch.

Swish Activation Function

The Swish activation function, introduced by researchers at Google in 2017, is defined mathematically as follows:

Swish(x) = x * sigmoid(x)

Where:

x: The input value to the activation function.

sigmoid(x): The sigmoid function, which maps any real-valued number to the range [0, 1]. It smoothly transitions from 0 to 1 as x increases.

The Swish activation combines a linear component (the input x) with a non-linear component (the sigmoid function), resulting in a smooth and differentiable activation function.

Where to Use Swish Activation?

Swish can be used in various neural network architectures, including feedforward neural networks, convolutional neural networks (CNNs), and recurrent neural networks (RNNs). Its advantages become particularly apparent in deep networks where it can help mitigate the vanishing gradient problem.

Advantages of Swish Activation Function over ReLU

Now, let’s explore the advantages of the Swish activation function compared to the popular ReLU activation.

Smoothness and Differentiability

Swish is a smooth and differentiable function due to the presence of the sigmoid component. This property makes it well-suited for gradient-based optimization techniques like stochastic gradient descent (SGD) and backpropagation. In contrast, ReLU is not differentiable at zero (ReLU’s derivative is undefined at x=0), which can lead to optimization challenges.

Improved Learning in Deep Networks

In deep neural networks, Swish can potentially enable better learning and convergence compared to ReLU. The smoothness of Swish helps gradients flow more smoothly through the network, reducing the likelihood of vanishing gradients during training. This is especially beneficial in very deep networks.

Similar Computational Cost

Swish activation is computationally efficient, similar to ReLU. Both functions involve basic arithmetic operations and do not significantly increase the computational burden during training or inference.

Implementation Using PyTorch

Now, let’s see how to implement the Swish activation function using PyTorch. We’ll create a custom Swish module and integrate it into a simple neural network.

Let’s start with importing the necessary libraries.

Python3

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

Once, we are done with importing the libraries, we can define a custom activation – Swish.

The following code defines a class that inherits the PyTorch base class. Inside the class, there is a forward method. The method defines how the module will process the input data. It will take input tensor as an arguement and return the output tensor after applying the Swish activation.

Python3

# Swish Function 
class Swish(nn.Module):
    def forward(self, x):
        return x * torch.sigmoid(x)

After defining the Swish class, we proceed with defining the neural network model.

In the following code snippet, we have defined a neural network model using PyTorch designed for image classification task.

The input layer has 28×28 pixels.
The hidden layer
- The first hidden layer consists of 256 neurons. It takes the flattened input and applies a linear transformation to produce output.
- The second hidden layer consists 128 neurons that takes the 256 dimensional output from the previous layer and produces a 128-dimensional output.
- The swish activation function is applied to both hidden layers to introduce non-linearity to the network.
- The output layer consists of 10 neurons to perform classification into 10 classes.

Python3

# Define the neural network model
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(28 * 28, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 10)
        self.swish = Swish()
 
    def forward(self, x):
        x = x.view(-1, 28 * 28)
        x = self.fc1(x)
        x = self.swish(x)
        x = self.fc2(x)
        x = self.swish(x)
        x = self.fc3(x)
        return x

To set up the neural network for training, we create an instance of the model, define the loss function, the optimizer and data transformations.

Python3

# Create an instance of the model
model = Net()
 
# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)
 
# Define data transformations
transform = transforms.Compose([
    transforms.ToTensor(),
])

Once, we are done with this step, we can proceed to train and evaluate the model on a dataset. Let’s load the MNIST data and create data loaders for training using the following code.

Python3

# Load the MNIST dataset
train_dataset = datasets.MNIST('', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST('', train=False, download=True, transform=transform)
 
# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

With these data loaders in place, we can proceed with the training loop to iterate through batches of training and testing data.

In the following code, we have executed the training loop for the neural network. The loop will repeat for 5 epochs, during which the model’s weights are updated to minimize the loss and improve its performance on the training data.

Python3

# Training loop
num_epochs = 5
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 
model.to(device)
 
for epoch in range(num_epochs):
    model.train()
    total_loss = 0.0
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
     
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {total_loss / len(train_loader)}")

Output:

Epoch 1/5, Loss: 1.6938323568503062
Epoch 2/5, Loss: 0.4569567457397779
Epoch 3/5, Loss: 0.3522500048557917
Epoch 4/5, Loss: 0.31695075702369213
Epoch 5/5, Loss: 0.2961081813474496

The last step is the model evaluation step.

Python3

# Evaluation loop
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for data, target in test_loader:
        data, target = data.to(device), target.to(device)
        outputs = model(data)
        _, predicted = torch.max(outputs.data, 1)
        total += target.size(0)
        correct += (predicted == target).sum().item()
 
print(f"Accuracy on test set: {100 * correct / total}%")

Output:

Accuracy on test set: 92.02%

Conclusion

The Swish activation function offers a promising alternative to traditional activation functions like ReLU. Its smoothness, differentiability, and potential to improve learning in deep networks make it a valuable tool for modern neural network architectures. By implementing Swish in PyTorch, you can harness its benefits and explore its effectiveness in various machine learning tasks.

Suggest improvement

Activation Functions in Pytorch

Share your thoughts in the comments

Swish activation function in Pytorch