Open In App

Custom Optimizers in Pytorch

Last Updated : 05 Feb, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

In PyTorch, an optimizer is a specific implementation of the optimization algorithm that is used to update the parameters of a neural network. The optimizer updates the parameters in such a way that the loss of the neural network is minimized. PyTorch provides various built-in optimizers such as SGD, Adam, Adagrad, etc. that can be used out of the box. However, in some cases, the built-in optimizers may not be suitable for a particular problem or may not perform well. In such cases, one can create their own custom optimizer.

A custom optimizer in PyTorch is a class that inherits from the torch.optim.Optimizer base class. The custom optimizer should implement the init and step methods. The init method is used to initialize the optimizer’s internal state, and the step method is used to update the parameters of the model.

Creating a Custom Optimizer:

In PyTorch, creating a custom optimizer is a two-step process. First, we need to create a class that inherits from the torch.optim.Optimizer class, and override the following methods:

  • __init__(self, params): This method is used to initialize the optimizer and store the model parameters in the params attribute.
  • step(): This method is used to perform a single optimization step. It should update the model parameters based on the current gradients.
  • zero_grad(): This method is used to set the gradients of all parameters to zero.

Init Method:

The init method is used to initialize the optimizer’s internal state. In this method, we define the hyperparameters of the optimizer and set the internal state. For example, let’s say we want to create a custom optimizer that implements the Momentum optimization algorithm. The init method for this optimizer would look something like this:

In the below example, we define the hyperparameters of the optimizer to be the learning rate lr and the momentum. We then call the super() method to initialize the internal state of the optimizer. We also set up a state dictionary that we will use to store the velocity vector for each parameter.

Python3




# Import the necessary libraries
import torch
import torch.nn as nn
  
# MomentumOptimizer
class MomentumOptimizer(torch.optim.Optimizer):
      
    # Init Method:
    def __init__(self, params, lr=1e-3, momentum=0.9):
        super(MomentumOptimizer, self).__init__(params, defaults={'lr': lr})
        self.momentum = momentum
        self.state = dict()
        for group in self.param_groups:
            for p in group['params']:
                self.state[p] = dict(mom=torch.zeros_like(p.data))
      
    # Step Method
    def step(self):
        for group in self.param_groups:
            for p in group['params']:
                if p not in self.state:
                    self.state[p] = dict(mom=torch.zeros_like(p.data))
                mom = self.state[p]['mom']
                mom = self.momentum * mom - group['lr'] * p.grad.data
                p.data += mom


The Step Method:

The step method is used to update the parameters of the model. This method takes no arguments and updates the internal state and the model parameters. In the case of our MomentumOptimizer, the step method would look something like this:

In the above example, we iterate over all the parameters in the model and check if they are in the state dictionary. If they are not, we add them to the state dictionary with an initial velocity vector of zero. We then calculate the new velocity vector using the momentum and the learning rate and update the parameter’s value using this velocity vector.

Using the custom optimizer is similar to using the built-in optimizers, in that we instantiate it and pass in the model’s parameters and the hyperparameters.

Illustration 1:

Let’s create a simple training loop that shows how to use the custom optimizer to train a model. The loop would perform the following steps:

  1. Initialize the gradients of the model’s parameters to zero using the optimizer’s zero_grad method.
  2. Compute the forward pass of the model on some input data and calculate the loss.
  3. Compute the gradients of the model’s parameters with respect to the loss using the backward method.
  4. Call the step method of the optimizer to update the model’s parameters based on the current gradients and the optimizer’s internal state.

Step 1. Import the necessary libraries:

Python3




# Import the necessary libraries
import torch
import torch.nn as nn
# To plot the figure
import matplotlib.pyplot as plt


Step 2: Define a custom optimizer class that inherits from torch.optim.Optimizer. In this example, we will create a custom optimizer that implements the Momentum optimization algorithm.

Python3




# MomentumOptimizer
class MomentumOptimizer(torch.optim.Optimizer):
      
    # Init Method:
    def __init__(self, params, lr=1e-3, momentum=0.9):
        super(MomentumOptimizer, self).__init__(params, defaults={'lr': lr})
        self.momentum = momentum
        self.state = dict()
        for group in self.param_groups:
            for p in group['params']:
                self.state[p] = dict(mom=torch.zeros_like(p.data))
      
    # Step Method
    def step(self):
        for group in self.param_groups:
            for p in group['params']:
                if p not in self.state:
                    self.state[p] = dict(mom=torch.zeros_like(p.data))
                mom = self.state[p]['mom']
                mom = self.momentum * mom - group['lr'] * p.grad.data
                p.data += mom


Step 3: Define a simple model, loss function and also initialize an instance of the custom optimizer:

Python3




# Define a simple model
model = nn.Linear(2, 2)
  
# Define a loss function
criterion = nn.MSELoss()
  
# Define the optimizer
optimizer = MomentumOptimizer(model.parameters(), lr=1e-3, momentum=0.9)


Step 4: Generate some random data to train the model

Python3




# Generate some random data
X = torch.randn(100, 2)
y = torch.randn(100, 1)


Step 5:Train the model with custom optimizer and Plot the training loss.

Python3




# Training loop
for i in range(2500):
    optimizer.zero_grad()
    y_pred = model(X)
    loss = criterion(y_pred, y)
      
    # Plot losses
    if i%100 ==0:
        plt.plot(i,loss.item(),'ro-')
      
    loss.backward()
    optimizer.step()
      
plt.title('Losses over iterations')
plt.xlabel('iterations')
plt.ylabel('Losses')
plt.show()


Output:

Losses -Geeksforgeeks

Losses

You will notice that your custom optimizer is correctly updating the parameters of the model and minimizing the loss function.

Note: The above loop is an example on how to use the custom optimizer and it will help you understand how the step method of optimizer is working.

Customizing Optimizers:

There are many ways to customize optimizers in PyTorch, Some of them are as follows:

Changing the learning rate schedule:

 The learning rate of the optimizer can be changed during training using a learning rate scheduler. PyTorch provides several built-in schedulers such as torch.optim.lr_scheduler.StepLR and torch.optim.lr_scheduler.ExponentialLR. We can also create our own scheduler by inheriting from the torch.optim.lr_scheduler._LRScheduler class.

In below code, we are using the torch.optim.lr_scheduler.StepLR scheduler which will multiply the learning rate by a factor of gamma every step_size iterations.

Python3




# Initialize an optimizer with a fixed learning rate
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
  
# Create a learning rate scheduler
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)
  
num_epochs = 200
# In the training loop
for i in range(num_epochs):
    # Perform the training step
    optimizer.zero_grad()
      
    y_pred = model(X)
    loss = criterion(y_pred, y)
      
    loss.backward()
    optimizer.step()
    # Update the learning rate
    scheduler.step()


Adding regularization

To add regularization to the optimizer, we can modify the step() method to include the regularization term in the update of the model parameters. For example, we can add L1 or L2 regularization by modifying the step() method to include a term that penalizes the absolute or squared values of the parameters respectively.

Python3




# Define custom optimizer
class MyAdam(torch.optim.Adam):
    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), weight_decay=0):
        super().__init__(params, lr=lr, betas=betas)
        self.weight_decay = weight_decay
  
    def step(self):
        for group in self.param_groups:
            for p in group['params']:
                if p.grad is None:
                    continue
                grad = p.grad.data
                if grad.is_sparse:
                    raise RuntimeError("Adam does not support sparse gradients")
  
                state = self.state[p]
  
                # State initialization
                if len(state) == 0:
                    state["step"] = 0
                    # Exponential moving average of gradient values
                    state["exp_avg"] = torch.zeros_like(p.data)
                    # Exponential moving average of squared gradient values
                    state["exp_avg_sq"] = torch.zeros_like(p.data)
  
                exp_avg, exp_avg_sq = state["exp_avg"], state["exp_avg_sq"]
                beta1, beta2 = group["betas"]
  
                state["step"] += 1
  
                if self.weight_decay != 0:
                    grad = grad.add(p.data, alpha=self.weight_decay)
  
                # Decay the first and second moment running average coefficient
                exp_avg.mul_(beta1).add_(1 - beta1, grad)
                exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
  
                denom = exp_avg_sq.sqrt().add_(group["eps"])
  
                bias_correction1 = 1 - beta1 ** state["step"]
                bias_correction2 = 1 - beta2 ** state["step"]
                step_size = group["lr"] * math.sqrt(bias_correction2) / bias_correction1
  
                p.data.addcdiv_(-step_size, exp_avg, denom)
  
# Optimizer
optimizer = MyAdam(model.parameters(), weight_decay=0.00002)


In the above code, we are creating a custom Adam optimizer that includes weight decay regularization by adding a weight_decay parameter to the optimizer, and modifying the step() method to include the weight decay term in the update of the parameters. The weight decay term is applied to the gradients by grad = grad.add(p.data, alpha=group[“weight_decay”]) , this will penalize large parameter values by decreasing their update.

Implementing a new optimization algorithm: 

PyTorch provides several built-in optimization algorithms, such as SGD, Adam, and Adagrad. However, there are many other optimization algorithms that are not included in the library. By creating a custom optimizer, we can implement any optimization algorithm that we want.

Python3




class MyOptimizer(torch.optim.Optimizer):
    def __init__(self, params, lr=0.01):
        defaults = dict(lr=lr)
        super(MyOptimizer, self).__init__(params, defaults)
  
    def step(self):
        for group in self.param_groups:
            for p in group['params']:
                if p.grad is None:
                    continue
                p.data = p.data - group['lr']*p.grad.data**2
  
optimizer = MyOptimizer(model.parameters(), lr=0.001)


In this example, we created a new optimization algorithm called MyOptimizer, that performs updates to the parameters based on the squared gradient values, instead of the gradients themselves.

Using multiple optimizers:

 In some cases, we may want to use different optimizers for different parts of the model. For example, we may want to use Adam for the parameters of the convolutional layers, and SGD for the parameters of the fully-connected layers. This can be achieved by creating multiple instances of the optimizer, one for each set of parameters.

Python3




# Define different optimizers for different parts of the model
params1 = model.conv_layers.parameters()
params2 = model.fc_layers.parameters()
  
optimizer1 = torch.optim.Adam(params1)
optimizer2 = torch.optim.SGD(params2, lr=0.01)
  
# In the training loop
for i in range(num_epochs):
    # Perform the training step
    ...
    optimizer1.zero_grad()
    optimizer2.zero_grad()
    loss.backward()
    optimizer1.step()
    optimizer2.step()


In this example, we are using Adam optimizer for the parameters of the convolutional layers, and SGD optimizer with a fixed learning rate of 0.01 for the parameters of the fully-connected layers. This can help fine-tune the training of specific parts of the model.

Illustration 2: 

Build a handwritten digit classifications model using a custom optimizer

Step 1: 

Import the necessary libraries

Python3




import torch
import torch.nn as nn
from torch.optim import Optimizer
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
from torch.utils.tensorboard import SummaryWriter
import math
import matplotlib.pyplot as plt
  
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


Step 2: 

Now, we’ll load the MNIST dataset, and create a data loader for it.

Python3




# Loading the dataset
dataset = MNIST(root='.', train=True, download=True, transform=ToTensor())
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
dataloader.dataset


Output:

Dataset MNIST
    Number of datapoints: 60000
    Root location: .
    Split: Train
    StandardTransform
Transform: ToTensor()

Step 3:

Let’s visualize the first batch of our dataset.

Python3




sample_idx = torch.randint(len(dataloader), size=(1,)).item()
len(dataloader)
for i, batch in enumerate(dataloader):
    figure = plt.figure(figsize=(16, 16))
    img, label = batch
    for j in range(img.shape[0]):
        figure.add_subplot(8, 8, j+1)
        plt.imshow(img[j].squeeze(), cmap="gray")
        plt.title(label[j])
        plt.axis("off")
          
    plt.show()
    break


Output:

First batch input images - Geeksforgeeks

First batch input images

Step 4: 

Next, we’ll define our model architecture, a simple fully connected network with two hidden layers

Python3




class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(28*28, 512)
        self.fc2 = nn.Linear(512, 512)
        self.fc3 = nn.Linear(512, 10)
  
    def forward(self, x):
        x = x.view(-1, 28*28)
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x
        
# Model
model = Net().to(device)


Step 4:

 we’ll define our loss function, in this case, we’ll use the cross-entropy loss.

Python3




# Loss functions
loss_fn = nn.CrossEntropyLoss()


Step 5: 

Next, we’ll define our custom optimizer

Python3




# Define custom optimizer
class MyAdam(torch.optim.Adam):
    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), weight_decay=0):
        super().__init__(params, lr=lr, betas=betas)
        self.weight_decay = weight_decay
  
    def step(self):
        for group in self.param_groups:
            for p in group['params']:
                if p.grad is None:
                    continue
                grad = p.grad.data
                if grad.is_sparse:
                    raise RuntimeError("Adam does not support sparse gradients")
  
                state = self.state[p]
  
                # State initialization
                if len(state) == 0:
                    state["step"] = 0
                    # Exponential moving average of gradient values
                    state["exp_avg"] = torch.zeros_like(p.data)
                    # Exponential moving average of squared gradient values
                    state["exp_avg_sq"] = torch.zeros_like(p.data)
  
                exp_avg, exp_avg_sq = state["exp_avg"], state["exp_avg_sq"]
                beta1, beta2 = group["betas"]
  
                state["step"] += 1
  
                if self.weight_decay != 0:
                    grad = grad.add(p.data, alpha=self.weight_decay)
  
                # Decay the first and second moment running average coefficient
                exp_avg.mul_(beta1).add_(1 - beta1, grad)
                exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
  
                denom = exp_avg_sq.sqrt().add_(group["eps"])
  
                bias_correction1 = 1 - beta1 ** state["step"]
                bias_correction2 = 1 - beta2 ** state["step"]
                step_size = group["lr"] * math.sqrt(bias_correction2) / bias_correction1
  
                p.data.addcdiv_(-step_size, exp_avg, denom)
  
# Optimizer
optimizer = MyAdam(model.parameters(), weight_decay=0.00001)


Step 6:

Now, Train the model with custom optimizer and Plot the training loss.

Python3




# Training loop
num_epochs = 10
for i in range(num_epochs):
    for inputs, labels in dataloader:
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = model(inputs)
        loss = loss_fn(outputs, labels)
  
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        #scheduler.step()
          
    plt.plot(i,loss.item(),'ro-')
    print(i,'>> Loss :', loss.item())
  
plt.title('Losses over iterations')
plt.xlabel('iterations')
plt.ylabel('Losses')
plt.show()


Output:

0 >> Loss : nan
1 >> Loss : 1.2611686178923354e-44
2 >> Loss : nan
3 >> Loss : 8.407790785948902e-45
4 >> Loss : nan
5 >> Loss : 1.401298464324817e-45
6 >> Loss : nan
7 >> Loss : 0.0
8 >> Loss : nan
9 >> Loss : 1.401298464324817e-45
Losses -Geeksforgeeks

Losses

Note:  Losses will be different for different devices. 

Conclusion:

Creating custom optimizers in PyTorch is a powerful technique that allows us to fine-tune the training process of a machine learning model. By inheriting from the torch.optim.Optimizer class and implementing the __init__, step, and zero_grad methods, we can create our own optimization algorithm, adding regularization, changing learning rate schedule, or using multiple optimizers. Custom optimizers can help to improve the performance of a model and make it more suitable for a specific problem.



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads