Optimizers in Tensorflow

Last Updated : 21 Dec, 2022

Optimizers are techniques or algorithms used to decrease loss (an error) by tuning various parameters and weights, hence minimizing the loss function, providing better accuracy of model faster.

Optimizers in Tensorflow

Optimizer is the extended class in Tensorflow, that is initialized with parameters of the model but no tensor is given to it. The basic optimizer provided by Tensorflow is:

tf.train.Optimizer - Tensorflow version 1.x
tf.compat.v1.train.Optimizer - Tensorflow version 2.x

This class is never used directly but its sub-classes are instantiated.

Gradient Descent algorithm

Before explaining let’s first learn about the algorithm on top of which others are made .i.e. gradient descent. Gradient descent links weights and loss functions, as gradient means a measure of change, gradient descent algorithm determines what should be done to minimize loss functions using partial derivative – like add 0.7, subtract 0.27 etc. But obstacle arises when it gets stuck at local minima instead of global minima in the case of large multi-dimensional datasets.

Syntax: tf.compat.v1.train.GradientDescentOptimizer(learning_rate, 
                                                    use_locking,
                                                    name = 'GradientDescent)
Parameters: 
learning_rate: rate at which algorithm updates the parameter. 
               Tensor or float type of value. 
use_locking: Use locks for update operations if True
name: Optional name for the operation

Tensorflow Keras Optimizers Classes

Tensorflow predominantly supports 9 optimizer classes including its base class (Optimizer).

Gradient Descent
SGD
AdaGrad
RMSprop
Adadelta
Adam
AdaMax
NAdam
FTRL

SGD Optimizer (Stochastic Gradient Descent)

The stochastic Gradient Descent (SGD) optimization method executes a parameter update for every training example. In the case of huge datasets, SGD performs redundant calculations resulting in frequent updates having high variance causing the objective function to vary heavily.

Syntax: tf.kears.optimizers.SGD(learning_rate = 0.01,
                                momentum=0.0, 
                                nesterov=False, 
                                name='SGD', 
                                **kwargs)
Parameters: 
learning_rate: rate at which algorithm updates the parameter. 
               Tensor or float type of value.Default value is 0.01
momentum: accelerates gradient descent in appropriate
          direction. Float type of value. Default value is 0.0
nesterov: Whether or not to apply Nesterov Momentum.
          Boolean type of value. Default value is False.
name: Optional name for the operation
**kwargs: Keyworded variable length argument length.

Advantages:

Requires Less Memory.
Frequent alteration of model parameters.
If Momentum is used then helps to reduce noise.

Disadvantages:

High Variance
Computationally Expensive

AdaGrad Optimizer

AdaGrad stands for Adaptive Gradient Algorithm. AdaGrad optimizer modifies the learning rate particularly with individual features .i.e. some weights in the dataset may have separate learning rates than others.

Syntax: tf.keras.optimizers.Adagrad(learning_rate=0.001,
                                     initial_accumulator_value=0.1,
                                     epsilon=1e-07,
                                     name="Adagrad",
                                     **kwargs)
Parameters: 
learning_rate: rate at which algorithm updates the parameter. 
               Tensor or float type of value.Default value is 0.001
initial_accumulator_value: Starting value for the per parameter 
                           momentum. Floating point type of value.
                           Must be non-negative.Default value is 0.1
epsilon: Small value used to sustain numerical stability. 
         Floating point type of value. Default value is 1e-07.
name: Optional name for the operation
**kwargs: Keyworded variable length argument length

Advantages:

Best suited for Sparse Dataset
Learning Rate updates with iterations

Disadvantages:

Learning rate becomes small with an increase in depth of neural network
May result in dead neuron problem

RMSprop Optimizer

RMSprop stands for Root Mean Square Propagation. RMSprop optimizer doesn’t let gradients accumulate for momentum instead only accumulates gradients in a particular fixed window. It can be considered as an updated version of AdaGrad with few improvements. RMSprop uses simple momentum instead of Nesterov momentum.

Syntax: tf.keras.optimizers.RMSprop(learning_rate=0.001, 
                                    rho=0.9, 
                                    momentum=0.0, 
                                    epsilon=1e-07, 
                                    centered=False,
                                    name='RMSprop', 
                                    **kwargs)
Parameters:
learning_rate: rate at which algorithm updates the parameter. 
               Tensor or float type of value.Default value is 0.001
rho: Discounting factor for gradients. Default value is 0.9
momentum: accelerates rmsprop in appropriate direction. 
          Float type of value. Default value is 0.0
epsilon: Small value used to sustain numerical stability. 
         Floating point type of value. Default value is 1e-07
centered: By this gradients are normalised by the variance of 
          gradient. Boolean type of value. Setting value to True may
          help with training model however it is computationally 
          more expensive. Default value if False.
name: Optional name for the operation
**kwargs: Keyworded variable length argument length.

Advantages:

The learning rate is automatically adjusted.
The discrete Learning rate for every parameter

Disadvantage: Slow learning

Adadelta Optimizer

Adaptive Delta (Adadelta) optimizer is an extension of AdaGrad (similar to RMSprop optimizer), however, Adadelta discarded the use of learning rate by replacing it with an exponential moving mean of squared delta (difference between current and updated weights). It also tries to eliminate the decaying learning rate problem.

Syntax: tf.keras.optimizers.Adadelta(learning_rate=0.001, 
                                     rho=0.95, 
                                     epsilon=1e-07, 
                                     name='Adadelta',
                                     **kwargs)
Parameters:
learning_rate: rate at which algorithm updates the parameter. 
               Tensor or float type of value.Default value is 0.001
rho: Decay rate. Tensor or Floating point type of value.
     Default value is 0.95
epsilon: Small value used to sustain numerical stability. 
         Floating point type of value. Default value is 1e-07
name: Optional name for the operation
**kwargs: Keyworded variable length argument length

Advantage: Setting of default learning rate is not required.

Disadvantage: Computationally expensive

Adam Optimizer

Adaptive Moment Estimation (Adam) is among the top-most optimization techniques used today. In this method, the adaptive learning rate for each parameter is calculated. This method combines advantages of both RMSprop and momentum .i.e. stores decaying average of previous gradients and previously squared gradients.

Syntax: tf.keras.optimizers.Adam(leaarning_rate=0.001, 
                                 beta_1=0.9, 
                                 beta_2=0.999, 
                                 epsilon=1e-07, 
                                 amsgrad=False,
                                 name='Adam', 
                                 **kwargs)
Parameters:
learning_rate: rate at which algorithm updates the parameter. 
               Tensor or float type of value.Default value is 0.001
beta_1: Exponential decay rate for 1st moment. Constant Float 
        tensor or float type of value. Default value is 0.9
beta_2: Exponential decay rate for 2nd moment. Constant Float 
        tensor or float type of value. Default value is 0.999
epsilon: Small value used to sustain numerical stability. 
         Floating point type of value. Default value is 1e-07
amsgrad: Whether to use AMSGrad variant or not. 
         Default value is False.
name: Optional name for the operation
**kwargs: Keyworded variable length argument length

Advantages:

Easy Implementation
Requires less memory
Computationally efficient

Disadvantages:

Can have weight decay problem
Sometimes may not converge to an optimal solution

AdaMax Optimizer

AdaMax is an alteration of the Adam optimizer. It is built on the adaptive approximation of low-order moments (based off on infinity norm). Sometimes in the case of embeddings, AdaMax is considered better than Adam.

Syntax: tf.keras.optimizers.Adamax(learning_rate=0.001, 
                                   beta_1=0.9, 
                                   beta_2=0.999, 
                                   epsilon=1e-07,
                                   name='Adamax', 
                                   **kwargs)
Parameters:
learning_rate: rate at which algorithm updates the parameter. 
               Tensor or float type of value.Default value is 0.001
beta_1: Exponential decay rate for 1st moment. Constant Float 
        tensor or float type of value. Default value is 0.9
beta_2: Exponential decay rate for weighted infinity norm. 
        Constant Float tensor or float type of value. 
        Default value is 0.999
epsilon: Small value used to sustain numerical stability. 
         Floating point type of value. Default value is 1e-07
name: Optional name for the operation
**kwargs: Keyworded variable length argument length

Advantages:

Infinite order makes the algorithm stable.
Requires less tuning on hyperparameters

Disadvantage: Generalization Issue

NAdam Optimizer

NAdam is a short form for Nesterov and Adam optimizer. NAdam uses Nesterov momentum to update gradient than vanilla momentum used by Adam.

Syntax: tf.keras.optimizers.Nadam(learning_rate=0.001, 
                                  beta_1=0.9, 
                                  beta_2=0.999, 
                                  epsilon=1e-07,
                                  name='Nadam', 
                                  **kwargs)
Parameters:
learning_rate: rate at which algorithm updates the parameter. 
               Tensor or float type of value.Default value is 0.001
beta_1: Exponential decay rate for 1st moment. Constant Float 
        tensor or float type of value. Default value is 0.9
beta_2: Exponential decay rate for weighted infinity norm. 
        Constant Float tensor or float type of value. 
        Default value is 0.999
epsilon: Small value used to sustain numerical stability. 
         Floating point type of value. Default value is 1e-07
name: Optional name for the operation
**kwargs: Keyworded variable length argument length

Advantages:

Gives better results for gradients with high curvature or noisy gradients.
Learns faster

Disadvantage: Sometimes may not converge to an optimal solution

FTRL Optimizer

Follow The Regularized Leader (FTRL) is an optimization algorithm best suited for shallow models having sparse and large feature spaces. This version supports both shrinkage-type L2 regularization (summation of L2 penalty and loss function) and online L2 regularization.

Syntax: tf.keras.optimizers.Ftrl(learning_rate=0.001, 
                                 learning_rate_power=-0.5, 
                                 initial_accumulator_value=0.1,
                                 l1_regularization_strength=0.0, 
                                 l2_regularization_strength=0.0,
                                 name='Ftrl', 
                          l2_shrinkage_regularization_strength=0.0, 
                                 beta=0.0,
                                 **kwargs)
Parameters:
learning_rate: rate at which algorithm updates the parameter. 
               Tensor or float type of value.Default value is 0.001
learning_rate_power: Controls the drop in learning rate during 
                     training. Float type of value. Should be less
                     than or equal to 0. Default value is -0.5.
initial_accumulator_value: Initial value for accumulator. Value
                           should be greater than or equal to zero.
                           Default value is 0.1.
l1_regularization_strength:Stabilization penalty.
                           Only positive values or 0 is allowed.
                           Float type of value.Default value is 0.0 
l2_regularization_strength: Stabilization Penalty.
                            Only positive values or 0 is allowed.
                               Float type of value.Default value is 0.0
name: Optional name for the operation
l2_shrinkage_regularization_strength: Magnitude Penalty.
                           Only positive values or 0 is allowed.
                           Float type of value.Default value is 0.0 
beta: Default float value is 0.0
**kwargs: Keyworded variable length argument length

Advantage: Can minimize loss function better.

Disadvantages:

Cannot achieve adequate stability if the range of the regularizer is insufficient.
If the range of the regularizer is huge, then it’s far away from the optimal decision.

Suggest improvement

Titanic Survival Prediction Using Machine Learning

Python Program for Generating Lyndon words of length n

Share your thoughts in the comments

Optimizers in Tensorflow