Optimization Rule in Deep Neural Networks

Last Updated : 13 Dec, 2023

In machine learning, optimizers and loss functions are two components that help improve the performance of the model. By calculating the difference between the expected and actual outputs of a model, a loss function evaluates the effectiveness of a model. Among the loss functions are log loss, hinge loss, and mean square loss. By modifying the model’s parameters to reduce the loss function value, the optimizer contributes to its improvement. RMSProp, ADAM, and SGD are a few examples of optimizers. The optimizer’s job is to determine which combination of the neural network’s weights and biases will give it the best chance to generate accurate predictions.

Optimization Rule in Deep Neural Networks

There are various optimization techniques to change model weights and learning rates, like Gradient Descent, Stochastic Gradient Descent, Stochastic Gradient descent with momentum, Mini-Batch Gradient Descent, AdaGrad, RMSProp, AdaDelta, and Adam. These optimization techniques play a critical role in the training of neural networks, as they help improve the model by adjusting its parameters to minimize the loss of function value. Choosing the best optimizer depends on the application.

Before we proceed, it’s essential to acquaint yourself with a few terms

The epoch is the number of times the algorithm iterates over the entire training dataset.
Batch weights refer to the number of samples used for updating the model parameters.
A sample is a single record of data in a dataset.
Learning Rate is a parameter determining the scale of model weight updates
Weights and Bias are learnable parameters in a model that regulate the signal between two neurons.

Gradient Descent

A derivative or gradient indicates the direction of increase of the function. Thus a negative derivative or gradient would indicate the direction of decrease of the function. This fact is used to minimize the value of the function.

In gradient descent, we initialize the variables with random values.

We calculate the derivative/gradient for each variable.
We take steps in the direction of the negative derivate/gradient using a learning rate. The learning rate controls the descent. Too large learning rate may result in oscillations while a small learning rate results in slow convergence and hence the optimal value of the learning rate is critical
This is iteratively done until we reach a convergence criteria.

Formula :

$θ_{(k+1)} = θ_k - α ∇J(θ_k)$

where,

θ_(k+1) is the updated parameter vector at the (k+1)th iteration.
θ_k is the current parameter vector at the kth iteration.
α is the learning rate, which is a positive scalar that determines the step size for each iteration.
∇J(θ_k) is the gradient of the cost or loss function J with respect to the parameters θ_k

In Gradient Descent, a single step is taken by considering the entirety of the training data. The process involves calculating the average of the gradients for all training examples, and this mean gradient is employed to update the parameters. This constitutes a singular step in the Gradient Descent process within a single epoch or iteration.

Advantages

Easy to implement and compute

Disadvantages

Chances of getting stuck in local minima.
If dataset is too large it becomes computationally expensive and requires large memory

Gradient descent with Armijo Goldstein condition:

Its a variant of gradient descent in which we ensure that the step size taken is sufficient enough to reduce the objective function thereby avoiding small steps. Here the step size is determined through a line search which must satisfy Armijo condition. Below is the process

Initialization : We set a initial guess for the function f(x)
Gradient : We compute the gradient of the objective function ∇f(x)
Line Search : Here we take a large step size( and check if the reduction in function value (using updated value and old value) satisfies below conditions know as Armjio condition ,

Here
- We are trying to find value at x(t) at time steep t and x(t-1) is the value at step t-1
- α is the step size
- c is a constant between 0 to 1.
- If we do not get the required reduction we reduce the step size by beta β ∈ (0, 1) iteratively till the above condition know as Armjio is satisfied
- Why this value ? It has been shown mathematically through Taylor series first order expansion that the minimum decrease in f(x) should be at least “step size * ∇f(x)2 “. These theoretical value is not practically possible to achieve that’s why we multiply by a fraction c.
Update : Update the solution parameters with the chosen step size.
Convergence Check: This can be done by examining the magnitude of the gradient, the change in the objective function value, or other convergence criteria

Gradient descent with Armijo Full Relaxation condition:

It is an optimization algorithm that combines the Armijo line search condition with a full Newton step. It considers both the first derivative and second derivative(Hessian) information to find a step size that ensures sufficient decrease in the objective function while incorporating information about the curvature of the function.

Initialization : We set a initial guess for the function f(x)
Gradient : We compute the gradient of the objective function ∇f(x)
Line Search : Here the step size should satisfy the below condition :

Here,
- H(x) is Hessian.
- 0<c<b <1 are constants that determine how much the function must decrease and how much the curvature of the function is taken into account
- If we do not get the required reduction we reduce the step size by beta β ∈ (0, 1) iteratively till the above condition is satisfied
Update : Update the solution parameters with the chosen step size.
Convergence Check: This can be done by examining the magnitude of the gradient, the change in the objective function value, or other convergence criteria

Stochastic Gradient Descent (SGD):

It’s a variation of the Gradient Descent algorithm. In Gradient Descent, we analyze the entire dataset in each step, which may not be efficient when dealing with very large datasets. To address this issue, we have Stochastic Gradient Descent (SGD). In Stochastic Gradient Descent, we process just one example at a time to perform a single step. So, if the dataset contains 10000 rows, SGD will update the model parameters 10000 times in a single cycle through the dataset, as opposed to just once in the case of Gradient Descent.

Here’s the process:

Select an example from the dataset.
Calculate its gradient.
Utilize the calculated gradient from step 2 to update the model weights.
Repeat steps 1 to 3 for all examples in the training dataset.
Completing a full pass through all the examples constitutes one epoch.
Repeat this entire process for several epochs as specified during training.

Advantages

Requires less memory
May get new minima

Disadvantages

SGD algorithm is noisier and takes more iterations as compared to gradient descent.

Mini Batch Stochastic Gradient Descent:

We utilize a mini-batch stochastic gradient descent, which consists of a predetermined number of training examples, smaller than the full dataset. This approach combines the advantages of the previously mentioned variants. In one epoch, following the creation of fixed-size mini-batches, we execute the following steps:

Select a mini-batch.
Compute the mean gradient of the mini-batch.
Apply the mean gradient obtained in step 2 to update the model’s weights.
Repeat steps 1 to 2 for all the mini-batches that have been created.

Advantages

Requires medium amount of memory
Less time required to converge when compared to SGD

Disadvantage

May get stuck at local minima

SGD with Momentum:

In Stochastic Gradient Descent, we don’t calculate the precise derivative of our loss function. Instead, we estimate it using a small batch. This results in “noisy” derivatives, which implies that we don’t always move in the optimal direction. To address this issue, Momentum was introduced to mitigate the noise in SGD. It speeds up convergence towards the relevant direction and diminishes fluctuations in irrelevant directions.

The concept behind Momentum involves denoising the derivatives by employing an exponential weighting average by assigning more weight to recent updates compared to previous ones

Update for the momentum term (often denoted as “v” or “m”):

$v_{(t+1)} = β * v_t + (1 - β) * ∇J(θ_t)$

Here

v_(t+1) is the updated momentum at time t+1.
v_t is the momentum at time t.
β is the momentum coefficient (typically a value between 0 and 1).
∇J(θ_t) is the gradient of the cost or loss function with respect to the parameters at time t.

Then, we update the parameters using the momentum term

Formula : $θ_{(t+1)} = θ_t - α * v_{(t+1)}$

θ_(t+1) is the updated parameter vector at time t+1.
θ_t is the current parameter vector at time t.
α is the learning rate.

Advantages

Mitigates parameter oscillations and reduces parameter variance.
Achieves faster convergence compared to standard gradient descent.

Disadvantage

Introduces an additional hyper-parameter that must be chosen manually and with precision

AdaGrad

ADAGRAD, short for adaptive gradient, signifies that the learning rates are adjusted or adapted over time based on previous gradients. A limitation of the previously discussed optimizers is the use of a fixed learning rate for all parameters throughout each cycle. This can hinder the training features which often exhibit small average gradients causing them to train at a slower pace. While one potential solution is to set different learning rates for each feature, this can become complex . AdaGrad addresses this issue by implementing the concept that the more a feature has been updated in the past, the less it will be updated in the future. This provides an opportunity for other features, such as sparse features, to catch up. AdaGrad, as an optimizer, dynamically adjusts the learning rate for each parameter at every time step ‘t’.

For each parameter θ:

Initialize a sum of squared gradients variable to zero:
- G₀ = 0
At each time step t:
- Compute the gradient of the cost or loss function with respect to the parameter θ at time t: ∇J(θt).
- Update the sum of squared gradients:
  G_t = G_(t-1) + (∇J(θt))²
- Update the parameter θ using the following formula:
  
  Where
  - G_t is the sum of squared gradients at time t.
  - θ_t is the current parameter at time t.
  - θ_(t+1) is the updated parameter at time t+1.
  - α (alpha) is the learning rate, which is a positive scalar.
  - ∇J(θ_t) is the gradient of the cost or loss function with respect to the parameter θ_t at time t.
  - ε (epsilon) is a small constant added to the denominator to prevent division by zero. It is typically a very small value, such as 1e-8.

Advantages:

Adaptive learning rates facilitate effective training of all features.

Disadvantages:

With a large number of iterations, the learning rate diminishes to extremely small values, causing slow convergence.

RMSProp

The challenge with AdaGrad lies in its notably slow convergence. This is primarily due to the fact that the sum of squared gradients only accumulates and never diminishes. To address this limitation, RMSProp, short for Root Mean Square Propagation, introduces a decay factor. More precisely, it transforms the sum of squared gradients into a decayed sum of squared gradients. The decay rate indicates that only recent gradient squared values are relevant, while those from the distant past are effectively disregarded. Instead of accumulating all previously squared gradients, RMSProp restricts the window of accumulated past gradients to a fixed size ‘w’. It achieves this by using an exponentially moving average instead of the sum of all gradients.

Initialize a moving average of squared gradients variable:
- E[g²]₀ = 0
Set a decay rate (typically close to 1), denoted as γ (gamma).
At each time step t:
- Compute the gradient of the cost or loss function with respect to the parameter θ at time t: ∇J(θt).
- Update the moving average of squared gradients:
  $E[g^2]_t = γ * E[g^2]_{(t-1)} + (1 - γ) * (∇J(θ_t))^2$
- Update the parameter θ using the following formula:
  
  Where,
  - E[g²]_t is the moving average of squared gradients at time t.
  - θ_t is the current parameter at time t.
  - θ_(t+1) is the updated parameter at time t+1.
  - α (alpha) is the learning rate, which is a positive scalar.
  - ∇J(θ_t) is the gradient of the cost or loss function with respect to the parameter θt at time t.
  - γ (gamma) is the decay rate, typically close to 1.ε (epsilon) is a small constant added to the denominator to prevent division by zero. It is typically a very small value, such as 1e-8.

Advantages:

Prevents the learning rate from decaying, allowing continuous training without premature stopping.

Disadvantages:

Involves higher computational complexity due to increased parameter, making it more computationally expensive.

Adam

Adam, which stands for Adaptive Moment Estimation, combines the strengths of both Momentum and RMSProp. Adam is the preferred choice for many deep learning applications in recent years

For each parameter θ:

Initialize the first moment vector (mean of gradients) m₀ to zeros:
- m₀ = 0
Initialize the second moment vector (uncentered variance of gradients) v₀ to zeros:
- v₀ = 0
Set the exponential decay rates for the moments (typically close to 1), denoted as β₁ (beta_1) and β₂ (beta_2).
Set the small constant ε (epsilon) to prevent division by zero, typically a small value like 1e-8.
At each time step t:
- Compute the gradient of the cost or loss function with respect to the parameter θ at time t: ∇J(θ_t).
- Update the moving average of sum of gradients:
  $m_t = β₁ * m_{(t-1)} + (1 - β₁) * ∇J(θ_t)$
- Update the moving average of squared gradients :
  $v_t = β₂ * v_{(t-1)} + (1 - β₂) * (∇J(θ_t))^2$
- Correct for bias in the moment estimates:
  - $m̂_t = m_t / (1 - β₁_t)$
  - $v̂_t = v_t / (1 - β₂_t)$
- Update the parameter θ using the following formula:
  $θ_{(t+1)} = θ_t - (α / (√v̂_t + ε)) * m̂_t$
- Where:
  - θ_t is the current parameter at time t.
  - θ_(t+1) is the updated parameter at time t+1.
  - α (alpha) is the learning rate, which is a positive scalar.
  - ∇J(θ_t) is the gradient of the cost or loss function with respect to the parameter θt at time t.
  - β₁ (beta_1) and β₂ (beta_2) are the exponential decay rates for the first and second moments, typically close to 1.
  - ε (epsilon) is a small constant added to the denominator to prevent division by zero, typically a very small value like 1e-8.

Advantages:

The method is fast and converges rapidly.

Disadvantages:

Takes lot of memory due to large number of parameters and hence computationally costly.

Comparison with SGD Optimizer

Let us see how each of the subsequent optimizers tackled different issues of SGD which finally lead to ADAM which is now widely used optimizer .

Mini batch SGD is less noisy when compared to SGD however it comes at an increase in computation cost/memory. Also it suffers with same problem of local minima and fixed learning rate.

Updating processes during SGD and Mini-Batch Gradient Descent

The usage of a momentum term in SGD with momentum helps to denoises the gradients and converge faster as compared to SGD without momentum. However it still used a fixed learning rate .

Updating process of SGD with momentum vs SGD without Momentum

AdaGrad an optimization of the SGD algorithm uses an adaptive learning rate (LR) algorithm which can automatically adjust the learning rate and increase prediction accuracy. However AdaGrad is slow in convergence due to the fact that it accumulates the gradient.
RMSProp modifies AdaGrad in a way that it accumulates the gradient into an exponentially weighted average. RMSProp discards past gradient and preserves only current knowledge on the gradient. This makes convergences faster.
Adam is a blend of RMSProp and Momentum. The fixed learning rate issue is resolved using the adaptive learning rate of RMSProp and the issue of local Minima is addressed using Momentum. Due to its overall performance, Adam is often recommended as the default optimizer for various applications. However ADAM uses lot of memory.

Conclusion

Each optimizer exhibits unique strengths and weaknesses, and the optimal choice depends on the particular deep learning task and the characteristics of the dataset. The selection of an optimizer can profoundly influence the speed and quality of convergence during training, ultimately impacting the final performance of the deep learning model.

Suggest improvement

Optimization in Neural Networks and Newton's Method

Share your thoughts in the comments

Optimization Rule in Deep Neural Networks