Intuition behind Adagrad Optimizer

Last Updated : 26 Nov, 2020

Adagrad stands for Adaptive Gradient Optimizer. There were optimizers like Gradient Descent, Stochastic Gradient Descent, mini-batch SGD, all were used to reduce the loss function with respect to the weights. The weight updating formula is as follows:

$(\mathrm{w})_{\mathrm{new}}=(\mathrm{w})_{\mathrm{old}}-\eta \frac{\partial \mathrm{L}}{\mathrm{dw}(\text { old })}$

Based on iterations, this formula can be written as:

$w_{t}=w_{t-1}-\eta \frac{\partial L}{\partial w(t-1)}$

where

w(t) = value of w at current iteration, w(t-1) = value of w at previous iteration and η = learning rate.

In SGD and mini-batch SGD, the value of η used to be the same for each weight, or say for each parameter. Typically, η = 0.01. But in Adagrad Optimizer the core idea is that each weight has a different learning rate (η). This modification has great importance, in the real-world dataset, some features are sparse (for example, in Bag of Words most of the features are zero so it’s sparse) and some are dense (most of the features will be noon-zero), so keeping the same value of learning rate for all the weights is not good for optimization. The weight updating formula for adagrad looks like:

$\mathrm{w}_{\mathrm{t}}=\mathrm{w}_{\mathrm{t}-1}-\eta_{\mathrm{t}}^{\prime} \frac{\partial \mathrm{L}}{\partial \mathrm{w}(\mathrm{t}-1)}$

Where alpha(t) denotes different learning rates for each weight at each iteration.

$\eta_{\mathrm{t}}^{\prime}=\frac{\eta}{\operatorname{sqrt}\left(\alpha_{\mathrm{t}}+\epsilon\right)}$

Here, η is a constant number, epsilon is a small positive value number to avoid divide by zero error if in case alpha(t) becomes 0 because if alpha(t) become zero then the learning rate will become zero which in turn after multiplying by derivative will make w(old) = w(new), and this will lead to small convergence.

$\alpha_{t}=\sum_{i=1}^{t} g_{i}^{2}$ $g_{i=} \frac{\partial L}{\partial w(o l d)}$

$g_i$ is derivative of loss with respect to weight and $g_i^2$ will always be positive since its a square term, which means that alpha(t) will also remain positive, this implies that alpha(t) >= alpha(t-1).

It can be seen from the formula that as alpha(t) and $\eta_t^'$ is inversely proportional to one another, this implies that as alpha(t) will increase, $\eta_t^'$ will decrease. This means that as the number of iterations will increase, the learning rate will reduce adaptively, so you no need to manually select the learning rate.

Advantages of Adagrad:

No manual tuning of the learning rate required.
Faster convergence
More reliable

One main disadvantage of Adagrad optimizer is that alpha(t) can become large as the number of iterations will increase and due to this $\eta_t^'$ will decrease at the larger rate. This will make the old weight almost equal to the new weight which may lead to slow convergence.

Suggest improvement

Introduction to Speech Separation Based On Fast ICA

FastText Working and Implementation

Share your thoughts in the comments

Intuition behind Adagrad Optimizer

Advantages of Adagrad:

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?