Gradient Descent is an iterative optimiZation algorithm, used to find the minimum value for a function. The general idea is to initialize the parameters to random values, and then take small steps in the direction of the “slope” at each iteration. Gradient descent is highly used in supervised learning to minimize the error function and find the optimal values for the parameters.

Various extensions have been designed for gradient descent algorithm. Some of them are discussed below:

**Momentum method**: This method is used to accelerate the gradient descent algorithm by taking into consideration the exponentially weighted average of the gradients. Using averages makes the algorithm converge towards the minima in a faster way, as the gradients towards the uncommon directions are canceled out. The pseudocode for momentum method is given below.V = 0 for each iteration i: compute dW V = β V + (1 - β) dW W = W - α V

V and dW are analogous to acceleration and velocity respectively. α is the learning rate, and β is normally kept at 0.9.

**RMSprop**: RMSprop was proposed by University of Toronto’s Geoffrey Hinton. The intuition is to apply an exponentially weighted average method to the second moment of the gradients (dW^{2}). The pseudocode for this is as follows:S = 0 for each iteration i compute dW S = β S + (1 - β) dW

^{2}W = W - α^{dW}⁄_{√S + ε}**Adam Optimization**: Adam optimization algorithm incorporates the momentum method and RMSprop, along with bias correction. The pseudocode for this approach is as follows,V = 0 S = 0 for each iteration i compute dW V = β

_{1}S + (1 - β_{1}) dW S = β_{2}S + (1 - β_{2}) dW^{2}V =^{V}⁄_{{1 - β1i}}S =^{S}⁄_{{1 - β2i}}W = W - α^{V}⁄_{√S + ε}Kingma and Ba, the proposers of Adam, recommended the following values for the hyperparameters.

α = 0.001 β

_{1}= 0.9 β_{2}= 0.999 ε = 10^{-8}

Attention reader! Don’t stop learning now. Get hold of all the important CS Theory concepts for SDE interviews with the CS Theory Course at a student-friendly price and become industry ready.

## Recommended Posts:

- Difference between Batch Gradient Descent and Stochastic Gradient Descent
- ML | Stochastic Gradient Descent (SGD)
- Gradient Descent in Linear Regression
- ML | Mini-Batch Gradient Descent with Python
- Gradient Descent algorithm and its variants
- Difference between Gradient descent and Normal equation
- Vectorization Of Gradient Descent
- Multivariate Optimization - Gradient and Hessian
- Uni-variate Optimization vs Multivariate Optimization
- ML | Momentum-based Gradient Optimizer introduction
- ML | XGBoost (eXtreme Gradient Boosting)
- LightGBM (Light Gradient Boosting Machine)
- ML - Gradient Boosting
- ADAM (Adaptive Moment Estimation) Optimization | ML
- Introduction to Ant Colony Optimization
- Optimization for Data Science
- Local and Global Optimum in Uni-variate Optimization
- Uni-variate Optimization - Data Science
- Hyperparameters Optimization methods - ML
- Multivariate Optimization and its Types - Data Science

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.