Numpy Gradient – Descent Optimizer of Neural Networks
In differential calculus, the derivative of a function tells us how much the output changes with a small nudge in the input variable. This idea can be extended to multivariable functions as well. This article shows the implementation of the Gradient Descent Algorithm using NumPy. The idea is very simple- start with an arbitrary starting point and move towards the minimum (that is -ve of gradient value), and return a point that is as close to the minimum.
GD() is a user-defined function employed for this purpose. It takes the following parameters:
- gradient is a function which or it can be a python callable object which takes a vector & returns the gradient of a function which we are trying to minimize.
- start is the arbitrary starting point which we give to the function, it is a single independent variable. It can also be a list, Numpy array for multivariable.
- learn_rate controls the magnitude by which the vectors get updated.
- n_iter is the number of iterations the operation should run.
- tol is the tolerance level that specifies the minimum movement in each iteration.
Given below is the implementation to produce out required functionality.
The vector notation of global minima:[9.5 9.25 8.75 8.25 7.75 7.5 ]
The vector notation of global minima: [2.0539126e-15 2.0539126e-15]
Lets see relevant concepts used in this function in detail.
Tolerance Level Application
The below line of code enables GD() to terminate early and return before n_iter is completed if the update is less than or equal to tolerance level this particularly speeds up the process when we reach a local minimum or a saddle point where the increment movement is very slow due to very low gradient thus it speeds up the convergence rate.
Learning Rate Usage (Hyper-parameter)
- The learning rate is a very crucial hyper-parameter as it affects the behavior of the gradient descent algorithm. For example, if we change the learning rate from 0.2 to 0.7 we get another solution that’s very close to 0, but because of the high learning rate there is a large change in x and i.e it passes the minimum value multiple times, hence it oscillates before settling to zero. This oscillation increases the convergence time of the entire algorithm.
- A small learning rate can lead to slow convergence and to make the matter worst if the no of iterations is limiting small then the algorithm might even return before it finds the minimum.
Given below is an example to show how learning rate affects out result.
The value returned by the algorithm is not even close to 0. This indicates that our algorithm returns before converging to global minima.