SPSA (Simultaneous Perturbation Stochastic Approximation) Algorithm using Python
SPSA is an algorithm of optimisation invented by James C. Spall specially useful for noisy cost functions and the ones which the exact gradient is not available. The general Idea is to increase the quality of the answer by estimating the gradient of the cost functions and using it to update the parameters in such way that the cost function is improved (lower its value).
Random Perturbation
The derivative of the Cost function is not calculated, it is approximated by a random perturbation made to all parameters. The perturbation is made by adding (and subtracting) to the parameters vector w a vector composed by random variables of a Bernoulli-like distribution in each of its components. The distribution is such that each variable is 1 or -1 with equal probability.
The perturbation made by the addition creates the w_plus vector, and the one made by the subtraction generates the w_minus vector. See equations below:
The cost function is now evaluated at the point of this two vectors, and then the slope(gradient) is measured (approximated).
Hyper Parameters
- N is the number of iterations.
- magnitude_g0 is a estimation of the first values of the approximated gradient of the cost function.
- The learning rate of an iteration is a/(A+k)α , where a, α (alpha) and A are constants and k is the iteration number (note that the learning rate decays each iteration).
- a is the numerator of the learning rate.
- A is a stabilisation factor of the learning rate (it avoids a bigger difference between the learning rate in the first and in the last iterations) .
- α is related to the speed that the learning rate of the decays.
- A way to choose a proposed by Spall is one that satisfies the following equation, where delta_w0 is the smallest of the desired change magnitudes among the elements of the parameters in the early iterations:
- c and γ are related with amplitude of the perturbation made to the parameters.
- c is such that the greater its value is the stronger will be the perturbation suffered (the parameters will be more strongly modified before calculating the cost function to approximate it).
- γ (gamma) is a constant that regulates how fast the ck value will decay (greater γ implies faster decay).
- It is a great choice to c’s value the standard deviation of the measurement noise related to the loss. If there isn’t any noise, a small c should work well.
- δ (delta) is a vector of random variables of a Bernoulli-like distribution (δk is the vector of the k-th iteration) used to disturb the parameters.
- if X is a random variable in the vector, X has a 50% chance of being -1 and 50% of being 1.
- For example: δ = (1, 1, -1, 1, -1) would be a possible vector for 6 parameters a possible random vector used to disturb 6 parameters would be: (0.01, 0.01, -0.01, 0.01, -0.01).
- w is a vector of parameters (weights), which we use as input of the cost function.
Pseudo Code for the SPSA
A <−−− 0.01N
a <−−− 0.1(A + 1)α/magnitude_g0
for k - 1,2,...,N do
ak <−−− a/(k + A)α
ck <−−− c/kγ
δk <−−− special_random_vector
g <−−− (L(w + ckδk) - L(w - ckδk))
w <−−− w - akg
end for
Implementation SPSA
Python3
import numpy as np
import matplotlib.pyplot as plt
np.random.seed( 42 )
|
Below is a helper function to calculate the output of a polynomial, given the coefficients in the array a and the x value (point to measure).
Python3
def polynomial(a, x):
N = len (a)
S = 0
for k in range (N):
S + = a[k] * x * * k
return S
|
Below function will calculate the mean squared error between the predicted values and the real values. But the final error will be passed with some noise added/subtracted from it.
Python3
def Loss(parameters, X, Y):
Y_pred = polynomial(parameters, X)
L = ((Y_pred - Y) * * 2 ).mean()
noise = 5 * np.random.random()
return L + noise
|
Below function will calculate the gradients given the loss, weights or coefficients of the functions and
Python3
def grad(L, w, ck):
p = len (w)
deltak = np.random.choice([ - 1 , 1 ], size = p)
ck_deltak = ck * deltak
DELTA_L = L(w + ck_deltak) - L(w - ck_deltak)
return (DELTA_L) / ( 2 * ck_deltak)
|
By using the below helper function we will choose the constants that are required for the algorithm to work well.
Python3
def initialize_hyperparameters(alpha, lossFunction, w0, N_iterations):
c = 1e - 2
A = N_iterations * 0.1
magnitude_g0 = np. abs (grad(lossFunction, w0, c).mean())
a = 2 * ((A + 1 ) * * alpha) / magnitude_g0
return a, A, c
|
Below functions are the main optimization algorithm and make use of all the helper functions we have defined above.
Python3
def SPSA(LossFunction, parameters, alpha = 0.602 ,\
gamma = 0.101 , N_iterations = int ( 1e5 )):
w = parameter
a, A, c = initialize_hyperparameters(
alpha, LossFunction, w, N_iterations)
for k in range ( 1 , N_iterations):
ak = a / ((k + A) * * (alpha))
ck = c / (k * * (gamma))
gk = grad(LossFunction, w, ck)
w - = ak * gk
return w
|
Now, we will show the working of the algorithm we have implemented above with an example.
Python3
X = np.linspace( 0 , 10 , 100 )
Y = 1 * X * * 2 - 4 * X + 3
noise = 3 * np.random.normal(size = len (X))
Y + = noise
plt.title( "polynomial with noise" )
plt.plot(X, Y, 'go' )
plt.show()
|
Output:
Set of random points with noise
Before training
Python3
parameters = ( 2 * np.random.random( 3 ) - 1 ) * 10
plt.title( "Before training" )
plt.plot(X, polynomial(parameters, X), "bo" )
plt.plot(X, Y, 'go' )
plt.legend([ "predicted value" , "true value" ])
plt.show()
|
Output:
Predictions of the function before training
After training
Python3
parameters = SPSA(LossFunction = lambda parameters: Loss(parameters, X, Y),
parameters = parameters)
plt.title( "After training" )
plt.plot(X, polynomial(parameters, X), "bo" )
plt.plot(X, Y, 'go' )
plt.legend([ "predicted value" , "true value" ])
plt.show()
|
Output:
Predictions after learning the weights
Last Updated :
09 Feb, 2023
Like Article
Save Article
Share your thoughts in the comments
Please Login to comment...