Open In App

SPSA (Simultaneous Perturbation Stochastic Approximation) Algorithm using Python

Improve
Improve
Like Article
Like
Save
Share
Report

SPSA is an algorithm of optimisation invented by James C. Spall specially useful for noisy cost functions and the ones which the exact gradient is not available. The general Idea is to increase the quality of the answer by estimating the gradient of the cost functions and using it to update the parameters in such way that the cost function is improved (lower its value).

Random Perturbation

The derivative of the Cost function is not calculated, it is approximated by a random perturbation made to all parameters. The perturbation is made by adding (and subtracting) to the parameters vector w a vector composed by random variables of a Bernoulli-like distribution in each of its components. The distribution is such that each variable is 1 or -1 with equal probability.

The perturbation made by the addition creates the w_plus vector, and the one made by the subtraction generates the w_minus vector.  See equations below:

\left\{\begin{matrix} w_{plus} & = & w & + & random\underline{\;\;}\;perturbation \\ w_{minus} & = & w & - & random\underline{\;\;}\;perturbation \end{matrix}\right.

The cost function is now evaluated at the point of this two vectors, and then the slope(gradient) is measured (approximated).

\frac{\partial L}{\partial w} \approx \frac{L{\left (  {w_{plus}} \right )} - L{\left (  {w_{minus}}\right )}}{2\times perturbation\underline{\;\;}amplitude}

Hyper Parameters

  •  N is the number of iterations. 
  • magnitude_g0 is a estimation of the first values of the approximated gradient of the cost function.
  • The learning rate of an iteration is a/(A+k)α , where a, α (alpha) and A are constants and k is the iteration number (note that the learning rate decays each iteration).
    • a is the numerator of the learning rate.
    • A is a stabilisation factor of the learning rate (it avoids a bigger difference between the learning rate in the first and in the last iterations) .
    • α is related to the speed that the learning rate of the decays.
    • A way to choose a proposed by Spall is one that satisfies the following equation, where delta_w0 is the smallest of the desired change magnitudes among the elements of the parameters in the early iterations: 

\frac{a}{\left ( A+1 \right )^{\alpha}} \cdot magnitude\underline{\;\;}g0\approx \Delta w_0

  • c and γ are related with amplitude of the perturbation made to the parameters.
    • c is such that the greater its value is the stronger will be the perturbation suffered  (the parameters will be more strongly modified before calculating the cost function to approximate it).
    • γ (gamma) is a constant that regulates how fast the ck value will decay (greater γ implies faster decay).
    • It is a great choice to c’s value the standard deviation of the measurement noise related to the loss. If there isn’t any noise, a small c should work well.
  • δ (delta) is a vector of random variables of a Bernoulli-like distribution (δk is the vector of the k-th iteration) used to disturb the parameters.
    • if X is a random variable in the vector, X has a 50% chance of being -1 and 50% of being 1.
    • For example: δ = (1,  1,   -1,   1,   -1) would be a possible vector for 6 parameters a possible random vector used to disturb 6 parameters would be: (0.01,    0.01,   -0.01,    0.01,   -0.01).
  • w is a vector of parameters (weights), which we use as input of the cost function.

Pseudo Code for the SPSA

A <−−− 0.01N
a <−−− 0.1(A + 1)α/magnitude_g0
for k - 1,2,...,N do
    ak <−−− a/(k + A)α
    ck <−−− c/kγ
    δk <−−− special_random_vector
     g <−−− (L(w + ckδk) - L(w - ckδk))
     w <−−− w - akg
end for

Implementation SPSA

Python3

import numpy as np
import matplotlib.pyplot as plt
 
# Defining the seed to have same results
np.random.seed(42)

                    

Below is a helper function to calculate the output of a polynomial, given the coefficients in the array a and the x value (point to measure).

Python3

def polynomial(a, x):
 
    N = len(a)
    S = 0
    for k in range(N):
        S += a[k]*x**k
    return S

                    

Below function will calculate the mean squared error between the predicted values and the real values. But the final error will be passed with some noise added/subtracted from it.

Python3

def Loss(parameters, X, Y):
 
  # Predictions of our model
    Y_pred = polynomial(parameters, X)
     
    # mse (mean square error)
    L = ((Y_pred - Y)**2).mean()
     
    # Noise in range: [-5, 5]
    noise = 5*np.random.random()
    return L + noise

                    

Below function will calculate the gradients given the loss, weights or coefficients of the functions and 

Python3

def grad(L, w, ck):
     
    # number of parameters
    p = len(w)
     
    # bernoulli-like distribution
    deltak = np.random.choice([-1, 1], size=p)
     
    # simultaneous perturbations
    ck_deltak = ck * deltak
 
    # gradient approximation
    DELTA_L = L(w + ck_deltak) - L(w - ck_deltak)
 
    return (DELTA_L) / (2 * ck_deltak)

                    

By using the below helper function we will choose the constants that are required for the algorithm to work well.

Python3

def initialize_hyperparameters(alpha, lossFunction, w0, N_iterations):
 
    c = 1e-2   # a small number
 
    # A is <= 10% of the number of iterations
    A = N_iterations*0.1
 
    # order of magnitude of first gradients
    magnitude_g0 = np.abs(grad(lossFunction, w0, c).mean())
 
    # the number 2 in the front is an estimative of
    # the initial changes of the parameters,
    # different changes might need other choices
    a = 2*((A+1)**alpha)/magnitude_g0
 
    return a, A, c

                    

Below functions are the main optimization algorithm and make use of all the helper functions we have defined above.

Python3

# optimization algorithm
def SPSA(LossFunction, parameters, alpha=0.602,\
         gamma=0.101, N_iterations=int(1e5)):
     
    # model's parameters
    w = parameter
 
    a, A, c = initialize_hyperparameters(
      alpha, LossFunction, w, N_iterations)
 
    for k in range(1, N_iterations):
 
        # update ak and ck
        ak = a/((k+A)**(alpha))
        ck = c/(k**(gamma))
 
        # estimate gradient
        gk = grad(LossFunction, w, ck)
 
        # update parameters
        w -= ak*gk
 
    return w

                    

Now, we will show the working of the algorithm we have implemented above with an example.

Python3

# Y is the polynomial to be approximated
X = np.linspace(0, 10, 100)
Y = 1*X**2 - 4*X + 3
 
noise = 3*np.random.normal(size=len(X))
Y += noise
 
# plot polynomial
plt.title("polynomial with noise")
plt.plot(X, Y, 'go')
plt.show()

                    

Output:

Set of random points with noise

Set of random points with noise

Before training

Python3

# Initial parameters are randomly
# choosing in the range: [-10,10]
parameters = (2*np.random.random(3) - 1)*10
 
plt.title("Before training")
 
# Compare true and predicted values before
# training
plt.plot(X, polynomial(parameters, X), "bo")
plt.plot(X, Y, 'go')
plt.legend(["predicted value", "true value"])
plt.show()

                    

Output:

Predictions of the function before training

Predictions of the function before training

After training

Python3

# Training with SPSA
parameters = SPSA(LossFunction = lambda parameters: Loss(parameters, X, Y),
                  parameters = parameters)
 
plt.title("After training")
plt.plot(X, polynomial(parameters, X), "bo")
plt.plot(X, Y, 'go')
plt.legend(["predicted value", "true value"])
plt.show()

                    

Output:

Predictions after learning the weights

Predictions after learning the weights



Last Updated : 09 Feb, 2023
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads