Open In App

Adam Optimizer in Tensorflow

Optimizers are algorithms or methods that are used to change or tune the attributes of a neural network such as layer weights, learning rate, etc. in order to reduce the loss and in turn improve the model. In this article, I am going to talk about Adam optimizer and its implementation in Tensorflow.

Before starting the discussion let’s talk a little about momentum and RMSprop.



Momentum Optimizer

The momentum optimizer is an extension of the standard gradient descent algorithm. The normal gradient descent approach would need you to move more quickly in one direction while moving more slowly in the opposite direction, which would slow the algorithm down. Momentum limits oscillation to one direction, which speeds up the convergence of our method. This will also allow us to choose a higher learning rate because there is a limit on the number of steps that can be taken in the y-direction.

The following is the formula for momentum:



where w is the weight, beta is the momentum factor, g is the gradient value and eta is the learning rate.

RMSprop Optimizer

The gradient descent algorithm with momentum and the RMSprop optimizer are comparable. The RMSprop optimizer limits oscillations that move vertically. As a result, we can speed up learning and our algorithm will be able to make bigger horizontal strides and converge more quickly. The method used to calculate the gradients differs between RMSprop and gradient descent. The calculations for the gradients for the RMSprop are shown in the following formulae.

where w is the weight, beta is the momentum factor, g is the gradient value and eta is the learning rate. Epsilon is a very small value that we use to avoid division by zero.

Now that we have an understanding of momentum and RMSprop optimization algorithms, let’s take a closer look at how the Adam algorithm works.

Adam Optimizer

Adam(Adaptive Moment Estimation) is an adaptive optimization algorithm that was created specifically for deep neural network training. It can be viewed as a fusion of momentum-based stochastic gradient descent and RMSprop. It scales the learning rate using squared gradients, similar to RMSprop, and leverages momentum by using the gradient’s moving average rather than the gradient itself, similar to SGD with momentum.

To estimate momentum, Adam uses exponential moving averages computed on the gradients evaluated on the current mini-batch. Mathematically, this can be written as :

where m and v are the moving averages and g is the gradient value. The betas are hyper-parameters whose good default values are, as suggested in the paper, 0.9 and 0.999 respectively.

Now as expectation values of the moments and gradient value should be equal to each other, we take the mean value of the moments, like:

Using all this information, Adam updates the weights using the following formula which is quite similar to the formula we use in RMSprop:

where w is the weight, eta is the learning rate and epsilon is an infinitely small value, usually 10-8, which we use to avoid division by zero.

Adam Optimizer in Tensorflow

You can pass string value adam to the optimizer argument of the model.compile functions like:

model.compile(optimizer="adam")

This method passes an adam optimizer object to the function with default values for betas and learning rate. You can use the Adam class provided in tf.keras.optimizers.

It has the following syntax:

Adam(learning_rate, beta_1, beta_2, epsilon, amsgrad, name)

The following is the description of the parameters given above:

Let us go through an example in Tensorflow to better understand the usage of Adam optimizer.

import tensorflow as tf

                    

Now let’s create the model. For this purpose, I am using a very simple Neural Network with 2 Dense layers. The following piece of code defines the architecture of the model:

def createModel(input_shape):
    X_input = tf.keras.layers.Input(input_shape)
    X = tf.keras.layers.Dense(10, 'relu')(X_input)
    X_output = tf.keras.layers.Dense(2, 'softmax')(X)
    model = tf.keras.Model(inputs=X_input, outputs=X_output)
    return model

                    

We can simply create the model now.

model = createModel((10, 10))

                    

The model summary is as follows:

print(model.summary())

                    

Output:

 

Now, let’s print out the weights of the model before training.

print('Initial Layer Weights')
print()
for i in range(1, len(model.layers)):
    print('Weight for Layer '+str(i)+': ')
    print(model.layers[i].get_weights()[0])
    print()

                    

Output:

 

Let’s get some dummy data to pass on to the model.

tf.random.set_seed(5)
X = tf.random.normal((2, 10, 10))
Y = tf.random.normal((2, 10, 2))

                    

Now it’s time to compile the model. I have used the ‘adam’ optimizer, ‘categorical_crossentropy‘ loss, and ‘accuracy’ metrics.

model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

                    

The optimizer has the following configurations.

print(model.optimizer.get_config())

                    

Output:

{'name': 'Adam',
 'learning_rate': 0.001,
 'decay': 0.0, 'beta_1': 0.9,
 'beta_2': 0.999,
 'epsilon': 1e-07,
 'amsgrad': False}

Now let’s fit the dataset to the model.

model.fit(X,Y)

                    

Output:

1/1 [==============================] - 2s 2s/step - loss: -0.2437 - accuracy: 0.6500
<keras.callbacks.History at 0x7eff2b3868d0>

The model has now been trained. Let’s check the weights after training.

print('Final Layer Weights')
print()
for i in range(1, len(model.layers)):
    print('Weight for Layer '+str(i)+': ')
    print(model.layers[i].get_weights()[0])
    print()

                    

Output:

 


Article Tags :