LSTM – Derivation of Back propagation through time

Last Updated : 27 Dec, 2021

LSTM (Long short term Memory ) is a type of RNN(Recurrent neural network), which is a famous deep learning algorithm that is well suited for making predictions and classification with a flavour of the time. In this article, we will derive the algorithm backpropagation through time and find the gradient value for all the weights at a particular timestamp.
As the name suggests backpropagation through time is similar to backpropagation in DNN(deep neural network) but due to the dependency of time in RNN and LSTM, we will have to apply the chain rule with time dependency.

Let the input at time t in the LSTM cell be x_t, the cell state from time t-1 and t be c_t-1 and c_t and the output for time t-1 and t be h_t-1and h_t. The initial value of c_tand h_tat t = 0 will be zero.

Step 1 : Initialization of the weights .

Weights for different gates are : 
Input gate : w_xi, w_xg, b_i, w_hj, w_g, b_g

Forget gate : w_xf, b_f, w_hf 

Output gate : w_xo, b_o, w_ho

Step 2 : Passing through different gates .

  
Inputs: x_tand h_t-i, c_t-1 are given to the LSTM cell 
      Passing through input gate: 
       
          Z_g= w_xg*x + w_hg* h_t-1+ b_g 
          g = tanh(Z_g)
          Z_j= w_xi* x + w_hi* h_t-1+ b_i 
          i =  sigmoid(Z_i) 
          
          Input_gate_out = g*i 
           
      Passing through forget gate:  
           
          Z_f= w_xf* x + w_hf*h_t-1+ b_f 
          f = sigmoid(Z_f) 
              
      Forget_gate_out = f 
       
      Passing through the output gate:  
              
      Z_o = w_xo*x +  w_ho * h_t-1+ b_o 
      o = sigmoid(z_O) 
    
      Out_gate_out = o

Step 3 : Calculating the output h_t and current cell state c_t.

  Calculating the current cell state c_t :
          c_{t =}(c_t-1* forget_gate_out) + input_gate_out 

Calculating the output gate ht:
          h_t=out_gate_out * tanh(ct)

Step 4 : Calculating the gradient through back propagation through time at time stamp t using the chain rule.

  Let the gradient pass down by the above cell be: 
      E_delta = dE/dh_t   
      
      If we are using MSE (mean square error)for error then,
      E_delta=(y-h(x))
      Here y is the original value and h(x) is the predicted value.    
   
  Gradient with respect to output gate  
          
             dE/do = (dE/dh_t) * (dh_t/do) = E_delta * ( dh_t / do) 
                dE/do =  E_delta * tanh(c_t) 
      
  Gradient with respect to c_t        
      dE/dc_t = (dE / dh_t)*(dh_t/dc_t)= E_delta *(dh_t/dc_t) 
                dE/dc_t =   E_delta   * o * (1-tanh²(c_t))        

  Gradient with respect to input gate dE/di, dE/dg 
           
      dE/di = (dE/di ) * (dc_t / di)  
             dE/di =  E_delta   * o * (1-tanh²(c_t)) * g 
      Similarly,  
      dE/dg =  E_delta   * o * (1-tanh²(c_t)) * i 
       
  Gradient with respect to forget gate  
           
          dE/df =  E_delta   * (dE/dc_t ) * (dc_t / dt) t
          dE/df =  E_delta   * o * (1-tanh²(c_t)) *  c_t-1 

  Gradient with respect to c_t-1 
           
          dE/dc_t =  E_delta   * (dE/dc_t ) * (dc_t / dc_t-1) 
          dE/dc_t =  E_delta   * o * (1-tanh²(c_t)) * f  
 
  Gradient with respect to output gate weights:
    
    dE/dw_xo   =  dE/d_o *(d_o/dw_xo) = E_delta * tanh(c_t) * sigmoid(z_o) * (1-sigmoid(z_o) * x_t
    dE/dw_ho   =  dE/do *(do/dw_ho) = E_delta * tanh(c_t) * sigmoid(z_o) * (1-sigmoid(z_o) * h_t-1
    dE/db_o   =  dE/do *(do/db_o) = E_delta * tanh(c_t) * sigmoid(z_o) * (1-sigmoid(z_o)

   Gradient with respect to forget gate weights:
    
    dE/dw_xf =  dE/df *(df/dw_xf) = E_delta * o * (1-tanh² (c_t)) * c_t-1 * sigmoid(z_f) * (1-sigmoid(z_f) * x_t
    dE/dw_hf =  dE/df *(df/dw_hf) = E_delta * o * (1-tanh² (c_t)) *  c_t-1 * sigmoid(z_f) * (1-sigmoid(z_f) * h_t-1
    dE/db_o  =  dE/df *(df/db_o) = E_delta * o * (1-tanh² (c_t)) *  c_t-1 * sigmoid(z_f) * (1-sigmoid(z_f) 

   Gradient with respect to input gate weights:
    
    dE/dw_xi  =  dE/di *(di/dw_xi) = E_delta * o * (1-tanh² (c_t)) * g * sigmoid(z_i) * (1-sigmoid(z_i) * x_t
    dE/dw_hi =  dE/di *(di/dw_hi) = E_delta * o * (1-tanh²(c_t)) * g * sigmoid(z_i) * (1-sigmoid(z_i) * h_t-1
    dE/db_i  =  dE/di *(di/db_i) = E_delta * o * (1-tanh² (c_t)) * g *  sigmoid(z_i) * (1-sigmoid(z_i)
    
    dE/dw_xg  =  dE/dg *(dg/dw_xg) = E_delta * o * (1-tanh² (c_t)) * i * (1?tanh²(z_g))*x_t
    dE/dw_hg  =  dE/dg *(dg/dw_hg) = E_delta * o * (1-tanh²(c_t)) * i * (1?tanh²(z_g))*h_t-1
    dE/db_g  =  dE/dg *(dg/db_g)  = E_delta * o * (1-tanh² (c_t)) * i * (1?tanh²(z_g))

Finally the gradients associated with the weights are,

Using all gradient, we can easily update the weights associated with input gate, output gate, and forget gate

Suggest improvement

Long Short Term Memory Networks Explanation

Text Generation using Recurrent Long Short Term Memory Network

Share your thoughts in the comments

Introduction to Deep Learning

Basic Neural Network

Activation Functions

Artificial Neural Network

Classification

Regression

Hyperparameter tuning

Introduction to Convolution Neural Network

Recurrent Neural Network

Gated Recurrent Unit Networks

Generative Learning

Generative adversarial networks

Reinforcement Learning

Q-Learning in Python

Deep Q Learning

LSTM – Derivation of Back propagation through time

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?