Today, different Machine Learning techniques are used to handle different types data. One of the most difficult type of data to handle and forecast is sequential data. Sequential data is different from other types of data in the sense that while all the features of a typical dataset can be assumed to be order-independent, this cannot be assumed for a sequnetial dataset. To handle such type of data, the concept of Recurrent Neural Networks was conceived. It is different from other Artificial Neural Networks in it’s structure. While other networks “travel” in a linear direction during the feed-forward process or the back-propagation process, the Recurrent Network follows a recurrence relation instead of a feed-forward pass and uses Back-Propagation through time to learn.
The Recurrent Neural Network consists of multiple fixed activation function units, one for each time step. Each unit has an internal state which is called the hidden state of the unit. This hidden state signifies the past knowledge that that the network currently holds at a given time step. This hidden state is updated at every time step to signify the change in the knowledge of the network about the past. The hidden state is updated using the following the recurrence relation:-
- The new hidden state - The old hidden state - The current input - The fixed function with trainable weights
Note: Typically, to understand the concepts of a Recurrent Neural Network, it is often illustrated in it’s unrolled form and this norm will be followed in this post.
At each time step, the new hidden state is calculated using the recurrence relation as given above. This new generated hidden state is used to generate indeed a new hidden state and so on.
The basic work-flow of a Recurrent Neural Network is as follows:-
Note that is the initial hidden state of the network. Typically, it is a vector of zeros, but it can have other values also. One method is to encode the presumptions about the data into the initial hidden state of the network. For example, for a problem to determine to the tone of a speech given by a renowned person, the person’s past speeches’ tones may be encoded into the initial hidden state. Another technique is to make the initial hidden state as a trainable parameter. Although these techniques add little nuances to the network, initializing the hidden state vector to zeros is typically an effective choice.
Working of each Recurrent Unit:
- Take input the previous hidden state vector and the current input vector.
Note that since the hidden state and current input are treated as vectors, each element in the vector is placed in a different dimension which is orthogonal to the other dimensions. Thus each element when multiplied by another element only gives a non-zero value when the elements involved are non-zero and the elements are in the same dimension.
- Element-wise multiply the hidden state vector by the hidden state weights and similarly perform the element wise multiplication of the current input vector and the current input weights. This generates the parameterized hidden state vector and current input vector.
Note that weights for different vectors are stored in the trainable weight matrix.
- Perform the vector addition of the two parameterized vectors and then calculate the element-wise hyperbolic tangent to generate the new hidden state vector.
During the training of the recurrent network, the network also generates an output at each time step. This output is used to train the network using gradient descent.
The Back-Propagation involved is similar to the one used in a typical Artificial Neural Network with some minor changes. These changes are noted as:-
Let the predicted output of the network at any time step be and the actual output be . Then the error at each time step is given by:-
The total error is given by the summation of the errors at all the time steps.
Similarly, the value can be calculated as the summation of gradients at each time step.
Using the chain rule of calculus and using the fact that the output at a time step t is a function of the current hidden state of the recurrent unit, the following expression arises:-
Note that the weight matrix W used in the above expression is different for the input vector and hidden state vector and is only used in this manner for notational convenience.
Thus the following expression arises:-
Thus, Back-Propagation Through Time only differs from a typical Back-Propagation in the fact the errors at each time step are summed up to calculate the total error.
Although the basic Recurrent Neural Network is fairly effective, it can suffer from a significant problem. For deep networks, The Back-Propagation process can lead to the following issues:-
- Vanishing Gradients: This occurs when the gradients become very small and tend towards zero.
- Exploding Gradients: This occurs when the gradients become too large due to back-propagation.
The problem of Exploding Gradients may be solved by using a hack – By putting a threshold on the gradients being passed back in time. But this solution is not seen as a solution to the problem and may also reduce the efficiency of the network. To deal with such problems, two main variants of Recurrent Neural Networks were developed – Long Short Term Memory Networks and Gated Recurrent Unit Networks.
- Introduction to Recurrent Neural Network
- Gated Recurrent Unit Networks
- ML | Text Generation using Gated Recurrent Unit Networks
- Text Generation using Recurrent Long Short Term Memory Network
- Long Short Term Memory Networks Explanation
- Activation functions in Neural Networks
- Depth wise Separable Convolutional Neural Networks
- Neural Networks | A beginners guide
- ML | Transfer Learning with Convolutional Neural Networks
- Capsule Neural Networks | ML
- Artificial Neural Networks and its Applications
- Dropout in Neural Networks
- DeepPose: Human Pose Estimation via Deep Neural Networks
- Multiple Labels Using Convolutional Neural Networks
- Single Layered Neural Networks in R Programming
- Activation functions in Neural Networks | Set2
- Signed Networks in Social Networks
- Mathematical explanation for Linear Regression working
- ML | Mathematical explanation of RMSE and R-squared error
- Explanation of Fundamental Functions involved in A3C algorithm
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to email@example.com. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.