Getting Started with Transformers
The Transformer is a deep learning model proposed in 2017, which is essentially used in NLP tasks. If you are working on Natural Language Processing tasks like- text summarization, translation, or emotion prediction, then you will encounter this term very often.
RNN suffered from the Vanishing Gradient problem which causes long term memory loss. RNN does the processing of text sequentially, which means if there is a long sentence like- ‘XYZ has been to France in 2019 when there were no cases of covid and there he met the president of that country.’ Now, if we ask that here ‘that country’ refers to which place? RNN will not be able to recall that the country was ‘France’ because it has encountered the word ‘France’ long before. The sequential nature of processing means that the model was trained at the word level, not at the sentence level. The gradients carry information used in the RNN parameter update and when the gradient becomes smaller, then no real learning is done.
By the addition of a few more memory cells and resolving the vanishing gradients issue, the problem regarding long-term memory loss was resolved to some extent. But the problem with sequential processing remained because RNN was unable to process the intact sentence at once. Rather than processing in parallel, it processed words one by one. This issue can’t be addressed in LSTMs due to their sequential design. In LSTMs we use the static embedding method, which suggests that without knowing the context of a word we embed it to some n-dimensional vector. But if the context changes, the meaning also changes.
Ex: There is a word -’ Point’, and we use it in two different contexts given below
- The needle has a sharp point.
- It is not polite to point at people.
Here, the word ‘Point’ has two different contexts in both of the sentences, but when embedding is done the context is not taken into consideration. Therefore, there was a need for a different architecture — Transformer. The Transformer was proposed in the paper Attention is All You Need.
Neural networks have been classified into mainly two groups, that are — Feedforward and Feedback. Transformers are based upon feedforward networks, which means that the information moves from the input to the output and it does not contain a feedback loop. In contrast to it, LSTMs utilizes feedback networks, which means that the information can pass to both directions and it consists of a feedback path i.e we can again make use of the memory for new predictions.
Now, coming to the architecture of the Transformer. Encoder and Decoder are building blocks of a Transformer. The encoder block turns the sequence of input words into a vector and a Decoder converts a vector into a sequence. Eg: A text in French processed into its English equivalent can be:
Je suis étudiant –> I am a student.
The encoder architecture has two layers: Self Attention and Feed Forward. The encoder’s inputs first pass by a self-attention layer and then the outputs of the self-attention layer are fed to a feed-forward neural network. Sequential data has temporal characteristics. It signifies that each word holds some position concerning the other. For example, let’s take a sentence- ‘The cat didn’t chase the mouse, because it was not hungry’. Here, we can easily tell that ‘it’ is referring to the cat, but it is not as simple for an algorithm. When the model is processing the word ‘it’, self-attention allows it to associate ‘it’ with ‘cat’. Self-attention is the method to reformulate the representation based on all other words of the sentence.
The decoder architecture has three layers: Self Attention, Encoder-decoder attention, and Feed Forward. The decoder has both the self-attention and feed-forward layer which is also present in the encoder, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence.
There are six layers of encoders and decoders in Transformer architecture. In the bottom encoder, word embeddings are performed and each word is transformed into a vector of size 512. The input to the other encoders would be the output of the encoder that’s directly below. The various layers of the encoder are for discovering the NLP pipeline. Like- the first layer is used for Part of speech tags, the second layer for constituents, the third layer for dependencies, the fourth layer for semantic roles, the fifth for coreference, and the sixth for relations.
The very last layer is the Softmax layer which assigns for each word in the vocabulary a probability and all of this probability sums up to 1.
Attention reader! Don’t stop learning now. Get hold of all the important Machine Learning Concepts with the Machine Learning Foundation Course at a student-friendly price and become industry ready.