Skip to content
Related Articles

Related Articles

Save Article
Improve Article
Save Article
Like Article

ML – Attention mechanism

  • Last Updated : 16 Jul, 2020

Introduction:
Assuming that we are already aware of how vanilla Seq2Seq or Encoder-Decoder models work, let us focus on how to further take it up a notch and improve the accuracy of our predictions. We’ll consider the good old example of Machine Translation.

Motivation:
In a Seq2Seq model, the encoder reads the input sentence once and encodes it. At each time step, the decoder uses this embedding and produces an output. But humans don’t translate a sentence like this. We don’t memorize the input and try to recreate it, we are likely to forget certain words if we do so. Also, is the entire sentence important at every time step, while producing every word? No. Only certain words are important. Ideally, we need to feed in only relevant information (encoding of relevant words) to the decoder.
“Learn to pay attention only to certain important parts of the sentence.”

Attention reader! Don’t stop learning now. Get hold of all the important Machine Learning Concepts with the Machine Learning Foundation Course at a student-friendly price and become industry ready.



Goal:
Our goal is to come up with a probability distribution, which says, at each time step, how much importance or attention should be paid to the input words.

How does it works:
Consider the following Encoder- Decoder architecture with Attention.

Encoder-Decoder with Attention

Encoder-Decoder with Attention



We can observer 3 sub-parts/ components in the above diagram:

  • Encoder
  • Attention
  • Decoder

Encoder:

Encoder


Contains a RNN layer (Can be LSTMs or GRU):
  1. There are 4 inputs: x_{0}, x_{1}, x_{2}, x_{3}
  2. Each input goes through an Embedding Layer.
  3. Each of the input generates a hidden representation.
  4. This generates the outputs for the Encoder: h_{0}, h_{1}, h_{2}, h_{3}

Attention:

  • Our goal is to generate the context vectors.
  • For example, context vector C_{1} tells us how much importance/ attention should be given the inputs: x_{0}, x_{1}, x_{2}, x_{3}.
  • This layer in turn contains 3 sub parts:
  • Feed Forward Network
  • Softmax Calculation
  • Context vector generation
attention

attention

Feed Forward Network:

Feed-Forward-Network

Feed-Forward-Network


Each A_{00}, A_{01}, A_{02}, A_{03} is a simple feed forward neural network with one hidden layer. The input for this feed forward network is:
  • Previous Decoder state
  • Output of Encoder states.

Each unit generates outputs: e_{00}, e_{01}, e_{02}, e_{03}.e_{0i} = g(S_{0}, h_{i}).
 g can be any activation function such as sigmoid, tanh or ReLu.

Softmax Calculation:

softmax

softmax calculation


 E_{0i} = \frac{exp(e_{0i})}{\sum_{i=0}^{3}exp(e_{0i})}
These E_{00}, E_{01}, E_{02}, E_{03} are called the attention weights. This is what that decides how much importance should be given to the inputs x_{0}, x_{1}, x_{2}, x_{3}.

Contect Vector Generation:

context vector generation

context vector generation


C_{0} = E_{00} \ast h_{0} + E_{01} \ast h_{1} + E_{02} \ast h_{2} + E_{03} \ast h_{3}.
We find C_{1}, C_{2}, C_{3} in the same way and feed it to different RNN units of the Decoder layer.

So this is the final vector which is the product of (Probability Distribution) and (Encoder’s output) which is nothing but the attention paid to the input words.
Decoder:
We feed these Context Vectors to the RNNs of the Decoder layer. Each decoder produces an output which is the translation for the input words.

Observation:
If we know the true attention weights, E_{ij}‘s then it would have been easier to compute the error and then adjust the parameters to minimise this loss. But in practice, we will not have this. We need someone to manually annotate each word to a set of contributing words. That is not possible.

Then why should this model work?
This is a better model when compared to the others because we are asking the model to make an informed choice. Given enough data, the model should be able to learn these attention weights just as humans do. And indeed these work better than the vanilla Encoder-Decoder models.




My Personal Notes arrow_drop_up
Recommended Articles
Page :