Open In App

Self -attention in NLP

Self-attention was proposed by researchers at Google Research and Google Brain. It was proposed due to challenges faced by encoder-decoder in dealing with long sequences. The authors also provide two variants of attention and transformer architecture. This transformer architecture generates the state-of-the-art results on WMT translation task.

Encoder-Decoder model:



The encoder-decoder model was presented in different papers. The difference between these two papers are on the basis of the relationship between input and output length

From a high-level, the model comprises two sub-models: an encoder and a decoder.



Encoder-decoder architecture

Attention mechanism :

Attention was proposed by authors of the Encoder-Decoder network as an extension of it. It is proposed to overcome the limitation of the Encoder-Decoder model encoding the input sequence to one fixed-length vector from which to decode each output time step. This issue is believed to be more of a problem when decoding long sequences.

Self-attention mechanism:

Self-Attention

The attention mechanism allows output to focus attention on input while producing output while the self-attention model allows inputs to interact with each other (i.e calculate attention of all other inputs wrt one input.

The above procedure is applied to all the input sequences. Mathematically, the self-attention matrix for input matrices (Q, K, V) is calculated as:

where Q, K, V are the concatenation of query, key, and value vectors.

Multi-headed-attention:

Multi-headed attention

In the attention paper, the authors proposed another type of attention mechanism called multi-headed attention. Below is the step-by-step process to calculate multi-headed self-attention:

Mathematically multi-head attention can be represented by:

Attention in Transformer architecture:

The transformer architecture uses attention model uses multi-headed attention at three steps:

Self-attention in Encoder

Self-attention in Decoder

Complexity and Results:

The advantage of using self-attention layers in the NLP tasks that it is less computationally expensive to perform than other operations. Below is the table representing the complexity of different operations:

Results:  

References:


Article Tags :