Open In App

Self – attention in NLP

Self-attention was proposed by researchers at Google Research and Google Brain. It was proposed due to challenges faced by the encoder-decoder in dealing with long sequences. The authors also provide two variants of attention and transformer architecture. This transformer architecture generates state-of-the-art results on WMT translation tasks.

Attention Mechanism in NLP?

Recently, deep learning has made significant progress in the area of attention mechanism, particularly for natural language processing tasks like machine translation, image captioning, dialogue generation, etc. The purpose of this mechanism is to improve the encoder-decoder (seq2seq) RNN model’s performance. I’ll attempt to describe the attention mechanism for the text classification task in this blog post.



Attention was proposed by authors of the Encoder-Decoder network as an extension of it. It is proposed to overcome the limitation of the Encoder-Decoder model encoding the input sequence to one fixed-length vector from which to decode each output time step. This issue is believed to be more of a problem when decoding long sequences.

Encoder-Decoder Model

The encoder-decoder model was presented in different papers. The difference between these two papers is based on the relationship between input and output length



From a high level, the model comprises two sub-models: an encoder and a decoder.

Encoder-decoder architecture

Attention Layer in Transformer

Transformer

Self-Attention Mechanism

Natural language processing (NLP) tasks have been revolutionised by transformer-based models, in which the self-attention mechanism is an essential component. The self-attention mechanism, which was first presented in the “Attention is All You Need” paper by Vaswani et al., enables models to dynamically determine the relative importance of various words in a sequence, improving the ability to capture long-range dependencies.

Multi-headed-attention

The attention mechanism allows output to focus attention on input while producing output while the self-attention model allows inputs to interact with each other (i.e calculate attention of all other inputs wrt one input.

The above procedure is applied to all the input sequences. Mathematically, the self-attention matrix for input matrices (Q, K, V) is calculated as:

where Q, K, V are the concatenation of query, key, and value vectors.

Scaled Dot-Product and Multi-Head Attentions

In the attention paper, the authors proposed another type of attention mechanism called multi-headed attention. Below is the step-by-step process to calculate multi-headed self-attention:

Mathematically multi-head attention can be represented by:

Transformer architecture uses attention model uses multi-headed attention at three steps:

Self-attention in Encoder3

Complexity and Results

The advantage of using self-attention layers in the NLP tasks that it is less computationally expensive to perform than other operations. Below is the table representing the complexity of different operations:

Results


Frequently Asked Quetions (FAQs)

Q. What is self-attention in natural language processing (NLP)?

NLP models, especially transformer models, use a mechanism called self-attention, which is also referred to as scaled dot-product attention. When generating predictions, it enables the model to assign varying weights to distinct words within a sequence. The attention mechanism weighs words according to how relevant they are to the word that is being considered at that moment.

Q. How does self-attention work?

Each word in a sequence has three vectors associated with it in self-attention: Query (Q), Key (K), and Value (V). By taking the dot product of one word’s query and another word’s key, and dividing the result by the square root of the key vector’s dimensionality, one can calculate the attention score between two words. The weighted sum is the self-attention mechanism’s output, and the scores that follow are used to weigh the Values.

Q. How is self-attention used in transformers?

Vaswani et al. introduced Transformers, which use self-attention as a fundamental building block. The encoder and decoder in the model are made up of several layers of self-attention mechanisms. The model can capture complex dependencies and process input sequences in parallel thanks to the self-attention mechanism.

Q. Are there any challenges or limitations with self-attention?

Computationally expensive self-attention mechanisms can arise, particularly when sequence length grows. Some of these problems are addressed by methods such as multi-head attention and scaled dot-product attention. Furthermore, techniques such as the Long-Range Arena (LRA) have been proposed to increase the efficiency for very long sequences.


Article Tags :