Dilated and Global Sliding Window Attention
To perform various NLP tasks, a transformer-based model such as BERT, SpanBERT, etc have been used. However, these models have limited capabilities due to their self-attention mechanism. These models tend to fail in detecting and reading data containing long texts. For this purpose, a Logformer (Long-Document Transformer) came into the picture in the late 2020s.
Attention reader! Don’t stop learning now. Get hold of all the important Machine Learning Concepts with the Machine Learning Foundation Course at a student-friendly price and become industry ready.
The Longformer architecture has a self-attention component with the capability of detecting and reading long spans of text data. However, the model still takes O(n2) time to scale the inputs and occupied O(n) memory which is very inefficient. (as shown in Fig 1) This is where a variety of attention models are introduced to make the process efficient. The sliding window attention model (as discussed in the previous article) is used to make the process way efficient. This model has two variations that is going to be discussed in this article.
Sliding Window Attention:
Sliding window is an attention pattern based on parsing a m x n image with a fixed step size to capture the target image(s) efficiently. It is used to improve the efficiency of the logformer. On comparing sliding window attention (Fig 2) model to a full connection model (Fig 1), it can easily be observed that this method is much more efficient than the former.
There are two types of sliding window attention models:
- Dilated SWA
- Global SWA
Dilated Sliding Window Attention:
The concept of a sliding window is based on that of Dilated CNNs. A dilation on top of the sliding window algorithm helps in better coverage of the input image while keeping the computational costs the same as before. This concept helps in a variety of ways based on the changes done in the dilation rate. We can parse small text inputs by keeping a low dilation rate, hence bearing the minimal computational cost and larger input texts can be traversed by increasing this parameter. Fig 3 depicts the increase in the receptive field on introducing a dilation rate of 2. ( [n-1] gaps introduced )
Global Sliding Window Attention :
This attention model deals with task-specific cases where we have to detect a particular text from the input. The symmetric nature of this attention model helps in finding particular sequences by considering all the corresponding tokens along the row/column in the input, thus giving global attention to such details. Fig 4 shows how such external cases are handled by this model while maintaining the sliding window protocol.
The above-discussed attention models help in increasing the output efficiency of the Logformer, making it a suitable alternative for the Transformer. These attention patterns also give better results when used on probabilistic models like auto-regressive language modelling where the prior words are taken as input to predict the next sequence of words. For any doubt/query, comment below.