Dilated and Global Sliding Window Attention

Last Updated : 31 Jan, 2024

“Dilated” and “Global Sliding Window” attentions are adaptations of attention mechanisms applied in neural networks, specifically in the domains of natural language processing and computer vision.

Prerequisites: Attention Mechanism | ML, Sliding Window Attention, Dilated CNN

A transformer-based model, such as BERT, SpanBERT, etc., has been utilized to carry out numerous Natural Language Processing tasks. These models’ self-attention mechanism Longformerlimits their potential. These models frequently fail to recognize and comprehend data that contains lengthy texts. In the late 2020s, a Longformer (Long-Document Transformer) entered the scene to provide this function. Long-sequenced strings can pose problems that Longformer seeks to resolve when they are longer than 512 tokens. It modified a CNN-like architecture called Sliding Window Attention to achieve this. Sliding window attention efficiently covers lengthy input data texts. It introduces a combination of sparse attention and sliding window approaches to efficiently manage long sequences.

What is Longformer?

Longformer is a transformer-based model designed to handle long sequences more efficiently. By introducing a sliding window attention mechanism, it lessens the quadratic complexity of conventional self-attention by allowing the model to focus on just a portion of tokens. Longformer preserves a wider context for each token by adding global information from the entire sequence. By including a global attention component that catches dependencies outside of the window size, it does this. Longformer is a scalable technique for handling long-range dependencies in natural language processing tasks and has been successfully applied to a variety of tasks, including document classification, question answering, and text generation.

Longformer architecture has a self-attention Mechanism component with the capability of detecting and reading long spans of text data. However, the model still takes O(n²) time to scale the inputs and occupied O(n) memory which is very inefficient. (as shown in Fig below) This is where a variety of attention models are introduced to make the process efficient. The sliding window attention model (as discussed in the previous article) is used to make the process way efficient. This model has two variations that are going to be discussed in this article.

Full O(n^2) connections-Geeksforgeeks

What is Sliding Window Attention?

A sliding window is an attention pattern based on parsing a m x n image with a fixed step size to capture the target image(s) efficiently. It is used to improve the efficiency of the longformer. On comparing the sliding window attention (Fig below) model to a full connection model (Fig above), it can easily be observed that this method is much more efficient than the former.

Sliding Window Attention-Geeksforgeeks

There are two types of sliding window attention models:

Dilated Sliding Window Attention
Global Sliding Window Attention

Dilated Attention and Global Sliding Window Attention are two attention mechanisms that have been proposed to improve the performance and efficiency of transformer-based models in natural language processing tasks.

Dilated Sliding Window Attention in Deep Learning

Dilated Attention, also known as Sparse Attention or Fixed Pattern Attention, inserts sparsity into transformers’ self-attention mechanisms by bypassing specific attention connections. To achieve this sparsity, the attention patterns are dilated, such that not all tokens pay attention to each other.

Each token in conventional self-attention attends to every other token in the sequence. Dilated Attention, on the other hand, causes gaps or dilations in the connection pattern because the attention pattern skips some tokens. As the number of attention connections to compute decreases, the computational cost of self-attention decreases. Dilated Attention can more effectively catch long-range relationships by carefully planning the dilation pattern.

The concept of Dilated Sliding Window Attention is similar to the Dilated CNN used in image recognition, object detection, and semantic segmentation. This technique is used to increase the receptive field of the filter and capture features at different scales. A dilation on top of the sliding window algorithm helps in better coverage of the input image while keeping the computational costs the same as before. This concept helps in a variety of ways based on the changes done in the dilation rate. We can parse small text inputs by keeping a low dilation rate, hence bearing the minimal computational cost and larger input texts can be traversed by increasing this parameter. Fig below depicts the increase in the receptive field on introducing a dilation rate of 2. ([n-1] gaps introduced)

Dilated Sliding Window Attention-Geeksforgeeks

In Transformers, the typical attention mechanism captures relationships between all positions in the input sequence, which can be computationally expensive for long sequences. By using dilated sliding windows, the attention mechanism is restricted to consider only a subset of positions within a certain window size, skipping some positions in between. A dilated sliding window refers to a technique used to incorporate local context information while processing sequential data.

Here’s how the dilated sliding window technique works in Transformers:

Window Size: Define the size of the window that determines the local context to be considered. For example, if the window size is 5, the attention mechanism will only consider the 5 surrounding positions for each position.
Dilation Factor: Introduce a dilation factor that specifies the gap between the positions included in the window. A dilation factor of 1 means adjacent positions are considered, while a larger dilation factor skips positions in between. For example, with a dilation factor of 2, positions 1, 3, 5, etc., may be considered.
Attention Calculation: Apply the attention mechanism within the defined window for each position. The attention mechanism computes the attention weights between the current position and the positions within the window.

By using dilated sliding windows, Transformers can focus on capturing dependencies and relationships within a local context while reducing the computational complexity associated with considering all positions in the input sequence. This technique can be particularly beneficial for long sequences, as it allows the model to capture both local and global dependencies efficiently.

It’s worth noting that while dilated sliding windows have been explored in the context of CNNs and self-attention mechanisms in Transformers, they are not a standard feature of the original Transformer model proposed in the “Attention is All You Need” paper by Vaswani et al. However, researchers have explored variations and extensions of Transformers that incorporate dilated sliding windows to address specific needs or constraints in certain applications.

Global Sliding Window Attention in Deep Learning

Global Sliding Window Attention is an attention mechanism used in transformer-based models to address the quadratic complexity issue of traditional self-attention. It limits the attention window size by considering a fixed-size window that slides across the sequence. This mechanism helps reduce the computational complexity while capturing contextual information within a limited context window. It aims to address the quadratic complexity issue of traditional self-attention by limiting the attention window size. In standard self-attention, the attention weights are computed for all pairs of tokens in the sequence, resulting in a quadratic time complexity.

In Global Sliding Window Attention, instead of attending to all tokens, a fixed-size attention window is used, which slides across the sequence. The attention window attends to a subset of tokens within its range, and this window slides to cover the entire sequence. By limiting the attention computation to a fixed-size window, the time complexity is reduced from quadratic to linear or sublinear, making it more efficient for long sequences. Global sliding window attention helps to capture the contextual information within a limited context window by trading off the ability to recognize global dependencies over the entire sequence and cost of computations.

This attention model deals with task-specific cases where we have to detect a particular text from the input. The symmetric nature of this attention model helps in finding particular sequences by considering all the corresponding tokens along the row/column in the input, thus giving global attention to such details. Fig below shows how such external cases are handled by this model while maintaining the sliding window protocol.

Fig 4: Global sliding window connection

The above-discussed attention models help in increasing the output efficiency of the Longformer, making it a suitable alternative for the Transformer. These attention patterns also give better results when used on probabilistic models like auto-regressive language modelling where the prior words are taken as input to predict the next sequence of words.

Here’s how Global Sliding Window Attention works:

Window Size: Determine the desired size of the attention window. This size determines the number of tokens that the window will consider at a time during the attention computation. For example, if the window size is set to 5, the attention window will consider 5 tokens at a time.
Window Movement: The attention window starts at the beginning of the input sequence and slides across the sequence. At each step, it moves by a fixed stride length. For example, if the stride length is set to 1, the window moves one token at a time.
Attention Computation: For each position of the attention window, the self-attention mechanism calculates attention weights between the tokens within the window and uses them to compute weighted representations.
Weighted Representation: The attention weights obtained from the attention computation are used to calculate a weighted representation for each token within the window. This is typically done by taking a weighted sum of the token embeddings, where the attention weights serve as the weights for the sum.
Contextual Information: The weighted representations capture the contextual information for each token within the window, considering the interactions and dependencies between tokens in the local context.
Sliding Window Coverage: The sliding window continues to move across the sequence until it covers the entire input. The attention computation is performed for each position of the window, providing contextual information for tokens at different positions.

By limiting the attention computation to a fixed-size window, Global Sliding Window Attention reduces the time complexity from quadratic to linear or sublinear. This makes it more efficient for long sequences, as it avoids the need to attend to all tokens simultaneously.

It’s important to note that the choice of window size and stride length in Global Sliding Window Attention affects the trade-off between computational efficiency and the model’s ability to capture long-range dependencies. A larger window size allows for more global context but increases the computational cost, while a smaller window size provides more local context but may limit the capture of long-range dependencies.

Global Sliding Window Attention offers a compromise between efficiency and modeling capacity, making it suitable for tasks where balancing computation and capturing contextual information within a limited context window is important.

Advantages and Disadvantages

Both Global Sliding Window and Dilated Attention aims to increase the scalability and effectiveness of self-attention process in transformer-based models. It provides alternatives to the traditional self-attention mechanism while balancing computational requirements and capturing long-range dependencies in the input sequences.

Dilated attention allows the network to capture features at different scales without increasing the number of parameters, but it can be computationally expensive. Global sliding window attention is computationally efficient and can be applied to long input sequences, but it may not be as effective at capturing features at different scales as dilated attention.