Transformer XL: Beyond a Fixed-Length Context

Last Updated : 14 Dec, 2023

Transformer XL is short for Transformer Extra Long. The Transformer-XL model was introduced in the paper titled “Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context,” authored by Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Natural Language Processing has experienced significant progress and Transformer XL is a key influencer in reshaping the landscape of sequence modeling.

The article aims to explore the key features including segment-level recurrence mechanism and relative encoding of the Transformer XL model.

Transformer

The Transformer was originally developed to solve the problem of sequence-to-sequence tasks, such as machine translation, but has since become a foundational model for various natural language processing (NLP) tasks. The key features of the Transformer are discussed below:

Self-attention: A distinctive aspect of the transformer is its utilization of a self-attention mechanism, enabling the model to assign varying weights to words in a sequence based on their relevance to one another. This facilitates the parallel capture of long-range dependencies. The implementation involves multiple self-attention heads, providing the model with diverse perspectives on word relationships.
Positional Encodings: Since all the inputs are processed parallelly the transformer layers do not have information about the token sequence. To address the lack of inherent understanding of token sequence order, positional encodings are incorporated into input embeddings.

Language Modelling and Limitations of Vanilla Transformer

Language modelling is a fundamental task in natural language processing (NLP) and machine learning. It estimates the likelihood of observing a particular sequence in a given language. Language models take into account the context of a word within a sequence. The probability of a word depends on the preceding words, capturing the dependencies and structure of the language. Language models are evaluated based on perplexity metrics, which measure how well the model predicts a given sequence. Lower perplexity indicates better performance.

The utilization of transformers for language modelling has emerged as a critical element in the field of natural language processing, empowering models to comprehend and produce text that closely resembles human language.

For language modelling, Transformers are currently implemented with a fixed-length context, i.e. a long text sequence is truncated into fixed-length segments of a few hundred tokens, and each segment is processed separately. In the vanilla transformer architecture, there is no information flow across segments. Each segment is processed independently.

Limitations of Vanilla Transformer

Two critical limitation of vanilla transformer architecture for utilizing them for language modelling task are:

Longer Term Dependency: Transformers are limited by a fixed-length context in the setting of language modeling. They cannot model dependencies that are longer than the fixed input length.
Context Fragmentation: The input corpus is divided into segments based on the input length size of the transformer layer. The fixed-length segments are thus created by selecting a consecutive chunk of tokens without respecting the sentence or any other semantic boundary. This results in context fragmentation which leads to inefficient training and optimization.

Transformer XL

Transformer XL is an extension of the vanilla transformer architecture designed to address the challenges associated with them for language modeling task as highlighted above. It introduces two key features:

Segment-level recurrent mechanism

In a standard Transformer, the hidden state at a given position is a vector that encodes information about the token at that position and its relationships with other tokens in the sequence. The hidden state is updated through self-attention mechanisms and feedforward layers in each layer of the Transformer.

The segment-level recurrent mechanism involves updating the hidden states not only within the current segment but also by attending to the hidden states from previous segments. This enables the model to extend its context window beyond the current segment. Let us understand this mathematically,

Let,

S_τand S_τ+1 be two segments
L be the length of sequence
D be the hidden dimension of the layer
n be the number of layers

Now the hidden state being feed into nth layer of segment S_τ+1 depends not only the hidden state of S_τ+1 at n-1 but also the hidden state of layer n-1 at S_τ . The two hidden state vectors are concatenated along the length dimension. This is expressed as

$h^{\sim n-1}_{\tau+1} = [SGh^{n-1}_{\tau} \oplus SGh^{n-1}_{\tau +1} ]$

Here we take the hidden state from previous layer of same segment and hidden state from previous layer of last segment and concatenate its. The SG denotes that the gradient is not backpropagated through previous layer.

This modified hidden state is used in for key and value calculation to key QKV matrices.

$q^n_{\tau+1} = h^{n-1}_{\tau+1}W_q$

$k^n_{\tau+1} = h^{\sim n-1}_{\tau+1}W_k^T$

$v^n_{\tau+1} = h^{\sim n-1}_{\tau+1}W_v^T$

Note that modified hidden state is used only for K and V. The Query calculation remains dependent only on hidden state of current segment previous layer. The gradient remains within a segment, but the additional history allows the network to model long-term dependency and avoid context fragmentation.

With this recurrence mechanism applied to every two consecutive segments of a corpus, it essentially creates a segment-level recurrence in the hidden states. Notice that the recurrent dependency between hⁿ_τ+1 and hⁿ⁻¹_τ shifts one layer downwards per segment. This can be visualized as below:

Segment level recurrent mechanism (Training Phase)

Segment level recurrent mechanism (Evaluation Phase)

Relative Positional Encoding

In the original transformer paper, we add the positional encoding vector (U) with the embedding vector(E). We multiply the result of this with weight matrices W_q and W_k to get the Q and K vectors.

The attention score between i and j token is obtained by multiplying the Query of i^th vector with Key of j^th vector.

This attention score between two tokens at position i and j from the original transformer architecture can be mathematically decomposed into U and E vectors as below.

$A_{ij} = E_{x_i} ^ T W_q^TW_kE_{x_j} + E_{x_i} ^ T W_q^TW_kU_{j} + U_{i} ^ T W_q^TW_kE_{x_j} + U_{i} ^ T W_q^TW_kU_{j}$

Here:

A_{ij –}Is the attention score between words at position i and j
Exi and Exj are the embedding vectors for word at i and j
Ui and Uj are the positional encoding vectors at position i and j
Wq and Wk are the query and key matrix

The attention score in transformer XL architecture can mathematically be formulated as below:

$A_{ij} = E_{x_i} ^ T W_q^TW_kE_{x_j} + E_{x_i} ^ T W_q^TW_kR_{i-j} + u^T W_{k,E}E_{x_j} + v^T W_{K,R}U_{j}$

U_j are Replaced by R_i-j which is a positional bedding based on distance between i and j instead of absolute position of j
The terms U_iW_q in part 3 remains constant for all query positions. The author replaced this with two new vecotrs U and V which are learned during the training. These vectors represent global content bias (since the third part has vecotr E_j) and global positional bias (since the fourt part has vecotr R_i-j).
The weight matrix Wk in term 3 and 4 is separated the two weight matrices W_k,E and W_k,R for producing the content-based key vectors and position-based key vectors respectively.

The four terms can be intuitively understood as:

Content-Based Addressing (1st Term): Think of this like focusing on the actual meaning or content of the information. It’s as if the model is paying attention to what the words or tokens actually represent.
Content-Dependent Positional Bias (2nd Term): Imagine that there’s a slight bias or preference based on where a word is located in a sequence. This bias is influenced by the specific meaning or content of the words.
Global Content Bias (3rd Term): This is like having an overall preference or inclination towards certain types of information across the entire set of data. It suggests a broader, more general influence on the model’s attention.
Global Positional Bias (4th Term): Similar to the content-dependent bias, but this one is not influenced by the specific meaning of words. It’s more about a general tendency based on the position of words in a global or overall sense.

Performance Transformer XL

As per the paper:

Transformer-XL can process up to 6,400 tokens in one batch, compared to 512 tokens for the original Transformer. This means that it can capture more long-term dependencies and generate more coherent and diverse texts.
Transformer-XL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformers during evaluation.
Transformer-XL reduces the previous state-of-the art (SOTA) perplexity from 20.5 to 18.3.

Drawbacks of Transformer XL

Transformer-XL has about 40% more parameters than the original Transformer, which means that it needs more data.
Recurrence mechanism costs additional memory

Conclusion

The original Transformer model uses fixes length sequence segment and absolute positional encoding, assigning a fixed vector to each token based on its position in the sequence. However, this approach has limitations, such as restricting the model’s effectiveness for longer sequences and overlooking relative distances between tokens.

To overcome these challenges, transformer XL introduced segment level recurrent mechanism and relative positional encoding is introduced. The segment level recurrent mechanism utilized information from hidden state of previous layer of previous segment. The relative encoding method employed unique vectors for each token pair, determined by their relative distance. These vectors are incorporated into the attention score, measuring how much each token attends to others. This enhancement enables the model to capture the context of each token, irrespective of its absolute position, and handle longer sequences more effectively without information loss.

Suggest improvement

Sparse Transformer: Stride and Fixed Factorized Attention

Share your thoughts in the comments