FNet: A Transformer Without Attention Layer

Last Updated : 11 Dec, 2023

This article delves into FNet, a transformative architecture that reimagines the traditional transformer by discarding attention mechanisms entirely. Let’s begin the journey to explore FNet, but first, let’s look at the limitations of transformers.

What is FNet?

In contrast to conventional transformer architectures, like the popular Transformer model, FNet uses a parameterized Fourier Transform in place of the self-attention mechanism. A mathematical operation called the Fourier Transform is frequently applied to image analysis and signal processing.

A different method of capturing dependencies between tokens in a sequence is provided by FNet, which applies the Fourier Transform to input sequences. Particularly for longer sequences, the authors contend that this approach can be more effective and scalable than self-attention.

Limitation of Transformers

Transformer-based models are good at understanding and processing sequences. This is because they use a kind of operation called “self-attention.” It’s like having to look at every token and figure out how it relates to every other token. The self-attention mechanism tries to measure the relationship of each input token with all the other input tokens in the sequence.

However, to do self-attention, we need to do $N^2$ operations for each sequence of length N, which is quadratic in scale. This takes a lot of time and computing power. The quadratic scaling of the operation of the transformer according to the input size makes it inefficient for processing long sentences. This imposes a limitation on the length of the sequence that can be processed. The standard BERT model can process 512 tokens. Any long document with more than 512 words had to be either truncated or chunked, which led to information loss or cascading errors, respectively.

Many methods have been developed to reduce this quadratic scaling. Methods like Longformer, Big Bird, and Performer speed up the computations by sparsifying the self-attention matrix calculation, thereby scaling linearly to the input size. However, they keep the concept of self-attention as it is.

One method that has recently come up is FNet, which replaces the self-attention layer completely. Let’s see the architecture and how it works.

Understanding the FNet Architecture

To grasp FNet’s architecture, familiarity with the Discrete Fourier transform (DFT) is crucial. The DFT transforms a sequence of discrete values into complex numbers, representing amplitude and phase for each frequency component in the original sequence. FNet adopts this concept and applies it to the input sequence through a Fourier mixing sublayer within each encoder block.

In stark contrast to traditional transformers, FNet’s encoder blocks do not rely on matrices with learnable parameters. Instead, the DFT is employed along the sequence and hidden dimensions, eliminating the need for complex number handling by retaining only the real part of the result. This strategic choice not only reduces memory requirements but also accelerates computation speed, especially when compared to the quadratic scaling of self-attention operations.

Discrete Fourier Transform

Given a sequence of N discrete values (x0, x1, …, xN-1), the DFT produces a corresponding sequence (X0, X1, …, XN-1), where each Xk is a complex number that represents the amplitude and phase of the corresponding frequency component in the original sequence.

The formula for the DFT is as follows:

$X_k = \sum_{n=0}^{N-1} x_n.e^{\frac{2\pi i}{N}nk}$

Here,

x_n is the input sequence of length N
X_k is the output sequence of length N
i is the imaginary unit
N is the number of samples in the sequence
k = 0,1,2,…N-1
Note X_k is complex number having real and imaginary part

Methods to calculate DFT

Direct Computation:

Use the definition of the DFT directly, summing over all input values. Here we check the formula for each input sample we need to do N summation and multiplication. These for a sequence of length N there would be $O(N^2)$ operations, where N is the number of samples.

Fast Fourier Transform (FFT):

Cooley-Tukey Algorithm: The FFT is an efficient algorithm to compute the DFT. The Cooley-Tukey algorithm is a widely used FFT algorithm. It uses a divide-and-conquer approach to reduce the computation time to $O(Nlog(N))$

FNet Architecture

FNet is an attention-free Transformer architecture as described below :

Input : The author have kept input the same as that of transformer. Input is embedded with word embeddings and combined with absolute position embeddings of the tokens and type embeddings of the sentence.
Encoder Block : Each block consists of a Fourier mixing sublayer followed by a feed-forward sublayer. Essentially the self-attention sublayer of Transformer encoder layer is replaced with a Fourier sublayer.
DFT is applied to embedding input. One 1D DFT along the sequence length ‘ F_seq‘ and one 1D DFT along the hidden dimension/embedding dimension ‘F_h‘. Mathematically this can be written as :

Here R is the real part of the DFT
The DFT produces a complex number. However only the real part of the result is kept thereby eliminating any need to modify the (nonlinear) feed-forward sublayers or output layers to handle complex number
Key points in Encoder Block :
- In the above equation there are no matrices . Hence there are no learnable parameters as such when compared to transformer encoder architecture which must learn Q , K and V matrices. This reduces the memory footprint.
- Also the DFT using FFT takes O( $Nlog(N)$ ) time for computation compared to transformer self attention $N^2$ . This increases the computation speed.
Dense Layer : The output of encoder block then can be used for various task similarly to that of how BERT uses the output of encoder block . For example for sentence classification the output of encoder can be passed through a dense layer and then a softmax layer for final projection.

Interpretation of F-Net key Terms

Fourier Transform as a Mixing Mechanism: The Fourier Transform can be described as an effective mechanism for mixing tokens where in each Fourier Transform operation allows each element of the input data to influence every other element in a meaningful way. This provides the feed-forward sublayers with access to all tokens.
Multiplication and Convolution: Multiplying by feed-forward sublayer coefficients in the frequency domain is said to be equivalent to convolving in the time domain. FNet can be though of as alternating between multiplications and convolutions.

Performance

FNet is encoder based architecture . As BERT is the start of the art model for encoder based architecture all the performance benchmark was done with respect to it.

On the GLUE benchmark following are the results:

FNet-base underperform BERT-base by 7.5 − 8%. FNet trains significantly faster than BERT – 80% faster on GPUs and 70% faster on TPUs
The gap between BERT and FNet shrinks to just 3% for Large models
In the hybrid model, the author replaced the final two Fourier sublayers of FNet with self-attention sublayers with only limited speed degradations . The hybrid FNet-base models underperformed BERT by 3% and hybrid FNet-large underperformed BERT by 1 %.
FNet is very competitive with the “efficient Transformers” (Linformer, Performer) evaluated on the Long-Range Arena benchmark, matching the accuracy of the most accurate models.

Applications

Considering the lightweight size of FNet model, it can be effective as a distilled model deployed in resource constrained settings such as production services or on edge device.
Also since it was able to match performance with efficient transformers it provides an alternative for long sequence processing.

Limitations

Despite its speed and memory advantages, FNet is not without limitations.

The FNet architecture was developed focusing on the encoder part of transformer. It has still not been developed and benchmarked for decoder only and encoder-decoder setups .As such the FNet architecture can be used only for limited task where encoder block of transformer is sufficient such as text classification task and not for more complex task like sequence to sequence conversion etc.
The tradeoff for speed and memory comes with accuracy cost. The base model underperforms the state of the art model by quite a margin.

Conclusion

FNet provides an alternative mechanism to self attention mechanism of Transformers architecture by completely replacing it with DFT . This unparametrized self-attention not only reduces memory footprint but also accelerates computation speed significantly. FNet, with its competitive performance and lightweight nature, opens avenues for deployment in resource-constrained settings, presenting an alternative for those seeking efficient long sequence processing solutions.

Suggest improvement

What is an Ideal Transformer?

Share your thoughts in the comments