Open In App

FNet: A Transformer Without Attention Layer

This article delves into FNet, a transformative architecture that reimagines the traditional transformer by discarding attention mechanisms entirely. Let’s begin the journey to explore FNet, but first, let’s look at the limitations of transformers.

What is FNet?

In contrast to conventional transformer architectures, like the popular Transformer model, FNet uses a parameterized Fourier Transform in place of the self-attention mechanism. A mathematical operation called the Fourier Transform is frequently applied to image analysis and signal processing.



A different method of capturing dependencies between tokens in a sequence is provided by FNet, which applies the Fourier Transform to input sequences. Particularly for longer sequences, the authors contend that this approach can be more effective and scalable than self-attention.

Limitation of Transformers

Transformer-based models are good at understanding and processing sequences. This is because they use a kind of operation called “self-attention.” It’s like having to look at every token and figure out how it relates to every other token. The self-attention mechanism tries to measure the relationship of each input token with all the other input tokens in the sequence.



However, to do self-attention, we need to do operations for each sequence of length N, which is quadratic in scale. This takes a lot of time and computing power. The quadratic scaling of the operation of the transformer according to the input size makes it inefficient for processing long sentences. This imposes a limitation on the length of the sequence that can be processed. The standard BERT model can process 512 tokens. Any long document with more than 512 words had to be either truncated or chunked, which led to information loss or cascading errors, respectively.

Many methods have been developed to reduce this quadratic scaling. Methods like Longformer, Big Bird, and Performer speed up the computations by sparsifying the self-attention matrix calculation, thereby scaling linearly to the input size. However, they keep the concept of self-attention as it is.

One method that has recently come up is FNet, which replaces the self-attention layer completely. Let’s see the architecture and how it works.

Understanding the FNet Architecture

To grasp FNet’s architecture, familiarity with the Discrete Fourier transform (DFT) is crucial. The DFT transforms a sequence of discrete values into complex numbers, representing amplitude and phase for each frequency component in the original sequence. FNet adopts this concept and applies it to the input sequence through a Fourier mixing sublayer within each encoder block.

In stark contrast to traditional transformers, FNet’s encoder blocks do not rely on matrices with learnable parameters. Instead, the DFT is employed along the sequence and hidden dimensions, eliminating the need for complex number handling by retaining only the real part of the result. This strategic choice not only reduces memory requirements but also accelerates computation speed, especially when compared to the quadratic scaling of self-attention operations.

Discrete Fourier Transform

Given a sequence of N discrete values (x0, x1, …, xN-1), the DFT produces a corresponding sequence (X0, X1, …, XN-1), where each Xk is a complex number that represents the amplitude and phase of the corresponding frequency component in the original sequence.

The formula for the DFT is as follows:

Here,

Methods to calculate DFT

Direct Computation:

Fast Fourier Transform (FFT):

FNet Architecture

FNet is an attention-free Transformer architecture as described below :

Interpretation of F-Net key Terms

Performance

FNet is encoder based architecture . As BERT is the start of the art model for encoder based architecture all the performance benchmark was done with respect to it.

On the GLUE benchmark following are the results:

Applications

Limitations

Despite its speed and memory advantages, FNet is not without limitations.

Conclusion

FNet provides an alternative mechanism to self attention mechanism of Transformers architecture by completely replacing it with DFT . This unparametrized self-attention not only reduces memory footprint but also accelerates computation speed significantly. FNet, with its competitive performance and lightweight nature, opens avenues for deployment in resource-constrained settings, presenting an alternative for those seeking efficient long sequence processing solutions.


Article Tags :