Open In App

Factorized Dense Synthesizer

Improve
Improve
Like Article
Like
Save
Share
Report

Transformer models are a huge success among the wide range of different NLP tasks. This caused the transformers to largely replacing the former auto-regressive recurrent neural network architecture in many state-of-the-art architectures. At the core of this transformer, the architecture uses a method called Query-Key dot product attention. The success of this transformer architecture is mostly attributed to the self-attention mechanism. 

The factorized random dense synthesizer is a type of attention model that is proposed in the paper ‘SYNTHESIZER: RETHINKING SELF-ATTENTION FOR TRANSFORMER MODELS ‘ by Google Research. The motivation behind this is to reduce the time complexity of the architecture operation by proposing the alternatives of dot-product self-attention.

Self-attention architecture

The only importance of dot-product self-attention in the transformer architecture is to self-alignment i.e. to calculate the relative importance of a single token with respect to all other tokens in the sequence. This paper proposes a synthesizer, a model that learns the self-alignment matrix instead of manually computing it. Thus, doing self-alignment not only without dot-product self-attention but also removing content-based memory-like self-attention altogether.

Architecture:

There are four types of Synthesizer models provided by the authors in the paper, half of them are factorized models.

  • Dense Synthesizer
  • Random Synthesizer
  • Factorized Model Dense Synthesizer
  • Factorized Model Random Synthesizer

In this article, we will discuss the Dense Synthesizer and its factorized model.

Dense Synthesizer architecture

Dense Synthesizer

Dense synthesizer did conditioning of model on each input. This accepts an input X ∈ Rl*d and produces the output Y such that Y ∈ Rl*d , where l refers to sequence length and d refers to dimensionality vector. Now, we use a parametric function called F(x) that projects input Xi of d dimensions to l dimensions. 

B_i = F(X_i)

where F(x) is a projection operation and can be defined as:

F(x) = W_2(\sigma_R (W_1(X)  +b)) + b

Where, \sigma_R is the ReLU activation function, W_1 ∈ Rd*d and W_2 ∈ Rd*l. Now the domain of B becomes Rl*l. Now, we can compute Y with the following relations: 

Y= softmax(B)G(X)

where G is another parameterized function similar to the V(value) in self-attention in Transformer architecture. This approach removes the dot-product of Key-Query by replacing QKT with the parameterized function F.

Factorized Dense Synthesizer

The Dense synthesizer defined above add the overhead of d*l to the model, even after removing the dot product and saving some parameters, the above model will be cumbersome to train the model when l is large. To solve this challenge the author proposes another model called Factorized  Dense Synthesizer. In this model, they just factorized the output of the model and train subsequently. It reduces the number of parameters slightly and also prevents overfitting. It can be expressed as follows:

A, B = F_A (X_i), F_B (X_i)

where, FA projects input Xi to a dimension and FB projects Xi to b dimensions such a*b = l. The output of factorized models can be derived from the following equations:

Y = softmax (C)G(X)

where, C = HA(A)* HB(B), where HA and HB are tiling function which simply duplicates the vector k times i.e l -> l*k. In this case, HA does the projection of Ra -> Ra*b and HB is the projection of Rb -> Rb*a.

Mixtures of Synthesizers

We can combine all the proposed synthetic attention variants in an additive fashion. This expression for this is:

Y = softmax(a_1 S_{1}(X) + a_2 S_{2}(X).... a_N S_N(X))

where S is the parameterization function, a is parameter such that \sum a  =1 that is the trainable weight.

In the case of mixing random factorized and standard dense synthesis, the expression becomes:

Y = softmax(a_1 R_1 R_2^{T} + a_2 F(X))G(X)

Time Complexity:

The time complexity of Self-attention is \theta = 2d^{2} while for the Dense Synthesizer, the time complexity becomes \theta(\theta(d^{2] +d*l)   and factorized dense synthesizer, the time complexity is \theta (d(d+ k_1 + k_2)). Where l refers to sequence length, d is the dimensionality of the model & k1,k2 is factorization.

References:


Last Updated : 06 Jun, 2021
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads