Transformers in Machine Learning

Last Updated : 10 Dec, 2023

Transformer is a neural network architecture used for performing machine learning tasks. In 2017 Vaswani et al. published a paper ” Attention is All You Need” in which the transformers architecture was introduced. Since then, transformers have been widely adopted and extended for various machine learning tasks beyond NLP. The article explores the architecture, working and applications of transformer.

Transformer Architecture is a model that uses self-attention that transforms one whole sentence into a single sentence. This is a big shift from how older models work step by step, and it helps overcome the challenges seen in models like RNNs and LSTMs.

Need For Transformer

The transformer neural network was a revolution in the field of machine learning. Let’s have look at the reasons why transformer architecture is needed:

RNN suffers from the vanishing gradient problem which causes long-term memory loss. RNN does the processing of text sequentially, which means if there is a long sentence like- ‘XYZ went to France in 2019 when there were no cases of COVID and there he met the president of that country.‘ Now, if we ask that here ‘that country’ refers to which place? RNN will not be able to recall that the country was ‘France’ because it has encountered the word ‘France’ long before. The sequential nature of processing means that the model was trained at the word level, not at the sentence level. The gradients carry information used in the RNN parameter update and when the gradient becomes smaller, then no real learning is done.
With the addition of a few more memory cells and resolving the vanishing gradients issue, the problem regarding long-term memory loss was resolved to some extent. However, the problem with sequential processing remained because RNN was unable to process the intact sentence at once. Rather than processing in parallel, it processed words one by one. This issue can’t be addressed in LSTMs due to their sequential design. In LSTMs, we use the static embedding method, which suggests that without knowing the context of a word we embed it to some n-dimensional vector. But if the context changes, the meaning also changes.

Ex: There is a word -’ Point’, and we use it in two different contexts given below

The needle has a sharp point.
It is not polite to point at people.

Here, the word ‘Point’ has two different contexts in both of the sentences, but when embedding is done the context is not taken into consideration. Therefore, there was a need for a different architecture — Transformer.

Architecture and Working of Transformers

Positional Encoding

Transformers lack an inherent understanding of the sequential order of elements, positional encodings are incorporated into the input embeddings. These encodings serve to convey information about the specific positions of tokens within the sequence.

Position-wise Feedforward Networks

The feedforward network consists if a fully connected layer for followed by non-linear activation function, such as Rectified Linear Unit (ReLU). The model captures and process features at different position in the sequence. These networks in both the encoder and decoder operate independently on each position.

Attention Mechanism

The attention mechanism in transformers employs a scaled dot-product approach, where the computation involves scaled dot products between the query, key, and value vectors. This produces weighted values that are then summed to yield the attention output. To enhance the model’s capacity to capture diverse relationships within the input, the multi-head attention mechanism is introduced. This involves applying the attention mechanism multiple times concurrently, each with distinct learned linear projections of the input. The resulting outputs from these parallel computations are concatenated and undergo a linear transformation to generate the final attention result. This multi-head approach allows the model to attend to different aspects of the input sequence simultaneously, promoting richer and more nuanced representations.

Encoder-Decoder Architecture

Encoder-Decoder architecture is the main component of transformer architecture. The encoder block turns the sequence of input words into a vector and a Decoder converts a vector into a sequence.

Eg: A text in French processed into its English equivalent can be:

Je suis étudiant –> I am a student.

Transformer Architecture

The encoder architecture has two layers: Self Attention and Feed Forward. The encoder’s inputs first pass by a self-attention layer and then the outputs of the self-attention layer are fed to a feed-forward neural network. Sequential data has temporal characteristics. It signifies that each word holds some position concerning the other. For example, let’s take a sentence- ‘The cat didn’t chase the mouse, because it was not hungry‘. Here, we can easily tell that ‘it’ is referring to the cat, but it is not as simple for an algorithm. When the model is processing the word ‘it’, self-attention allows it to associate ‘it’ with ‘cat’. Self-attention is the method to reformulate the representation based on all other words of the sentence.

Encoder and Decoder Layer Architecture of Transformer

The decoder architecture has three layers: Self Attention, Encoder-decoder attention, and Feed Forward. The decoder has both the self-attention and feed-forward layer, which is also present in the encoder, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence.

There are six layers of encoders and decoders in Transformer architecture. In the bottom encoder, word embeddings are performed, and each word is transformed into a vector of size 512. The input to the other encoders would be the output of the encoder that’s directly below. The various layers of the encoder are for discovering the NLP pipeline. Like- the first layer is used for Part of speech tags, the second layer for constituents, the third layer for dependencies, the fourth layer for semantic roles, the fifth for coreference, and the sixth for relations.

Applications of Transformer

Some of the applications of transformers are:

Transformers is used for NLP tasks like, machine translation, text summarization, name entity recognition and sentimental analysis.
Another application is speech recognition system, where audio signals are processed to provide transcribed text.
The application transformers in only limited to NLP, it is also used for computer vision tasks such as image classification, object detection, image generation.
Transformers are also used in recommendation system to provide customized recommendations.
Transformers can used for text and music generation also.

Also Check:

Conclusion

In conclusion, Transformer Architecture in Machine learning helps in NLP projects, it follows Encoder and Decoder Architecture. As machine learning continues to evolve and the transformer architecture stands as a pivotal development, and then shaping future of NLP and other related domains.

Suggest improvement

5 Deep Learning Project Ideas for Beginners

Star Charts in Python

Share your thoughts in the comments