Transformer Neural Network In Deep Learning – Overview

Last Updated : 02 Oct, 2022

In this article, we are going to learn about Transformers. We’ll start by having an overview of Deep Learning and its implementation. Moving ahead, we shall see how Sequential Data can be processed using Deep Learning and the improvement that we have seen in the models over the years.

Deep Learning

So now what exactly is Deep Learning? But before we go and understand what is Deep Learning, let’s quickly walk you through the chronology over here, starting off with AI. AI or artificial intelligence is basically the entire thing. AI is an area of computer science that emphasizes the creation of intelligence within the machine to work and react like human beings. In short, here, we are trying to have the capability of machines to imitate the intelligence of human behaviour. Then we have Machine Learning. ML is basically a science of getting computers to act by feeding them up on previous data. So Deep Learning is a subset of Machine Learning. And here we make use of something called neural networks. We see neural networks are the set of algorithms and techniques, which are modelled in accordance with the human brain and neural networks are designed to solve complex and advanced machine learning problems.

So what exactly is Deep Learning? Well, Deep Learning is a part of a broad family of ML methods, which are based on learning data patterns in opposition to what a Machine Learning algorithm does. In Machine Learning we have algorithms for a specific task. Here, the Deep Learning algorithm can be supervised semi-supervised or unsupervised. As mentioned earlier, Deep Learning is inspired by the human brain and how it perceives information through the interaction of neurons. So let’s see what exactly can we do with Deep Learning. But before we go there, so why should we choose deep Learning for, you know, various tasks? So the big advantage of using Deep Learning is that we can extract more features and when we have more features and when we can work at the same time with a huge amount of data, we can perceive an object like a human being does. What it means is, if you want to perform a classification task between pen and a pencil, you’ll obviously know as a human being, you know, the difference because you look at a pen and a pencil contains a number of times, and now when you’re trying to actually classify it, you can do it with ease. And the reason for this is because, you know, the features of a pen, and you know, the features of a pencil. Similarly, this is how Deep Learning works. More, the data you feed more, the dimensions, it can analyze more the dimensions, it can learn. As already mentioned, one of the most popular applications of Deep Learning is image classification. And when it comes to image classification, it can be something as simple as classifying between two different animals, for something as complicated as, hiding data or trying to run automated cars using classification task.

So next type of application using Deep Learning is using Sequential Data. Sequential Data, basically refers to something like time-series data or having to understand natural language. So the reason why we call it sequential data is that here the previous word or the previous feature is dependent upon the next feature. If I say what time is it? So if I just say ‘it is’ like, over here what time is and, it basically features in the sentence. And in order for you to make an analogy or to understand, obviously have to know what has happened in the past. So in order to do this, we use something called as RNNs. And there are various versions of RNN.

Moving on to the next application that is GAN’s. GANs, which stands for generative adversarial networks is an unsupervised part of a Deep Learning application. Some common application, which you can see in recent days is nothing but deep fakes and many more. Finally, coming down to performance classification and regression task using multi-layer perceptron. If you remember, or if you are well versed with Machine Learning in order to perform classification in ML, we had algorithms like decision tree, random forest, or something, very simple as linear regression or logistic regression. But when we try to perform classification using MLP or multi-layer perceptron, we get a very high accuracy even compared to SVM and decision trees.

Natural Language Processing (NLP) Using RNN

So now that we know what exactly is Deep Learning and why we use it, let’s now stream down to understand how can we process natural language, data using RNNs. So what are RNNs? It Stands for Recurrent Neural Network. And we usually use this in order to deal with sequential data. Sequential data can be something like a time-seriessome data, or textual data of any format. So why should one use RNN? This is because there’s a concept of internal memoriam here. RNN can remember important things about the input it has received, which allows them to be very precise in predicting what can be the next outcome. So this is the reason why they are performed or preferred on a sequential data algorithm. And some of the examples of sequence data can be something like time, series, speech, text, financial data, audio, video, weather, and many more. Although RNN was the state-of-the-art algorithm for dealing with sequential data, they come up with their own drawbacks and some popular drawbacks over here can be like due to the complication or the complexity of the algorithm. The neural network is pretty slow to train. And as a huge amount of dimensions here, the training is very long and difficult to do. Apart from that most decisive feature for RNN or for the improvement in RNN, is that off of vanishing gradient? What this vanished gradient is? When we go deeper and deeper into our neural network, the previous data is lost. This is because of a concept, vanishing ingredient. And you do this we cannot work on a large or longer sequence of data.

To overcome this, we came up with some new or upgrades to the current record neural networks or RNNs. Starting off with a Bi-Directional recurrent neural network. Bi-directional recurrent neural network connects two hidden layers of opposite direction into the same output with this form of generating Deep Learning, the output can get information from past and future state simultaneously. So why do we need a Bi-Directional recurrent neural network? Well, it duplicates, RNN processing chain, so that the input process both forward and reverse time order, thus allowing a bi-directional recurrent neural network to look into future context as well. The next one is long short-term memory, long short term memory, or also sometimes referred to as LSTM is an artificial recurrent neural network architecture used in the field of Deep Learning. This standard feedforward neural network at LSTM has a feedback connection. It can not only process single data point, but also the entire sequence of data. With LSTM or long short term memory, it has something like, you know, we can feed a longer sequence compared to what it was with bi-directional RNN or RNNs.

So why is LSTM better than RNN? We can say that when we move from RNN to LSTM, we are introducing more and more control over the sequence of the data that we can provide. Thus, LSTM gives us more control ability and does better results.

So the next type of recurrent neural network is the Gated Recurrent Neural Network also referred to as GRUs. It is a type of recurrent neural network that is in certain cases is advantageous over long short-term memory. GRU makes use of less memory and also is faster than LSTM. But the thing is LSTMs are more accurate while using longer datasets. So the trend over here is, you know, the models should be capable of remembering and taking it on a longer input sequence.

Transformers

The game-changer part for the sequencer data was developed when we came up with something called Transformers and this paper was something which is based on a concept called Attention Is Everything. So let’s take a look at this. The paper ‘Attention Is All You Need’ introduces and an architecture called last Transformers. Like LSTMs Transformers is an architecture for transforming one sequence into an antidote while helping other two parts that is encoders and decoders, but it differs from the previously described sequence your sequence model, because it does not work like GRUs. So it does not implement recurrent neural networks. Recurrent neural network until now was one of the best ways to capture the tiny dependence on a sequence. However, the team presenting this paper that is ‘Attention Is All You Need’ prove that architecture with only attention mechanism does not use RNN can improve its result in translation task and other NLP tasks. An example of it could be Google’s BERT.

So what exactly is this transformer. Both encoder and decoder are comprised of modules that can speak onto the top of each other multiple times. So what happens is the inputs and outputs are first embedded into n-dimension space, since we cannot use this directly. So we obviously have to encode our inputs, whatever we are providing. One slight, but important part of this model is positional and coding of different words. Since we have no recurrent neural network that can remember how to sequence is fed into the model, we need to somehow give every word or part of a sequence, a relative position since a sequence depends on the order of the elements. These positions are added to the embedded representation of each word. So this was a brief about Transformers.

Language Models

Let’s move ahead and see some popular language models that are available in the market. We’ll start off by understanding OpenAI’s GPT3. The successor to GPT and GPT2 is the GPT3, and is one of the most controversial pre-trained models, by OpenAI the large-scale transformer-based language model has been trained on 175 billion parameters, which is 10 times more than any previous non-sparsed language model. The model has been trained to achieve strong performance on much NLP dataset, including task translation, answering questions, as well as several other tasks.

Then we have Google’s BERT. It stans for bi-directional encoder representations from Transformers. Is a pre-trained NLP model, which is developed by Google in 2018 with this, anyone in the work and train either their own question-answering module with up to 30 minutes on a single cloud TPU or few hours using a single GPU. The company then showcasing the performance of 11 NLP tasks, including very competitive, Stanford dataset questions. Unlike other language models, but BERT only been pre-trained on 250 million words of Wikipedia and 800 million words of book corpus and has been successfully used as a pre-trained model in a deep neural network, according to researchers, but has achieved 93% accuracy, which has suppressed any previous language models.

Next, we have ELMO. ELMO is also known as embedding for language model is a deep contextualize word representation that model syntax and semantic words, as well as the logistic context. The model developed by Alan LP has been pre-trained on a huge text Corpus and learn functions from bi-directional models. That is by LM. ELMO can easily be added to the existing model, which drastically improves the features of functions across vast NLP problems, including answering questions, textual sentiment, and sentiment analysis.

Suggest improvement

Transformers in Machine Learning

Share your thoughts in the comments