seq2seq model in Machine Learning
Seq2seq was first introduced for machine translation, by Google. Before that, the translation worked in a very naïve way. Each word that you used to type was converted to its target language giving no regard to its grammar and sentence structure. Seq2seq revolutionized the process of translation by making use of deep learning. It not only takes the current word/input into account while translating but also its neighborhood.
Nowadays, it is used for a variety of different applications such as image captioning, conversational models, text summarization, etc.
As the name suggests, seq2seq takes as input a sequence of words(sentence or sentences) and generates an output sequence of words. It does so by use of the recurrent neural network (RNN). Although the vanilla version of RNN is rarely used, its more advanced version i.e. LSTM or GRU is used. This is because RNN suffers from the problem of vanishing gradient. LSTM is used in the version proposed by Google. It develops the context of the word by taking 2 inputs at each point in time. One from the user and the other from its previous output, hence the name recurrent (output goes as input).
It mainly has two components i.e encoder and decoder, and hence sometimes it is called the Encoder-Decoder Network.
Encoder: It uses deep neural network layers and converts the input words to corresponding hidden vectors. Each vector represents the current word and the context of the word.
Decoder: It is similar to the encoder. It takes as input the hidden vector generated by the encoder, its own hidden states, and the current word to produce the next hidden vector and finally predict the next word.
Apart from these two, many optimizations have to lead to other components of seq2seq:
- Attention: The input to the decoder is a single vector that has to store all the information about the context. This becomes a problem with large sequences. Hence the attention mechanism is applied which allows the decoder to look at the input sequence selectively.
- Beam Search: The highest probability word is selected as the output by the decoder. But this does not always yield the best results, because of the basic problem of greedy algorithms. Hence beam search is applied which suggests possible translations at each step. This is done by making a tree of top k-results.
- Bucketing: Variable-length sequences are possible in a seq2seq model because of the padding of 0’s which is done to both input and output. However, if the max length set by us is 100 and the sentence is just 3 words long it causes huge wastage of space. So we use the concept of bucketing. We make buckets of different sizes like (4, 8) (8, 15), and so on, where 4 is the max input length defined by us and 8 is the max output length defined.