Explanation of BERT Model – NLP

BERT (Bidirectional Encoder Representations from Transformers) is a Natural Language Processing Model proposed by researchers at Google Research in 2018. When it was proposed it achieve state-of-the-art accuracy on many NLP and NLU tasks such as:

  • General Language Understanding Evaluation
  • Stanford Q/A dataset SQuAD v1.1 and v2.0
  • Situation With Adversarial Generations

Soon after few days of release the published open-sourced the code with two versions of pre-trained model BERTBASE and BERTLARGE which are trained on a massive dataset. BERT also use many previous NLP algorithms and architectures such that semi-supervised training, OpenAI transformers, ELMo Embeddings, ULMFit, Transformers.

BERT Model Architecture:
BERT is released in two sizes BERTBASE and BERTLARGE. The BASE model is used to measure the performance of the architecture comparable to another architecture and the LARGE model produces state-of-the-art results that were reported in the research paper.

Semi-supervised Learning:
One of the main reasons for the good performance of BERT on different NLP tasks was the use of Semi-Supervised Learning. This means the model is trained for a specific task that enables it to understand the patterns of the language. After training the model (BERT) has language processing capabilities that can be used to empower other models that we build and train using supervised learning.

Semi-Supervised Learning


BERT is basically an Encoder stack of transformer architecture. A transformer architecture is an encoder-decoder network that uses self-attention on the encoder side and attention on the decoder side. BERTBASE has 12 layers in the Encoder stack while BERTLARGE has 24 layers in the Encoder stack. These are more than the Transformer architecture described in the original paper (6 encoder layers). BERT architectures (BASE and LARGE) also have larger feedforward-networks (768 and 1024 hidden units respectively), and more attention heads (12 and 16 respectively) than the Transformer architecture suggested in the original paper. It contains 512 hidden units and 8 attention heads. BERTBASE contains 110M parameters while BERTLARGE has 340M parameters.

BERTBASEand BERT LARGE architecture
.

This model takes CLS token as input first, then it is followed by a sequence of words as input. Here CLS is a classification token. It then passes the input to the above layers. Each layer applies self-attention, passes the result through a feedforward network after then it hands off to the next encoder.



The model outputs a vector of hidden size (768 for BERT BASE). If we want to output a classifier from this model we can take the output corresponding to CLS token.

BERT output as Embeddings


Now, this trained vector can be used to perform a number of tasks such as classification, translation, etc.
For Example, the paper achieves great results just by using a single layer NN on the BERT model in the classification task.

ELMo Word Embeddings:
This article is good for recapping Word Embedding. It also discusses Word2Vec and its implementation. Basically, word Embeddings for a word is the projection of a word to a vector of numerical values based on its meaning. There are many popular words Embedding such as Word2vec, GloVe, etc. ELMo was different from these embeddings because it gives embedding to a word based on its context i.e contextualized word-embeddings.To generate embedding of a word, ELMo looks at the entire sentence instead of a fixed embedding for a word.
Elmo uses a bidirectional LSTM trained for the specific task to be able to create those embeddings. This model is trained on a massive dataset in the language of our dataset, and then we can use it as a component in other architectures that are required to perform specific language tasks.

Elmo Contextualize Embeddings Architecture


ELMo gained its language understanding from being trained to predict the next word in a sequence of words – a task called Language Modeling. This is convenient because we have vast amounts of text data that such a model can learn from without labels can be trained.

ULM-Fit: Transfer Learning In NLP:
ULM-Fit introduces a new language model and process to effectively fine-tuned that language model for the specific task. This enables NLP architecture to perform transfer learning on a pre-trained model similar to that is performed in many Computer vision tasks.

Open AI Transformer: Pre-training:

The above Transformer architecture pre-trained only encoder architecture. This type of pre-training is good for a certain task like machine-translation, etc. but for the task like sentence classification, next word prediction this approach will not work. In this architecture, we only trained decoder. This approach of training decoders will work best for the next-word-prediction task because it masks future tokens (words) that are similar to this task.
The model has 12 stacks of the decoder layers. Since there is no encoder, these decoder layers only have self-attention layers.
We can train this model for language modelling (next word prediction) task by providing it with a large amount of unlabeled dataset such as a collection of books, etc.


OpenAI transformers next word Prediction

Now that Open AI transformer having some understanding of language, it can be used to perform downstream tasks like sentence classification. Below is an architecture for classifying a sentence as “Spam” or “Not Spam”.


OpenAI transformers Sentence Classification Task

Results: BERT provides fine-tuned results for 11 NLP tasks. Here, we discuss some of those results on benchmark NLP tasks.

  • GLUE:
    The General Language Understanding Evaluation task is a collection of different Natural Language Understanding tasks. These include MNLI (Multi-Genre Natural Language Inference), QQP(Quora Question Pairs), QNLI(Question Natural Language Inference), SST-2(The Stanford Sentiment Treebank), CoLA(Corpus of Linguistic Acceptability) etc. Both, BERTBASE and BERTLARGE outperforms previous models by a good margin (4.5% and 7% respectively). Below are the results of BERTBASE and BERTLARGE as compared to other models:

    BERT-GLUE

    Result of BERT on GLUE NLP task

  • SQuAD v1.1 Dataset
    Stanford Question Answer Dataset is a collection 100k crowd source Question Answer Pairs. A data point contains a question and a passage from wikipedia which contains the answer. The task is to predict the answer text span from the the passage.
    The best performing BERT (with the ensemble and TriviaQA) outperforms the top leaderboard system by 1.5 F1-score in ensembling and 1.3 F1-score as a single system. In fact, single BERTBASE outperforms top ensemble system in terms of F1-score.
  • SWAG (Situations With Adversarial Generations)
    SWAG dataset contains 113k sentence completion tasks that evaluate best-fitting answer using a grounded commonsense inference. Given a sentence, the task is to choose the most plausible continuation among four choices.
    BERTLARGE outperforms the OpenAI GPT by 8.3%. It even performs better than an expert human.
    The result of SWAG dataset are given below:

    Results on SWAG dataset

Conclusion :
BERT was able to improve the accuracy (or F1-score) on many Natural Language Processing and Language Modelling tasks. The main breakthrough that is provided by this paper is allowing the use of semi-supervised learning for many NLP task that allows transfer learning in NLP. It is also used in Google search, as of December 2019 it was used in 70 languages.
Below are some examples of search queries in Google Before and After using BERT.

References:




My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.