BERT (Bidirectional Encoder Representations from Transformers) is a Natural Language Processing Model proposed by researchers at Google Research in 2018. When it was proposed it achieve state-of-the-art accuracy on many NLP and NLU tasks such as:
- General Language Understanding Evaluation
- Stanford Q/A dataset SQuAD v1.1 and v2.0
- Situation With Adversarial Generations
Soon after a few days of release the published open-sourced code with two versions of the pre-trained model BERTBASE and BERTLARGE which are trained on a massive dataset. BERT also uses many previous NLP algorithms and architectures such as semi-supervised training, OpenAI transformers, ELMo Embeddings, ULMFit, and Transformers.
- BERT Model Architecture: BERT is released in two sizes BERTBASE and BERTLARGE. The BASE model is used to measure the performance of the architecture comparable to another architecture and the LARGE model produces state-of-the-art results that were reported in the research paper.
- Semi-Supervised Learning. This means the model is trained for a specific task that enables it to understand the patterns of the language. After training the model (BERT) has language processing capabilities that can be used to empower other models that we build and train using supervised learning.
How does BERT work?
BERT is basically an Encoder stack of transformer architecture. A transformer architecture is an encoder-decoder network that uses self-attention on the encoder side and attention on the decoder side. BERTBASE has 12 layers in the Encoder stack while BERTLARGE has 24 layers in the Encoder stack. These are more than the Transformer architecture described in the original paper (6 encoder layers). BERT architectures (BASE and LARGE) also have larger feedforward networks (768 and 1024 hidden units respectively), and more attention heads (12 and 16 respectively) than the Transformer architecture suggested in the original paper. It contains 512 hidden units and 8 attention heads. BERTBASE contains 110M parameters while BERTLARGE has 340M parameters.
BERTBASE and BERT LARGE architecture.
This model takes the CLS token as input first, then it is followed by a sequence of words as input. Here CLS is a classification token. It then passes the input to the above layers. Each layer applies self-attention and passes the result through a feedforward network after then it hands off to the next encoder. The model outputs a vector of hidden size (768 for BERT BASE). If we want to output a classifier from this model we can take the output corresponding to the CLS token.
BERT output as Embeddings
Now, this trained vector can be used to perform a number of tasks such as classification, translation, etc. For Example, the paper achieves great results just by using a single layer Neural Network on the BERT model in the classification task.
ELMo Word Embeddings
This article is good for recapping Word Embedding. It also discusses Word2Vec and its implementation. Basically, word Embeddings for a word is the projection of a word to a vector of numerical values based on its meaning. There are many popular words Embedding such as Word2vec, GloVe, etc. ELMo was different from these embeddings because it gives embedding to a word based on its context i.e. contextualized word embeddings. To generate embedding of a word, ELMo looks at the entire sentence instead of a fixed embedding for a word. Elmo uses a bidirectional LSTM trained for the specific task to be able to create those embeddings. This model is trained on a massive dataset in the language of our dataset, and then we can use it as a component in other architectures that are required to perform specific language tasks.
Elmo Contextualize Embeddings Architecture
ELMo gained its language understanding from being trained to predict the next word in a sequence of words – a task called Language Modeling. This is convenient because we have vast amounts of text data that such a model can learn from without labels can be trained.
ULM-Fit: Transfer Learning In NLP
ULM-Fit introduces a new language model and process to effectively fine-tuned that language model for the specific task. This enables NLP architecture to perform transfer learning on a pre-trained model similar to that is performed in many Computer vision tasks.
Open AI Transformer: Pre-Training
The above Transformer architecture pre-trained only encoder architecture. This type of pre-training is good for certain tasks like machine translation, etc. but for tasks like sentence classification, and next-word prediction this approach will not work. In this architecture, we only trained the decoder. This approach of training decoders will work best for the next-word-prediction task because it masks future tokens (words) that are similar to this task. The model has 12 stacks of decoder layers. Since there is no encoder, these decoder layers only have self-attention layers. We can train this model for language modeling (next word prediction) tasks by providing it with a large number of unlabeled datasets such as a collection of books, etc.
OpenAI transformers next word Prediction
Now that the Open AI transformer has some understanding of language, it can be used to perform downstream tasks like sentence classification. Below is an architecture for classifying a sentence as “Spam” or “Not Spam”.
OpenAI transformers Sentence Classification Task
Results: BERT provides fine-tuned results for 11 NLP tasks. Here, we discuss some of those results on benchmark NLP tasks.
- GLUE: The General Language Understanding Evaluation task is a collection of different Natural Language Understanding tasks. These include MNLI (Multi-Genre Natural Language Inference), QQP(Quora Question Pairs), QNLI(Question Natural Language Inference), SST-2(The Stanford Sentiment Treebank), CoLA(Corpus of Linguistic Acceptability), etc. Both BERTBASE and BERTLARGE outperform previous models by a good margin (4.5% and 7% respectively). Below are the results of BERTBASE and BERTLARGE as compared to other models:
Result of BERT on GLUE NLP task
- SQuAD v1.1 Dataset Stanford Question Answer Dataset is a collection of 100k crowdsource Question Answer Pairs. A data point contains a question and a passage from Wikipedia that contains the answer. The task is to predict the answer text span from the passage. The best performing BERT (with the ensemble and TriviaQA) outperforms the top leaderboard system by 1.5 F1-score in ensembling and 1.3 F1-score as a single system. In fact, single BERTBASE outperforms the top ensemble system in terms of the F1 score.
- SWAG (Situations With Adversarial Generations) SWAG dataset contains 113k sentence completion tasks that evaluate best-fitting answers using a grounded commonsense inference. Given a sentence, the task is to choose the most plausible continuation among four choices. BERTLARGE outperforms the OpenAI GPT by 8.3%. It even performs better than an expert human. The result of the SWAG dataset is given below:
Results on the SWAG dataset
BERT was able to improve the accuracy (or F1-score) on many Natural Language Processing and Language Modelling tasks. The main breakthrough that is provided by this paper is allowing the use of semi-supervised learning for many NLP tasks that allows transfer learning in NLP. It is also used in Google search, as of December 2019 it was used in 70 languages. Below are some examples of search queries in Google Before and After using BERT.
Whether you're preparing for your first job interview or aiming to upskill in this ever-evolving tech landscape, GeeksforGeeks Courses
are your key to success. We provide top-quality content at affordable prices, all geared towards accelerating your growth in a time-bound manner. Join the millions we've already empowered, and we're here to do the same for you. Don't miss out - check it out now!