The BERT was proposed by researchers at Google AI in 2018. BERT has created something like a transformation in NLP similar to that caused by AlexNet in computer vision in 2012. It allows one to leverage large amounts of text data that is available for training the model in a self-supervised way.
ALBERT was proposed by researchers at Google Research in 2019. The goal of this paper to improve the training and results of BERT architecture by using different techniques like parameter sharing, factorization of embedding matrix, Inter sentence Coherence loss.
The backbone of ALBERT architecture is similar to BERT that is encoder layers with GELU (Gaussian Error Linear Unit) activation function. However, below are the three main changes that are present in ALBERT but not in BERT.
- Factorization of the Embedding matrix: In the BERT model and its improvements such as XLNet and ROBERTa, the input layer embeddings and hidden layer embeddings have the same size. But in this model, the authors separated the two embedding matrices. This is because input-level embedding (E) needs to refine only context-independent learning but hidden level embedding (H) requires context-dependent learning. This step leads to a reduction in parameters by 80% with a minor drop in performance when compared to BERT.
- Cross-layer parameter sharing: The authors of this model also proposed the parameter sharing between different layers of the model to improve efficiency and decrease redundancy. The paper proposed that since the previous versions of BERT, XLNet, and ROBERTa have encoder layer stacked on top of one another causes the model to learn similar operations on different layers. The authors proposed three types of parameter sharing in this paper:
- Only share Feed Forward network parameter
- Only share attention parameters
- Share all parameters. Default setting used by authors unless stated otherwise.
The above step leads to a 70% reduction in the overall number of parameters.
- Inter Sentence Coherence Prediction: Similar to the BERT, ALBERT also used Masked Language model in training. However, Instead of using NSP (Next Sentence Prediction) loss, ALBERT used a new loss called SOP (Sentence Order Prediction). NSP is a binary classification loss for predicting whether two segments appear consecutively in the original text, the disadvantage of this loss is that it checks for coherence as well as the topic to identify the next sentence. However, the SOP only looks for sentence coherence.
ALBERT is released in 4 different model sizes,
|Model||Size||Parameters||Encoder Layers (L)||Embedding (E)||Hidden units (H)|
|X Large||60 M||24||128||2048|
|XX Large||235 M||12||128||4096|
As we can see from the above table is the ALBERT model has a smaller parameter size as compared to corresponding BERT models due to the above changes authors made in the architecture. For Example, BERT base has 9x more parameters than the ALBERT base, and BERT Large has 18x more parameters than ALBERT Large.
Similar to the BERT, ALBERT is also pre-trained on the English Wikipedia and Book CORPUS dataset which together contains 16 GB of uncompressed data.
- In this implementation, we will use a pre-trained ALBERT model using TF-Hub and ALBERT GitHub repository. We will run the model on Microsoft Research Paraphrase Corpus (MRPC) dataset on GLUE benchmark.
Results & Conclusion:
Despite the much fewer number of parameters, ALBERT has achieved the state-of-the-art of many NLP tasks. Below are the results of ALBERT on GLUE benchmark datasets. The ALBER
Below are the results of the ALBERT-xxl model on SQuAD and RACE benchmark datasets.
Here, ALBERT (1M) represents model is trained with 1M steps whereas, ALBERT 1.5M represents the model is trained with 1.5M epoch.
As of now, the authors have also released a new version of ALBERT (V2), with improvement in the average accuracy of the BASE, LARGE, X-LARGE model as compared to V1.