Open AI GPT-3

Last Updated : 09 Nov, 2022

Open AI GPT-3 is proposed by the researchers at OpenAI as a next model series of GPT models in the paper titled “Language Models are few shots learners”. It is trained on 175 billion parameters, which is 10x more than any previous non-sparse model. It can perform various tasks from machine translation to code generation etc.

The model is not available for download as of now due to its concerns about wrong uses. The OpenAI will provide premium API for using GPT-3 ability. The API is currently available in beta-version.

Zero-shot learning: The model tries to predict the answer without training (updating gradients). The model has provided the input and description of the task. The model needs to predict the output on the basis of input.
One-Shot Learning: The model tries to predict the answer with only one example of a task. The model sees a single example for a task but not used for training. This is commonly used in computer vision, such as the Siamese network where we one training and one test example pass through a neural network and calculate the distance between them.
Few-Shot Learning: The model tries to predict the answer with only a few examples of tasks. The model provides some examples of a task and task-description.

Zero-shot, one-shot and few-shot learning

The above training methods are used for in-context learning, which means it provided a task and examples, based on that the model needs to perform it on the test dataset. This training method commonly used in GPT-3

Fine Tuning: In this process, the model is trained by providing a large amount of data. In this method, we will train the model by performing gradient updates after every epoch (or every example) similar to the training of neural networks.

Architecture: GPT-3 is trained with different variants of models with a number of parameters ranging from 125 million to 175 billion. Below are the architectural details of different GPT-3 models.

Model Name	n_params	n_layers	d_model	n_heads	d_heads	Batch Size	Learning Rate
GPT-3 small	125 M	12	768	12	64	0.5 M	6 * 10^-4
GPT-3 Medium	350 M	24	1024	16	64	0.5 M	3 * 10^-4
GPT-3 Large	760 M	24	1536	16	96	0.5 M	2.5 * 10^-4
GPT-3 XL	1.3 B	24	2048	24	128	1 M	2 * 10^-4
GPT-3 2.7 B	2.7 B	32	2560	32	80	1 M	1.6 * 10^-4
GPT-3 6.7 B	6.7 B	32	4096	32	128	2 M	1.2 * 10^-4
GPT-3 13 B	13 B	40	5140	40	128	2 M	1 * 10^-4
GPT-3 175 B	175 B	96	12288	96	128	3.2 M	0.6 * 10^-4

n_params: Number of parameters in the model
n_layers: Number of layers in the model.
d_model: Number of units in each bottleneck model.
d_head: Dimension of attention heads.
n_head: Number of attention heads.

Result Details:

Language Modeling: For the language modeling task the GPT-3 is evaluated on the Penn Treebank dataset. The language model uses Zero-shot setting to evaluate the result. The largest GPT-3 model improved the state-of-the-art (SOTA) results by 15 points. The GPT-3 is also evaluated on 3 other language modeling datasets.
- LAMBADA dataset: The LAMBADA dataset tests the modeling of long-range dependencies in text. The task is to predict the last word of sentences which requires reading a paragraph of context. On the LAMBADA dataset the few-shot GPT-3 model improves the accuracy by 18%, even the zero-shot GPT-3 also gives 8% better accuracy than previous SOTA.
- HellaSwag dataset: The HellaSwag dataset involves picking the best ending to a story or set of instructions. The examples were adversarially mined such that they became tough for language models while remaining easy for humans. On the HellaSwag dataset, the few-shot GPT-3 got 79.3% accuracy which is not better than the previous state-of-the-art (85.6%).
- StoryCloze: StoryCloze 2016 dataset involves selecting the correct ending sentence for five-sentence long stories. The few shot GPT-3. The few-shot learning of GPT-3 obtains 87.7% accuracy which is closer to state-of-the-art accuracy (91%).
Closed Book Question Answering: This task measures the GPT-3 model’s ability to answer the question without providing any auxiliary data to search for answers. In this task, the model uses the broad factual knowledge to answer the questions. The GPT-3 model is evaluated on three datasets (NaturalQS, WebQS, and TriviaQA) for zero-shot, one-shot, and few-shot learning. Below are the results generated by GPT-3 compared to this task. On, TriviaQA dataset the GPT-3 obtained just better results (71.2%) than previous state-of-the-art(, but on NaturalQS and WebQS datasets, it still lags behind Retrieval-Augmented Generation (RAG) model.
Translation: Since the majority of GPT-3’s training data is a raw Common crawl dataset with filtering. So, most of the training data is in the English language (93%) with only 7% of other languages. The zero-shot setting which provided the only description of the task underperforms the previous unsupervised Neural Machine Translation models. However, the authors noticed that the BLEU score for the translation task increased by 7 points on average with a further 4 BLUE increases from one-shot to a few-shot setting. Another point the authors concluded that when translating from English to other languages, it lags behind state-of-the-art NMT models. However, while translating to English, it achieves state-of-art results (or closer) to it.
Winograd Style Task: In this word, the goal of the model is to determine which word a pronoun refers to when the pronoun is unambiguous to model but not for humans to understand. Recent models have achieved human-level accuracy on the Winograd task. GPT-3 also obtained accuracy closer to the previous state-of-the-art. But on the bigger Winogrande dataset, there is a room for improvement in comparison to the previous state-of-the-art.
Common Sense Reasoning: In order to capture the physical and scientific reasoning, the model is evaluated on three datasets. These are:
- Physical QA: Contains common sense questions about how the physical world works and is intended as a probe of grounded understanding of the world. GPT-3 achieves 81.0% accuracy zero-shot, 80.5% accuracy one-shot, and 82.8% accuracy few-shot learning. This is better than the previous state-of-the-art accuracy of fine-tuned RoBERTa.
- ARC: It contains multiple-choice questions collected from 3rd to 9th-grade science exams. GPT-3 achieves 51.4% accuracy in the zero-shot setting, 53.2% in the one-shot setting, and 51.5% in the few-shot setting.
- OpenBookQA: On OpenBookQA, GPT-3 improves significantly from zero to few shot settings but is still over 20 points short of the overall state-of-the-art (SOTA). GPT-3’s few-shot performance is similar to a fine-tuned BERT Large baseline on the leaderboard.
Reading Comprehension: For reading comprehension, GPT-3 is evaluated on 5 different datasets. The GPT-3 results are closer to state-to-art on the Conversational Question Answer dataset. However, on the four datasets (including DROP dataset, QuCA, Stanford Question Answer (SQuAD), Reading Comprehension From Extraction (RACE)) the GPT-3 lags behind state-of-the-art by a large margin.
SuperGLUE. In order to better summarize the results on NLP tasks and compare to popular models such as BERT and RoBERTa in a more systematic way, we also evaluate GPT-3 on a standardized collection of datasets, the SuperGLUE benchmark. Below is the result of GPT-3’s performance on these benchmark dataset.

Results on SuperGLUE benchmarks

NLI: Natural Language Inference (NLI) concerns the ability to understand the relationship between the two sentences. In practice, this task is usually structured as a two or three class classification problem where the model classifies whether the second sentence logically follows first, contradicts first or it is a possible combination of two sentences. SuperGLUE contains an NLI dataset (RTE) for which the results are provided above. GPT-3 is tested on another NLI dataset called ANLI (Adversarial Natural Language Inference). THis dataset contains 3 levels of adversely mined questions (R1, R2, and R3). The largest GPT-3 model gives ~40% accuracy on R3 which is much below State-of-the-art (48.3 %).
Synthetic and Qualitative Task: To test the abilities of GPT-3, the authors provide tasks which it requires to observe patterns in real-time, which it had not seen in training. First, the authors test GPT-3’s ability to perform arithmetic tasks. Second, the authors evaluated GPT-3 on tasks that involve rearranging or unscrambling the letters in a word (such as anagrams, reverse words, etc). Third, the authors test GPT-3’s ability to solve SAT-style analogy problems for a few-shot. Finally, GPT-3 is evaluated on several qualitative tasks, including using new words in a sentence, correcting English grammar, and news article generation.
- On the arithmetic tasks, the few-shot learning of GPT-3 initially gives almost 100% correct results on 2-digits addition and subtraction but as the digits increase the accuracy also suffers.
GPT-3 also demonstrated impressive results on news article generation. The articles then tested on humans to detect it is real or generated. The articles generated by GPT-3 175B model are only detected correctly by 52% of humans (compared to 50% randomly). Below is the sample of the article on which most humans got wrong (12% accuracy).

Title: United Methodists Agree to Historic Split
Subtitle: Those who oppose gay marriage will form their own denomination
Article: After two days of intense debate, the United Methodist Church
has agreed to a historic split - one that is expected to end in the
creation of a new denomination, one that will be "theologically and
socially conservative," according to The Washington Post. The majority of
delegates attending the church's annual General Conference in May voted to
strengthen a ban on the ordination of LGBTQ clergy and to write new rules
that will "discipline" clergy who officiate at same-sex weddings. But
those who opposed these measures have a new plan: They say they will form a
separate denomination by 2020, calling their church the Christian Methodist
denomination.
The Post notes that the denomination, which claims 12.5 million members, was
in the early 20th century the "largest Protestant denomination in the U.S.,"
but that it has been shrinking in recent decades. The new split will be the
second in the church's history. The first occurred in 1968, when roughly
10 percent of the denomination left to form the Evangelical United Brethren
Church. The Post notes that the proposed split "comes at a critical time
for the church, which has been losing members for years," which has been
"pushed toward the brink of a schism over the role of LGBTQ people in the
church." Gay marriage is not the only issue that has divided the church. In
2016, the denomination was split over ordination of transgender clergy, with
the North Pacific regional conference voting to ban them from serving as
clergy, and the South Pacific regional conference voting to allow them.

Datasets Used: There are five different datasets used in training, the biggest of them is the Common crawl dataset which contains nearly a trillion words before filtering. But this dataset is filtered and preprocessed to obtain nearly 400 billion tokens. The other dataset includes an expanded version of the WebText dataset and two internet-based book corpora datasets and English Wikipedia text.

Dataset	Quantity(Num Tokens)	Weight in Training MIx
Common Crawl Dataset (filtered)	410 billion	60%
WebText 2	19 billion	22%
Books1	12 billion	8%
Books2	55 billion	8%
Wikipedia	3 billion	3%

Training Details:

All versions of GPT-3 is (pre) trained with Adam as Optimizer with β₁ = 0.9, β₂ = 0.95, and epsilon = 10^-8. The batch size of training data is linearly increased from 32k tokens to a maximum over 4-12 billion tokens. The data is sampled without replacement during training to minimize overfitting.

Limitations:

Despite its strong improvement in qualitative and quantitative result, GPT-3 also has some limitations:

GTP-3 also suffers the same problem as other NLP models, despite the model size GPT-3 samples sometimes repeat themselves semantically at the document level, start to lose coherence over sufficiently long passages, contradict themselves, and occasionally contain non-conclusive sentences or paragraphs
Since in-context learning is different from standard model training, it does not involve any bidirectional architectures or other training objectives such as denoising. This could be a possible explanation for GPT-3’s comparatively bad few-shot performance on a few of the tasks, such as WIC (which involves comparing the use of a word in two sentences), ANLI (which involves comparing two sentences to see if one implies the other), and several reading comprehension tasks (e.g. QuAC and RACE).
While GPT-3 takes a step towards test-time sample efficiency closer to that of humans (one-shot or zero-shot), it still requires much more text during pre-training than a human sees in their lifetime.
GPT-3 also suffers from common biases such as the bias towards race, gender, religion, etc.
- Bias towards Gender: To test the gender bias, the authors tested the gender associations of different occupations. Below are the results
  - 83% of 388 occupations evaluated were more likely to be associated with a male identifier by GPT-3. This includes the labor-intensive jobs, jobs that require high levels of education and competence.
  - Most of the female descriptive words related to their appearance while male descriptive words are quite diverse.
- Bias towards Race: Across the models, the authors noticed that Asians have comparatively good sentiments while Blacks have comparatively low sentiments.
- Bias towards Religion: To evaluate the biases related to religion the authors take generated text from 800 model outputs of length ≈50 by giving a prompt with a religion name. The authors found some words are more associated with a particular religion as compared to others