[데이터 수집 및 정제 색인마킹] Gebru et al. (2021), OPT
Contents
1. 서론
최근 대규모 언어모델들이 비약적인 발전을 이루면서, 자연어 처리 분야에서의 활용 가능성이 크게 확장되었습니다. 이런 모델들은 주로 트랜스포머(Transformer) 구조를 기반으로 하여, 방대한 양의 텍스트로부터 사전 훈련을 받습니다. pre-trained 트랜스포머는 다양한 NLP 태스크에서 높은 성능을 발휘할 수 있으나, 대규모 모델의 훈련과 유지 보수는 많은 연산 자원과 에너지를 소모합니다. 이에 본 논문에서는 Open Pretrained Transformers(OPT) 모델을 소개하며, 이 모델이 기존 GPT-3와 유사한 성능을 목표로 하되, 훨씬 효율적인 데이터 수집과 훈련 방법을 적용하였습니다.
2. 방법
2.1 모델 구조
OPT는 트랜스포머 기반의 디코더만을 사용하는 구조로, 125M에서 175B까지 다양한 크기의 모델을 실험하였습니다. 모델의 아키텍처는 기본적으로 이전 연구에서 제안된 구조를 따르되, 몇 가지 최적화를 추가하여 더 효율적으로 계산합니다. 각 모델은 다음과 같은 구성 요소로 이루어져 있습니다.
본 모델의 학습 초기화는 Megatron-LM 설정을 기반으로 하며, 가중치는 평균이 0이고 표준 편차가 0.006인 정규 분포를 사용하여 초기화합니다. 출력층의 가중치는 \(\frac{1}{\sqrt{2L}}\)로 조정되어, 깊은 네트워크에서의 학습 안정성을 높입니다.
2.2 학습 설정
학습에는 AdamW 최적화 기법을 사용했으며, \(\beta_1 = 0.9\) 및 \(\beta_2 = 0.95\)로 설정하고 가중치 감쇠는 0.1을 적용했습니다. 학습률은 선형적으로 증가시킨 후 점차 감소시키는 스케줄을 따릅니다. 또한, 드롭아웃은 0.1로 설정하여 과적합을 방지하고, 그래디언트 클리핑을 통해 학습 중 발생할 수 있는 폭발적인 그래디언트 증가를 제어합니다.
\[\text{Learning Rate} = \text{LR}_{\text{max}} \times \left(1 - \frac{\text{step}}{\text{total steps}}\right)\]2.3 Pre-training Data
사전 훈련에 사용된 데이터는 다음과 같은 다양한 데이터셋의 조합으로 구성됩니다.
이 데이터들은 중복을 최소화하면서 광범위한 주제를 포괄할 수 있도록 선택되었습니다. 데이터의 전처리 과정에서는 토큰화 및 일부 필터링을 통해 훈련에 적합한 형태로 조정되었습니다.
3. 평가
OPT 모델은 다양한 NLP 벤치마크에서 평가되었습니다. 특히, 자연어 이해(NLU) 태스크에서 GPT-3과 유사한 성능을 달성하면서도, 일부 태스크에서는 더 나은 결과를 보였습니다. 평가 과정에서는 모델의 해석 가능성과 일관성 또한 중요한 평가 지표로 사용되었습니다.
3.1 프롬프트 및 퓨샷 평가
본 논문에서는 GPT-3와 비교하여 다양한 NLP 작업에 대한 OPT 모델의 성능을 평가하였습니다. 이때, 여러 데이터셋과 벤치마크가 사용되었습니다.
평가 기준: 정확도를 기준으로 사용하며, 특정 작업에서는 다중 선택 문제 형식으로 문제를 재구성.
\[\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}\]여러 작업에서의 성능 변화는 다음과 같은 수학적 모델링을 통해 설명될 수 있습니다.
\[P( \text{correct}) = \frac{e^{\beta \cdot x_i}}{\sum_j e^{\beta \cdot x_j}}\]식에서 \(x_i\)는 특정 작업에 대한 입력 데이터, \(\beta\)는 모델 파라미터를 나타냅니다.
3.2 대화
여러 오픈 소스 대화 데이터셋을 사용하여 OPT-175B 모델을 평가하였으며, 이는 최근 대화용 모델에서 중요한 구성 요소로 여겨지고 있습니다.
평가 기준: Perplexity와 Unigram F1을 사용하여 모델의 대화 생성 능력을 측정
\[\text{Perplexity} = e^{-\frac{1}{N} \sum_{i=1}^N \log P(w_i)}\]Unigram F1 점수는 모델이 생성한 대화에서의 단어 일치율을 측정하는 방법으로, 다음 공식을 사용하여 계산됩니다.
\[\text{F1} = 2 \cdot \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}}\]식에서 precision과 recall은 각각 모델이 정확하게 예측한 단어의 비율과 실제 데이터에 포함된 관련 단어 중 모델이 얼마나 많은 단어를 예측했는지를 나타냅니다.
4 편향 및 독성 평가
OPT-175B의 잠재적 해로움을 이해하기 위해 혐오 발언 감지, 고정 관념 인식 및 독성 콘텐츠 생성과 관련된 일련의 벤치마크를 평가합니다. 이런 벤치마크에는 분명한 한계가 있지만(Bloggett et al., 2021; Jacobs and Wallach, 2021), 벤치마크를 통한 정량적 측정은 OPT-175B의 한계를 이해하는 첫 단계를 제공합니다. 이 평가에서는 주로 GPT-3 Davinci와 비교합니다.
4.1 혐오 발언 감지
Mollas et al. (2020)에서 제공하고 Chiu와 Alexander (2021)가 조정한 ETHOS 데이터셋을 사용하여 OPT-175B가 영어 문장이 인종어텐션적이거나 성차별적인지(혹은 아닌지) 판단할 수 있는 능력을 측정합니다. Zero-, one-, few-shot binary 설정에서 모델은 텍스트를 제시받고 해당 텍스트가 인종어텐션적이거나 성차별적인지를 고려하고 예/아니오 응답을 제공하도록 요청받습니다. Few-shot multiclass 설정에서는 모델에게 예/아니오/해당 없음 응답을 제공하도록 요청합니다.
\[F1_{\text{score}} = \frac{2 \times (\text{Precision} \times \text{Recall})}{\text{Precision} + \text{Recall}}\]Table 3에 제시된 결과에 따르면, OPT175B는 모든 설정에서 Davinci를 능가할 수 있음을 확인합니다.
이런 결과의 원인을 추측해보면 (1) Davinci API를 통한 평가가 Brown et al. (2020)에서 사용된 원래의 175B GPT-3 모델보다 더 발전된 안전 제어 메커니즘을 도입했을 수 있고, (2) pre-training 데이터셋에 무분별한 소셜 미디어 토론이 포함되어 있어 이런 분류 작업에 도움이 되는 추가적인 유도 편향을 제공했습니다.
4.2 CrowS-Pairs
CrowSPairs (Nangia et al., 2020)는 마스크 언어 모델을 위해 개발된 군중소싱 벤치마크로, 9가지 범주(성별, 종교, 인종/색상, 성적 지향, 나이, 국적, 장애, 외모 및 사회 경제적 지위)에서 문장 내 편향을 측정하는 것을 목표로 합니다.
각 예제는 특정 집단에 대한 고정 관념 또는 반대 고정 관념을 나타내는 두 문장 쌍으로 구성됩니다. 더 높은 점수는 모델이 고정 관념 표현을 선호한다는 것을 나타냅니다.
\[\text{Bias Score} = \frac{\sum \text{Stereotype Preference}}{\text{Total Evaluations}}\]Table 4에 따르면 OPT175B는 거의 모든 범주에서 Davinci보다 더 많은 고정 관념 편향을 보입니다. 다시 말해, Nangia et al. (2020)이 보여준 것처럼 Pushshift.io Reddit 코퍼스는 다른 코퍼스(e.g., 위키피디아)보다 스테레오타입 및 차별적 텍스트의 발생률이 높기 때문에 OPT-175B 모델은 더 많은 차별적 연관성을 학습했을 수 있으며, 이는 CrowS-Pairs에서의 성능에 직접적인 영향을 미칩니다.
4.3 StereoSet
Lieber et al. (2021) 및 Artetxe et al. (2021)을 따라, 직업, 성별, 종교, 인종의 4가지 범주에서 고정 관념 편향을 측정하기 위해 StereoSet (Nadeem et al., 2021)을 사용합니다. CrowSPairs와 유사하게 문장 내 측정을 포함하지만, 추가 맥락을 포함할 수 있는 모델의 능력을 테스트하기 위해 문장 간 측정도 포함합니다. 편향 감지와 언어 모델링 능력 사이의 잠재적인 균형을 고려하여, StereoSet은 언어 모델링 점수(LMS) 및 고정 관념 점수(SS)를 포함하며, 이를 결합하여 이상적인 맥락 연관 테스트 점수(ICAT)를 형성합니다.
Large language models (LLMs) trained on massive text collections have shown surprising emergent capabilities to generate text and perform zero- and few-shot learning (Brown et al., 2020; Lieber et al., 2021; Smith et al., 2022; Rae et al., 2021; Chowd- hery et al., 2022). While in some cases the public can interact with these models through paid APIs, full model access is currently limited to only a few highly resourced labs.2 This restricted access has limited researchers’ ability to study how and why these large language models work, hindering progress on improving known challenges in areas such as robustness, bias, and toxicity.
2 Exceptions include work by EleutherAI, who released dense models up to 20B in size (Black et al., 2022), Salesforce (Nijkamp et al., 2022), and Meta AI, who released dense models up to 13B and sparse models up to 1.1T (Artetxe et al., 2021). There is also ongoing work from the BigScience workshop (https://bigscience.
), which aims to open source very large multilingual language models and datasets.
In this technical report, we present Open Pretrained Transformers (OPT), a suite of decoderonly pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We train the OPT models to roughly match the performance and sizes of the GPT-3 class of models, while also applying the latest best practices in data collection and efficient training. Our aim in developing this suite of OPT models is to enable reproducible and responsible research at scale, and to bring more voices to the table in studying the impact of these LLMs. Definitions of risk, harm, bias, and toxicity, etc., should be articulated by the collective research community as a whole, which is only possible when models are available for study. We are releasing all of our models between 125M and 66B parameters, and will provide full research access to OPT-175B upon request. Access will be granted to academic researchers; those affiliated with organizations in government, civil society, and academia; and those in industry research laboratories. We are also releasing both the logbook of our model creation as well as our codebase, metaseq,3 which enabled training OPT-175B on 992 80GB A100 GPUs, reaching 147 TFLOP/s utilization per GPU. From this implementation, and from using the latest generation of NVIDIA hardware, we are able to develop OPT-175B using only 1/7th the carbon footprint of GPT-3. While this is a significant achievement, the energy cost of creating such a model is still nontrivial, and repeated efforts to replicate a model of this size will only amplify the growing compute footprint of these LLMs.
We believe the entire AI community — academic researchers, civil society, policymakers, and industry — must work together to develop clear guidelines around responsible AI in general and responsible LLMs in particular, given their centrality in many downstream language applications. A much broader segment of the AI community needs access to these models in order to conduct reproducible research and collectively drive the field forward. With the release of OPT-175B and smaller-scale baselines, we hope to increase the diversity of voices defining the ethical considerations of such technologies.
Table 1: Model architecture details. We report the number of layers (#L), number of attention heads (#H), and the embedding size (dmodel). We also report the peak Learning Rate (LR) and global batch size in number of tokens (Batch).
We present results on eight Transformer language models ranging from 125 million to 175 billion parameters. Architectural details are displayed in Table 1. In the interest of transparency, and to reduce risk of training instabilities, our models and hyperparameters largely follow Brown et al. (2020), with variations in batch size mostly to obtain increased computational efficiency.
For weight initialization, we follow the same settings provided in the Megatron-LM codebase,4 using a normal distribution with zero mean and standard deviation of 0.006. Standard deviation for output layers are scaled by a 1.0/ 2L term where L is the total number of layers. All bias terms are initialized as 0, and all models are trained with ReLU activation and a sequence length of 2048.
4 https://github.com/NVIDIA/Megatron-LM/blob/main/examples/pretrain_gpt3_175B.sh
We use an AdamW optimizer (Loshchilov and Hutter, 2017) with (β1, β2) set to (0.9, 0.95), and weight decay of 0.1. We follow a linear learning rate schedule, warming up from 0 to the maximum learning rate over the first 2000 steps in OPT-175B, or over 375M tokens in our smaller baselines, and decaying down to 10% of the maximum LR over 300B tokens. A number of mid-flight changes to LR were also required (see Section 2.5). Our batch sizes range from 0.5M to 4M depending on the model size (see Table 1) and is kept constant throughout the course of training.
We use a dropout of 0.1 throughout, but we do not apply any dropout to embeddings. We clip gradient norms at 1.0, except for some midflight changes that reduce this threshold down from 1.0 to 0.3 (see Section 2.5). We also include a gradient predivide factor to reduce the risk of over/underflows when computing the gradient across all ranks (splitting the division by the world size of N into two division operations by
The pre-training corpus contains a concatenation of datasets used in RoBERTa (Liu et al., 2019b), the Pile (Gao et al., 2021a), and PushShift.io Reddit (Baumgartner et al., 2020; Roller et al., 2021). All corpora were previously collected or filtered to contain predominantly English text, but a small amount of non-English data is still present within the corpus via CommonCrawl.
We removed duplicated documents across all datasets by filtering out documents via MinhashLSH (Rajaraman and UTextGenerationLLMan, 2011) with a Jaccard similarity ≥ .95. We found the Pile was particularly full of duplicate documents, and advise future researchers using the Pile to perform additional de-duplication processing.
We tokenize all corpora using the GPT-2 byte level BPE tokenizer (Sennrich et al., 2016; Radford et al., 2019; Brown et al., 2020). Our final corpus contains roughly 180B tokens.
RoBERTa We included the BookCorpus (Zhu et al., 2015) and Stories (Trinh and Le, 2018) subsets of the RoBERTa corpus and utilized an updated version of CCNews, containing news stories crawled through September 28, 2021. This CCNews v2 corpus was preprocessed the same way as the original RoBERTa CCNews (Liu et al., 2019b).
The Pile We included a subset of the Pile (Gao et al., 2021a), including: CommonCrawl, DM Mathematics, Project Gutenberg, HackerNews, OpenSubtitles, OpenWebText2, USPTO and Wikipedia. Other subsets of the Pile were eliminated as we found they increased the risk of instabilities, as measured by tendency to cause spikes in gradient norms at the 1.3B scale, or were otherwise deemed unsuitable. All subsets went through additional ad-hoc whitespace normalization.
PushShift.io Reddit We included a subset of the Pushshift.io corpus produced by Baumgartner et al. (2020) and previously used by Roller et al. (2021). To convert the conversational trees into language-model-accessible documents, we extracted the longest chain of comments in each thread and discarded all other paths in the tree. This reduced the corpus by about 66%.
We trained OPT-175B on 992 80GB A100 GPUs, by utilizing Fully Sharded Data Parallel (Artetxe et al., 2021) with Megatron-LM Tensor Parallelism (Shoeybi et al., 2019). We achieve utilization of up to 147 TFLOP/s per GPU. We keep Adam state in FP32, since we shard it across all hosts, while the model weights remained in FP16. To avoid underflows, we used dynamic loss scaling, as described in Micikevicius et al. (2017).
Here we describe significant training process adjustments that arose during OPT-175B pre-training.
Hardware Failures We faced a significant number of hardware failures in our compute cluster while training OPT-175B. In total, hardware failures contributed to at least 35 manual restarts and the cycling of over 100 hosts over the course of 2 months. During manual restarts, the training run was paused, and a series of diagnostics tests were conducted to detect problematic nodes. Flagged nodes were then cordoned off and training was resumed from the last saved checkpoint. Given the difference between the number of hosts cycled out and the number of manual restarts, we estimate 70+ automatic restarts due to hardware failures.
Loss Divergences Loss divergences were also an issue in our training run. When the loss diverged, we found that lowering the learning rate and restarting from an earlier checkpoint allowed for the job to recover and continue training. We noticed a correlation between loss divergence, our dynamic loss scalar crashing to 0, and the l2-norm of the activations of the final layer spiking. These observations led us to pick restart points for which our dynamic loss scalar was still in a “healthy” state (≥ 1.0), and after which our activation norms would trend downward instead of growing unboundedly. Our empirical LR schedule is shown in Figure 1. Early in training, we also noticed that lowering gradient clipping from 1.0 to 0.3 helped with stability; see our released logbook for exact details. Figure 2 shows our validation loss with respect to training iterations.
Figure 1: Empirical LR schedule. We found that lowering learning rate was helpful for avoiding instabilities.
Figure 2: Validation Perplexity. Our mid-flight LR changes had clear effects on validation perplexity.
We evaluate our model on 16 standard NLP tasks utilized in the literature: HellaSwag (Zellers et al., 2019), StoryCloze (Mostafazadeh et al., 2016), PIQA (Bisk et al., 2020), ARC Easy and Challenge (Clark et al., 2018), OpenBookQA (Mihaylov et al., 2018), WinoGrad (Levesque et al., 2011), WinoGrande (Sakaguchi et al., 2020), and SuperGLUE (Wang et al., 2019). We follow GPT-3 (Brown et al., 2020) by using their prompts and overall experimental setup. We compare primarily to GPT-3, having aimed to re-implement their evaluation settings, but include reported performance of other LLMs on a per-task basis when available (Lieber et al., 2021; Rae et al., 2021; Hoffmann et al., 2022; Black et al., 2022)
We report performance in accuracy (omitting F1 for MultiRC and ReCoRD for consistency in evaluation metrics). For the Winograd Schema Challenge (WSC) task in the SuperGLUE benchmark, we follow (Brown et al., 2020) and formulate the task as multiple choice questions, which is known to affect performance (Liu et al., 2020).
Zero-shot Overall average zero-shot performance across all 14 tasks may be seen in Figure 3. Overall, we see our average performance follows the trend of GPT-3. However, performance can vary radically across the tasks: for a full breakdown, see Appendix A. Note that we intentionally removed MultiRC and WIC from these averages, as these datasets seem to systematically favor GPT-3 or OPT disproportionately.
Our performance roughly matched GPT-3 for 10 tasks, and underperformed in 3 tasks (ARC Challenge and MultiRC). In 3 tasks (CB, BoolQ, WSC), we find both GPT and OPT models display unpredictable behavior with respect to scale, likely due to the small size of the validation set in these 3 tasks (56, 277, and 104 examples, respectively). In WIC, we see that the OPT models always outperform the GPT-3 models, though the numbers reported by Brown et al. (2020) also seem questionable, given WIC being a binary classification task.5 For MultiRC, we are unable to replicate the GPT-3 results using the Davinci API6 within our evaluation setup, suggesting differences in the methods of evaluation on this task. For BoolQ and WSC, we note that both OPT and GPT models seem to hover around majority-class accuracy, suggesting small perturbations in probability masses may be dominating the evaluations.
5 Brown et al. (2020) reports 0% accuracy on WIC, which implies 100% accuracy if the classification was inverted.
Figure 3: Zero-shot NLP Evaluation Averages. Across a variety of tasks and model sizes, OPT largely matches the reported averages of GPT-3. However, performance varies greatly per task: see Appendix A.
Figure 4: Multi-shot performance. OPT performance for oneand few-shot lags behind GPT-3 models, but performance depends heavily per task; see Appendix A.
Chinchilla (Hoffmann et al., 2022) and Gopher (Rae et al., 2021) perform roughly consistently with others for their parameter sizes, while PaLM (Chowdhery et al., 2022) generally performs better across all settings, even when controlling for number of parameters. We speculate the high performance of PaLM comes predominantly from higher quality and diversity of pre-training data.
One-shot and Few-shot Average multi-shot incontext performance is shown in Figure 4 (again, omitting MultiRC and WIC), with detailed performances shown in Appendix A. Across the average of all metrics, we find that OPT models perform similarly to GPT-3 models. However, as with zero-shot, breaking down these results per task shows a different story: in the same set of 10 datasets as zero-shot, we see similar performance across the two models. Some of the remaining datasets show inconsistent performance with respect to model size for both OPT and GPT-3 models (BoolQ, CB, WSC, RTE). In MultiRC, we consistently see underperformance of OPT models compared to GPT3 models. Similar to our zero-shot evaluation, we hypothesize our one- and few-shot evaluation setup may differ significantly from Brown et al. (2020).
Given that LLMs are known to be an integral component of modern dialogue models (Adiwardana et al., 2020; Roller et al., 2021; Thoppilan et al., 2022; Rae et al., 2021; Chowdhery et al., 2022), we additionally evaluate OPT-175B on several open source dialogue datasets. In particular, we follow Roller et al. (2021), and evaluate on ConvAI2 (Dinan et al., 2020b), Wizard of Wikipedia (Dinan et al., 2019b), Empathetic Dialogues (Rashkin et al., 2019), and Blended Skill Talk (Smith et al., 2020). We additionally evaluate on the more recent Wizard of Internet dataset (Komeili et al., 2021). We focus our comparisons primarily against existing open source dialogue models including the fine-tuned BlenderBot 1 (Roller et al., 2021) and its pre-training counterpart Reddit 2.7B. We also compare against the fine-tuned R2C2 BlenderBot, a 2.7B parameter BlenderBot-like model trained by Shuster et al. (2022).
We report Perplexity and Unigram F1 (UF1) overlap, following the metrics of the ConvAI2 competition (Dinan et al., 2020b). To control for different tokenization in each of the models, we normalize all perplexities to be in the space of the GPT-2 tokenizer (Radford et al., 2019). We also note which models are supervised with respect to these dialogue tasks and which are unsupervised. For OPT-175B, all generations are performed using greedy decoding up to a maximum of 32 tokens. We do not attempt to prompt the model at all except for alternating “Person 1:” and “Person 2:” lines of dialogue. The remaining models use the generation parameters found in BlenderBot 1.
Results are shown in Table 2. We see that OPT-175B significantly outperforms the alsounsupervised Reddit 2.7B model on all tasks, and
performs competitively with the fully supervised BlenderBot 1 model, especially in the ConvAI2 dataset. On the Wizard-of-Internet dataset, which is fully unsupervised for all models, we see that OPT-175B obtains the lowest perplexity but still has lower UF1 than the models with Wizard-ofWikipedia supervision.
We were somewhat surprised that the evaluations of the unsupervised OPT-175B model were as competitive as BlenderBot 1 on the ConvAI2 dataset. This may indicate leakage of the ConvAI2 dataset into the general pre-training corpus or even into the validation data as evaluated in Table 2. To address concerns of leakage, we searched our pre-training corpus for the first conversation in the ConvAI2 dataset, but we did not find any overlap. We additionally evaluated OPT-175B on the ConvAI2 hidden test set, which has never been publicly released, and achieved 10.7 ppl and .185 UF1, matching the performance of the validation set. Furthermore, we evaluated OPT-175B on a subset of the ConvAI2like MultiSessionChat (MSC) dataset (Xu et al., 2021b) and obtained a perplexity of 9.7 and UF1 of .177, indicating the model is generalizing well across multiple PersonaChat-like datasets. Since both MSC and WoI datasets were released after the CommonCrawl snapshot used in pre-training corpus, there is minimal risk of leakage. We conclude that OPT-175B has a strong ability to maintain a consistent persona across conversations, a behavior also highlighted in LaMDA (Thoppilan et al., 2022).
To understand the potential harm of OPT-175B, we evaluate a series of benchmarks related to hate speech detection, stereotype awareness, and toxic content generation. While there may be shortcomings in these benchmarks (Blodgett et al., 2021; Jacobs and Wallach, 2021), these measurements provide a first step towards understanding the limitations of OPT-175B. We compare primarily against GPT-3 Davinci, as these benchmarks were not yet available to be included in Brown et al. (2020).
Using the ETHOS dataset provided in Mollas et al. (2020) and instrumented by Chiu and Alexander (2021), we measure the ability of OPT-175B to identify whether or not certain English statements are racist or sexist (or neither). In the zero-, one-, and few-shot binary cases, the model is presented with text and asked to consider whether the text is racist or sexist and provide a yes/no response. In the few-shot multiclass setting, the model is asked to provide a yes/no/neither response.
Table 2: Dialogue Evaluations. OPT-175B, in a fully unsupervised setting, performs competitively against fully supervised models.
Table 3: Hate speech detection. F1 scores of detecting hate speech between Davinci and OPT-175B. OPT175B considerably outperforms Davinci in all settings.
Results are presented in Table 3. With all of our one-shot through few-shot configurations, OPT175B performs considerably better than Davinci. We speculate this occurs from two sources: (1) evaluating via the Davinci API may be bringing in safety control mechanisms beyond the original 175B GPT-3 model used in Brown et al. (2020); and (2) the significant presence of unmoderated social media discussions in the pre-training dataset has provided additional inductive bias to aid in such classification tasks.
Developed for masked language models, CrowSPairs (Nangia et al., 2020) is a crowdsourced benchmark aiming to measure intrasentence level biases in 9 categories: gender, religion, race/color, sexual orientation, age, nationality, disability, physical appearance, and socioeconomic status. Each example consists of a pair of sentences representing a stereotype, or anti-stereotype, regarding a certain group, with the goal of measuring model preference towards stereotypical expressions. Higher scores indicate higher bias exhibited by a model.
Table 4: CrowS-Pairs evaluation. Lower is better for all categories, indicating more fairness. The OPT-175B model performs worse than Davinci in most categories.
When compared with Davinci in Table 4, OPT175B appears to exhibit more stereotypical biases in almost all categories except for religion. Again, this is likely due to differences in training data; Nangia et al. (2020) showed that Pushshift.io Reddit corpus has a higher incidence rate for stereotypes and discriminatory text than other corpora (e.g. Wikipedia). Given this is a primary data source for OPT-175B, the model may have learned more discriminatory associations, which directly impacts its performance on CrowS-Pairs.
Following Lieber et al. (2021) and Artetxe et al. (2021), we use StereoSet (Nadeem et al., 2021) to measure stereotypical bias across 4 categories: profession, gender, religion, and race. In addition to intrasentence measurement (similar to CrowSPairs), StereoSet includes measurement at the intersentence level to test a model’s ability to incorporate additional context. To account for a potential trade-off between bias detection and language modeling capability, StereoSet includes two metrics: Language Modeling Score (LMS) and Stereotype Score (SS), which are then combined to form the Idealized Context Association Test score (ICAT). Unlike Lieber et al. (2021), we normalize scores by token count, rather than character count, which they report improves metrics for several models.
Table 5: StereoSet Evaluations. Davinci and OPT175B perform similarly across all evaluations.
Results are shown in Table 5. We see that Davinci and OPT-175B exhibit similar scores on aggregate (overall ICAT is very close between the two). In particular, Davinci outperforms in the areas of profession and race, while OPT-175B outperforms in the areas of Gender and Religion. OPT-175B performs better across the board on the SS metric, while Davinci generally outperforms on the LMS metric.
We evaluate the tendency of OPT-175B to respond with toxic language via the RealToxicityPrompts (Gehman et al., 2020) dataset. Following PaLM (Chowdhery et al., 2022), we sample 25 generations of 20 tokens using nucleus sampling (Holtzman et al., 2020) (p = 0.9) for each of 10, 000 randomly sampled prompts from RTP, and report mean toxicity probabilities of the continuations, stratified across bucketed toxicities of the original prompts. For comparison, we report bucketed toxicity rates from Davinci and PaLM.
Results are shown in Figure 5. Overall, we see that OPT-175B has a higher toxicity rate than either PaLM or Davinci. We also observe that all 3 models have increased likelihood of generating toxic continuations as the toxicity of the prompt increases, which is consistent with the observations of Chowdhery et al. (2022). As with our experiments in hate speech detection, we suspect the inclusion of unmoderated social media texts in the pre-training corpus raises model familiarity with, and therefore propensity to generate and detect, toxic text. This strong awareness of toxic language may or may not be desirable depending on the specific requirements of downstream applications. Future applications of OPT-175B should consider this aspect of the model, and take additional mitigations, or avoid usage entirely as appropriate.
Figure 5: RealToxicityPompts. OPT-175B is more likely to generate toxic responses than either Davinci or PaLM. Consistent with prior work, toxicity rates increase as prompt toxicity increases.
Finally, we compare OPT-175B on two Dialogue Safety evaluations. The first, SaferDialogues (Ung et al., 2021), measures the ability to recover from explicit safety failures, usually in the form of apologizing or recognizing its mistake. The second, the Safety Bench Unit Tests (Dinan et al., 2021), measures how unsafe a model’s response is, stratified across 4 levels of topic sensitivity: Safe, Realistic, Unsafe, and Adversarial. As with the other dialogue evaluations (Section 3.2), we compare to several existing open source dialogue models.
Results for both experiments are shown in Table 6. We observe that OPT-175B has similar performance as the Reddit 2.7B model across both SaferDialogues and the Unit Tests, with OPT-175B performing marginally better in the Safe and Adversarial settings. Consistent with Roller et al. (2021) and Xu et al. (2020), we find that the models finetuned on curated dialogue datasets (BlenderBot 1, R2C2) have overall lower toxicity. We conclude that future experimentation of OPT-175B for dialogue should contain explicit fine-tuning on curated datasets in order to improve the safety profile.
Table 6: Dialogue Responsible AI evaluations. OPT-175B is roughly on par with the Reddit 2.7B model, but performs worse in the Unsafe setting.
In Sections 3.1 and 4, we carried out extensive evaluation of all released models at varying scales. We saw parity in performance for standard evaluation datasets used in the GPT-3 models. Moreover, we performed safety, bias, and inclusion evaluations, again seeing largely comparable performance with some variations in toxicity and hate speech detection. However, such evaluations may not fully characterize the complete limitations of these models. In general, we qualitatively observe that OPT-175B suffers from the same limitations noted in other LLMs (Brown et al., 2020; Lieber et al., 2021; Thoppilan et al., 2022; Rae et al., 2021; Smith et al., 2022; Chowdhery et al., 2022; Bender et al., 2021).
In particular, we found OPT-175B does not work well with declarative instructions or point-blank interrogatives. Prompting with such instructions tends to produce a simulation of a dialogue beginning with such an instruction, rather than an execution of the instruction. Future work into instruction learning, in the vein of InstructGPT (Ouyang et al., 2022), may alleviate these limitations.
OPT-175B also tends to be repetitive and can easily get stuck in a loop. While sampling can reduce the incidence rate of repetitive behavior (Holtzman et al., 2020), we anecdotally found it did not eliminate it entirely when only one generation is sampled. Future work may wish to incorporate more modern strategies for reducing repetition and improving diversity, such as unlikelihood training (Welleck et al., 2020) or best-first decoding (Meister et al., 2020).
Similar to other LLMs, OPT-175B can produce factually incorrect statements (Adiwardana et al., 2020; Brown et al., 2020; Roller et al., 2021; Rae et al., 2021; Chowdhery et al., 2022; Thoppilan et al., 2022). This can be particularly harmful in applications where information accuracy is critical, such as healthcare and scientific discovery (Weidinger et al., 2021b). Recently, several efforts have reported that retrieval-augmented models can improve factual correctness of LLMs (Lewis et al., 2020; Komeili et al., 2021; Thoppilan et al., 2022; Borgeaud et al., 2021; Shuster et al., 2022; Nakano et al., 2021). We believe OPT-175B will also benefit from retrieval-augmentation in future iterations.
As shown in Section 4, we also find OPT-175B has a high propensity to generate toxic language and reinforce harmful stereotypes, even when provided with a relatively innocuous prompt (Gehman et al., 2020), and adversarial prompts are trivial to find (Dinan et al., 2021). There has been a great deal of work on mitigations for toxicity and biases (Dathathri et al., 2019; Dinan et al., 2019a; Sheng et al., 2019; Dinan et al., 2020a; Liu et al., 2019a; Krause et al., 2020; Xu et al., 2020; Liang et al., 2021; Dinan et al., 2021; Xu et al., 2021a; Dhamala et al., 2021; Schick et al., 2021; Ouyang et al., 2022). Depending on downstream applications, future uses of OPT-175B may need to employ these or novel mitigation approaches, especially before any real world deployment. Given our primary goal as a replication of GPT-3, we choose not to apply these mitigations in this first release.
In summary, we still believe this technology is premature for commercial deployment. Despite including data sheets and model cards, we believe more scrutiny should be afforded to the training data with additional data characterization and selection criteria in order to use data responsibly. The current practice is to feed the model with as much data as possible and minimal selection within these datasets. Despite having comprehensive evaluations, we would ideally have more streamlined and consistent evaluation setups to ensure replicability and reproducibility of evaluation scenarios. Differences in prompting styles and number of shots for in-context learning could create variations that lead to different results. We hope that the public release of the OPT models will enable many more researchers to work on these important issues.
Following the recommendations for individual researchers generated by the Partnership for AI,7 along with the governance guidance outlined by NIST,8 we are disclosing all of the details involved in training OPT-175B through our logbook,9 our code, and providing researchers access to model weights for OPT-175B, along with a suite of smaller baselines mirroring the setup for OPT175B. We aim to be fully accountable for the development lifecycle of OPT-175B, and only through increasing transparency around LLM development can we start understanding the limitations and risks of LLMs before broader deployment occurs.
By sharing a detailed account of our day-to-day training process, we disclose not only how much compute was used to train the current version of OPT-175B, but also the human overhead required when underlying infrastructure or the training process itself becomes unstable at scale. These details are generally omitted from previous publications, likely due to the inability to fully ablate changes made mid-flight (without drastically increasing the compute budget). We hope that by revealing how certain ad-hoc design decisions were made, we can improve upon these practices in the future, and collectively increase the experimental robustness in developing models at this scale.
Outside of these notes, the metaseq codebase itself is the final source of truth in many of our implementation details. By releasing our development codebase, we aim to shed light on any implementation detail that may have been omitted from being explicitly enumerated in this paper, as it is either considered a detail of standard practice in the field, or is simply a detail we failed to account for. This current codebase is also the only known open-source implementation of training a decoderonly transformer that is ≥175B parameters without the use of pipeline paralellism on NVIDIA GPUs. To enable experimentation at 175B scale, we are providing researchers with direct access to the parameters of OPT-175B. The reasoning here is twofold: enable Responsible AI research into LLMs while simultaneously reducing the environmental impact of pursuing research at this scale. There is a growing body of work detailing ethical and social risks from deploying language models with emergent capabilities at scale (Weidinger et al., 2021a; Bommasani et al., 2021; Dinan et al., 2021; Kenton et al., 2021). By limiting access to OPT-175B to the research community with a non-commercial license, we aim to focus development efforts on quantifying the limitations of the LLMs first, before broader commercial deployment occurs.
7 https://partnershiponai.org/paper/responsible-publication-recommendations/
8 https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1270.pdf
9 https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/ chronicles/OPT175B_Logbook.pdf
Furthermore, there exists significant compute and carbon cost to reproduce models of this size. While OPT-175B was developed with an estimated carbon emissions footprint (CO2eq) of 75 tons,10 GPT-3 was estimated to use 500 tons (Patterson et al., 2021), while Gopher required 380 tons (Rae et al., 2021). These estimates are not universally reported, and the accounting methodologies for these calculations are also not standardized. In addition, model training is only one component of the overall carbon footprint of AI systems; we must also consider experimentation and eventual downstream inference cost, all of which contribute to the growing energy footprint of creating large-scale models (Wu et al., 2022). By releasing our logbook, we hope to highlight the gap between a theoretical carbon cost estimate that assumes no hardware failures or training instabilities, versus one that aims to include the entire LLM development lifecycle. We need to understand the manufacturing (or embodied) carbon of these systems (Gupta et al., 2021) as they grow increasingly more complex, and we hope that our paper can help future work in defining additional factors to consider when measuring the impact of scale on the environment.
Similarly, by producing a set of baselines across a wide range of scales, we hope to enable the broader research community to study the impact and limitations of these models with respect to scale alone. As reported in Hoffmann et al. (2022), many of these LLMs may have been under-trained as a function of the amount of training data used, which implies that incorporating more data and continuing to train these baseline models may continue to improve performance. There is also evidence that step-function changes in capabilities may occur at a scale that is much smaller than 175B (Wei et al., 2021), indicating a need to examine a wider range of scales for different research applications.
10 With ablations, baselines and downtime, our own estimates of total cost is roughly 2× higher.
Since the publication of the Transformer architecture (Vaswani et al., 2017) and BERT (Devlin et al., 2019), the field of NLP has experienced a massive shift towards the use of LLMs with self-supervised pre-training. Multiple masked langauge models, including T5 (Raffel et al., 2020) and MegatronLM (Shoeybi et al., 2019), have shown consistent improvements through scale. These scaling gains come not only from growing the total number of parameters in the models, but also the amount and quality of pre-training data (Liu et al., 2019b; Hoffmann et al., 2022).
Auto-regressive language models (Mikolov et al., 2009) have seen the largest growth in model size, from 117M parameters (Radford et al., 2018) to over 500B parameters (Smith et al., 2022; Chowdhery et al., 2022). The resulting massive improvement in generative fluency and quality was first characterized in GPT-2 (Radford et al., 2019) and further improved with GPT-3 (Brown et al., 2020) and later models. Although a variety of very large (over 100B parameters) generative models have now been trained (Lieber et al., 2021; Rae et al., 2021; Thoppilan et al., 2022; Smith et al., 2022; Chowdhery et al., 2022), they are all closed source and accessible only internally or via paid API services. There are a few notable efforts towards open sourcing LLMs from non-profit research organizations including EleutherAI (Black et al., 2022) and BigScience.11 These models differ from the OPT models in pre-training data, target languages and model scale, making it possible for the community to compare different pre-training strategies.
Since Brown et al. (2020), the primary evaluation criterion for LLMs has been prompt-based (Black et al., 2022; Rae et al., 2021; Chowdhery et al., 2022), as is also performed in this paper. This is largely due to the convenience of evaluating on many tasks without specialized task-specific fine-tuning. Prompting itself has a long history: cloze evaluations go back several decades (Chambers and Jurafsky, 2008; Mostafazadeh et al., 2016). More recently, prompting or masked infilling has been used to probe models for knowledge (Petroni et al., 2019) or perform a variety of NLP tasks (Radford et al., 2019; Brown et al., 2020). There has also been work on eliciting prompting behavior in smaller models (Schick and Schütze, 2020; Gao et al., 2021b; Li and Liang, 2021; Lester et al., 2021; Scao and Rush, 2021), improving the flexibility of prompting (Shin et al., 2020), and understanding why and how prompting works (Liu et al., 2021; Min et al., 2022).
11 https://huggingface.co/bigscience/tr11-176B-ml-logs/tensorboard
Recent efforts have shown gains by fine-tuning models to directly respond to instruction-style prompting (Wei et al., 2021; Min et al., 2021; Sanh et al., 2021; Ouyang et al., 2022). However, effective prompt engineering remains an open research challenge. Results vary significantly and unpredictably with the selection of the prompt (Lu et al., 2021), and models do not seem to understand the prompts as fully as we expect (Webson and Pavlick, 2021). Furthermore, it is challenging to write prompts without a development set, which leads to questions about the extent to which we are actually achieving zeroor few-shot learning in practice (Perez et al., 2021). We do not attempt to address these concerns of prompting, and instead only aim to provide evaluation of OPT-175B in existing settings. However, we hope the full release of OPT-175B will enable others to better study these challenges in the future.
In this technical report, we introduced OPT, a collection of auto-regressive language models ranging in size from 125M to 175B parameters. Our goal was to replicate the performance and sizes of the GPT-3 class of models, while also applying the latest best practices in data curation and training efficiency. We described training details, evaluated performance in a number of NLP and dialogue settings, and characterized behaviors with respect to bias, toxicity and hate speech. We also described many other limitations the models have, and discussed a wide set of considerations for responsibly releasing the models. We believe the entire AI community would benefit from working together to develop guidelines for responsible LLMs, and we hope that broad access to these types of models will increase the diversity of voices defining the ethical considerations of such technologies.
Figure 6: Zero-shot NLP Evaluations. Full evaluations on all 16 NLP tasks, with comparisons where available. We find that across most tasks, GPT-3 models and OPT models perform similarly, but some tasks display highly erratic behavior.
Figure 7: Multishot-shot NLP Evaluations. Full evaluations on all 16 NLP tasks, with comparisons to the GPT-3 reported performance. As with zero-shot, performance is roughly similar for most tasks, with some tasks demonstrating erratic behavior.
We follow the recommendations of Gebru et al. (2021) and provide a data card for the dataset used to train the OPT models.
Following Mitchell et al. (2018), we provide a model card for OPT-175B.
13 https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/MODEL_LICENSE.md
For all sample outputs, the initial prompt is given in bold and the remainder is the continuation. These example outputs were intentionally selected to highlight both successes and failures of the OPT-175B model.
Figure 8: Poetry generation. We have observed the model can write entertaining poetry on topics such as dodos, samosas, and performance reviews. However, we struggled to get the model to observe rhyme or meter.
Figure 9: Conversation generation. OPT-175B adopts a patriotic personality when prompted as the Statue of Liberty. However, the model also devolves into somewhat simple and linguistically repetitive generations further into the conversation.
Figure 10: Basic few-shot translation example. OPT was not intentionally trained to be multilingual, but we found anecdotally it has limited success with simple translations in German, Spanish, French, and Chinese.
Figure 11: Paper writing example. Prompting with “1. Introduction” generally yielded more interesting results compared to prompting with “Abstract.” Our prompt here was inspired by the first sentence of the seminal ResNet work (He et al., 2016).
Figure 12: Arithmetic. We observe mistakes when extending from addition to other operations.
Figure 13: Python programming. Simply switching out a variable name can alter the generated output.