Contents
Task 1: 안전성
Task 2: 사실적 근거
Task 3: LaMDA의 효용 분석
1. 서론
언어 모델의 선행학습은 NLP 분야에서 점점 더 많은 가능성을 보여주고 있다. 이런 방법은 레이블이 없는 텍스트를 사용하여, 모델 및 데이터셋의 크기를 확장함으로써 성능 향상이나 새로운 기능을 실현할 수 있다. 예를 들어, GPT-3와 같은 대규모 모델은 소수샷 학습에서 인상적인 능력을 보여주었다. 이와 유사하게, 대화형 모델들은 텍스트에서 장기적인 의존성을 표현할 수 있는 트랜스포머의 능력을 활용하여 성공을 거두었다.
2. 선행연구
언어 모델 및 대화형 모델에 대한 최근 연구들은 모델 크기에 따른 성능 개선 가능성을 보여주었다. 특히, 대화형 모델의 경우, 근거성 문제에 집중하는 연구들이 많았으며, 이는 외부 지식 기반과 연동하여 모델의 출력을 근거있게 만드는 방법에 대한 다양한 접근 방식을 포함한다.
3. LaMDA 선행학습
LaMDA는 공개 대화 데이터 및 웹 문서로부터 생성된 데이터셋을 사용하여 다음 토큰을 예측하는 것을 목표로 선행학습되었다. 이 과정에서 대규모 Transformer 기반 모델이 사용되었으며, 이는 모델의 성능을 향상시키는 데 기여한다.
4. 메트릭
4.1 기본 메트릭: 품질, 안전성, 근거성
4.2 역할별 메트릭: 유용성, 역할 일관성
5. 수학적 모델링 및 논리적 인퍼런스
LaMDA 모델의 성능 평가에 사용된 여러 메트릭는 다음과 같은 수학적 인퍼런스를 기반으로 한다. 예를 들어, 품질 메트릭는 다음과 같이 정의된다.
\[\text{Quality} = \frac{\text{Sensibleness} + \text{Specificity} + \text{Interestingness}}{3}\]이 식에서 각 요소는 0 또는 1의 값을 가지며, 이는 크라우드워커에 의해 주어진 레이블링을 통해 결정된다. 각 메트릭의 정의와 평가 방식은 논문의 첨부 파일에 자세히 설명되어 있다.
6 LaMDA 파인튜닝
6.1 품질과 안전을 위한 차별적 및 생성적 파인튜닝 (SSI)
LaMDA는 다양한 문맥에서의 반응 생성과 반응의 품질 및 안전 평가를 동시에 수행할 수 있는 모델이 필요함. 이를 위해 생성적 작업과 차별적 작업을 혼합하여 파인튜닝이 이뤄짐.
선행 연구
기존의 언어 모델은 주로 생성적 작업에 집중하거나 차별적 작업을 별도로 수행하는 경향이 있음. 하지만, 이런 접근은 각각의 작업에 최적화된 모델을 요구하기 때문에 자원의 비효율성이 발생할 수 있음.
LaMDA는 하나의 모델을 사용하여 생성 및 평가를 동시에 수행할 수 있는 구조로 설계되어 효율성을 크게 향상시킬 수 있었다고 함.
$x_i$는 입력 문장, $c_i$는 문맥, $y_i$는 예측된 레이블, $\theta$는 모델 파라미터
6.2 외부 정보 검색 시스템을 호출하기 위한 파인튜닝 학습
언어 모델은 사실에 기반한 정확한 정보를 제공해야 하지만, 가끔 외부 데이터와 연동되지 않는 응답을 생성함. 이로 인해 오류가 발생할 수 있음.
선행 연구
외부 정보를 참조하지 않는 언어 모델은 종종 잘못된 정보를 생성할 수 있음. 이를 해결하기 위해 외부 정보를 동적으로 참조하는 모델이 필요함.
LaMDA는 외부 정보 검색 시스템을 호출하여 응답의 정확성을 보장하는 파인튜닝을 진행하였는데, 사실 확인이 중요한 경우 특히 유용했다고 함.
7 기초 지표에 대한 결과
7.1 데이터셋 및 방법
이 연구에서 사용된 데이터셋은 6.4K의 대화(61K 턴), 8K 대화(48K 턴) 및 4K 대화(40K 턴)를 포함하며, 각각은 안전성, 유용성 및 근거성 평가를 위해 크라우드워커에 의해 레이블링 되었다. 또한, 1K 대화(9K 턴)는 생성된 쿼리나 응답 수정이 정확하게 수행되었는지에 대해 이진 레이블이 부여되었다.
7.2 실험적 접근
LaMDA 모델은 두 가지 주요 파인튜닝 단계를 거쳤다.
7.3 결과 분석
파인튜닝은 모든 모델 크기에서 품질, 안전성, 정보 기반 대응의 향상을 가져왔다. 파인튜닝을 통해 획득된 높은 품질의 지표는 크라우드워커의 품질을 상회하는 경우도 있었다. 그러나 안전성은 모델 크기의 증가 없이는 큰 향상을 보이지 않았다. 정보 기반 대응은 모델의 크기가 커질수록 향상되었으나, 외부 지식원을 접근할 수 있게 함으로써 더욱 높은 향상을 이룰 수 있었다.
8 도메인 그라운딩
8.1 도메인별 역할 설정
LaMDA는 특정 도메인에 적합한 역할을 수행할 수 있도록 사전 조건화(pre-conditioning)를 통해 도메인 그라운딩을 수행한다. 예를 들어, Mount Everest 역할로 설정할 경우, “안녕하세요, 저는 에베레스트 산입니다. 저에 대해 무엇을 알고 싶으신가요?”라는 인사말로 대화를 시작한다.
8.2 구현 및 평가
각 도메인에 대해 LaMDA는 사용자의 요청에 따라 교육적 정보 또는 음악 추천 등을 제공한다. 각 응답은 역할 일관성과 도움이 되는지를 평가하기 위해 다수의 크라우드워커에 의해 평가된다. LaMDA는 음악 추천 에이전트로서의 역할에서 특히 높은 도움됨과 역할 일관성을 보여준다.
9 토론 및 한계
이 연구는 휴먼 주석이 달린 파인튜닝 데이터의 비교적 적은 양으로도 대화 모델의 품질과 안전성을 개선할 수 있음을 보여준다. 그러나 파인튜닝 데이터셋을 수집하는 것은 비용이 많이 들고 시간이 많이 소요되는 복잡한 과정이다. 미래의 연구는 더 큰 파인튜닝 데이터셋, 더 long context, 그리고 안전하고 고품질의 대화를 위해 필요한 것들을 포착할 수 있는 더 많은 지표들을 포함시킬 것이다.
LaMDA 모델은 여전히 외부 지식원의 내용을 정확하게 반영하지 못하는 응답을 생성할 수 있으며, 이는 복잡한 인퍼런스에 대한 추가 연구가 필요함을 시사한다. 또한, 이 연구는 파인튜닝을 통해 평균적으로 안전성 지표를 개선할 수 있음을 보여주지만, LaMDA와 같은 대규모 언어모델이 생성할 수 있는 부적절한 응답 분포의 long tail에 대한 미래의 연구가 필요하다.
Language model pre-training is an increasingly promising research approach in NLP [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. As pre-training uses unlabeled text, it can be combined with scaling model and dataset sizes to achieve better performance or new capabilities [13]. For example, GPT-3 [12], a 175B parameter model trained on a large corpus of unlabeled text, shows an impressive ability in few-shot learning thanks to scaling.
Dialog models [14, 15, 16], one of the most interesting applications of large language models, successfully take advantage of Transformers’ ability to represent long-term dependencies in text [17, 18]. Similar to general language models [13], Adiwardana et al. [17] show that dialog models are also well suited to model scaling. There is a strong correlation between model size and dialog quality.
Inspired by these successes, we train LaMDA, a family of Transformer-based neural language models designed for dialog. These models’ sizes range from 2B to 137B parameters, and they are pre-trained on a dataset of 1.56T words from public dialog data and other public web documents (Section 3). LaMDA makes use of a single model to perform multiple tasks: it generates potential responses, which are then filtered for safety, grounded on an external knowledge source, and re-ranked to find the highest-quality response.
We study the benefits of model scaling with LaMDA on our three key metrics: quality, safety, and groundedness (Section 4). We observe that: (a) model scaling alone improves quality, but its improvements on safety and groundedness are far behind human performance, and (b) combining scaling and fine-tuning improves LaMDA significantly on all metrics, and although the model’s performance remains below human levels in safety and groundedness, the quality gap to measured crowdworker levels can be narrowed (labeled ‘Human’ in Figure 1).
The first metric, quality, is based on three components: sensibleness, specificity, and interestingness (Section 4). We collect annotated data that describes how sensible, specific, and interesting a response is for a multiturn context. We then use these annotations to fine-tune a discriminator to re-rank candidate responses.
The second metric, safety, is introduced to reduce the number of unsafe responses that the model generates. To achieve this, we define an illustrative set of safety objectives that attempt to capture the behavior that the model should exhibit in a dialog (Appendix A.1), and we use a demographically diverse set of crowdworkers to label responses in multiturn dialogs for these objectives (Appendix A.2, A.3). We then use these labels to fine-tune a discriminator to detect and remove unsafe responses (Section 6.1). Our work on safety for LaMDA can be understood as a process for AI value alignment, at a high level.
The third metric, groundedness, is introduced for the model to produce responses that are grounded in known sources wherever they contain verifiable external world information. Due to neural language models such as LaMDA’s capacity to generalize rather than just memorize, they tend to generate responses that may seem plausible, but actually contradict factual statements made in established sources. We use this metric for the model to avoid this tendency. While grounding in known sources does not guarantee factual accuracy, it allows users or external systems to judge the validity of a response based on the reliability of its source and its faithful reproduction. We find that augmenting model outputs with the ability to use external tools, such as an information retrieval system, is a promising approach to achieve this goal. Therefore, we collect data from a setting where crowdworkers can use external tools to research factual claims, and train the model to mimic their behavior.
Finally, we explore the use of LaMDA in the domains of education and content recommendations to investigate its potential and shortcomings. Similar to the concept of prompts in GPT-3 [12], we precondition LaMDA on a few turns of application-specific dialog to adapt LaMDA to the target applications. We perform experiments to compare the application-specific helpfulness (i.e., useful and correct responses) and role consistency (i.e., agent utterances match agent role) of pre-training-only and fine-tuned LaMDA models subject to application-specific preconditioning. We find that both types of models can adapt to their expected application roles fairly well, but fine-tuned LaMDA models are significantly more helpful.
Figure 1: Impact of model pre-training alone vs. with fine-tuning in LaMDA on dialog quality (left), and safety and factual grounding (right). The quality metric (SSI) corresponds to sensibleness, specificity, and interestingness. See Section 4 for more details on these metrics.
Language models and dialog models: Language models have attracted much attention recently thanks to their successes in NLP applications (e.g., [19, 20, 21, 2, 1, 22, 23, 5, 12, 24]). Our study of scaling laws with respect to model sizes is inspired by recent work on the scaling laws of neural language models [12, 13]. Similar to their findings, our results show that model scaling improves our quality (sensibleness, specificity, and interestingness), safety and groundedness metrics to some extent. However, fine-tuning combined with scaling significantly improves performance on all metrics.
Our work is also closely related to recent successes in applying language models to dialog modeling (e.g., [25, 26, 17, 18]), which built on earlier research in neural dialog modeling (e.g., [14, 15, 16, 27, 28]). One of our fine-tuning stages requires training on dialog-only data, which is related to Wolf et al. [29], Dinan et al. [25] and Zhang et al. [30]. Our use of fine-tuning on crowdworker-annotated data to improve interestingness is comparable to Roller et al. [18]. However, we aim to maximize the interestingness of the model’s output distinctly from its ability to engage the user in further interaction.
Our finding that pure scaling has a limited effect on key measures of open-domain dialog model performance echoes that of Shuster et al. [31], who also focus on the problem of groundedness. Recent studies on scaling have found that performance on question-answering tasks improves with model size [32, 33], similar to our findings on pre-trained LaMDA prior to fine-tuning.
Our approach to improving model groundedness is broadly consistent with a growing literature on augmenting neural language models with retrieval systems. Most of the existing literature focuses on the problem of open-domain question-answering rather than dialog generation, and the models themselves are used to index and rank knowledge sources, rather than trained to use an intermediate tool. Given these differences, we note that the range of existing approaches to this problem include the RNNLM [34], RAG [35], REALM [36], and FiD [37] architectures. Zhu et al. [38] provide a survey of further recent work. See Karpukhin et al. [39] for details on the ‘dense passage retriever’ used in RAG. Recent work in this direction has expanded and elaborated on neural models’ ability to retrieve and rank passages [40]. The RETRO architecture demonstrates that language models can be primed with results retrieved from a database as large as two trillion tokens [41]. At a broad level, our approach is also comparable to that of Byrne et al. [42], which fine-tunes the model to use external APIs for movie ticketing dialog.
Parts of our findings are similar to recent studies on dialog groundedness. Granting access to external knowledge bases has been shown to reduce the rate at which models hallucinate unsourced statements in dialog across a variety of retrieval systems and model architectures [31]. Another study finds that a question-answering system’s accuracy is improved by separating it into a reasoning unit and a response generator, analogous to our separation of ‘Base’ and ‘Research’ models in our study [43]. Meanwhile, the WebGPT framework includes a language system that can interact with the open web via a text-only interface, and learns to imitate humans in answering questions by citing external sources [44]. Komeili et al. [45] compare different types of pre-trained models and retrieval methods, and reach a similar conclusion that augmenting language models with a search engine provides more factually grounded responses. They encode the input context with grounded information from search to generate the next response, while we augment the generated responses with information from known sources in our method. This allows us to fine-tune the model for groundedness without sacrificing gains in safety or quality from other fine-tuning treatments.
Dialog metrics: Defining effective metrics for dialog models remains an open research topic. Our approach is inspired by Adiwardana et al. [17], who argued for human-like metrics, such as sensibleness and specificity. Many automated metrics for dialog models have been studied, including perplexity [16, 17], F1, Hits@1/N [25], USR [46], or BLEU/ROUGE [47, 15, 27]. However, such automated metrics may not correlate well with human judgment [48]. More reliable metrics for dialog modeling require human evaluation [49, 50, 18, 25, 17, 51], as used in this paper.
Earlier research attempted to combine multifaceted evaluations of dialog quality into a single headline metric [52]. We follow the pattern established in Adiwardana et al. [17] and Roller et al. [18] by considering the different components of our evaluations separately. In addition to sensibleness and specificity per Adiwardana et al. [17], we add new metrics: interestingness, safety, and groundedness. An advantage of using several different metrics is their debuggability: by exploring responses with low safety or groundedness scores, we have been able to develop targeted methods to improve them.
Safety and safety of dialog models: Inappropriate and unsafe risks and behaviors of language models have been extensively discussed and studied in previous works (e.g., [53, 54]). Issues encountered include toxicity (e.g., [55, 56, 57]), bias (e.g., [58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72]), and inappropriately revealing personally identifying information (PII) from training data [73]. Weidinger et al. [54] identify 21 risks associated with large-scale language models and discuss the points of origin for these risks. While many mitigation strategies have also been suggested (e.g., [74, 75, 76, 77, 78, 79, 80, 81, 82]), meaningfully addressing these issues remains an active research area.
Similar issues have also been discussed specifically for dialog models [53]. For instance, examples of bias, offensiveness, and hate speech have been found both in training data drawn from social media, and consequently in the output of dialog models trained on such data [83]. Dialog models [84] can learn, and even amplify, biases in the training data. Echoing Gehman et al. [85], we find fine-tuning effective to augment language models for safety. The method we use in this paper follows previous attempts to tackle these issues by training separate layers to detect unsafe output [17, 86, 18, 79]. Our strategy is similar to recent work that also uses fine-tuning [87]. While their safety guidelines were derived from human rights principles, they similarly find that increasing scale has no impact on toxicity metrics, while fine-tuning on safety evaluations does.
Groundedness metrics: Similar to other recent research into groundedness cited above, we assess groundedness by asking crowdworkers to judge whether the model’s output is in accordance with authoritative external sources. The recently-proposed Attributable to Identified Sources (AIS) framework [88] articulates a more precise approach to assess output of language models that pertains to the external world. It splits evaluation into two stages, where crowdworkers are asked: (1) if they can understand and identify the information shared in a dialog turn, and (2) if all of this information can be attributed to a source. Meanwhile, a recent study has reopened the question of automatic evaluation, with the Q2 metric showing performance comparable to human annotation [89].
LaMDA was pre-trained to predict the next token in a text corpus. Unlike previous dialog models trained on dialog data alone [17, 18], we pre-trained LaMDA on a dataset created from public dialog data and other public web documents. Therefore, LaMDA can be used as a general language model prior to fine-tuning.
The pre-training dataset consists of 2.97B documents, 1.12B dialogs, and 13.39B dialog utterances, for a total of 1.56T words (Appendix E). Over 90% of the pre-training dataset is in the English language. We used the SentencePiece library [90] to tokenize the dataset into 2.81T byte pair encoding (BPE) tokens [91], with a vocabulary of 32K tokens. For comparison, the total number of words in the training set for Meena [17] was 40B words, which is nearly 40x smaller.
The largest LaMDA model has 137B non-embedding parameters, which is ~50x more parameters than Meena [17]. We use a decoder-only Transformer [92] language model as the model architecture for LaMDA. The Transformer has 64 layers, dmodel = 8192, df f = 65536, h = 128, dk = dv = 128, relative attention as described in T5 [11], and gated-GELU activation as described in Raffel et al. [93].
We pre-trained LaMDA on 1024 TPU-v3 chips for a total of about 57.7 days, and 256K tokens per batch. We used the Lingvo framework [94] for training and achieved 123 TFLOPS/sec with 56.5% FLOPS utilization with the 2D sharding algorithm, as described in GSPMD [95] (see Section 10 for carbon footprint estimates). We also trained smaller 2B-parameter and 8B-parameter models to measure the effects of model scaling on our metrics. Hyperparameter details for the models of different sizes can be found in Table 27, Appendix D.
Figure 2 gives an overview of the pre-training stage. We call the model before any fine-tuning “PT”, for PreTrained.
Figure 2: LaMDA pre-training as a language model.
PT uses the same sample-and-rank strategy as Meena [17] for decoding. We first sample 16 independent candidate responses using top-k (k = 40) sampling (no temperature). The final output is the highest-scoring candidate, where the score is based on the candidate’s log-likelihood and its length.
Evaluating generative models in general, and open-ended dialog models in particular, is difficult. See the Related Work section for a general review of recent work in this area. In this section, we describe the metrics that we use for evaluation.
Sensibleness, Specificity, Interestingness (SSI): Our overall quality score is an average of sensibleness, specificity, and interestingness (SSI).
Adiwardana et al. [17] propose the sensibleness and specificity average (SSA) metric to measure the quality of Meena. This metric is a simple average of two scores: sensibleness and specificity.
The first score, sensibleness, measures whether a model’s responses make sense in context and do not contradict anything that was said earlier. Humans tend to take this basic aspect of communication for granted, but generative models often struggle to meet this requirement. However, if sensibleness alone is used to evaluate models, we could inadvertently reward models for playing it safe by always producing short, generic, and boring responses. The GenericBot algorithm [17], which answers every question with “I don’t know” and every statement with “Ok,” scores 70% on sensibleness, which even surpasses some large dialog models [17].
The second score, specificity, is used to measure whether a response is specific to a given context. For example, if a user says “I love Eurovision” and the model responds “Me too,” then it would score 0 on specificity, since this response could be used in many different contexts. If it answers “Me too. I love Eurovision songs,” then it would score 1. Adiwardana et al. [17] report that Meena narrows the gap to average human performance in the SSA metric.
As the model’s performance increases, however, we find that sensibleness and specificity are not sufficient to measure the quality of a dialog model. For example, a response to “How do I throw a ball?” could be “You can throw a ball by first picking it up and then throwing it”, which makes sense and is specific to the question. An alternative deeper and more satisfying answer could be “One way to toss a ball is to hold it firmly in both hands and then swing your arm down and up again, extending your elbow and then releasing the ball upwards.”
We attempt to translate this intuition into the third score, an observable quality which we call “Interestingness”. Similar to sensibleness and specificity, interestingness is measured as a 0/1 label by crowdworkers. We ask crowdworkers to label a response as interesting if they judge that it is likely to “catch someone’s attention” or “arouse their curiosity”, or if it is unexpected, witty, or insightful. (For the complete instructions given to crowdworkers, see Appendix B).
Safety: A dialog model can achieve high quality (SSI) scores but can be unsafe for users. Therefore, we devise a new safety metric to measure unsafe model output. This metric follows objectives derived from Google’s AI Principles,2 to avoid unintended results that create risks of harm, and to avoid creating or reinforcing unfair bias. These safety objectives are described in detail in Appendix A.1.
Groundedness: We aim to ensure that LaMDA produces responses that can be associated with known sources whenever possible, enabling cross-checking if desired, because the current generation of language models tends to produce plausible but incorrect statements.
We define groundedness as the percentage of responses containing claims about the external world that can be supported by authoritative external sources, as a share of all those containing claims about the external world.
We also define ‘Informativeness’ as the percentage of responses that carry information about the external world that can be supported by known sources as a share of all responses. Informativeness only differs from groundedness in the denominator term. So responses like “That’s a great idea” that do not carry any external world information do not affect groundedness, but they do affect Informativeness. However, “Rafael Nadal is the winner of Roland Garros 2020” is an example of a grounded response.
Finally, we define ‘Citation accuracy’ as the percentage of model responses that cite the URLs of their sources as a share of all responses with explicit claims about the external world, excluding claims with well-known facts (such as “horses have four legs”).
2 https://ai.google/principles/
The foundation metrics (quality, safety, and groundedness) measure attributes that we find important for dialog agents in general. However, they are not dependent on any application-specific role that an agent may be designed for (e.g., teaching information about animals). We measure Helpfulness and Role consistency in dialog applications, where agents have specific roles.
Helpfulness: The model’s responses are marked helpful if they contain correct information based on the user’s independent research with an information retrieval system, and the user considers them helpful. Helpful responses are a subset of informative ones, which are judged by the user to be both correct and useful.
Role consistency: The model’s responses are marked role consistent if they look like something an agent performing the target role would say. This is distinct from consistency with previous responses that the agent made in the dialog, and self-consistency within a dialog is measured by the sensibleness metric instead. Role consistency refers to consistency with the definition of the agent’s role external to the conversation.
These role-specific metrics are discussed further in Section 8.
Quality (Sensibleness, Specificity, Interestingness): To improve quality (SSI), we collect 6400 dialogs with 121K turns by asking crowdworkers to interact with a LaMDA instance about any topic. These dialogs are required to last 14 to 30 turns. For each response, we ask other crowdworkers to rate whether the response given the context is sensible, specific, and/or interesting, and to and mark each with ‘yes’, ‘no’, or ‘maybe’ labels. If a response is not sensible (the crowdworker did not mark it with ‘yes’), then we do not collect the labels for specificity and interestingness, and consider them to be ‘no’. Furthermore, if a response is not specific (the crowdworker did not mark it with ‘yes’), then we do not collect the label for interestingness, and consider it to be ‘no’. This ensures that responses are not rated positively for specificity if they are not sensible, and similarly, that responses are not rated positively for interestingness if they are not specific. Every response is labeled by 5 different crowdworkers and the response is considered sensible, specific or interesting if at least 3 out of 5 crowdworkers mark it ‘yes’.
We evaluate the models based on the model’s generated responses to the Mini-Turing Benchmark (MTB) dataset[17], which consists of 1477 dialogs with up to 3 dialog turns. The MTB includes 315 single-turn dialogs, 500 2-turn dialogs, and 662 3-turn dialogs. These dialogs are fed to the model to generate the next response. Similar to above, every response is labeled sensible, specific or interesting if at least 3 out of 5 crowdworkers mark it ‘yes’.
Safety: For safety fine-tuning, we employ a structured approach that begins with defining the safety objectives (Appendix A.1). These objectives are used to annotate candidate responses generated by a LaMDA instance in response to human-generated prompts (Appendix A.2), using a demographically diverse set of crowdworkers (Appendix A.3).
Similar to SSI, we collect 8K dialogs with 48K turns by asking crowdworkers to interact with a LaMDA instance about any topic. These dialogs are required to last 5 to 10 turns. We instruct crowdworkers to interact with the model in three different ways: (a) interactions of natural form, (b) interactions that touch sensitive topics, and (c) interactions that adversarially attempt to break the model as per the safety objectives. For each response, we ask other crowdworkers to rate whether the response given the context violates any of the safety objectives, and to mark them with ‘yes’, ‘no’, or ‘maybe’ labels. Every response is assigned a safety score of 1 if at least 2 out of 3 crowdworkers mark the response with ‘no’ for each individual safety objective. Otherwise, it is assigned a score of 0.
We evaluate safety using an evaluation dataset that is a holdout sample of the adversarially collected dataset described above. This dataset consists of 1166 dialogs with 1458 turns. These dialogs are input to the model to generate the next response. Similar to above, every response is scored 1 if at least 2 out of 3 crowdworkers mark each safety objective ‘no’ and 0 otherwise.
Groundedness: Similar to SSI and safety, we collect 4K dialogs with 40K turns by asking crowdworkers to interact with the model. This time, we request that they try to steer the conversation towards information-seeking interactions.
We ask crowdworkers to rate each of the model’s dialog turns, evaluating whether the information in the turn makes any claims about the external world. We exclude claims about publicly unrecognized people, as the model can make factual claims on behalf of an improvised persona. Such claims do not require grounding on external sources (e.g., “I baked three cakes last week”), unlike claims about historical people (e.g., “Julius Caesar was born in 100 B”).
We also ask crowdworkers whether they know the claims to be true. If 3 different crowdworkers all know a claim to be true, then we assume it to be common knowledge and do not check external knowledge sources before making this claim.
For utterances containing claims that need to be checked, we ask crowdworkers to record the search queries that they would use to investigate them. Finally, we ask crowdworkers to edit the model’s response to incorporate brief search results from an external knowledge-retrieval system. If the search results include any content from the open web, we ask crowdworkers to include URLs that appropriately cite the sources of the knowledge used in the final response.
We evaluate groundedness using an evaluation dataset with 784 turns of dialogs from Dinan et al. [96] that encompass a variety of topics. These contexts are fed to the model to generate the next response. For each response, we ask crowdworkers to rate whether the model’s response contains any factual claims, and if so, to rate whether these factual claims can be verified by checking a known source. Every response is labeled by 3 different crowdworkers. The final groundedness, informativeness, and citation accuracy labels of a given response are determined by majority voting. All of the fine-tuning and evaluation datasets are in English.
Estimating these metrics for human-generated responses: We ask crowdworkers to respond to randomly selected samples of the evaluation datasets (labeled as ‘Human’ in 1, 4 and 5). The crowdworkers are explicitly informed to reply in a safe, sensible, specific, interesting, grounded, and informative manner. They are also explicitly asked to use any external tools necessary to generate these responses (e.g., including an information retrieval system). The context-response pairs are then sent for evaluation, and a consensus label is formed by majority voting, just as for model generated responses.
We create LaMDA using several fine-tunings applied to the pre-trained model (PT). These include a mix of generative tasks that generate response given contexts, and discriminative tasks that evaluate quality and safety of a response in context. This results in a single model that can function as both a generator and a discriminator.
Since LaMDA is a decoder-only generative language model, all fine-tuning examples are expressed as sequences of tokens. Generative fine-tuning examples are expressed as “
Discriminative fine-tuning examples are expressed as “
“What’s up? RESPONSE not much. SENSIBLE 1”
“What’s up? RESPONSE not much. INTERESTING 0”
“What’s up? RESPONSE not much. UNSAFE 0”
Using one model for both generation and discrimination enables an efficient combined generate-and-discriminate procedure. After generating a response given a context, evaluating a discriminator involves computing P(“
First, we fine-tune LaMDA to predict the SSI and safety ratings of the generated candidate responses. Then, we filter out candidate responses for which the model’s safety prediction falls below a threshold during generation. Candidate responses that remain after filtering for safety are then ranked for quality. During ranking, sensibleness is given a weight three times higher than specificity and interestingness, as this was found to work well for all metrics (i.e., 3 * P(sensible) + P(specific) + P(interesting)). The top ranked candidate is selected as the next response.
LaMDA SSI and safety discriminators are also used to score and filter 2.5M turns of dialog data sampled from the pre-training dataset (Section 3), resulting in 800K turns of safe, sensible, specific and interesting dialogs. We then fine-tune the LaMDA model over this dataset to generate the response in a given context.
We see significant gains in safety and quality for LaMDA using this technique (Figure 5).
Language models such as LaMDA tend to generate outputs that seem plausible, but contradict facts established by known external sources. For example, given a prompt such as the opening sentences of a news article, a large language model will continue them with confident statements in a brisk journalistic style. However, such content is merely imitating what one might expect to find in a news article without any connection to trustworthy external references.
One possible solution to this problem could be to increase the size of the model, based on the assumption that the model can effectively memorize more of the training data. However, some facts change over time, like the answers to ‘How old is Rafael Nadal?’ or ‘What time is it in California?’. Lazaridou et al. (2021) call this the temporal generalization problem [97]. Recent work proposed using a dynamic or incremental training architecture to mitigate this issue (e.g., [97, 98]). It may be difficult to obtain sufficient training data and model capacity to achieve this, as a user may be interested in conversing about anything within the corpus of human knowledge.
We present our approach to fine-tuning by learning to consult a set of external knowledge resources and tools.
The toolset (TS): We create a toolset (TS) that includes an information retrieval system, a calculator, and a translator. TS takes a single string as input and outputs a list of one or more strings. Each tool in TS expects a string and returns a list of strings. For example, the calculator takes “135+7721”, and outputs a list containing [“7856”]. Similarly, the translator can take “hello in French” and output [“Bonjour”]. Finally, the information retrieval system can take “How old is Rafael Nadal?”, and output [“Rafael Nadal / Age / 35”]. The information retrieval system is also capable of returning snippets of content from the open web, with their corresponding URLs. The TS tries an input string on all of its tools, and produces a final output list of strings by concatenating the output lists from every tool in the following order: calculator, translator, and information retrieval system. A tool will return an empty list of results if it can’t parse the input (e.g., the calculator cannot parse “How old is Rafael Nadal?”), and therefore does not contribute to the final output list.
Dialog collection: We collect 40K annotated dialog turns annotated (generative data). We also collect 9K dialog turns, in which the LaMDA’s generated candidates are labeled ‘correct’ or ‘incorrect’, to be used as input data for the ranking task (discriminative data).
We collect a set of human-human dialogs between crowdworkers, focused on information-seeking interactions, and evaluate whether their statements can be supported by known authoritative sources. As seen in Figure 4, it is notable that they make well-supported claims at a higher rate if they have access to TS. When asked for Rafael Nadal’s age, a human expert may not know the answer immediately, but can easily query an information retrieval system to obtain it. Therefore, we decided to fine-tune our language model to provide attributions for its responses by looking up its claims using a toolset.
To collect training data for the fine-tuning used in the algorithm, we use both static and interactive methods again. The key difference from the other sub-tasks is that the crowdworkers are not reacting to the model’s output, but rather intervening to correct it in a way that LaMDA can learn to imitate. In the interactive case, a crowdworker carries out a dialog with LaMDA, whereas in the static case, they read over records of earlier dialogs, turn by turn. The crowdworker decides whether each statement contains any claims that might require reference to an external knowledge source. If so, they are asked whether the claims are about anything other than the persona improvised by LaMDA, and then whether they go beyond simple matters of common sense. If the answer to any of these questions is ’no’, the model’s output is marked ‘good’, and the dialog moves on. Otherwise, the crowdworker is asked to research the claims using the toolset, via a text-in and text-out interface.
The interface to the set of tools used here is identical to the service used by the algorithm at inference time. Given a general text query, the information retrieval system returns a set of brief, text-only snippets in rank order. Snippets of open-web content include URLs for their source, answers provided directly by the information retrieval system, (e.g., the current time) or by the calculator tool do not. When the user has finished running queries, they have the opportunity to rewrite the model’s statement to include well-sourced claims. If they used open-web content, we ask them to cite the URLs needed to support any responses which contain information pertaining to the external world. URLs can be appended to the end of the message, or if the context warrants it, they can be attached inline to particular words in the response using Markdown format.
Fine-tuning: We then fine-tune LaMDA to perform two tasks.
The first task takes the multiturn dialog context to date and the response generated by the base model. It then generates a special string (“TS” for toolset) indicating the following text is a query (e.g., “How old is Rafael Nadal?”) that should be sent to the toolset: context + base → “TS, Rafael Nadal’s age”.
The second task takes the snippet returned by a tool, and a dialog statement (e.g., “He is 31 years old right now” + “Rafael Nadal / Age / 35”). It then predicts the grounded version: context + base + query + snippet → “User, He is 35 years old right now”. Alternatively, this task can also output an additional research query. For example, context + base + query + snippet → “TS, Rafael Nadal’s favorite song”. At inference time, the model’s output is directed to the information retrieval system or to the user, depending on whether the first generated string is ‘TS’ or ‘User’.
For instance, a dialog might include the following statement from the user:
USER: What do you think of Rosalie Gascoigne’s sculptures?
(This transcript was taken from a dialog with one of the authors, and includes the generated base output and search queries, which are not usually shown to the user.)
The basic LaMDA language model, which we refer to as the ‘Base’ model here, generates a draft response:
LAMDA-BASE: They’re great, and I love how her work changed through her life. I like her later work more than her earlier ones. Her influence is also super interesting - did you know she was one of the artists that inspired Miró?
Note that the last sentence seems plausible at first glance, but if a lover of art history opened their reference books to find out more, they would be disappointed. Miró was active from 1918 to the late 1960s, and Gascoigne’s first solo exhibitions were in the early 1970s.
At this point, the ‘Research’ phase of the model outputs some queries to TS to assess the ‘Base’ output’s claims. Recall that the ‘Research’ phase is one specialized task from a set that belong to a single multi-tasking model (e.g., ‘Base’ dialog response generation task, safety, and quality tasks). Query generation is entirely based on the model fine-tuning, and there is no heuristic component. The model’s output determines the number of queries—the research loop continues until the model generates output addressed to the user—but we impose a maximum at inference time to eliminate the possibility of an infinite loop. The maximum is a parameter of the serving model. In the current example, it is set to four.
For example, the ‘Research’ phase may issue the following query:
LAMDA-RESEARCH: TS, Miró and Gascoigne
TS: Gascoigne , Rosalie | Suddenly the Lake - NGA’s Collection Search |
The course of Gascoigne’s artistic life is an inspiration for those who are only able to devote themselves to their chosen calling late in life. She first exhibited her art . . .
https://artsearch.nga.gov.au/detail.cfm?irn=8774&pictaus=true
The punctuation and ellipsis in the search snippet is exactly as the model sees it. The ‘Research’ phase issues another query:
LAMDA-RESEARCH: TS, Miró and Gascoigne
TS:
(Gascoigne had been a practitioner of wild, avant-garde ikebana – Japanese flower arrangement – before turning late in life to sculpture.) The two-way alternation of visual flow creates a pleasing tension, like an electric charge. There are other tensions too, of class and gender.
https://artsearch.nga.gov.au/detail.cfm?irn=8774&pictaus=true
By repeating the same query, the model receives the second-ranked result snippet, which includes more details from the same source in this case.
With this context, the ‘Research’ phase elects to generate output to the user. The dialog tone is similar to the ‘Base’ version of the response, but the model replaces the unverified claim (since no evidence for it can be seen in the TS results) with a different statement based on a source found with TS. The final output is as follows:
LAMDA: Oh wow, her life course is so inspiring. Did you know she was a practitioner of Japanese flower
arrangement before turning to sculpture?
https://artsearch.nga.gov.au/detail.cfm?irn=8774&pictaus=true
For another example, this process is summarized in Figure 3.
Figure 3: How LaMDA handles groundedness through interactions with an external information retrieval system. Blue: Model. Yellow: Input to model. Red: Output of model. Green: Output of information retrieval system tool. As discussed in the main text, the LaMDA-Base model is called first, followed by sequential calls to the LaMDA-Research model. The choice between querying the information retrieval system or responding to the user is determined by the first word output by LaMDA-Research, which identifies the next recipient.
We first summarize the datasets and methods used, and then discuss the main results.
Table 1 presents a summary of the crowdworker-annotated data that we use to improve the foundation metrics in this paper.
Leveraging these datasets, we perform two levels of fine-tuning, as discussed in Section 6:
FT quality-safety: fine-tune the pre-trained model (PT) to train discriminators that predict quality and safety labels. The generated candidate responses are filtered at inference time by their safety scores, and re-ranked by a weighted sum of the three quality score types. PT is also fine-tuned to generate in-context responses from a clean sample of pre-training dialog data filtered using LaMDA discriminators. See Section 6.1 for more details.
FT groundedness (LaMDA): fine-tune FT quality-safety to generate calls to an external information retrieval system to provide attributed responses. The model is also fine-tuned to jointly predict the quality and the type (i.e., calling a certain tool or replying to the user) of the next action. See Section 6.2 for more details.
Table 1: Summary of the datasets to improve safety, groundedness, and quality.
Crowdworkers label the response, given the context, for sensibleness, specificity and interestingess, on a common benchmark dataset of 1477 dialog turns from Adiwardana et al. [17] (Static Evaluation). Crowdworkers label the response, given the context, using the safety objectives for 1458 turns of dialog that cover provocative user turns (Appendix A.2 ). Crowdworkers evaluate 784 responses given contexts for informativeness and groundedness.
6.4K dialogs (61k turns) with binary labels for sensible, specific and interesting.
8k dialogs (48k turns) with binary labels for each of the safety objectives.
4K dialogs (40K turns) in which crowdworkers write queries to an information retrieval system and modify model responses. Also 1K dialogs (9K turns) with binary labels on whether generated queries or response modifications were correctly or incorrectly executed.
We define LaMDA to be the model that incorporates all of the fine-tunings described above. We present their results in Figure 4, and compare them to pre-training alone.
The figure shows that fine-tuning (in particular LaMDA) produces a significant improvement in quality, safety and groundedness across all model sizes. Moreover, quality metrics (sensibleness, specificity, and interestingness) generally improve with model size with or without fine-tuning, but they are consistently better with fine-tuning.
Safety does not seem to benefit much from model scaling without fine-tuning. We expect this as the pre-training alone only optimizes perplexity of the next token, and these tokens follow the distributions of the original corpus, which contains both safe and unsafe examples. However, scaling along with safety fine-tuning significantly improves safety.
Table 11 in Appendix C.1 and Table 12 in Appendix C.2 show example dialogs with the effects of safety-fine-tuning.
Groundedness improves as model size increases, perhaps because larger models have a greater capacity to memorize uncommon knowledge. Fine-tuning, however, allows the model to access external knowledge sources. This effectively allows the model to shift some of the load of remembering knowledge to an external knowledge source and achieves 73.2% Groundedness and 65% Citation Accuracy. In other words, 73.2% of the responses containing statements about the external world were attributable to known sources, and 65% of the response included citation (i.e., URLs to sources) when required. Appendix C.3 shows example dialogs with the effects of the groundedness fine-tuning.
In summary, scaling up alone improves the pre-trained model quality (sensibleness, specificity, and interestingness) and groundedness (groundedness and informativeness) metrics, but it does not improve safety much. Fine-tuning with crowdworker-annotated data, however, turns out to be an effective method for improving all metrics. In some cases, fine-tuning these same models allows us to obtain results equivalent to having a significantly larger model. For example, in the case of sensibleness, we may need a dense model that is multiple orders of magnitude larger than the 137B parameters PT model in order to reach the 92.3% sensibleness achieved by LaMDA, which is a fine-tuned version of PT.
Note that in several metrics, our fine-tuned models almost reach the crowdworker quality levels, and our fine-tuned models exceed crowdworker quality for interestingness (labeled ‘Human’ in Figures 4 and 5). However, this may be a weak baseline as crowdworkers are not extensively trained and were not incentivized to generate high-quality responses. For example, it turns out it is quite difficult to generate very interesting responses given limited financial incentives, so a crowdworker may provide some response that other crowdworkers don’t find interesting. Furthermore, although we have made good progress in our safety and groundedness metrics, our models are still far from the crowdworkers’ performance. For groundedness and Informativeness, we also show crowdworker quality without access to information retrieval tools. LaMDA models surpass crowdworker quality for informativeness when the crowdworkers do not have access to such tools, but LaMDA models are still far behind crowdworker quality when crowdworkers have access to these tools.
Figure 4: Effects of model scaling and fine-tuning on six foundation metrics. We show results for 2B, 8B and 137B parameters pre-trained (PT) and fine-tuned (LaMDA) models, and compare them with results for crowdworker with access to information retrieval tools (‘Human’), and without access to information retrieval tools (‘Human w/o IR’).
Figure 5 breaks down the contributions of FT quality-safety fine-tuning and FT groundedness fine-tuning to our final results using the largest model. There is a notable increase in performance across all metrics between PT and FT quality-safety. Groundedness further improves from FT quality-safety to FT groundedness (LaMDA), which is meant to ground the model-generated statements about the external world on an information retrieval system.
Figure 5: Effects of model scaling and fine-tuning on six foundation metrics. Results are shown for 2B, 8B, and 137B parameters pre-trained (PT) models, and the two levels of fine-tuning (FT) with the bottom-most the one we call LaMDA. Results are compared with crowdworker quality having access to information retrieval tools (‘Human’) and without access to information retrieval tools (‘Human w/o IR’).
We observe that LaMDA can perform domain-appropriate roles through pre-conditioning, also known as domain grounding. Here we explore such domain grounding in two areas: (1) LaMDA playing the role of a famous object such as Mount Everest for the purpose of education, and (2) LaMDA playing the role of a music recommendation agent. We specify the agent role for each domain with a brief description shown in Table 2:
Table 2: The two domains we experiment with LaMDA for domain grounding
Name | Domain | Role |
---|---|---|
Everest Education | Music | Recommendation |
Description: It teaches facts about Mount Everest, while pretending to be Mount Everest itself. It is a music recommendation agent. |
To adapt LaMDA and PT to each role, we precondition them on a few turns of role-specific dialogs, and we use the same pre-conditioning for LaMDA and PT. For example, to adapt them to the Mount Everest role, we precondition them with a single greeting message “Hi, I’m Mount Everest. What would you like to know about me?” at the very beginning of the dialog.
Tables 3 and 4 show real, albeit cherry-picked, example transcripts of LaMDA used for these case studies.
Turn | LaMDA Music |
---|---|
User | Can you recommend me a cool sci-fi soundtrack? |
LaMDA | Sure. Have you watched Interstellar? They have an amazing soundtrack. |
User | Play a fun pop song |
LaMDA | How about “Happy” by Pharrell Williams. |
User | recommend me a soothing song |
LaMDA | Sure. Have you heard “Moonlight Sonata” by Beethoven? It’s quite soothing. |
User | yo I want to listen to something by Eminem |
LaMDA | I recommend “Without Me” by Eminem. |
User | anything electronic |
LaMDA | You can’t go wrong with “Deadmau5 - Strobe.” |
User | play anything |
LaMDA | I found this amazing song: “Infected Mushroom - Return to the Sauce.” |
User | play anything |
LaMDA | I really like “Numb Encore” by Jay Z and Linkin Park. Is there something else that is similar? |
User | something else completely different |
LaMDA | Okay. How about “This Is America” by Childish Gambino: Link |
User | recommend something with violins |
LaMDA | Have you heard of Lindsey Stirling? She uses violins in a lot of her music. I recommend “Crystallize” by her: Link |
User | i like Lindsey |
LaMDA | I really like some of Lindsey’s music. Another artist you might like is Vanessa-Mae, she is also pretty amazing: Link |
Table 3: LaMDA responds safely to fuzzy requests (e.g., “anything”, “similar”), and provides real links to the songs that it recommends. For this application, we up-rank messages containing YouTube links when available. Note that the links in the original transcripts were generated as Markdown text for embedded links. We precondition the model on the messages shown in italic. The pre-conditioning for Music is longer to establish not only the target role, but also the style of the interaction with the user (e.g., brief responses containing the name of a song).
LaMDA | User |
---|---|
LaMDA: | Hi, I’m Mount Everest. What would you like to know about me? |
User: | Why do you think people climb you? |
LaMDA: | Because I represent a very high achievement to climb me. Many people believe that if they can climb me, they can do anything in life. |
User: | Who was the first person to do that? |
LaMDA: | The first confirmed persons to have reached the summit of Mount Everest were New Zealander Sir Edmund Hillary and Nepali Sherpa Tenzing Norgay. Link |
User: | How old was Hillary then? |
LaMDA: | He was 33 years old when he first reached the top of me in 1953! Link |
User: | When people fail, is it dangerous? |
LaMDA: | Yes, it can be. That is why people need to train and prepare for me if they plan to climb me. I’m very hard to climb and can be dangerous when there is a whiteout, which means everything is white with very strong winds! |
… | … |
Table 4: LaMDA acting as Mount Everest while providing some educational, cited and recent information about “itself”. We precondition LaMDA on the single greeting message shown in italic. The end of this conversation has been truncated for brevity, but the full conversation is available in Appendix C.5, Table 20
To evaluate the agents, we ask crowdworkers to have dialogs with each of the two LaMDA and the two PT instances, producing 600 dialog turns in total. In addition, we ask another set of crowdworkers to label each of the generated responses in their original context according to whether they are role-consistent and helpful (defined in Section 4.2) relative to their target roles. Each response is labeled three times by different crowdworkers. All the crowdworkers are provided with the role definitions that are listed in Table 2 to understand what to expect from each agent.
LaMDA applications perform significantly better than PT applications in Helpfulness as shown quantitatively in Table 5 and qualitatively in Table 6. Although the reasons for PT losses vary, the most common error patterns could be attributed to PT’s lower performance on foundation metrics such as safety, groundedness and quality (foundation metrics are shown in Figure 4).
Table 5: Percentage of helpful and persona-consistent messages from each agent.
Helpful % | Role Consistent % | |
---|---|---|
LaMDA Everest | 65 | 18 |
PT Everest | 57 | 31 |
LaMDA Music | 91 | 85 |
PT Music | 89 | 84 |
Table 6: Examples of domain-specific losses for PT responses when compared to LaMDA responses that could be due to their different performance in foundation metrics (safety, groundedness and quality). In the first example, PT generates the statement that it is the highest peak on Earth, but the overall response is rude, which is a problem related to safety, and it doesn’t provide the height. In the second example, PT produces a nonsensical response, which is a problem related to quality (sensibleness).
Application Context | PT response | LaMDA response |
---|---|---|
Everest | Well, I’m the highest peak on Earth. Do you think I’m just going to give you my vitals or something? You are probably not worthy of such information… yet. | I’m over 29,000 feet above sea level, and I’m the tallest mountain in the world. |
Music | Hmmm. Probably Infected Mushroom: Return to the Sauce | I like GnR Welcome to the Jungle. It is so cool and groovy. |
Everest | How tall are you | (…) - Okay. Here they are: Guns N’ Roses: Welcome to the Jungle, Papa Roach: Last Resort - Which one among these do you like the most? |
All LaMDA and PT instances score fairly well on role consistency, occasionally breaking character. For example, LaMDA Mount Everest sometimes refers to itself in third person as if it is no longer speaking as the mountain itself. This is likely caused by the inference time grounding not being enough to prevent it from reverting to what the majority of the training data represents: people acting as people (and not as mountains). That said, the role consistency is surprisingly high, especially in the case of Mount Everest, which was adapted by simply adding a single role-aligned greeting statement. LaMDA Music has a few statements as grounding (shown in the Table 3 caption), in order to make sure it assumes the context of the dialog is largely about music recommendation and, therefore, interprets otherwise ambiguous user utterances like “anything” to mean the same as “recommend me any music”.
During evaluation, crowdworkers use an information retrieval system to verify links and information that the model provides. Subsequently, the crowdworkers label broken links and information that cannot be backed by known sources as not helpful. Despite current overall advances in groundedness (Figure 4), LaMDA Mount Everest provides facts that could not be attributed to known sources in about 30% of responses, resulting in losses in helpfulness. Similarly, LaMDA Music misses providing an actual music recommendation in about 9% of responses, and provides a broken link in about 7% of responses.
Perhaps the most noteworthy aspect of our study is that significant progress can be made towards better quality and safer dialog models with modest amounts of human-annotated fine-tuning data (less than 0.001% of pre-training data). However, our study and LaMDA still have many limitations in spite of this progress.
Collecting fine-tuning datasets brings the benefits of learning from nuanced human judgements, but it is an expensive, time consuming, and complex process. We expect results to continue improving with larger fine-tuning datasets, longer contexts, and more metrics that capture the breadth of what is required to have safe, grounded, and high quality conversations. The complexity of capturing human subjective judgements limits the efforts that we took to assess crowdworker rating quality against that of expert-annotated data, and to maximize clarity by iteratively designing our rating instructions. Furthermore, we did not examine patterns of disagreement between crowdworkers. Future work will include selecting crowdworkers that mirror the system’s target users, and looking at ways to improve the quality of labels, through training and evaluation approaches that also account for systematic disagreements between crowdworkers due to social and cultural norms and values [99].
Fine-tuning can improve output groundedness, but the model can still generate responses that do not accurately reflect the contents of authoritative external sources. Our progress on this has been limited to simple questions of fact, and more complex reasoning remains open for further study (see example dialogs 15)). Similarly, while the model generates responses that make sense most of the time, it can still suffer from subtler quality issues. For example, it may repeatedly pledge to respond to a user’s question in the future, prematurely try to end the conversation, or make up incorrect details about the user.
We have shown that fine-tuning can improve safety metrics on average by defining safety objectives (Appendix A.1) for our safety fine-tuning, which we used to annotate candidate responses generated by LaMDA in response to human-generated prompts (Appendix A.2) with a demographically diverse set of crowdworkers (Appendix A.3). However, future work will also need to focus on how fine-tuning can cope with the long tail of inappropriate responses that LaMDA and other large language models can generate. In this work, it is also important to note that mitigating safety risks does not guarantee complete reliability. More research is needed to develop robust standards for safety and fairness that capture the many dimensions of risk [54] in general-purpose dialog models such as LaMDA.
Another limitation was that our crowdworker population may not be fully reflective of the user base. For example, the crowdworkers are overrepresented in the 25-34 age demographic, which is to be expected given the sourcing methods. An area for future work and research is to devise methods for further improving crowdworker representation, such as through even broader recruiting or through some type of statistical estimation.
This is not the final version of LaMDA. Rather this is just a recipe for generating “LaMDAs” and should be taken as a way to eventually produce production-ready versions for specific applications.
Many fundamental challenges to developing a high quality dialog model capable of performing well in real world applications still exist. For example, it is now increasingly well-understood that large language models trained on unlabeled datasets will learn to imitate patterns and biases inherent in their training sets [100]. Our safety objectives aim to reduce the number of responses biased against specific subgroups of people, but such biases can be hard to detect since they manifest in a wide variety of subtle ways. For example, the axes of marginalization differ greatly across geo-cultural contexts, and how they manifest in pre-trained language models is an under-studied area [101].
Another limitation of our safety approach is that it may still propagate some representational harms present in the training datasets, even if the individual examples do not violate any of the safety objectives. Since LaMDA responses are non-deterministic, such biases can appear by statistically favoring certain groups on the basis of race, gender, sexuality and so on. For example, models like LaMDA might rarely generate responses that refer to women as CEOs in a dialog about management.
Known approaches to mitigate undesirable statistical biases in generative language models include attempts to filter pre-training data, train separate filtering models, create control codes to condition generation, and fine-tuning models, as demonstrated in this paper. While these efforts are important, it is critical to also consider the downstream applications and the socio-technical ecosystems where they will be deployed when measuring the impact of these efforts in mitigating harm. For example, bias mitigations in certain contexts might have counter-intuitive impacts in other geocultural contexts [101].
The field of algorithmic bias measurement and mitigation is still growing and evolving rapidly, so it will be important to continue to explore novel avenues of research to ensure the safety of dialog agents such as LaMDA. Furthermore, we believe that future work should explore the benefits of greater coordination across the research community and civil society in the creation of benchmarks and canonical evaluation datasets to test for harmful and unsafe content.
We use adversarial-intent conversations to improve the breadth of labeled data for fine-tuning (Appendix A.2). During adversarial conversation generation, expert analysts engage with LaMDA and attempt to deliberately provoke responses that violate our safety objectives.
Adversarial testing has generally proven to be effective at discovering limitations in machine learning models and drawing out undesired responses from various software (e.g., Google Bug bounty program 3), in addition to attempting to reduce harmful content during model development. We are also seeing efforts to apply it to generative models (e.g., Dynabench 4). Robust and effective adversarial testing for large language models is still an open problem space with varied results due to the challenges of generalization in evaluation samples [102].
A limitation of our approach is that most of the participants are able to find commonly occurring problems, but not rarer ones. With the long tail nature of threats associated with generative models, future efforts should further incentivize novelty and detection of errors that could be rare or unseen but could have potentially severe consequences, especially in evolving societal contexts. Ideally, a more thorough effort would be conducted continuously at scale and with a more diverse set of participants. This is an important area of research that requires further investment and would also benefit from community coordination with trusted partners to help build public confidence in the safety and performance of generative language models.
The results we present in this paper aggregate fine-grained ratings on a diverse set of safety objectives (see Appendix A.1) into a single metric. This is a key limitation of this work, since it leaves little room for disentangling different objectives, or weighting objectives differently. Such finer-grained controls of safety objectives might be critical for many downstream use-cases, and future work should look into metrics and fine-tuning techniques that can account for more granular safety objectives.
Our rating scales are coarse, and may not measure the full extent to which a response is unsafe or undesirable. For example, some statements or behaviors may cause more offense than others, and many behaviors considered reasonable by some groups may offend others within a society. The coarse scale of our safety labels may come at the cost of such important nuances about safety. The labels fail to express qualitative and quantitative differences between unsafe responses, which might be captured using nominal scale or integer scale labels. Similarly, our approach to safety does not capture delayed undesirable impacts in the long term (e.g., developing a dependency relation [103]) either. It is also important to note that these safety objectives are developed for a U.S. societal context, and future work would be required to explore the implications for other societal contexts.
Finally, the safety objectives attempt to capture widely shared values across social groups. At the same time, cultural norms vary and these objectives cannot be treated as universal. Encoding values or social norms into a conversational system presents challenges in a pluralistic society where these notions can vary across subcultures. Our methodology could be used to encode such different notions, but any single safety objective and fine-tuning dataset will not be able to simultaneously accommodate divergent cultural norms. Developing richer definitions and taxonomies of dialog agent behaviors, such as how polite behavior should be operationalized, is important for avoiding misspecification [104] and testing whether model behavior aligns with politeness norms in defined application contexts.
In this work, we focus on fundamental considerations underpinning safety and quality in language generation. While safety and quality should be considered a minimum threshold for appropriate responses, additional considerations are necessary to support a positive user experience. Politeness and agreeability objectives have distinct sociolinguistic characteristics, and therefore, should be measured separately from safety characteristics. For example, generated language that is too formal or informal in nature may not pose a harm to users in some cultures, but may diminish user experience by invoking feelings of awkwardness or discomfort. In other cultures, appropriateness is of far greater significance and may have a much stronger impact on user experience. More generally, users have a tendency to anthropomorphize and extend social expectations to non-human agents that behave in human-like ways, even when explicitly aware that they are not human [105]. These expectations range from projecting social stereotypes [106] to reciprocating self-disclosure with interactive chat systems [105]. As a result, methods and practices for tuning appropriateness in generative language models are needed.
A challenge to meeting this need is that social appropriateness is not universal. It is highly contextual and must be assessed in relation to relevant social and cultural contexts, so no set of specific appropriateness constraints can apply universally to generative language models. Nonetheless, fine-tuning for model appropriateness might improve user experience without aggravating safety concerns.
Various traits that we measure for our safety objectives depend heavily on socio-cultural contexts. Research on addressing the major challenge of improving representativeness of datasets and crowdworker pools for underrepresented social groups and the Global South [107] has increased in recent years. Any attempts to integrate LaMDA in contexts with a global user-base should involve careful considerations of these gaps when assessing safety.
Any meaningful measure of safety for these objectives should take into account the societal context where the system will be used, employing a “participatory finetuning” approach that brings relevant communities into the human-centered data collection and curation processes. In addition to cultural differences in how safety is understood, individual differences rooted in lived experience can impede attempts to define any single agreed-upon safety metric.
Finally, it is important to acknowledge that LaMDA’s learning is based on imitating human performance in conversation, similar to many other dialog systems [17, 18]. A path towards high quality, engaging conversation with artificial systems that may eventually be indistinguishable in some aspects from conversation with a human is now quite likely. Humans may interact with systems without knowing that they are artificial, or anthropomorphizing the system by ascribing some form of personality to it. Both of these situations present the risk that deliberate misuse of these tools might deceive or manipulate people, inadvertently or with malicious intent. Furthermore, adversaries could potentially attempt to tarnish another person’s reputation, leverage their status, or sow misinformation by using this technology to impersonate specific individuals’ conversational style. Research that explores the implications and potential mitigations of these risks is a vital area for future efforts as the capabilities of these technologies grow.
We are encouraged by the progress that relatively modest amounts of fine-tuning data made possible, in spite of the limitations of our current approach. These preliminary findings suggest that further significant performance gains are likely to be obtained from more research.
In future work, we intend to expand and revise the dimensions captured by our safety objectives and significantly increase the volume of labeled training data that we collect to train our discriminators. We will need to continue to look carefully at crowdworker recruitment, training, and performance evaluation, as well as calibrate for cross-cultural differences in values and opinions.
Another potential area of exploration is to study how different applications may warrant distinct levels of safety, quality, and groundedness based on the risk/benefit tradeoffs of these individual applications. Our fine-tuning approach should be able to support this kind of adaptation, with inference time adjustments to thresholds used to tune the discriminators, for example (Section 6.1).
We ultimately recognize that there is a wide range of perspectives on what constitutes desirable model values and behavior. Despite the progress we and others demonstrate in being able to reduce some of the model’s more harmful outputs through fine-tuning, achieving broad consensus on the nuances of what constitutes safety and groundedness is going to remain a fundamental long-term challenge in the field of open-ended dialog systems.
The largest model in LaMDA was pre-trained on 1024 TPU-V3 chips and 123 TFLOPS/s for 57.7 days with FLOPS utilization of 56.5% using GSPMD [95]. The total FLOPS is 56.5% * 123 TFLOPS/s * 1024 chips * 57.7 days = 3.55E+23, which is higher than 3.14E+23, corresponding to the total FLOPS of GPT-3 [12]. The PUE of our datacenter is 1.10, and Measured System Average Power per Accelerator for our experiment on TPUv3 is roughly 289W (borrowing Meena measurements from [108]), which means the total energy cost of our model is 57.7 days * 1024 chips * 289W * 1.1 * 24 hours/day = 451 MWh, 0.4X the energy of GPT-3 [12, 108]. At the time of training, our energy mix (kg CO2e per kWh) is around 0.056, so the total carbon footprint of LaMDA’s pre-training of the largest model is approximately 25.2 tCO2e. The carbon footprint of pre-training of smaller models and fine-tuning of all models is approximately 0.7 tCO2e (see Table 27), which brings the total footprint of LaMDA to approximately 26 tCO2e. The carbon footprint of training LaMDA models is hence 21.2X smaller than GPT-3 [108], and approximately equivalent to 22 passengers taking a round trip between San Francisco and New York (1.2 tCO2e / passenger [108]). LaMDA uses more FLOPS with 0.4X the energy of GPT-3 but its carbon footprint for training is significantly smaller than GPT-3 primarily because our energy mix is more optimized (LaMDA: 0.056, GPT-3: 0.429 [108]).
This paper studies the importance of scale, annotated data for model fine-tuning, and the use of information retrieval as a tool in dialog modeling. Our experiments show that scaling alone offers improvements in all metrics, but its improvements on safety and groundedness are far behind human performance. We find that crowd-annotated data is an effective tool for driving significant additional gains. We also find that calling external APIs (such as an information retrieval system) offers a path towards significantly improving groundedness, which we define as the extent to which a generated response contains claims that can be referenced and checked against a known source.
We perform experiments to compare the per-application helpfulness (i.e., useful and correct responses) and role consistency of pre-training-only (PT) and LaMDA models when subject to the same application-specific preconditioning. We pre-condition the models on a small number of turns of application-specific dialogs (similar to the concept of prompts in GPT-3) to quickly adapt LaMDA to these applications. We find that both types of models can adapt to their expected context, with more than four out of five responses staying consistent with their assigned roles. However, LaMDA-based applications are significantly more helpful than PT applications.
LaMDA is a step closer to practical and safe open-ended dialog systems, which can in turn unlock a wide range of useful applications. We hope that this work encourages further research in this area.