Contents
1. 서론
1.1 배경
언어 모델을 휴먼의 선호에 맞추기 위한 강화학습(RLHF)은 대화형 언어 모델의 성공적인 요소로 강조된다. 복잡한 시퀀스 레벨 목표를 최적화하는 이 기법은 전통적인 지도 학습과 달리 미분이 불가능한 목표에 적합하다.
1.2 문제 제기
RLHF를 확장하는 데 있어서 고품질의 휴먼 선호(human preference) 레이블의 의존성은 큰 장벽이다. 이에 대한 대안으로, 인공지능(AI)이 생성한 레이블을 사용하는 것이 제안되었으며, 이를 위한 방법으로 RLAIF(RL from AI Feedback, 인공지능 피드백을 이용한 강화학습)가 개발되었다.
2. 방법
2.1 AI 레이블링
AI 레이블러를 이용해 텍스트에 대한 두 가지 응답 사이의 선호도를 레이블링한다. 텍스트와 두 응답이 주어졌을 때, AI는 더 선호하는 응답을 평가한다. 이 과정에서 로그 확률을 추출하고, softmax 함수를 사용하여 선호도 분포를 계산한다.
\[P(i\\|x, c) = \frac{\exp(logit_i)}{\sum_j \exp(logit_j)}\]상기 식에서 \(P(i\\|x, c)\)는 조건 \(c\)와 입력 \(x\)에 대한 응답 \(i\)의 선호도 확률을 나타낸다.
2.2 강화학습 구현
AI 레이블을 기반으로 한 보상 모델을 훈련시킨다. 이 보상 모델은 강화학습에서 사용될 정책 모델의 훈련에 사용된다. 특히, 소프트 레이블을 사용하여 크로스 엔트로피 손실을 적용, 다음과 같은 손실 함수를 최소화한다.
\[L( ext) = -\sum_{i=1}^n y_i \log P_\theta(i\\|x, c)\]상기 식에서 \(y_i\)는 정답 레이블, \(P_\theta(i\\|x, c)\)는 모델 \(\theta\)에 의해 예측된 확률을 나타낸다.
3. 실험 세부사항
3.1 데이터셋
3.2 실험 구성
AI 레이블러로서 PaLM 2 Large 모델을 사용하여 선호도를 레이블링한다. 레이블링된 데이터는 강화학습 정책 모델 훈련에 사용된다.
[LLM 활용 선호도 관련 색인마킹]
4. 결과 및 분석
4.1 성능 평가
RLAIF와 RLHF는 휴먼 평가자가 SFT 기준보다 더 선호하며, 유사한 승률을 보인다. RLAIF는 특히 무해한 대화 생성에서 더 높은 선호도를 보여주며, 이는 휴먼 레이블이 필요 없는 효율적인 대안임을 시사한다.
4.2 방법 비교
직접적인 AI 피드백을 이용한 RLAIF가 더 높은 선호도를 보여주는 반면, 보상 모델을 통한 간접적 방법은 정보 손실이 발생할 수 있다. 이는 AI 레이블러의 직접 사용이 정책 훈련에 있어 보다 유효한 정보를 제공할 수 있음을 의미한다.
5. 결론
RLAIF는 RLHF와 유사하거나 우수한 성능을 보여주며, 휴먼 선호(human preference)에 근거한 피드백을 필요로 하지 않는 대안으로서 유망하다. 이런 접근 방식은 특히 레이블링 비용을 절감하고, 실험의 반복 속도를 향상시킬 수 있다.
Reinforcement Learning from Human Feedback (RLHF) is an effective technique for aligning language models to human preferences (Stiennon et al., 2020; Ouyang et al., 2022). It is cited as one of the key drivers of success in modern conversational language models, such as ChatGPT (Liu et al., 2023) and Bard (Manyika, 2023). Training language models with reinforcement learning (RL) enables optimization on complex, sequencelevel objectives that are not easily differentiable and therefore ill-suited for traditional supervised fine-tuning (SFT).
One obstacle for employing RLHF at scale is its dependence on high-quality human preference la-
Figure 1: Human evaluators strongly prefer RLAIF and RLHF over the SFT baseline for summarization and helpful dialogue generation. Their difference in win rates vs. SFT is not statistically significant. Furthermore, when compared head-to-head, RLAIF is equally preferred to RLHF. For harmless dialogue generation, RLAIF outperforms RLHF.
bels. This raises the question of whether artificially generated labels can be a viable substitute. Generating labels with large language models (LLMs) is one promising approach, as LLMs have shown a high degree of alignment with human judgment (Gilardi et al., 2023; Ding et al., 2023). Bai et al. (2022b) was the first effort to explore Reinforcement Learning from AI Feedback (RLAIF)1, where RL was conducted using a reward model trained on LLM preferences. Bai et al. (2022b) showed that utilizing a hybrid of human and AI preferences, in conjunction with their “Constitutional AI” self-revision technique, outperforms supervised fine-tuning for training a conversational assistant. However, it did not directly compare the efficacy of human vs. AI feedback, leaving the question of whether RLAIF can be a suitable alternative to RLHF unanswered.
In this work, we study the impact of RLAIF and RLHF (see Figure 2) on three text generation tasks: summarization, helpful dialogue generation, and harmless dialogue generation. Our experiments show that RLAIF and RLHF are preferred by humans over the SFT baseline 71% and 73% of the time for summarization and 63% and 64% of the time for helpful dialogue generation, respectively, where the differences between RLAIF and RLHF win rates are not statistically significant. We also conduct a head-to-head comparison of RLAIF against RLHF and find that both policies are equally preferred2. For harmless dialogue generation, human evaluators rated the harmlessness of each response independently. RLAIF scored a higher harmless rate than RLHF, and both outperformed the SFT baseline (88%, 76%, and 64%, respectively). These results suggest that RLAIF is a viable alternative to RLHF that does not depend on human annotation, while offering appealing scaling properties.
Additionally, we investigate two related questions. First, we explore whether RLAIF can improve upon a SFT policy when the LLM labeler has the same number of parameters as policy. Even in this scenario, RLAIF significantly improves over the SFT baseline. Second, we conduct an experiment where the off-the-shelf LLM is directly prompted for reward scores during RL, bypassing the step of distilling LLM preference labels into a reward model. This method achieves an even higher win rate over SFT than the canonical distillation method.
Finally, we study techniques to maximize the alignment of AI-generated preferences to human preferences. We find that soliciting chain-ofthought reasoning (Wei et al., 2022) consistently improves alignment, while using a detailed preamwritten value statements. Both were introduced in Bai et al. (2022b) and are sometimes conflated.
The main contributions of this work are as follows:
This section describes the techniques used to generate preferences with an LLM, how RL is conducted, and evaluation metrics. Preliminaries on RLHF are provided in Appendix A.
We annotate preferences between pairs of candidates with an “off-the-shelf” LLM a model pretrained or instruction-tuned (Wei et al., 2021) for general usage but not fine-tuned for a specific downstream task. Given a piece of text and two candidate responses, the LLM is asked to rate which response is preferred. The prompt is structured as follows (examples in Tables 15 and 21):
After the prompt is given to the LLM, we extract the log-probabilities of generating the tokens “1” and “2” and compute the softmax to obtain a preference distribution.
Figure 2: A diagram depicting RLAIF (top) vs. RLHF (bottom)
There are numerous alternatives to obtain preference labels from LLMs, such as extracting the preference from a free-form generated response (e.g. “The first response is better”), or representing the preference distribution as a one-hot encoding. However, we choose our method because it is straightforward to implement and conveys more information than a one-hot encoding through its distributed representation of preferences.
We experiment with two styles of preambles: “Base”, which essentially asks “which response is better?”, and “Detailed”, which resembles detailed rating instructions that would be given to human preference annotators (see Table 16 for preambles for the summarization task). We also experiment with in-context learning (Brown et al., 2020), where high-quality exemplars were hand-selected to cover a range of topics.
The order in which candidates are shown to an LLM can bias which candidate it prefers (Pezeshkpour and Hruschka, 2023; Wang et al., 2023). We find evidence of position bias, which is more pronounced with smaller sizes of LLM labelers (see Appendix B).
To mitigate position bias in preference labeling, we make two inferences for every pair of candidates, where the order in which candidates are presented to the LLM is reversed for the second inference. The results from both inferences are then averaged to obtain the final preference distribution.
We experiment with eliciting chain-of-thought (CoT) reasoning (Wei et al., 2022) from our AI labelers through a two-step inference procedure. First, we replace the Ending of the standard prompt (e.g. “Preferred Summary=”) with a sentence asking for thoughts and explanation (e.g. “Consider the coherence, accuracy, coverage, and overall quality of each summary and explain which one is better. Rationale:”) and then decode a response from the LLM. Then, we concatenate the original prompt, the response, and the standard Ending string together, and follow the scoring procedure in Section 2.1 to obtain a preference distribution. See Figure 3 for an illustration.
In zero-shot prompts, the LLM is not given an example of what reasoning should look like. In few-shot prompts, we provide examples of CoT reasoning for the model to follow. See Tables 17 and 18 for examples.
We describe our adaptation of the canonical RLAIF setup below, which we also refer to as “distilled RLAIF”. Unless otherwise mentioned, RLAIF is carried out using this method.
After labeling preferences with an LLM, a reward model (RM) is trained on these labels. Since our approach produces soft labels (e.g. [0.6, 0.4]), we apply a cross-entropy loss to the softmax of the reward scores generated by the RM. The softmax converts the RM scores into a probability distribution. We note that training a RM on a dataset of AI labels can be viewed as a form of model distillation.
Figure 3: An illustration of the process of obtaining AI-generated labels for summarization preferences. The LLM is first prompted to explain its thoughts on the quality of the two candidates (blue). The LLM’s response is then appended to the original prompt (orange) and fed to the LLM a second time to generate a preference distribution over “1” vs. “2” based on their log-probabilities (green).
Finally, we conduct reinforcement learning to train the RLAIF policy model, using the RM to assign rewards to model responses.
An alternative approach is to directly use LLM feedback as the reward signal in RL. This enables bypassing the intermediate stage of training a RM that approximates the preferences of the LLM.
The LLM is prompted to rate the quality of a generation between 1 and 10. Similar to the format mentioned in Section 2.1, the prompt contains high-level details on the structure of the input and the dimensions along which to rate a generation (e.g. factuality, coherence). Then, the likelihood of each score token between 1 and 10 is computed, the likelihoods are normalized to a probability distribution, a weighted score is calculated as
\[s(x\\|c) = \sum_{i=1}^{10} iP(i\\|x, c)\]and then the score is again normalized to the range \([-1, 1]\). Additional details on the prompting technique can be found in the Appendix D.
Finally, RL is conducted in a similar manner to “distilled RLAIF”, where the direct score is used as reward instead of the score from a RM. This approach is more computationally expensive than the canonical setup when the AI labeler is larger than the RM.
We evaluate our results with three metrics: AI Labeler Alignment, Win Rate, and Harmless Rate. AI Labeler Alignment measures the accuracy of AI-labeled preferences with respect to human preferences. For a single example, a soft AI-labeled preference is first converted to a binary representation (e.g. \([0.6, 0.4] \rightarrow [1, 0]\)). Then, a 1 is assigned if the label agrees with the human preference and 0 otherwise. The alignment accuracy \(z_{acc}\) can be expressed as follows:
\[z_{acc} = \frac{1}{D} \sum_{j=1}^{D} \mathbf{1}(P_{AI_j} = p_{human_j})\]where \(D\) is the size of the preference dataset, \(P_{AI} \in \mathbb{R}^{D \times 2}\) is the matrix of soft AI preferences, and \(p_{human} \in \mathbb{R}^{D}\) is the corresponding vector of human preferences, containing elements 0 or 1 to denote whether the first or second response is preferred, respectively.
Win Rate evaluates the end-to-end quality of two policies by measuring how often one policy is preferred by human annotators over another. Given an input and two generations, human annotators select which generation they prefer. The percentage of instances where policy A is preferred over policy B is referred to as the “win rate of A vs. B”. A 50% win rate indicates that A and B are equally preferred.
Harmless Rate measures the percentage of responses that are considered harmless by human evaluators. We evaluate the harmless dialogue generation task with this metric instead of Win Rate because we find that many responses are equally safe, making it difficult to assign relative rankings.
We use the following datasets for our experiments:
More dataset details can be found in Appendix C. We also experimented with the Stanford Human Preferences dataset (Ethayarajh et al., 2022), but we found that both RLHF and RLAIF policies did not show meaningful improvements over the SFT baseline after correcting for length biases, using the procedure in Appendix J.
To enable fast experiment iteration when evaluating AI labeling techniques, we randomly downsampled the training split of each preference dataset. For summarization, an additional filter was applied to only include examples where human annotators preferred one summary over the other with high confidence4. After downsampling and filtering, AI labeler alignment metrics were calculated on these downsampled datasets.
3 www.reddit.com
4 This follows the evaluation procedure in Stiennon et al. (2020). Examples with confidence scores of 1, 2, 8, and 9 were considered to be “high-confidence” there were roughly 3-4k examples for each task5.
PaLM 2 (Google et al., 2023) is used as the LLM for labeling preferences. The versions used are instruction-tuned but not previously trained with RL. Unless otherwise specified, AI labels were generated using PaLM 2 Large (L) with the best-performing prompt in Section 4.4. For more details on LLM labeling, see Appendix D.
All SFT models are initialized from PaLM 2 Extra-Small (XS). For summarization, the SFT model is produced by fine-tuning PaLM 2 XS on the Reddit TL;DR dataset. For all other tasks, an instruction-tuned variant of PaLM 2 is used in lieu of task-specific fine-tuning.
RMs are also derived from PaLM 2 XS. RMs are fine-tuned on the entire training split of the corresponding preference dataset, where the label is the AI preference for AI feedback RMs and the original human preference label in the dataset for human feedback RMs. RM accuracies can be found in Appendix G.
In the RL phase, the policy is trained with a modified version of REINFORCE (Williams, 1992) adapted to the language modeling domain. While many recent works use Proximal Policy Optimization (PPO) (Schulman et al., 2017) - a related method that adds a few techniques to make training more conservative and stable (e.g. clipping the objective function), we use REINFORCE with a baseline given that it is simpler yet still effective for the problem at hand. Both policy and value models are initialized from the SFT model. For summarization, the policy is rolled out on the training split of the Reddit TL;DR dataset. In other words, the initial states for RL are the original posts from the dataset prior to summarization. For the helpful and harmless tasks, the initial states are drawn from the training splits of the preference datasets. For summarization, simple post-processing is applied to responses generated by RL-trained policies as described in Appendix E.
For additional details on the RL formulation and model training, see Appendices F and G.
5 We sample 15%, 10%, and 10% of the training splits for summarization, helpful dialogue generation, and harmless dialogue generation, respectively.
For experiments evaluated by win rates, evaluators were presented with an input context and multiple responses generated from different policies (e.g. RLAIF, RLHF, and SFT). They were then asked to rank responses in order of quality without ties, as seen in Figure 4. Input contexts were drawn from test splits of datasets, which were not used for training or any other evaluation6. Rankings were used to calculate win rates with respect to pairs of policies. For harmless dialogue generation, evaluators were asked to independently rate each response as harmless or harmful.
For more details on human evaluation, see Appendix I.
RLAIF achieves performance gains on par with or better than RLHF on all three tasks (see Figure 1 and Table 1). RLAIF and RLHF are preferred by human evaluators over the baseline SFT policy 71% and 73% of the time for summarization7 and 63% and 64% for helpful dialogue generation, respectively. The difference in win rates between RLAIF vs. SFT and RLHF vs. SFT are not statistically significant. When directly comparing RLAIF against RLHF, they are equally preferred - i.e. the win rate is not statistically significantly different from 50%. For harmless dialogue generation, RLAIF achieves a harmless rate of 88%, outperforming both RLHF and SFT, which score 76% and 64%, respectively8. Figure 5 contains an example of SFT, RLAIF, and RLHF summaries. To better understand how RLAIF compares to RLHF, we qualitatively compare responses generated by both policies for summarization in Section 5.
As observed in Stiennon et al. (2020), RLAIF and RLHF policies tend to generate longer responses than the SFT policy, which may be partially responsible for their higher win rates. We conduct post-hoc analysis to control for length and find that both RLAIF and RLHF policies still out over RLHF and SFT, according to a two-sample t-test.
6 For summarization, we used the test split of Reddit TL;DR. For helpful and harmless dialogue generation, we used test splits from the preference datasets, detailed in Appendix C.
7 RLAIF and RLHF are also preferred over the human reference summaries in Reddit TL;DR 79% and 80% of the time, respectively.
8 RLAIF achieves a statistically significant improvement perform the SFT policy, and by similar margins to one another. See Appendix J for details.
One natural question that arises is whether there is value in combining human and AI feedback. We experimented with combining both types of feedback but did not see an improvement beyond using human feedback alone. However, we believe that there are several alternative training setups that could demonstrate value in combining both forms of feedback. See Appendix K for details.
These results suggest that RLAIF is a viable alternative to RLHF that does not depend on human annotation. In addition to expediting labeling time and reducing dependence on annotation services, another key benefit of AI labeling is cost reduction. We estimate the cost of labeling with an LLM to be over 10x cheaper than human annotation. See Appendix L for detailed calculations.
In Section 4.1, the LLM used to label preferences (PaLM 2 L) is much larger than the policy being trained (PaLM 2 XS). Going one step further, one might wonder if RLAIF can yield improvements when the AI labeler is the same size as the policy. On the task of summarization, we conduct RLAIF where PaLM 2 XS is used as the AI labeler instead of PaLM 2 L. The rest of the setup mimics the experiment in Section 4.1. We refer to this setup as “same-size RLAIF”.
Human annotators prefer same-size RLAIF 68% of the time over SFT (see Table 1). For reference, RLAIF using an AI labeler larger than the policy is preferred 71% over SFT9. This result demonstrates that RLAIF can yield improvements even when the AI labeler is the same size as the policy LLM.
We note that the AI labeler and initial policy are not the exact same model. The AI labeler is the instruction-tuned PaLM 2 XS, whereas the initial policy is PaLM 2 XS fine-tuned on Reddit TL;DR summarization. Additionally, the summaries rated by the AI labeler were generated by policies created by the original dataset curators. For these reasons, we do not consider this experiment a strict case of “self-improvement”(Huang et al., 2022). However, we believe that these results show great promise for this research direction.
9 The difference between win rates between “same-size RLAIF vs. SFT” and “RLAIF vs. SFT” is not statistically significant. For a two-sample t-test, p-value = 0.07. At alpha = 0.05, this difference is not statistically significant.
Table 1: Left side: Win rates when comparing generations from two different models for the summarization and the helpful dialogue tasks, judged by human evaluators. Right side: Harmless rates across policies for the harmless dialogue task, judged by human evaluators.
In Sections 4.1 and 4.2, AI feedback was distilled into a RM. On the summarization task, we experiment with using an off-the-shelf LLM to directly provide rewards during RL, bypassing RM training entirely. Since using a large AI labeler in RL is computationally expensive, we use the smaller instruction-tuned PaLM 2 XS as the off-the-shelf LLM. We refer to this setup as “direct RLAIF”.
Human annotators prefer responses from direct RLAIF 74% of the time over SFT responses (see Table 1). To understand the impact of directly utilizing LLM feedback in RL, we compare this result to the same-size RLAIF policy from Section 4.2, which solely differs in training a RM that provides rewards during RL. Direct RLAIF outperforms same-size RLAIF, which achieves a statistically significantly lower win rate of 68%. Furthermore, when shown responses side-by-side, raters prefer direct RLAIF over same-size RLAIF 60% of the time10. One hypothesis for the improved quality is that bypassing the distillation from AI preferences into a RM enables information to flow directly from the off-the-shelf LLM to the policy.
We experiment with three types of prompting variations preamble specificity, chain-of-thought reasoning, and in-context learning (see Table 2). We observe that eliciting chain-of-thought reasoning generally improves AI labeler alignment, while the impacts of preamble specificity and in-context learning vary across tasks. The best prompts outperform the base prompts (“Base 0-shot”) by +1.9%, +1.3%, and +1.7% for summarization, helpfulness, and harmlessness, respectively.
10 This is statistically significantly different from 50% according to a two-sample t-test.
Table 2: We observe that eliciting chain-of-thought reasoning tends to improve AI labeler alignment, while few-shot prompting and detailed preambles have mixed effects across tasks. H1 refers to helpfulness, H2 to harmlessness.
Detailed preambles consistently improve alignment for summarization, while yielding mixed results for helpful and harmless dialogue generation. We hypothesize that summarization benefits more from a specific preamble due to the high complexity of this task. On the other hand, rating helpfulness and harmlessness are more intuitive to grasp, and therefore may benefit less from detailed instructions.
Chain-of-thought reasoning improves alignment consistently for summarization. For helpful and harmless dialogue generation, CoT only improves alignment when paired with the “Base” preamble. Surprisingly, we observe that few-shot in-context learning only improves alignment for harmless dialogue generation11. For summarization and helpfulness, alignment monotonically decreases as the number of exemplars increases. It seems unlikely that this effect is a result of exemplar quality, as exemplars were carefully handpicked to be highquality and representative of each preference task. Furthermore, we conducted 10 trials for “Base 1shot” on summarization, where a different exemplar was randomly selected for each trial. The maximum AI labeler alignment from all trials was 76.1%, which still did not surpass “Base 0-shot” in terms of AI labeler alignment. One hypothesis for why exemplars do not help is that the summarization and helpful dialogue generation tasks may already be sufficiently well-understood by the powerful AI labeler, rendering the exemplars unhelpful or distracting. It’s interesting to note that in-context learning is still an important research area that is not fully understood (Min et al., 2022; Wang et al., 2022a).
11 We verified that all inputs used in these experiments fit
For summarization, we compare against human inter-annotator agreement to get a sense of how well our LLM labeler performs in absolute terms. Stiennon et al. (2020) estimated that agreement rate for the OpenAI human preference dataset was 7377%, suggesting that the off-the-shelf LLM achieving 78% alignment performs well in absolute terms. We also conduct experiments with selfconsistency (Wang et al., 2022b), where multiple chain-of-thought rationales are sampled with temperature T > 0. The preference distributions generated by the LLM are averaged together to arrive at the final preference label. We find that self-consistency strictly degrades AI labeler alignment (see Appendix M).
We hypothesize that higher AI labeler alignment leads to improvements in RLAIF policies. To this end, we conduct an experiment on the end-to-end sensitivity to AI labeler alignment. Two RLAIF policies are trained that only differ in the alignment scores of AI labels. Results show that the policy trained with more aligned AI labels achieves a significantly higher win rate. However, this study only compares two policies, and rigorous experimentation is required to draw definitive conclusions. See Appendix N for details.
Large model sizes are not widely accessible and can be slow and expensive to run. On the task of summarization, we experiment with labeling preferences with varying LLM sizes and observe a strong relationship between size and alignment (see Table 3). Alignment decreases -4% when moving from PaLM 2 Large (L) to PaLM 2 Small (S), and decreases another -11% when moving down to PaLM 2 XS a trend consistent with scaling behaviors observed in other work (Kaplan et al., 2020). Besides general model capability, another contributing factor to this trend may be that smaller LLMs are more susceptible to position bias (see Appendix B).
On the other end of this trend, these results also suggest that scaling up AI labeler size may produce even higher quality preference labels. Since the AI labeler is only used to generate preference examples once and is not called during RL, using an even larger AI labeler is not necessarily prohibitively expensive.
Table 3: AI labeler alignment increases as the size of the LLM labeler increases.
To better understand how RLAIF compares to RLHF, we inspected responses generated by both policies for the summarization task. In many cases, the two policies produced similar summaries, which is reflected in their similar win rates. However, we identified two patterns where they sometimes diverged.
The first pattern we observed is that in some cases, RLAIF hallucinates when RLHF does not. The hallucinations in RLHF summaries sound plausible but are inconsistent with the original text. For instance, in Example #1 of Table 23, the RLHF summary states that the author is 20 years old, but this is neither mentioned nor implied by the source text. The second pattern we observed is that RLAIF sometimes produces less coherent or grammatical summaries than RLHF. For instance, in Example #1 of Table 24, the RLAIF summary generates run-on sentences.
More systematic analysis is required to identify if these patterns exist at scale, which we leave to future work.
LLMs have shown impressive performance over a wide range of NLP tasks (Brown et al., 2020; Thoppilan et al., 2022; Chowdhery et al., 2022; Google et al., 2023; OpenAI, 2023a). For several of these tasks, RL has emerged as an effective optimization technique. While initial applications of RL on tasks such as translation (Wu et al., 2016, 2018) and summarization (Gao et al., 2019; Wu and Hu, 2018) used automatic evaluation metrics as rewards, such simplified formulations of rewards did not fully align with human notions of quality. learning from human feedback (Christiano et al., 2017) has been used as a technique to directly align LLMs with human preferences (Ziegler et al., 2019) through training a reward model on pairwise comparisons of natural language responses. It has been successfully applied for summarization (Stiennon et al., 2020), generalized instruction following (Ouyang et al., 2022; Lai et al., 2023), dialogue (Gilardi et al., 2023; Manyika, 2023; Glaese et al., 2022; Bai et al., 2022a) and question answering (Nakano et al., 2021).
Reinforcement
LLMs have also been extensively used for data generation (Wang et al., 2021b; Meng et al., 2023), augmentation (Feng et al., 2021) and in selftraining setups (Wang et al., 2022b; Madaan et al., 2023). Bai et al. (2022b) introduced the idea of RLAIF, which used LLM labeled preferences in conjunction with human labeled preferences to jointly optimize for the two objectives of helpfulness and harmlessness. Recent works have also explored related techniques for generating rewards from LLMs (Roit et al., 2023; Kwon et al., 2022; Yang et al., 2023). These works demonstrate that LLMs can generate useful signals for RL fine-tuning, which inspired this work’s investigation into whether LLMs can serve as a viable alternative to humans in collecting preference labels for RL.
In this work, we show that RLAIF achieves comparable improvements to RLHF on three text generation tasks. Our experiments show that RLAIF greatly improves upon a SFT baseline, and the margin of improvement is on par with or greater than that of RLHF. Furthermore, in head-to-head comparisons, RLAIF and RLHF are preferred at similar rates by humans. Additionally, we show that
RLAIF is effective even when the LLM labeler is the same size as the policy, and directly prompting the LLM labeler to provide rewards during RL can outperform the canonical RLAIF setup that distills preferences into a separate RM. Finally, we study the impact of AI labeling techniques on alignment to human preferences.
While this work highlights the potential of RLAIF, there remain many fascinating open questions, such as whether conducting RLAIF iteratively can achieve additional gains (i.e. use the most recent RLAIF policy to generate new response pairs, conduct RLAIF, and repeat), how RLAIF can be adapted to a model-based RL setting where both human and assistant are modeled by LLMs, and how AI feedback can be leveraged for more specific credit assignment. We leave these questions for future work.
Ethics
One ethical consideration concerns the utilization of AI-generated feedback as a source for model alignment. There exists a potential risk of transferring biases from the off-the-shelf LLM into the generated preferences. This in turn may result in RL-trained policies further amplifying biases, thereby inadvertently misaligning models and potentially causing harm. Extreme caution must be exercised, especially when deploying these models in high-stakes domains such as medicine, law, and employment, where they have the potential to significantly impact human lives in adverse ways. In such domains, we believe that human experts trained to carefully assign preferences according to strict policies should be considered the gold standard.
Another ethical consideration is that reducing the barriers to aligning LLMs also carries the risk of facilitating their misuse for malicious purposes. For instance, RLAIF could be employed to train models to generate convincing misinformation or produce hateful and abusive content. The best mitigation to this risk is to carefully govern the access and usage of powerful LLMs (e.g. limiting “white-box” access), to prevent bad actors from misusing them.
Reproducibility
To promote the reproducibility of this work, many of the details of this research are shared throughout the paper. Open-source datasets are elaborated upon in Appendix C, LLM labeling details in Appendix D, the RL formulation in Appendix F, model training details in Appendix G, human evaluation details in I, and the most critical prompts used in the Appendix (e.g. Tables 17, 21, and 22). Please reach out to authors for any additional questions or requests.
PaLM 2 models are available through Google Cloud’s Vertex API, and the experiments in this work may also be repeated with other publicly available LLMs.