Contents
1. 서론
최근 대규모 언어모델(LLM)은 수십억 개의 파라미터를 학습하여 자연어 처리, 수학적 인퍼런스, 프로그래밍 등 다양한 분야에서 향상된 성능을 보이고 있다. 이런 모델의 훈련에서 중요한 점은 사용자의 피드백을 반영하여 모델의 휴먼과의 상호작용을 개선하는 것이다. 이를 위해 사용된 기법으로는 강화 학습(RLHF), 대조 학습, 언어 모델링 등이 있으며, 각각의 방법은 선호도 데이터의 효율적 사용을 목표로 하고 있다.
그러나 이런 방법들은 트레이닝 중에 모델과 데이터 분포 간의 격차가 발생하여 성능 저하를 초래할 수 있는 문제가 있다. 이를 해결하기 위해 본 논문에서는 적대적 학습 프레임워크인 Adversarial Preference Optimization(APO)을 제안한다. APO는 생성적 적대 신경망(GAN)의 아이디어를 바탕으로 하여 보상 모델과 LLM 간의 동적인 상호작용을 통해 모델의 성능을 최적화한다.
2. 선수 지식
2.1 휴먼 선호(human preference)도 조정
LLM은 휴먼의 선호도 데이터를 바탕으로 응답 생성 정책 \(\pi_\theta(y\\|x)\)를 조정하여 사용자에게 선호되는 응답을 생성하도록 훈련된다. 이 과정에서 보상 모델 \(r_\phi(x, y)\)가 중요한 역할을 하며, 이는 Bradley-Terry 모델을 사용하여 다음과 같이 정의된다.
\[\mathcal{L}_{\text{BT}}(r_\phi; \mathcal{D}_P) = -\mathbb{E}_{(x, y_{\text{good}}, y_{\text{bad}}) \in \mathcal{D}_P} \left[ \log \frac{1}{1 + \exp(-(r_\phi(x, y_{\text{good}}) - r_\phi(x, y_{\text{bad}})))} \right]\]이를 통해 보상 모델은 좋은 응답과 나쁜 응답의 선호도를 학습하고, 이를 기반으로 LLM의 응답 생성 정책을 최적화한다.
2.2 생성적 적대 네트워크(GAN)
GAN은 진짜 같은 데이터 샘플을 생성하는 생성기와 생성된 샘플과 진짜 데이터를 구별하는 판별기를 사용하여 훈련된다. 본 논문에서는 이 원리를 LLM의 응답 생성에 적용하며, 보상 모델이 판별기의 역할을 하고 응답 생성 정책이 생성기의 역할을 한다.
3. 적대적 선호도 최적화 (APO)
APO 프레임워크는 LLM이 높은 보상 점수를 받는 응답을 생성하려고 시도하는 동안 보상 모델이 금과 응답 및 생성된 응답 사이의 점수 차이를 확대하려고 하는 적대적 게임을 구성한다. 이 과정에서 KL Divergence을 사용하여 정책과 보상 모델의 업데이트를 규제하며, 이는 모델이 과적합을 방지하고 다양한 응답을 생성하도록 돕는다.
3.1 보상 최적화 단계
보상 모델을 업데이트하는 과정에서는 다음과 같은 목표 함수를 최적화한다.
\[L_{\text{APO-RM}}(r_\phi) = \mathbb{E}_{(x, y) \sim P_{\text{gold}}(x, y)}[r_\phi(x, y)]\]이는 금 데이터와 생성된 응답 사이의 보상 점수 차이를 최대화하는 것을 목표로 한다.
3.2 정책 최적화 단계
정책을 업데이트하는 과정에서는 다음과 같은 목표 함수를 최적화한다.
\[\max_{\pi_\theta} \mathbb{E}_{(x, y) \sim \mathcal{D}}[r_\phi(x, y)] - \beta \text{KL}[\pi_\theta(y\\|x) \\\| \pi_{\text{ref}}(y\\|x)]\]이를 통해 LLM은 보상 모델로부터 높은 평가를 받는 응답을 생성하도록 훈련된다.
4. 실험
APO 프레임워크의 효과를 검증하기 위해 Helpful&Harmless 데이터셋과 Alpaca 모델을 사용하여 실험을 수행하였다. 실험 결과 APO가 기존 방법들보다 더 나은 성능을 보였으며, 특히 선호도 정확도 및 캘리브레이션 오류에서 우수한 결과를 보였다.
5. 결론
APO 프레임워크는 대규모 언어모델의 선호도 조정을 위한 효율적이고 효과적인 방법을 제공한다.
Learned from massive textual data with billions of parameters, large language models (LLMs), such as ChatGPT (OpenAI, 2023a) and LLaMA-2 (Touvron et al., 2023b), have shown remarkable AI capabilities, especially in domains of natural language processing (Jiao et al., 2023; Han et al., 2023), logical (mathematical) reasoning (Liu et al., 2023a; Frieder et al., 2023), and programming (Surameery & Shakor, 2023; Tian et al., 2023). Among the training techniques that push LLMs to such excellent performance, human preference alignment finetunes LLMs to follow users’ feedback, which has been widely recognized as essential for improving human-model interaction (Ouyang et al., 2022; Yuan et al., 2023; Rafailov et al., 2023; Dong et al., 2023). However, obtaining highly qualified human feedback requires meticulous annotations of all manner of queryresponse pairs in various topics (Askell et al., 2021), which is rather challenging and forms a sharp contrast to the easy access of enormous unsupervised pretraining-used text. Hence, the limitation of preference data collection raises demands for learning efficiency of preference alignment methods (Yuan et al., 2023; Sun et al., 2023).
To utilize preference data, current human feedback aligning methods are proposed mainly from three perspectives (Wang et al., 2023b): reinforcement learning (Ouyang et al., 2022), contrastive learning (Yuan et al., 2023; Rafailov et al., 2023; Liu et al., 2023c), and language modeling (Dong et al., 2023; Touvron et al., 2023b; Wang et al., 2023a). Reinforcement learning with human feedback (RLHF) (Kreutzer et al., 2018; Ziegler et al., 2019) is the earliest exploration and has become the mainstream approach for LLMs’ preference optimization (Ouyang et al., 2022; Touvron et al., 2023b). RLHF first learns a reward model (RM) from the human preference data, then optimizes the expected reward score of the LLM’s outputs via the Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017). Although widely used, RLHF has been criticized as not only unstable during the fine-tuning, but also complicated in implementation and computational resource consumption (Yuan et al., 2023; Rafailov et al., 2023). For more efficient and steady training, instead of directly optimizing the non-differentiable rewards, contrastive learning methods (Yuan et al., 2023; Rafailov et al., 2023; Zhao et al., 2023) enlarge the likelihood gap between positive and negative response pairs, where the positive and negative labels can be either annotated by humans or predicted by reward models. Alternatively, language modeling-based methods (Dong et al., 2023; Liu et al., 2023b; Wang et al., 2023a) remain using language modeling loss to align preference, but with different data preparation strategies. For example, rejection sampling (Dong et al., 2023; Touvron et al., 2023b) select responses with top reward scores as the language modeling fine-tuning data, while Wang et al. (2023a) and Liu et al. (2023b) add different prompts to different responses based on the corresponding preference levels.
Although contrastive-learning-based and language-modeling-based methods have partly alleviated the inefficiency of RLHF, the sampling distribution shifting problem (Touvron et al., 2023b) still hinders the alignment effectiveness: after a few steps of preference alignment updates, a distribution gap emerges between LLM generated samples and preference-annotated data. Consequently, the reward model performs worse rapidly on the newly generated LLM responses, if not additionally trained on new samples from the shifted distribution. To address this problem, most of the aforementioned methods (Ouyang et al., 2022; Dong et al., 2023; Yuan et al., 2023) require additional annotation of human feedback on newly generated responses (Touvron et al., 2023b) after a few LLM updating steps, which leads to increasingly massive manpower costs (Askell et al., 2021). Besides, the vast time consumption of extra manual annotation also significantly slows down the feedback alignment learning process.
To reduce the manual annotation efforts and further improve the preference optimization efficiency, we propose a novel adversarial learning framework called Adversarial Preference Optimization (APO). Inspired by generative adversarial networks (GANs) (Goodfellow et al., 2014; Arjovsky et al., 2017), we conduct an adversarial game between the RM and the LLM agent: the LLM generates responses to maximize the expected reward score, while the RM aims to distinguish the score difference between golden and sampled responses. To verify the effectiveness of our APO framework, we conduct experiments on the Helpful&Harmless (Bai et al., 2022) datasets with Alpaca (Taori et al., 2023) as the base LLM. With the same amount of human preference data, both the LLM agent and the reward model receive additional performance gains through the APO game, compared with the naive rejection sampling baselines.
Human preference alignment aims to fine-tune the response-generation policy \(\pi_\theta(y\\|x)\) of an LLM agent with a group of human preference data \(\mathcal{D}_P = \{(x, y_{\text{good}}, y_{\text{bad}})\}\), so that the LLM agent can generate more human-preferred responses to improve the human-model interaction quality. To achieve this, a reward model (RM) (Christiano et al., 2017; Ouyang et al., 2022) \(r_\phi(x, y)\) is usually utilized to evaluate the quality of responses from \(\pi_\theta(y\\|x)\), by learning from the human preference data \(\mathcal{D}_P\) with the Bradley-Terry (BT) ranking loss (Bradley & Terry, 1952):
\[\mathcal{L}_{\text{BT}}(r_\phi; \mathcal{D}_P) = -\mathbb{E}_{(x, y_{\text{good}}, y_{\text{bad}}) \in \mathcal{D}_P} \left[ \log \frac{1}{1 + \exp(-(r_\phi(x, y_{\text{good}}) - r_\phi(x, y_{\text{bad}})))} \right]\]where \(\sigma(\cdot)\) is the Sigmoid activation function (Han & Moraga, 1995). If we denote \(y \succ y'\) as “response \(y\) is preferred to \(y'\)”, then a model-predicted probability \(Q_{r_\phi}(y \succ y'\\|x)\) can be induced by reward scores \(r_\phi(x, y), r_\phi(x, y')\) with the following parameterization:
\[Q_{r_\phi}(y \succ y'\\|x) = \sigma(r_\phi(x, y) - r_\phi(x, y'))\]With equation 2, training RM with the Bradley-Terry ranking loss can be explained as the loglikelihood maximization of \(Q_{r_\phi}\), i.e., \(\mathcal{L}_{\text{Ranking}}(r_\phi; \mathcal{D}_P) = -\mathbb{E}_{\mathcal{D}_P} [\log Q_{r_\phi}(y_{\text{good}} \succ y_{\text{bad}}\\|x)]\). With a learned reward model \(r_\phi(x, y)\), human preference alignment methods Ouyang et al. (2022); Rafailov et al. (2023); Liu et al. (2023c) target on maximizing the reward expectation of generated responses with the following objective:
\[\mathcal{L}_{\text{Pref}}(\pi_\theta; r_\phi, \mathcal{D}_P) = \mathbb{E}_{x \sim \mathcal{D}_P}[\mathbb{E}_{y \sim \pi_\theta(y\\|x)}[r_\phi(x, y)]] - \beta \text{KL}[\pi_\theta(y\\|x) \parallel \pi_{\text{ref}}(y\\|x)]\]where \(\pi_{\text{ref}}(y\\|x)\) is the base reference policy commonly set as the supervised fine-tuned (SFT) language model (Ouyang et al., 2022), and \(\beta > 0\) is a hyper-parameter re-weighting the reward expectation and the KL-divergence (Kullback, 1997) regularizer. Practically the learning policy \(\pi_\theta(y\\|x)\) is also initialized from the reference \(\pi_{\text{ref}}(y\\|x)\). The regularizer \(\text{KL}[\pi_\theta(y\\|x) \parallel \pi_{\text{ref}}(y\\|x)]\) in equation 3 prevents \(\pi_\theta(y\\|x)\) from degenerating to repeat a single response with the highest reward score, and preserves the generation diversity. Since the sampled responses \(y\) are discrete, it is challenging to directly back-propagate gradients from reward \(r_\phi(x, y)\) back to policy \(\pi_\theta(y\\|x)\). The typical solution to the preference optimization in equation 3 is reinforcement learning (RLHF) (Ouyang et al., 2022), especially with the proximal policy optimization (PPO) algorithms (Schulman et al., 2017).
However, RLHF has been recognized as practically suffering from implementation complexity and training instability (Yuan et al., 2023). Hence, recent studies (Rafailov et al., 2023; Yuan et al., 2023; Dong et al., 2023; Liu et al., 2023c) try to avoid the reinforcement learning scheme during preference optimization. More specifically, DPO (Rafailov et al., 2023) finds a connection between the reward model and LLM’s optimal solution, then replaces the reward model with the likelihood ratio of the policy and its reference:
\[\mathcal{L}_{\text{DPO}}(r_\phi; \pi_{\text{ref}}, \mathcal{D}_P) = -\mathbb{E}_{(x, y) \in \mathcal{D}_P}[\log \frac{\pi_\theta(y\\|x)}{\pi_{\text{ref}}(y\\|x)}]\]Analogously, other methods consider human feedback learning from the perspective of contrastive learning. For example, RRHF (Yuan et al., 2023) propose a ranking loss as:
\[\mathcal{L}_{\text{RRHF}}(r_\phi; \mathcal{D}_P) = -\mathbb{E}_{(x, y_{\text{best}}, y) \in \mathcal{D}_P} \left[ \log \sigma(r_\phi(x, y_{\text{best}}) - r_\phi(x, y)) \right]\]where \(y_{\text{best}}\) is the corresponding response to \(x\) with the highest reward, and the preference data \(\mathcal{D}\) can be built from human annotation \(\mathcal{D}_P\) or RM ranking results. Additionally, Zhao et al. (2023) propose a ranking loss similar to equation 5 with a margin relaxation to the log-likelihood difference. Moreover, rejection sampling (RJS) methods (Touvron et al., 2023b; Liu et al., 2023c) directly conduct supervised fine-tuning (SFT) on \(y_{\text{best}}\) to further simplify the human preference alignment process. The rejection sampling optimization (RJS) loss can be written as:
\[\mathcal{L}_{\text{RJS}}(r_\phi; \mathcal{D}_P) = -\mathbb{E}_{x \sim \mathcal{D}_P}[\log \sigma(r_\phi(x, y_{\text{best}}))]\]where \(y_{\text{best}} = \arg \max_{1 \leq s \leq S}\{r_\phi(x, y_s)\}\) is the sampled response with the highest reward score.
Generative adversarial networks (GANs) (Goodfellow et al., 2014) are a classical group of unsupervised machine learning approaches that can fit complicated real-data distributions in an adversarial learning scheme. More specifically, GANs use a discriminator \(D(\cdot)\) and a generator \(G(\cdot)\) to play a min-max game: the generator tries to cheat the discriminator with real-looking generated samples, while the discriminator aims to distinguish the true data and the samples. The GANs’ objective is:
\[\min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{\text{data}}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]\]where \(z\) is a random vector from prior \(P_z(z)\) to induce the generated sample distribution. The objective equation 7 can be theoretically shown as the Jensen–Shannon divergence between distributions of real data and generated samples (Goodfellow et al., 2014). Moreover, Arjovsky et al. (2017) replace the Jensen-Shannon divergence with the Wasserstein distance (Villani, 2009) and propose the Wasserstein GAN objective:
\[\min_G \max_{f: \\|f\\|_L \leq K} \mathbb{E}_{x \sim p_{\text{data}}(x)}[f(x)] - \mathbb{E}_{x \sim p_g(x)}[f(x)]\]where \(\\|f\\|_L \leq K\) requires \(f(\cdot)\) to be a K-Lipschitz continuous function. Wasserstein GANs have been recognized with higher training stability than the original GANs Arjovsky et al. (2017).
In natural language generation, GANs have also been empirically explored (Zhang et al., 2016; 2017), where a text generator samples real-looking text and a discriminator makes judgment between the true data and textual samples. As introduced in Section 2.1, the response-generation policy \(\pi_\theta(y\\|x)\) can be regarded as a generator of a conditional text GAN (Mirza & Osindero, 2014). Besides, the reward model \(r_\phi(x, y)\) plays an analogous role as a discriminator to judge the quality of generated responses.
Figure 1: The APO framework. In the RM updating step, the reward model learns by distinguishing the difference between the manually annotated golden responses and the LLM-generated responses. In the LLM updating step, the LLM agent updates to generate higher-quality responses with the feedback from the reward model.
We begin with a revisit of the human preference alignment objective (equation 3) in a mathematical optimization form:
\[\max_{\pi_\theta} \mathbb{E}_{(x,y) \sim \mathcal{D}}[r_\phi(x, y)] - \beta \text{KL}[\pi_\theta(y\\|x) \\| \pi_{\text{ref}}(y\\|x)]\]where we aim to maximize the expected reward value with respect to the generation policy \(\pi_\theta(y\\|x)\), under a KL-divergence constraint with the reference policy \(\pi_{\text{ref}}(y\\|x)\). Applying the method of Lagrange multipliers (Beavis & Dobbs, 1990) to equation 9, one can easily obtain the widely-used preference optimization objective in equation 3. As discussed in Section 1, the above optimization form will become ineffective after several policy updating steps, for the generated sample distribution diverges from the preference data distribution for the RM \(r_\phi(x, y)\) training. To address this problem, we aim to update the reward model correspondingly during the policy fine-tuning.
Inspired by generative adversarial networks (GANs) (Goodfellow et al., 2014), we design an adversarial learning framework to align human preferences:
\[\text{KL}[P(y \succ y'\\|x) \\| Q_{r_\phi}(y \succ y'\\|x)] < \eta^2\]\(\text{KL}[P(y \succ y' \mid x) \| Q_{r_\phi}(y \succ y' \mid x)] < \eta^2,\) where \(P_\theta(x, y) = P_D(x) \pi_\theta(y \mid x)\) is the joint distribution of input queries and generated responses, and \(P_{\text{gold}}(x, y)\) denotes the annotated golden data distribution.
Based on equation 10, we conduct an adversarial game, in which the policy \(\pi_\theta(y\\|x)\) needs to improve its response quality to get a higher expected reward, while the reward model \(r_\phi(x, y)\) tries to enlarge the scoring gap between the golden responses and the generation from \(\pi_\theta(y\\|x)\). We call this novel optimization problem as Adversarial Preference Optimization (APO).
Besides, following the original preference alignment objective, we add two KL-divergence regularizers to both \(\pi_\theta\) and \(r_\phi\) to prevent over-fitting and degeneration. Here \(P (y \succ y' \\| x)\) denotes the ground-truth human preference probability, and \(Q_{r_\phi}(y \succ y' \\| x)\) is described in equation 2.
Note that we use the reverse \(\text{KL}[\pi_\theta \\| \pi_{\text{ref}}]\) to constrain the generative model \(\pi_\theta\) but the forward KL[P \| Q_{r_\phi}] for the discriminate model \(r_\phi\). We provide an intuitive explanation to this separative forward-reverse KL regularization design: the reverse \(\text{KL}[\pi_\theta \\| \pi_{\text{ref}}]\) can be estimated with \(\pi_\theta\)-generated samples, paying more attention to the generation quality; while the forward \(KL[P \\| Q_{r_\phi}]\) is practically estimated with ground-truth preference data, focusing on the preference fitting ability of reward models.
To play the above adversarial game, we alternatively update one of \(\pi_\theta(y\\|x)\) and \(r_\phi(x, y)\) with the other’s parameters fixed. Next, we will provide detailed descriptions of APO’s reward optimization step and policy optimization step separately.
In the reward optimization step, we fix the generator \(\pi_\theta(y\\|x)\) and update the reward model \(r_\phi(x, y)\). Note that in equation 10 term \(\text{KL}[\pi_\theta(y\\|x) \\| \pi_{\text{ref}}(y\\|x)]\) has no relation with \(r_\phi\), so we can simplify the objective for reward model updates:
\[L_{\text{APO-RM}}(r_\phi) = \mathbb{E}_{(x, y) \sim P_{\text{gold}}(x, y)}[r_\phi(x, y)]\]The equation 11 indicates that the reward model enlarges the expected score gap between golden answers and generated responses to challenge \(\pi_\theta(y\\|x)\) for better generation quality. Note that equation 11 has a similar form as the objective of Wasserstein GANs (equation 8), which can be intuitively explained as the calculation of the Wasserstein distance between distributions \(P_\theta\) and \(P_{\text{gold}}\). However, rigorously equation 11 is not a Wasserstein distance because \(r_\phi(x, y)\) does not satisfy the Lipschitz continuity as described in Arjovsky et al. (2017). We provide more discussion about connections between APO and W-GANs in the supplementary materials.
To practically conduct the APO RM training, we first collect a set of user queries {xm} ∼ PD(x), then annotate each xm with a golden response \(y_{\text{good} m=1}\), then each (xm, y_{\text{gold}}) can be regarded as a sample drawn from \(P_{\text{gold}}(x, y)\). Meanwhile, we generate \(y_{\text{sample}} \sim m \pi_\theta(y\\|xm)\), so that \((xm, y_{\text{sample} m}) \sim P_\theta(x, y) = PD(x)\pi_\theta(y\\|x)\), D_{\text{sample}} = \{(xm, y_{\text{sample} m})\}M
\(m=1\). With DAPO = \{(xm, y_{\text{gold} m})\}
being our APO sample set, the RM learning objective in equation 11 can be calculated:
Note that equation 12 also calculates the reward difference between pairs of responses like the Bradley-Terry (BT) loss does.
Hence, for training stability, we can empirically use the BT loss to optimize equation 12 instead:
\(L_{\text{APO-RM}}(r_\phi; DAPO) = \mathbb{E}_{(x, y_{\text{gold}}) \sim P_{\text{gold}}(x, y)}[r_\phi(x, y_{\text{gold}}) - r_\phi(x, y_{\text{sample}})]\) (13) With a Lagrange multiplier \(\beta_2 > 0\), we can convert the KL constraint in equation 11 to a regularizer:
\[L_{\text{APO-RM}}(r_\phi) = L_{\text{Ranking}}(r_\phi; DP) + \beta_2 \text{KL}[P(y \succ y' \\| x) \\| Q_{r_\phi}(y \succ y' \\| x)]\](14) Since \(\text{KL}[P \\| Q_{r_\phi}] = \mathbb{E}_P [y \succ y' \\| x](\log P - \log Q_{r_\phi}) = H(y \succ y' \\| x) - \mathbb{E}_P [y \succ y' \\| x](\log Q_{r_\phi})\), where \(H(y \succ y' \\| x)\) is the entropy of ground-truth human preference as a constant for \(r_\phi\) updating. As introduced in equation 2, with a group of preference data \(DP = \{(x_n, y_{\text{good} n})\}\) representing samples of \(P(y \succ y' \\| x)\), we have
\[-\mathbb{E}_P [y \succ y' \\| x](\log Q_{r_\phi}(y \succ y' \\| x)) = L_{\text{Ranking}}(r_\phi; DP)\]Therefore, the overall APO RM learning objective can be written as:
\[L_{\text{APO-RM}}(r_\phi) = L_{\text{Ranking}}(r_\phi; DAPO) + \beta_2 L_{\text{Ranking}}(r_\phi; DP)\] \[L_{\text{APO-RM}}(r_\phi) = L_{\text{Ranking}}(r_\phi; DAPO) + \beta_2L_{\text{Ranking}}(r_\phi; DP)\]The APO RM loss involves two datasets DAPO and DP, which practically have different data sizes. Because the golden responses consume much larger annotation resources than pair-wised response comparison. In experiments, we find the re-weighting parameter \(\beta\) requires to be larger to avoid over-fitting on the relatively smaller golden annotation set DAPO. We conduct more detailed ablation studies in the experimental part.
In the policy optimization step, we fix the reward model \(r_\phi(x, y)\) and update policy \(\pi_\theta(y\\|x)\). Since term \(\mathbb{E}(x,y) \sim P_{\text{gold}}(x,y)[r_\phi(x, y)]\) and constraint \(\text{KL}[P(y \succ y' \mid x) \| Q_{r_\phi}(y \succ y' \mid x)]\) are not related to policy \(\pi_\theta(y\\|x)\), we only need to optimize:
\[\max_{\pi_\theta} \mathbb{E}_{(x, y) \sim \mathcal{D}}[r_\phi(x, y)] - \beta \text{KL}[\pi_\theta(y\\|x) \\| \pi_{\text{ref}}(y\\|x)]\]Algorithm 1 Adversarial preference optimization (APO) with rejection sampling (RJS).
Parameters: Reward model \(r_\phi(x, y)\), policy \(\pi_\theta(y\\|x)\). Data: LLM training queries \(DQ = \{xl\}\), annotated responses \(D_{\text{gold}} = \{(xm, y_{\text{gold}})\}\)preference comparisons\(DP = \{(xn, y_{\text{good}})\}\) for rejection sampling rounds do Generate response sample y1 Collect the APO comparison set \(DAPO = \{(xm, y_{\text{gold}})\}\) Update \(r_\phi\) with the APO RM loss:
\[L_{\text{APO-RM}}(r_\phi; DAPO) = \mathbb{E}_{(x, y_{\text{gold}}) \sim P_{\text{gold}}(x, y)}[r_\phi(x, y)] - \mathbb{E}_{(x, y_{\text{sample}}) \sim P_\theta(x, y)}[r_\phi(x, y)]\]which is equivalent to the original preference optimization in equation 3. Naturally, previous preference aligning methods, such as PPO (Ouyang et al., 2022), DPO (Rafailov et al., 2023), RRHF (Yuan et al., 2023), and RJS (Dong et al., 2023; Liu et al., 2023c) remain qualified for the optimization in equation 16 and compatible with our APO framework. To preliminarily validate the effectiveness of our APO framework, we first select the rejection sampling (RJS) as the LLM updating algorithm, for its implementation simplicity and training stability. Experiments of APO with other preference optimization methods are still in process.
In this section, we verify the effectiveness of the APO framework on the Helpful&Harmless (HH) dataset (Bai et al., 2022) with Alpaca (Taori et al., 2023) as the base SFT model and rejection sampling (RJS) (Dong et al., 2023) as the LLM updating algorithm. The overall training scheme is described in Algorithm 1.
Data Preparation We use the Helpful&Harmless (HH) set (Bai et al., 2022) to verify the effectiveness. Each query in the HH set is answered with two responses. Annotators are asked to label “chosen” or “reject” for each response based on the interaction quality. Following the data preprocesses in Cheng et al. (2023), we clean both HH training and testing sets by removing queries with two same responses or with two same scores. After the cleaning, the HH training set contains 43.8K helpfulness-training queries and 42.5K harmlessness-training queries, while the HH testing set includes 2.3K helpfulness-testing queries and 2.3K harmlessness-testing queries. Next, we describe the usage of the cleaned HH data as shown in Table 1:
Table 1: Data preparation and usage. The original HH training set is used to learn a testing RM to automatically evaluate the quality of LLM responses. The split HHRM set is for training of baseline RMs and APO RMs. Queries in HHLLM set are utilized to update the LLM agent. Both RM and LLM’s performance are evaluated on HHTest set.
Evaluation To evaluate the performance of RMs and LLMs, we consider the following metrics:
Training Details We describe the training details for RMs and LLMs separately:
Table 2: Training setups and performance of reward models.
As described in Algorithm 1, we conduct three rounds of rejection sampling with Alpaca-7B as the initial SFT model and RMBase as the baseline RM. In Table 2, we show the preference accuracy and expected calibration error (ECE) on both HHTest and HHDev sets. From the results, we find the APO RM uniformly achieves better preference accuracy, but raises the calibration error meanwhile. To further visualize the relation between the preference accuracy and the calibration error during the APO RM training, we plot every RM’s performance on HHDev in Figure 2 with negative ECE score as the X-axis and preference accuracy as the Y-axis. The closer an RM is located to the upperright corner of the plot, the better its performance is. Compared to RMAPO trained from RMBase each round, sequentially updated RMAPO-seq can continuously achieve higher preference accuracy, especially on the validation set. However, the calibration errors also significantly increase at the same time, indicating the RMs become more and more over-fitted on the HHRM training set. In contrast, updating RMAPO from RMBase in each round can stably control the calibration error with a little performance loss on preference accuracy. Without the APO sample data DAPO, the ablationstudy-used RMBase-AB shows an apparent performance gap compared to the APO RMs, which supports the effectiveness of our adversarial training comparison between the golden annotation and model generation.
In Table 3, we provide the training setups and performance of LLMs during the three RJS rounds. For the RJS baselines, we fix RMBase as the rejection RM to select the highest-score responses. For LLMAPO, we use the corresponding RMAPO for response selection. After each round of training, we let the updated LLM to response the queries in the HHTest set, then use the testing RMAll and RMTest to infer average scores of the LLM responses. From the results, both RJS and APO can achieve significantly higher average scores round-by-round. APO-trained LLMs uniformly outperform the RJS baselines in every training round. From the right plot in Figure 2, the performance gap between APO and RJS visibly enlarges when training rounds increase. Notably, although sequentially APO RM training can cause much higher calibration errors, in the second round LLMAPO-v2seq achieves the highest average score compared with both LLMRJS-v2 and LLMAPO-v2. However, when the training continues to the third round, the sequentially trained RM becomes totally over-fitted with the performance score decreasing. This phenomenon provides us an insight into the importance of balancing the preference accuracy and probability calibration for RM training. We are conducting more experiments to discuss the impact of the accuracy-calibration trade-off.
Figure 2: Left: Performance of RMs on the validation set. Right: Average RM scores of LLM responses on the HH testing set.
Figure 3: GPT-4 comparison results between first-round APO-v1 and RJS-v1 on the HH testing set.
Besides RM average scores as the automatic evaluation, we also use GPT-4 to compare the responses from LLMRJS-v1 and LLMAPO-v1 for further verification of APO’s effectiveness. As described in Section 4.1, we query GPT-4 with crafted prompts for comprehensive judgments. The results are summarized in Figure 3, where our LLMAPO-v1 has a notably higher win rate.
We proposed an adversarial preference optimization (APO) framework for aligning LLMs with huInstead of updating the LLM agent with a fixed reward model (RM), our APO man feedback. updates both the RM and LLM alternatively via an adversarial game, where the RM is dedicated to distinguishing the difference between LLM responses and the golden annotations, and the LLM aims to maximize the expectation score under the RM judgment. We empirically verify the effectiveness of APO with the Alpaca SFT model on the Helpful&Harmless set. We discovered that through the APO training, the RM can continuously gain accuracy improvement with the same amount of preference training data. Compared to the vanilla rejection sampling (RJS) methods, the APO-enhanced RJS uniformly achieves better response quality in terms of both the RM average score and GPT-4 evaluation. We believe that if applied to practical LLM training scenarios, the APO framework can significantly reduce the annotation resource and improve the preference optimization efficiency.