00:00:00

Share Your Feedback 🏝️

Adversarial Preference Optimization

Adversarial Preference Optimization

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Model | OPT Next: LLM Error Correction

Adversarial Preference Optimization

  • Related Project: Private
  • Category: Paper Review
  • Date: 2024-01-14

Adversarial Preference Optimization

  • url: https://arxiv.org/abs/2311.08045
  • pdf: https://arxiv.org/pdf/2311.08045
  • abstract: Human preference alignment is a crucial training step to improve the interaction quality of large language models (LLMs). Existing aligning methods depend on manually annotated preference data to guide the LLM optimization directions. However, in practice, continuously updating LLMs raises a distribution gap between model-generated samples and human-preferred responses, which hinders model fine-tuning efficiency. To mitigate this issue, previous methods require additional preference annotation on generated samples to adapt the shifted distribution, which consumes a large amount of annotation resources. Targeting more efficient human preference optimization, we propose an adversarial preference optimization (APO) framework, where the LLM agent and the preference model update alternatively via a min-max game. Without additional annotation, our APO method can make a self-adaption to the generation distribution gap through the adversarial learning process. In experiments, we empirically verify the effectiveness of APO in improving LLM’s helpfulness and harmlessness compared with rejection sampling baselines.

Contents

TL;DR


  • 대규모 언어모델의 선호도 최적화: 강화 학습 대비 적대적 학습의 효율성
  • APO 프레임워크를 이용한 선호도 반영 및 개선 방법 제시
  • 실험을 통한 방법의 효과적 검증 및 성능 비교

1. 서론

최근 대규모 언어모델(LLM)은 수십억 개의 파라미터를 학습하여 자연어 처리, 수학적 인퍼런스, 프로그래밍 등 다양한 분야에서 향상된 성능을 보이고 있다. 이런 모델의 훈련에서 중요한 점은 사용자의 피드백을 반영하여 모델의 휴먼과의 상호작용을 개선하는 것이다. 이를 위해 사용된 기법으로는 강화 학습(RLHF), 대조 학습, 언어 모델링 등이 있으며, 각각의 방법은 선호도 데이터의 효율적 사용을 목표로 하고 있다.

그러나 이런 방법들은 트레이닝 중에 모델과 데이터 분포 간의 격차가 발생하여 성능 저하를 초래할 수 있는 문제가 있다. 이를 해결하기 위해 본 논문에서는 적대적 학습 프레임워크인 Adversarial Preference Optimization(APO)을 제안한다. APO는 생성적 적대 신경망(GAN)의 아이디어를 바탕으로 하여 보상 모델과 LLM 간의 동적인 상호작용을 통해 모델의 성능을 최적화한다.


2. 선수 지식

2.1 휴먼 선호(human preference)도 조정

LLM은 휴먼의 선호도 데이터를 바탕으로 응답 생성 정책 \(\pi_\theta(y\\|x)\)를 조정하여 사용자에게 선호되는 응답을 생성하도록 훈련된다. 이 과정에서 보상 모델 \(r_\phi(x, y)\)가 중요한 역할을 하며, 이는 Bradley-Terry 모델을 사용하여 다음과 같이 정의된다.

\[\mathcal{L}_{\text{BT}}(r_\phi; \mathcal{D}_P) = -\mathbb{E}_{(x, y_{\text{good}}, y_{\text{bad}}) \in \mathcal{D}_P} \left[ \log \frac{1}{1 + \exp(-(r_\phi(x, y_{\text{good}}) - r_\phi(x, y_{\text{bad}})))} \right]\]

이를 통해 보상 모델은 좋은 응답과 나쁜 응답의 선호도를 학습하고, 이를 기반으로 LLM의 응답 생성 정책을 최적화한다.

2.2 생성적 적대 네트워크(GAN)

GAN은 진짜 같은 데이터 샘플을 생성하는 생성기와 생성된 샘플과 진짜 데이터를 구별하는 판별기를 사용하여 훈련된다. 본 논문에서는 이 원리를 LLM의 응답 생성에 적용하며, 보상 모델이 판별기의 역할을 하고 응답 생성 정책이 생성기의 역할을 한다.


3. 적대적 선호도 최적화 (APO)

APO 프레임워크는 LLM이 높은 보상 점수를 받는 응답을 생성하려고 시도하는 동안 보상 모델이 금과 응답 및 생성된 응답 사이의 점수 차이를 확대하려고 하는 적대적 게임을 구성한다. 이 과정에서 KL Divergence을 사용하여 정책과 보상 모델의 업데이트를 규제하며, 이는 모델이 과적합을 방지하고 다양한 응답을 생성하도록 돕는다.

3.1 보상 최적화 단계

보상 모델을 업데이트하는 과정에서는 다음과 같은 목표 함수를 최적화한다.

\[L_{\text{APO-RM}}(r_\phi) = \mathbb{E}_{(x, y) \sim P_{\text{gold}}(x, y)}[r_\phi(x, y)]\]

이는 금 데이터와 생성된 응답 사이의 보상 점수 차이를 최대화하는 것을 목표로 한다.

3.2 정책 최적화 단계

정책을 업데이트하는 과정에서는 다음과 같은 목표 함수를 최적화한다.

\[\max_{\pi_\theta} \mathbb{E}_{(x, y) \sim \mathcal{D}}[r_\phi(x, y)] - \beta \text{KL}[\pi_\theta(y\\|x) \\\| \pi_{\text{ref}}(y\\|x)]\]

이를 통해 LLM은 보상 모델로부터 높은 평가를 받는 응답을 생성하도록 훈련된다.


4. 실험

APO 프레임워크의 효과를 검증하기 위해 Helpful&Harmless 데이터셋과 Alpaca 모델을 사용하여 실험을 수행하였다. 실험 결과 APO가 기존 방법들보다 더 나은 성능을 보였으며, 특히 선호도 정확도 및 캘리브레이션 오류에서 우수한 결과를 보였다.


5. 결론

APO 프레임워크는 대규모 언어모델의 선호도 조정을 위한 효율적이고 효과적인 방법을 제공한다.


1 INTRODUCTION

Learned from massive textual data with billions of parameters, large language models (LLMs), such as ChatGPT (OpenAI, 2023a) and LLaMA-2 (Touvron et al., 2023b), have shown remarkable AI capabilities, especially in domains of natural language processing (Jiao et al., 2023; Han et al., 2023), logical (mathematical) reasoning (Liu et al., 2023a; Frieder et al., 2023), and programming (Surameery & Shakor, 2023; Tian et al., 2023). Among the training techniques that push LLMs to such excellent performance, human preference alignment finetunes LLMs to follow users’ feedback, which has been widely recognized as essential for improving human-model interaction (Ouyang et al., 2022; Yuan et al., 2023; Rafailov et al., 2023; Dong et al., 2023). However, obtaining highly qualified human feedback requires meticulous annotations of all manner of queryresponse pairs in various topics (Askell et al., 2021), which is rather challenging and forms a sharp contrast to the easy access of enormous unsupervised pretraining-used text. Hence, the limitation of preference data collection raises demands for learning efficiency of preference alignment methods (Yuan et al., 2023; Sun et al., 2023).

To utilize preference data, current human feedback aligning methods are proposed mainly from three perspectives (Wang et al., 2023b): reinforcement learning (Ouyang et al., 2022), contrastive learning (Yuan et al., 2023; Rafailov et al., 2023; Liu et al., 2023c), and language modeling (Dong et al., 2023; Touvron et al., 2023b; Wang et al., 2023a). Reinforcement learning with human feedback (RLHF) (Kreutzer et al., 2018; Ziegler et al., 2019) is the earliest exploration and has become the mainstream approach for LLMs’ preference optimization (Ouyang et al., 2022; Touvron et al., 2023b). RLHF first learns a reward model (RM) from the human preference data, then optimizes the expected reward score of the LLM’s outputs via the Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017). Although widely used, RLHF has been criticized as not only unstable during the fine-tuning, but also complicated in implementation and computational resource consumption (Yuan et al., 2023; Rafailov et al., 2023). For more efficient and steady training, instead of directly optimizing the non-differentiable rewards, contrastive learning methods (Yuan et al., 2023; Rafailov et al., 2023; Zhao et al., 2023) enlarge the likelihood gap between positive and negative response pairs, where the positive and negative labels can be either annotated by humans or predicted by reward models. Alternatively, language modeling-based methods (Dong et al., 2023; Liu et al., 2023b; Wang et al., 2023a) remain using language modeling loss to align preference, but with different data preparation strategies. For example, rejection sampling (Dong et al., 2023; Touvron et al., 2023b) select responses with top reward scores as the language modeling fine-tuning data, while Wang et al. (2023a) and Liu et al. (2023b) add different prompts to different responses based on the corresponding preference levels.

Although contrastive-learning-based and language-modeling-based methods have partly alleviated the inefficiency of RLHF, the sampling distribution shifting problem (Touvron et al., 2023b) still hinders the alignment effectiveness: after a few steps of preference alignment updates, a distribution gap emerges between LLM generated samples and preference-annotated data. Consequently, the reward model performs worse rapidly on the newly generated LLM responses, if not additionally trained on new samples from the shifted distribution. To address this problem, most of the aforementioned methods (Ouyang et al., 2022; Dong et al., 2023; Yuan et al., 2023) require additional annotation of human feedback on newly generated responses (Touvron et al., 2023b) after a few LLM updating steps, which leads to increasingly massive manpower costs (Askell et al., 2021). Besides, the vast time consumption of extra manual annotation also significantly slows down the feedback alignment learning process.

To reduce the manual annotation efforts and further improve the preference optimization efficiency, we propose a novel adversarial learning framework called Adversarial Preference Optimization (APO). Inspired by generative adversarial networks (GANs) (Goodfellow et al., 2014; Arjovsky et al., 2017), we conduct an adversarial game between the RM and the LLM agent: the LLM generates responses to maximize the expected reward score, while the RM aims to distinguish the score difference between golden and sampled responses. To verify the effectiveness of our APO framework, we conduct experiments on the Helpful&Harmless (Bai et al., 2022) datasets with Alpaca (Taori et al., 2023) as the base LLM. With the same amount of human preference data, both the LLM agent and the reward model receive additional performance gains through the APO game, compared with the naive rejection sampling baselines.

2 PRELIMINARY

2.1 HUMAN PREFERENCE ALIGNMENT

Human preference alignment aims to fine-tune the response-generation policy \(\pi_\theta(y\\|x)\) of an LLM agent with a group of human preference data \(\mathcal{D}_P = \{(x, y_{\text{good}}, y_{\text{bad}})\}\), so that the LLM agent can generate more human-preferred responses to improve the human-model interaction quality. To achieve this, a reward model (RM) (Christiano et al., 2017; Ouyang et al., 2022) \(r_\phi(x, y)\) is usually utilized to evaluate the quality of responses from \(\pi_\theta(y\\|x)\), by learning from the human preference data \(\mathcal{D}_P\) with the Bradley-Terry (BT) ranking loss (Bradley & Terry, 1952):

\[\mathcal{L}_{\text{BT}}(r_\phi; \mathcal{D}_P) = -\mathbb{E}_{(x, y_{\text{good}}, y_{\text{bad}}) \in \mathcal{D}_P} \left[ \log \frac{1}{1 + \exp(-(r_\phi(x, y_{\text{good}}) - r_\phi(x, y_{\text{bad}})))} \right]\]

where \(\sigma(\cdot)\) is the Sigmoid activation function (Han & Moraga, 1995). If we denote \(y \succ y'\) as “response \(y\) is preferred to \(y'\)”, then a model-predicted probability \(Q_{r_\phi}(y \succ y'\\|x)\) can be induced by reward scores \(r_\phi(x, y), r_\phi(x, y')\) with the following parameterization:

\[Q_{r_\phi}(y \succ y'\\|x) = \sigma(r_\phi(x, y) - r_\phi(x, y'))\]

With equation 2, training RM with the Bradley-Terry ranking loss can be explained as the loglikelihood maximization of \(Q_{r_\phi}\), i.e., \(\mathcal{L}_{\text{Ranking}}(r_\phi; \mathcal{D}_P) = -\mathbb{E}_{\mathcal{D}_P} [\log Q_{r_\phi}(y_{\text{good}} \succ y_{\text{bad}}\\|x)]\). With a learned reward model \(r_\phi(x, y)\), human preference alignment methods Ouyang et al. (2022); Rafailov et al. (2023); Liu et al. (2023c) target on maximizing the reward expectation of generated responses with the following objective:

\[\mathcal{L}_{\text{Pref}}(\pi_\theta; r_\phi, \mathcal{D}_P) = \mathbb{E}_{x \sim \mathcal{D}_P}[\mathbb{E}_{y \sim \pi_\theta(y\\|x)}[r_\phi(x, y)]] - \beta \text{KL}[\pi_\theta(y\\|x) \parallel \pi_{\text{ref}}(y\\|x)]\]

where \(\pi_{\text{ref}}(y\\|x)\) is the base reference policy commonly set as the supervised fine-tuned (SFT) language model (Ouyang et al., 2022), and \(\beta > 0\) is a hyper-parameter re-weighting the reward expectation and the KL-divergence (Kullback, 1997) regularizer. Practically the learning policy \(\pi_\theta(y\\|x)\) is also initialized from the reference \(\pi_{\text{ref}}(y\\|x)\). The regularizer \(\text{KL}[\pi_\theta(y\\|x) \parallel \pi_{\text{ref}}(y\\|x)]\) in equation 3 prevents \(\pi_\theta(y\\|x)\) from degenerating to repeat a single response with the highest reward score, and preserves the generation diversity. Since the sampled responses \(y\) are discrete, it is challenging to directly back-propagate gradients from reward \(r_\phi(x, y)\) back to policy \(\pi_\theta(y\\|x)\). The typical solution to the preference optimization in equation 3 is reinforcement learning (RLHF) (Ouyang et al., 2022), especially with the proximal policy optimization (PPO) algorithms (Schulman et al., 2017).

However, RLHF has been recognized as practically suffering from implementation complexity and training instability (Yuan et al., 2023). Hence, recent studies (Rafailov et al., 2023; Yuan et al., 2023; Dong et al., 2023; Liu et al., 2023c) try to avoid the reinforcement learning scheme during preference optimization. More specifically, DPO (Rafailov et al., 2023) finds a connection between the reward model and LLM’s optimal solution, then replaces the reward model with the likelihood ratio of the policy and its reference:

\[\mathcal{L}_{\text{DPO}}(r_\phi; \pi_{\text{ref}}, \mathcal{D}_P) = -\mathbb{E}_{(x, y) \in \mathcal{D}_P}[\log \frac{\pi_\theta(y\\|x)}{\pi_{\text{ref}}(y\\|x)}]\]

Analogously, other methods consider human feedback learning from the perspective of contrastive learning. For example, RRHF (Yuan et al., 2023) propose a ranking loss as:

\[\mathcal{L}_{\text{RRHF}}(r_\phi; \mathcal{D}_P) = -\mathbb{E}_{(x, y_{\text{best}}, y) \in \mathcal{D}_P} \left[ \log \sigma(r_\phi(x, y_{\text{best}}) - r_\phi(x, y)) \right]\]

where \(y_{\text{best}}\) is the corresponding response to \(x\) with the highest reward, and the preference data \(\mathcal{D}\) can be built from human annotation \(\mathcal{D}_P\) or RM ranking results. Additionally, Zhao et al. (2023) propose a ranking loss similar to equation 5 with a margin relaxation to the log-likelihood difference. Moreover, rejection sampling (RJS) methods (Touvron et al., 2023b; Liu et al., 2023c) directly conduct supervised fine-tuning (SFT) on \(y_{\text{best}}\) to further simplify the human preference alignment process. The rejection sampling optimization (RJS) loss can be written as:

\[\mathcal{L}_{\text{RJS}}(r_\phi; \mathcal{D}_P) = -\mathbb{E}_{x \sim \mathcal{D}_P}[\log \sigma(r_\phi(x, y_{\text{best}}))]\]

where \(y_{\text{best}} = \arg \max_{1 \leq s \leq S}\{r_\phi(x, y_s)\}\) is the sampled response with the highest reward score.

2.2 GENERATIVE ADVERSARIAL NETWORKS

Generative adversarial networks (GANs) (Goodfellow et al., 2014) are a classical group of unsupervised machine learning approaches that can fit complicated real-data distributions in an adversarial learning scheme. More specifically, GANs use a discriminator \(D(\cdot)\) and a generator \(G(\cdot)\) to play a min-max game: the generator tries to cheat the discriminator with real-looking generated samples, while the discriminator aims to distinguish the true data and the samples. The GANs’ objective is:

\[\min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{\text{data}}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]\]

where \(z\) is a random vector from prior \(P_z(z)\) to induce the generated sample distribution. The objective equation 7 can be theoretically shown as the Jensen–Shannon divergence between distributions of real data and generated samples (Goodfellow et al., 2014). Moreover, Arjovsky et al. (2017) replace the Jensen-Shannon divergence with the Wasserstein distance (Villani, 2009) and propose the Wasserstein GAN objective:

\[\min_G \max_{f: \\|f\\|_L \leq K} \mathbb{E}_{x \sim p_{\text{data}}(x)}[f(x)] - \mathbb{E}_{x \sim p_g(x)}[f(x)]\]

where \(\\|f\\|_L \leq K\) requires \(f(\cdot)\) to be a K-Lipschitz continuous function. Wasserstein GANs have been recognized with higher training stability than the original GANs Arjovsky et al. (2017).

In natural language generation, GANs have also been empirically explored (Zhang et al., 2016; 2017), where a text generator samples real-looking text and a discriminator makes judgment between the true data and textual samples. As introduced in Section 2.1, the response-generation policy \(\pi_\theta(y\\|x)\) can be regarded as a generator of a conditional text GAN (Mirza & Osindero, 2014). Besides, the reward model \(r_\phi(x, y)\) plays an analogous role as a discriminator to judge the quality of generated responses.

Figure 1: The APO framework. In the RM updating step, the reward model learns by distinguishing the difference between the manually annotated golden responses and the LLM-generated responses. In the LLM updating step, the LLM agent updates to generate higher-quality responses with the feedback from the reward model.

3 ADVERSARIAL PREFERENCE OPTIMIZATION

We begin with a revisit of the human preference alignment objective (equation 3) in a mathematical optimization form:

\[\max_{\pi_\theta} \mathbb{E}_{(x,y) \sim \mathcal{D}}[r_\phi(x, y)] - \beta \text{KL}[\pi_\theta(y\\|x) \\| \pi_{\text{ref}}(y\\|x)]\]

where we aim to maximize the expected reward value with respect to the generation policy \(\pi_\theta(y\\|x)\), under a KL-divergence constraint with the reference policy \(\pi_{\text{ref}}(y\\|x)\). Applying the method of Lagrange multipliers (Beavis & Dobbs, 1990) to equation 9, one can easily obtain the widely-used preference optimization objective in equation 3. As discussed in Section 1, the above optimization form will become ineffective after several policy updating steps, for the generated sample distribution diverges from the preference data distribution for the RM \(r_\phi(x, y)\) training. To address this problem, we aim to update the reward model correspondingly during the policy fine-tuning.

Inspired by generative adversarial networks (GANs) (Goodfellow et al., 2014), we design an adversarial learning framework to align human preferences:

\[\text{KL}[P(y \succ y'\\|x) \\| Q_{r_\phi}(y \succ y'\\|x)] < \eta^2\]

\(\text{KL}[P(y \succ y' \mid x) \| Q_{r_\phi}(y \succ y' \mid x)] < \eta^2,\) where \(P_\theta(x, y) = P_D(x) \pi_\theta(y \mid x)\) is the joint distribution of input queries and generated responses, and \(P_{\text{gold}}(x, y)\) denotes the annotated golden data distribution.

Based on equation 10, we conduct an adversarial game, in which the policy \(\pi_\theta(y\\|x)\) needs to improve its response quality to get a higher expected reward, while the reward model \(r_\phi(x, y)\) tries to enlarge the scoring gap between the golden responses and the generation from \(\pi_\theta(y\\|x)\). We call this novel optimization problem as Adversarial Preference Optimization (APO).

Besides, following the original preference alignment objective, we add two KL-divergence regularizers to both \(\pi_\theta\) and \(r_\phi\) to prevent over-fitting and degeneration. Here \(P (y \succ y' \\| x)\) denotes the ground-truth human preference probability, and \(Q_{r_\phi}(y \succ y' \\| x)\) is described in equation 2.

Note that we use the reverse \(\text{KL}[\pi_\theta \\| \pi_{\text{ref}}]\) to constrain the generative model \(\pi_\theta\) but the forward KL[P \| Q_{r_\phi}] for the discriminate model \(r_\phi\). We provide an intuitive explanation to this separative forward-reverse KL regularization design: the reverse \(\text{KL}[\pi_\theta \\| \pi_{\text{ref}}]\) can be estimated with \(\pi_\theta\)-generated samples, paying more attention to the generation quality; while the forward \(KL[P \\| Q_{r_\phi}]\) is practically estimated with ground-truth preference data, focusing on the preference fitting ability of reward models.

To play the above adversarial game, we alternatively update one of \(\pi_\theta(y\\|x)\) and \(r_\phi(x, y)\) with the other’s parameters fixed. Next, we will provide detailed descriptions of APO’s reward optimization step and policy optimization step separately.

3.1 REWARD OPTIMIZATION STEP

In the reward optimization step, we fix the generator \(\pi_\theta(y\\|x)\) and update the reward model \(r_\phi(x, y)\). Note that in equation 10 term \(\text{KL}[\pi_\theta(y\\|x) \\| \pi_{\text{ref}}(y\\|x)]\) has no relation with \(r_\phi\), so we can simplify the objective for reward model updates:

\[L_{\text{APO-RM}}(r_\phi) = \mathbb{E}_{(x, y) \sim P_{\text{gold}}(x, y)}[r_\phi(x, y)]\]

The equation 11 indicates that the reward model enlarges the expected score gap between golden answers and generated responses to challenge \(\pi_\theta(y\\|x)\) for better generation quality. Note that equation 11 has a similar form as the objective of Wasserstein GANs (equation 8), which can be intuitively explained as the calculation of the Wasserstein distance between distributions \(P_\theta\) and \(P_{\text{gold}}\). However, rigorously equation 11 is not a Wasserstein distance because \(r_\phi(x, y)\) does not satisfy the Lipschitz continuity as described in Arjovsky et al. (2017). We provide more discussion about connections between APO and W-GANs in the supplementary materials.

To practically conduct the APO RM training, we first collect a set of user queries {xm} ∼ PD(x), then annotate each xm with a golden response \(y_{\text{good} m=1}\), then each (xm, y_{\text{gold}}) can be regarded as a sample drawn from \(P_{\text{gold}}(x, y)\). Meanwhile, we generate \(y_{\text{sample}} \sim m \pi_\theta(y\\|xm)\), so that \((xm, y_{\text{sample} m}) \sim P_\theta(x, y) = PD(x)\pi_\theta(y\\|x)\), D_{\text{sample}} = \{(xm, y_{\text{sample} m})\}M \(m=1\). With DAPO = \{(xm, y_{\text{gold} m})\} being our APO sample set, the RM learning objective in equation 11 can be calculated:

\[L_{\text{APO-RM}}(r_\phi; DAPO) = \mathbb{E}_{(x, y_{\text{gold}}) \sim P_{\text{gold}}(x, y)}[r_\phi(x, y)] - \mathbb{E}_{(x, y_{\text{sample}}) \sim P_\theta(x, y)}[r_\phi(x, y)]\]

Note that equation 12 also calculates the reward difference between pairs of responses like the Bradley-Terry (BT) loss does.

Hence, for training stability, we can empirically use the BT loss to optimize equation 12 instead:

\(L_{\text{APO-RM}}(r_\phi; DAPO) = \mathbb{E}_{(x, y_{\text{gold}}) \sim P_{\text{gold}}(x, y)}[r_\phi(x, y_{\text{gold}}) - r_\phi(x, y_{\text{sample}})]\) (13) With a Lagrange multiplier \(\beta_2 > 0\), we can convert the KL constraint in equation 11 to a regularizer:

\[L_{\text{APO-RM}}(r_\phi) = L_{\text{Ranking}}(r_\phi; DP) + \beta_2 \text{KL}[P(y \succ y' \\| x) \\| Q_{r_\phi}(y \succ y' \\| x)]\]

(14) Since \(\text{KL}[P \\| Q_{r_\phi}] = \mathbb{E}_P [y \succ y' \\| x](\log P - \log Q_{r_\phi}) = H(y \succ y' \\| x) - \mathbb{E}_P [y \succ y' \\| x](\log Q_{r_\phi})\), where \(H(y \succ y' \\| x)\) is the entropy of ground-truth human preference as a constant for \(r_\phi\) updating. As introduced in equation 2, with a group of preference data \(DP = \{(x_n, y_{\text{good} n})\}\) representing samples of \(P(y \succ y' \\| x)\), we have

\[-\mathbb{E}_P [y \succ y' \\| x](\log Q_{r_\phi}(y \succ y' \\| x)) = L_{\text{Ranking}}(r_\phi; DP)\]

Therefore, the overall APO RM learning objective can be written as:

\[L_{\text{APO-RM}}(r_\phi) = L_{\text{Ranking}}(r_\phi; DAPO) + \beta_2 L_{\text{Ranking}}(r_\phi; DP)\] \[L_{\text{APO-RM}}(r_\phi) = L_{\text{Ranking}}(r_\phi; DAPO) + \beta_2L_{\text{Ranking}}(r_\phi; DP)\]

The APO RM loss involves two datasets DAPO and DP, which practically have different data sizes. Because the golden responses consume much larger annotation resources than pair-wised response comparison. In experiments, we find the re-weighting parameter \(\beta\) requires to be larger to avoid over-fitting on the relatively smaller golden annotation set DAPO. We conduct more detailed ablation studies in the experimental part.

3.2 POLICY OPTIMIZATION STEP

In the policy optimization step, we fix the reward model \(r_\phi(x, y)\) and update policy \(\pi_\theta(y\\|x)\). Since term \(\mathbb{E}(x,y) \sim P_{\text{gold}}(x,y)[r_\phi(x, y)]\) and constraint \(\text{KL}[P(y \succ y' \mid x) \| Q_{r_\phi}(y \succ y' \mid x)]\) are not related to policy \(\pi_\theta(y\\|x)\), we only need to optimize:

\[\max_{\pi_\theta} \mathbb{E}_{(x, y) \sim \mathcal{D}}[r_\phi(x, y)] - \beta \text{KL}[\pi_\theta(y\\|x) \\| \pi_{\text{ref}}(y\\|x)]\]

Algorithm 1 Adversarial preference optimization (APO) with rejection sampling (RJS).

Parameters: Reward model \(r_\phi(x, y)\), policy \(\pi_\theta(y\\|x)\). Data: LLM training queries \(DQ = \{xl\}\), annotated responses \(D_{\text{gold}} = \{(xm, y_{\text{gold}})\}\)preference comparisons\(DP = \{(xn, y_{\text{good}})\}\) for rejection sampling rounds do Generate response sample y1 Collect the APO comparison set \(DAPO = \{(xm, y_{\text{gold}})\}\) Update \(r_\phi\) with the APO RM loss:

\[L_{\text{APO-RM}}(r_\phi; DAPO) = \mathbb{E}_{(x, y_{\text{gold}}) \sim P_{\text{gold}}(x, y)}[r_\phi(x, y)] - \mathbb{E}_{(x, y_{\text{sample}}) \sim P_\theta(x, y)}[r_\phi(x, y)]\]

which is equivalent to the original preference optimization in equation 3. Naturally, previous preference aligning methods, such as PPO (Ouyang et al., 2022), DPO (Rafailov et al., 2023), RRHF (Yuan et al., 2023), and RJS (Dong et al., 2023; Liu et al., 2023c) remain qualified for the optimization in equation 16 and compatible with our APO framework. To preliminarily validate the effectiveness of our APO framework, we first select the rejection sampling (RJS) as the LLM updating algorithm, for its implementation simplicity and training stability. Experiments of APO with other preference optimization methods are still in process.

4 EXPERIMENTS

In this section, we verify the effectiveness of the APO framework on the Helpful&Harmless (HH) dataset (Bai et al., 2022) with Alpaca (Taori et al., 2023) as the base SFT model and rejection sampling (RJS) (Dong et al., 2023) as the LLM updating algorithm. The overall training scheme is described in Algorithm 1.

4.1 EXPERIMENTAL SETUPS

Data Preparation We use the Helpful&Harmless (HH) set (Bai et al., 2022) to verify the effectiveness. Each query in the HH set is answered with two responses. Annotators are asked to label “chosen” or “reject” for each response based on the interaction quality. Following the data preprocesses in Cheng et al. (2023), we clean both HH training and testing sets by removing queries with two same responses or with two same scores. After the cleaning, the HH training set contains 43.8K helpfulness-training queries and 42.5K harmlessness-training queries, while the HH testing set includes 2.3K helpfulness-testing queries and 2.3K harmlessness-testing queries. Next, we describe the usage of the cleaned HH data as shown in Table 1:

  • Training Data: For separately updating the RM and LLM, we merge the helpful and harmless training sets, then randomly split them into an RM training set (HHRM, 20K queries) and an LLM training set (HHLLM, 66K queries). HHRM is used to learn the rejection sampling RM baseline RMBase and to further update the APO RMAPO. In HHLLM, we only use the instruction queries as prompts for LLMs to sample responses and to update through preference alignment.
  • Annotated Golden Data: Due to the annotation resource limitation, instead of manually labeling, we call GPT-4 (OpenAI, 2023b) API with the queries in HHRM set to collect responses as the simulated golden annotation. Since GPT-4 has been widely recognized as the state-of-the-art LLM, we intend to check how close an Alpaca-7B model can approach GPT-4’s performance. The data collection prompts and details are shown in Appendix A.
  • Testing & Validation Data: Note that we only utilize the queries in HHLLM for LLM policy updating. To make further usage of the 66K comparison data, we randomly select 10K response pairs from HHLLM to build a validation set HHDev for RMs. Besides, both evaluations of RMs and LLMs are conducted on the original HH testing data (HHTest), where response pairs are prepared for RMs preference tests and instruction queries are utilized for LLMs generating responses.

Table 1: Data preparation and usage. The original HH training set is used to learn a testing RM to automatically evaluate the quality of LLM responses. The split HHRM set is for training of baseline RMs and APO RMs. Queries in HHLLM set are utilized to update the LLM agent. Both RM and LLM’s performance are evaluated on HHTest set.

Evaluation To evaluate the performance of RMs and LLMs, we consider the following metrics:

  • Preference Accuracy: For RM evaluation, we first calculate the preference accuracy on HHTest. If an RM \(r(x, y)\) outputs \(r(x, y_{\text{good}}) > r(x, y_{\text{bad}})\) for an annotated comparison \((x, y_{\text{good}}, y_{\text{bad}})\), we denote a correct prediction. The preference accuracy is computed as the proportion of correct predictions within all testing response pairs.
  • Probability Calibration: The preference accuracy provides pairwise comparisons of responses but does not reflect the degree of preference for each response. Following Bai et al. (2022), we check the probability calibration to test if the learned RMs faithfully represent the human preference distribution. Specifically, we consider the RM performance separately in \(B\) bins, where each bin \(D_b\) collects testing preference samples \((x, y, y')\) with RM predicted probability \(Q_{r_\phi}(y \succ y' \mid x) \in \left[\frac{b-1}{B}, \frac{b}{B}\right]\), for \(b = 1, 2, \ldots, B\). The expected calibration error (ECE) is calculated as \(\text{ECE}(r_\phi) = \sum_{b=1}^B \\\|o_b - e_b\\\|\), where \(o_b = \frac{1}{\\|D_b\\|} \sum_{(x, y, y') \in D_b} 1_{\{y \succ y'\}}\) is the ground-truth fraction of \(y \succ y'\) tuples in \(D_b\), and \(e_b = \frac{1}{\\|D_b\\|} \sum_{(x, y, y') \in D_b} Q_{r_\phi}(y \succ y' \mid x)\) is the mean of RM predicted probabilities within \(D_b\).
  • RM Average Score: To automatically evaluate the performance of LLM agents, we use two well-learned reward models, RMAll and RMTest. RMTest is trained on the entire HH training set, while RMAll is trained with two additional preference sets (WebGPT (Nakano et al., 2021) and GPT4LLM (Peng et al., 2023)) following the same setup as in Cheng et al. (2023). Average scores of both RMAll and RMTest are reported on the HH testing set.
  • Automatic Evaluation: Due to the annotation limitation, we use GPT-4 (OpenAI, 2023b) as an annotator to provide evaluation. To avoid position bias and make the annotation more credible, we employ position-swap (Zheng et al., 2023) and chain-of-thought (Wei et al., 2022) techniques. For content assessment, we primarily consider helpfulness and harmlessness. The evaluation prompts can be found in Appendix B.

Training Details We describe the training details for RMs and LLMs separately:

  • RM Training Details: We follow the training setups in Cheng et al. (2023). The testing RMAll, RMTest, and the rejection sampling RM baseline RMBase are initialized with the pretrained LLaMA-7B (Touvron et al., 2023b) model and fine-tuned with a learning rate of \(1 \times 10^{-6}\). For APO RM training, we explore two different setups: (1) in each round, APO \(\text{RM}_{\text{APO}}\) is fine-tuned with APO data \(D_{\text{APO}}\) based on the baseline RMBase as the initial checkpoint; (2) in round \(R\), APO \(\text{RM}_{\text{APO-v}(R)_{\text{seq}}}\) is sequentially updated with \(D_{\text{APO}}\) based on the former round’s checkpoint \(\text{RM}_{\text{APO-v}(R-1)_{\text{seq}}}\). The learning rate of \(\text{RM}_{\text{APO}}\) and \(\text{RM}_{\text{APO-seq}}\) is set to \(1 \times 10^{-8}\), while the re-weighting parameter \(\beta_2\) is 10. For the ablation study, we also train an \(\text{RM}_{\text{Base-AB}}\) with the same setups as \(\text{RM}_{\text{APO-v1}}\) but without any comparison data from \(D_{\text{APO}}\). All RM training batch sizes are set to 64. The max input sequence length is 512. All reward models are fine-tuned with one epoch.
  • LLM Training Details: Our LLM is initialized with Alpaca (Taori et al., 2023), which is an instruction-tuned LLaMA-7B model (Touvron et al., 2023a). To fine-tune the LLM, we set the queries in the HHLLM training set as the SFT sources and the RM-selected responses as the SFT targets. We follow the training setups in Alpaca (Taori et al., 2023) and update the LLM round-by-round with decreasing learning rates (i.e., the first round with \(5 \times 10^{-6}\), the second round with \(2 \times 10^{-6}\), and the third round with \(9 \times 10^{-7}\)). The batch size is 128 and the max input length is 1024. Each round is updated with one training epoch.

Table 2: Training setups and performance of reward models.

4.2 REWARD MODEL PERFORMANCE

As described in Algorithm 1, we conduct three rounds of rejection sampling with Alpaca-7B as the initial SFT model and RMBase as the baseline RM. In Table 2, we show the preference accuracy and expected calibration error (ECE) on both HHTest and HHDev sets. From the results, we find the APO RM uniformly achieves better preference accuracy, but raises the calibration error meanwhile. To further visualize the relation between the preference accuracy and the calibration error during the APO RM training, we plot every RM’s performance on HHDev in Figure 2 with negative ECE score as the X-axis and preference accuracy as the Y-axis. The closer an RM is located to the upperright corner of the plot, the better its performance is. Compared to RMAPO trained from RMBase each round, sequentially updated RMAPO-seq can continuously achieve higher preference accuracy, especially on the validation set. However, the calibration errors also significantly increase at the same time, indicating the RMs become more and more over-fitted on the HHRM training set. In contrast, updating RMAPO from RMBase in each round can stably control the calibration error with a little performance loss on preference accuracy. Without the APO sample data DAPO, the ablationstudy-used RMBase-AB shows an apparent performance gap compared to the APO RMs, which supports the effectiveness of our adversarial training comparison between the golden annotation and model generation.

4.3 LLM AGENT PERFORMANCE

In Table 3, we provide the training setups and performance of LLMs during the three RJS rounds. For the RJS baselines, we fix RMBase as the rejection RM to select the highest-score responses. For LLMAPO, we use the corresponding RMAPO for response selection. After each round of training, we let the updated LLM to response the queries in the HHTest set, then use the testing RMAll and RMTest to infer average scores of the LLM responses. From the results, both RJS and APO can achieve significantly higher average scores round-by-round. APO-trained LLMs uniformly outperform the RJS baselines in every training round. From the right plot in Figure 2, the performance gap between APO and RJS visibly enlarges when training rounds increase. Notably, although sequentially APO RM training can cause much higher calibration errors, in the second round LLMAPO-v2seq achieves the highest average score compared with both LLMRJS-v2 and LLMAPO-v2. However, when the training continues to the third round, the sequentially trained RM becomes totally over-fitted with the performance score decreasing. This phenomenon provides us an insight into the importance of balancing the preference accuracy and probability calibration for RM training. We are conducting more experiments to discuss the impact of the accuracy-calibration trade-off.

Figure 2: Left: Performance of RMs on the validation set. Right: Average RM scores of LLM responses on the HH testing set.

Figure 3: GPT-4 comparison results between first-round APO-v1 and RJS-v1 on the HH testing set.

Besides RM average scores as the automatic evaluation, we also use GPT-4 to compare the responses from LLMRJS-v1 and LLMAPO-v1 for further verification of APO’s effectiveness. As described in Section 4.1, we query GPT-4 with crafted prompts for comprehensive judgments. The results are summarized in Figure 3, where our LLMAPO-v1 has a notably higher win rate.

5 CONCLUSION

We proposed an adversarial preference optimization (APO) framework for aligning LLMs with huInstead of updating the LLM agent with a fixed reward model (RM), our APO man feedback. updates both the RM and LLM alternatively via an adversarial game, where the RM is dedicated to distinguishing the difference between LLM responses and the golden annotations, and the LLM aims to maximize the expectation score under the RM judgment. We empirically verify the effectiveness of APO with the Alpaca SFT model on the Helpful&Harmless set. We discovered that through the APO training, the RM can continuously gain accuracy improvement with the same amount of preference training data. Compared to the vanilla rejection sampling (RJS) methods, the APO-enhanced RJS uniformly achieves better response quality in terms of both the RM average score and GPT-4 evaluation. We believe that if applied to practical LLM training scenarios, the APO framework can significantly reduce the annotation resource and improve the preference optimization efficiency.

Previous: Model | OPT Next: LLM Error Correction

post contain ""

    No matching posts found containing ""