00:00:00

Share Your Feedback 🏝️

DPO | Self-Play Fine-Tuning**

DPO | Self-Play Fine-Tuning**

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Model | DeepSeek-v1** Next: MoE | Mixtral of Experts

DPO | Self-Play Fine-Tuning**

  • Related Project: Private
  • Category: Paper Review
  • Date: 2024-01-10

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

  • url: https://arxiv.org/abs/2401.01335
  • pdf: https://arxiv.org/pdf/2401.01335
  • abstract: Harnessing the power of human-annotated data through Supervised Fine-Tuning (SFT) is pivotal for advancing Large Language Models (LLMs). In this paper, we delve into the prospect of growing a strong LLM out of a weak one without the need for acquiring additional human-annotated data. We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN), which starts from a supervised fine-tuned model. At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. More specifically, the LLM generates its own training data from its previous iterations, refining its policy by discerning these self-generated responses from those obtained from human-annotated data. Our method progressively elevates the LLM from a nascent model to a formidable one, unlocking the full potential of human-annotated demonstration data for SFT. Theoretically, we prove that the global optimum to the training objective function of our method is achieved only when the LLM policy aligns with the target data distribution. Empirically, we evaluate our method on several benchmark datasets including the HuggingFace Open LLM Leaderboard, MT-Bench, and datasets from Big-Bench. Our results show that SPIN can significantly improve the LLM’s performance across a variety of benchmarks and even outperform models trained through direct preference optimization (DPO) supplemented with extra GPT-4 preference data. This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.

[Self DPO 색인마킹]


Contents

TL;DR


  • LLM 자기향상: Large Language Models(LLM)을 자체 향상시키는 새로운 방법 SPIN 제안.
  • 자기 대결 메커니즘 활용: 자기 대결(Self-Play)을 통해 휴먼 주석 데이터 없이 모델 성능 향상.
  • 수학적 논리적 근거 제시: 수학적 모델링과 이론적 근거를 통한 방법의 효과성 검증.

1 서론

Large Language Models(LLMs)는 수학적 인퍼런스, 코드 생성, 텍스트 생성 등 다양한 영역에서 놀라운 능력을 보여주고 있습니다. 이런 모델들은 특히 수학적 문제 해결에서 큰 성과를 보여주고 있으며, 인공 일반 지능(AGI) 시대를 앞당기고 있습니다. 그러나 이런 성능 향상에는 대규모 휴먼 주석 데이터가 필요하며, 이는 비용이 많이 드는 작업입니다. 본 연구에서는 추가적인 휴먼 주석 데이터 없이도 LLM의 파인튜닝이 가능한 새로운 방법인 Self-Play fIne-tuNing (SPIN)을 제안합니다. 이 방법은 모델이 자체적으로 자기 대결을 통해 성능을 향상시킬 수 있도록 합니다.

1.1 문제 정의 및 연구 배경

현대의 LLM은 수많은 도메인에서 향상된 성과를 보이고 있는 상화엥서 모델의 성능을 더욱 향상시키기 위해서는 방대한 양의 휴먼 주석 데이터가 필요하지만, 이런 데이터를 획득하는 것은 비용이 많이 들며, 때로는 실현 불가능할 수 있습니다. 따라서, 본 논문에서는 휴먼 주석 데이터 없이 LLM을 자체적으로 향상시킬 수 있는 방법을 제안하고자 합니다.

1.2 선행 연구 및 기존 방법

이전 연구들은 주로 Supervised Fine-Tuning(SFT)과 Reinforcement Learning from Human Feedback(RLHF) 방법을 사용했습니다. 이런 방법들은 휴먼 주석 데이터에 많이 의존하며, 비용과 시간이 많이 드는 단점이 있습니다. 반면, 본 연구에서 제안하는 SPIN 방법은 자기 대결을 통해 이런 한계를 극복하고자 합니다.


2 관련 연구

2.1 자기 대결(Self-Play)

자기 대결 방법은 강화학습에서 주로 사용되며, 모델이 자신과의 대결을 통해 학습을 진행합니다. 이 방법은 AlphaGo Zero에서 큰 성공을 거두었으며, 본 논문에서도 이런 방법을 LLM의 향상에 적용하고자 합니다.

2.2 지도 학습(SFT)

SFT는 미리 훈련된 모델을 특정 작업에 맞춰 파인튜닝하는 방법입니다. 이 방법은 일반적으로 고품질의 휴먼 주석 데이터를 필요로 하며, 모델이 해당 데이터에 잘 맞도록 학습합니다.

2.3 강화학습 파인튜닝(RL Fine-Tuning)

RL 파인튜닝은 특정 작업에 모델을 더 잘 맞추기 위해 사용됩니다. 이 방법은 특정 보상 함수를 최대화하는 방식으로 작동하며, 주로 휴먼의 피드백에서 학습합니다.


3 문제 설정

3.1 수학적 모델 및 수식 설명

LLM은 다음과 같이 정의됩니다. \(p_{\theta}\) 는 모델의 파라미터를 나타내며, \(x = [x_1, \ldots, x_n]\)는 입력 시퀀스, \(y = [y_1, \ldots, y_m]\)는 출력 응답 시퀀스입니다. 출력 응답은 조건부 확률 분포 \(p_{\theta}(\cdot\|x)\)로부터 샘플링되며, 이는 다음과 같이 표현됩니다. \(p_{\theta}(y\|x) = \prod_{j=1}^{m} p_{\theta}(y_j\|y_1, y_2, \ldots, y_{j-1}, x)\) 이 수식은 Markov 과정을 나타내며, 각 토큰은 이전 토큰들에만 의존하여 생성됩니다.

3.2 SFT의 수학적 접근

SFT에서는 다음과 같은 음의 로그 가능도 손실을 최소화합니다. \(\mathcal{L}_{\text{SFT}}( ext) = -\mathbb{E}_{x \sim q(\cdot)} \mathbb{E}_{y \sim p_{\text{data}}(\cdot\|x)} [\log p_{\theta}(y\|x)]\) 이 손실 함수는 모델이 training dataset의 분포를 잘 반영할 수 있도록 돕습니다.

3.3 RL 파인튜닝의 수학적 기초

RL 파인튜닝에서는 다음 목표 함수를 최대화하는 것을 목표로 합니다. \(\mathbb{E}_{x \sim q(\cdot)} \mathbb{E}_{y \sim p_{\theta}(\cdot\|x)} [r(x, y)] - \lambda \text{KL}(p_{\theta}(\cdot\|x) \| p_{\text{ref}}(\cdot\|x))\) \(\lambda\)는 정규화 파라미터로, 모델이 참조 모델로부터 너무 멀어지지 않도록 조절합니다.

SPIN 방법을 적용한 결과, LLM은 휴먼 주석 데이터를 사용하지 않고도 성능을 지속적으로 향상시켰습니다. 실험 결과는 이 방법이 휴먼 주석 데이터 없이도 효과적으로 LLM을 향상시킬 수 있음을 보여줍니다.


4. 방법

  • 반복적 자기대교 학습 방법을 통한 언어 모델 최적화
  • 비효율적인 데이터 활용 문제 해결을 위한 새로운 학습 기술 도입
  • 수학적 논리와 최적화 이론을 기반으로 한 세밀한 알고리즘 설계

이 연구에서는 추가 휴먼 또는 AI 피드백에 의존하지 않고 LLM의 성능을 향상시키기 위한 새로운 파인튜닝 방법을 도입합니다. 우수한 질의 감독된 파인튜닝(SFT) 데이터셋 \(\mathcal{S}_{\text{SFT}} = \{(x, y)\}_n\)를 고려하면, \(x\)는 주변 분포 \(q(x)\)에서 샘플링되고, \(y \sim p_{\text{data}}(y \\| x)\)입니다.

주어진 감독된 파인튜닝된 LLM \(p_{\theta_0}\)에 SFT 접근 방식을 추가 적용하는 것은 비효과적이며 성능이 더 악화될 수 있습니다. 또, 휴먼/AI 피드백 없이는 RL 파인튜닝(e.g., RLHF 및 RLAF)에 대한 선호 데이터셋를 획득하는 것이 불가능해질 수 있고, 이는 RL 파인튜닝을 방해합니다.

4.1 주요 플레이어 훈련

주요 플레이어가 LLM 반응을 휴먼 반응과 구분하도록 훈련되는 방법을 설명합니다. 적분 확률 메트릭(IPM)에 의해 동기를 부여받은 목표 함수는 주요 플레이어 \(f_{t+1}\)가 대상 데이터 분포 \(p_{\text{data}}\)와 상대 플레이어의 분포 \(p_{\theta_t}\) 사이의 예상 값 차이를 극대화하도록 합니다.

\[\mathbb{E}_{x \sim q(\cdot)} \left[ f_{t+1}(x, y) \mid y \sim p_{\text{data}}(\cdot\\|x) \right] - \mathbb{E}_{x \sim q(\cdot)} \left[ f_{t+1}(x, y') \mid y' \sim p_{\theta_t}(\cdot\\|x) \right]\]

\(\mathcal{F}_t\)는 나중에 인퍼런스될 표현력 있는 함수 클래스의 시퀀스입니다. \(f_{t+1}\)의 값은 \(y\)가 \(p_{\text{data}}\)에서 비롯된 것이라는 주요 플레이어의 믿음의 정도를 반영합니다.

상대 플레이어 업데이트

\(f_{t+1}\)를 최적화한 후, 상대 플레이어의 파라미터 \(\theta_{t+1}\)를 얻습니다. 특히 두 반응 \(y\)와 \(y'\)에 대해 주어진 동일한 프롬프트 \(x\)에 대해 \(f_{t+1}\)는 더 높은 값을 가진 반응이 실제 데이터 분포에서 왔다고 인퍼런스합니다. 상대 플레이어의 목표는 주요 플레이어에게 구별이 불가능한 반응을 생성하는 더 나은 LLM을 찾는 것입니다. 이는 다음 최적화 문제로 표현됩니다.

\[\max_{\theta} \mathbb{E}_{x \sim q(\cdot), y \sim p_{\theta}(\cdot\\|x)} \left[ f_{t+1}(x, y) \right] - \lambda \text{KL}(p_{\theta}(\cdot\\|x) \\\| p_{\theta_t}(\cdot\\|x))\]


5. 이론적 분석

이 섹션에서는 섹션 4의 알고리즘 1에 대한 이론적 분석을 제공합니다. 목적함수 \(\ell\)의 단조성 및 볼록성 가정 하에, 파라미터 \(\theta_t\)가 데이터 분포를 생성하는 경우에만 전역 최적점이 달성된다는 것을 보여줍니다.

가정 5.1

손실 함수 \(\ell(t) : \mathbb{R} \rightarrow \mathbb{R}\)는 단조 감소 함수이며, 모든 \(t\)에 대해 \(\ell'(t) \leq 0\)이며 \(\ell'(0) < 0\)입니다. 또한, \(\ell(t)\)는 볼록 함수입니다.

이 가정은 로지스틱 손실 함수 \(\ell(t) = \log(1 + \exp(-t))\) 등 머신러닝에서 흔히 사용되는 다양한 손실 함수에 적용됩니다. 이 가정 하에, 파라미터 \(\theta_t\)가 반복적으로 전역 최적점으로 수렴함을 보장하는 주요 정리를 제시합니다.

정리 5.1

(전역 최적점으로의 수렴) 가정 5.1 하에, 파라미터 \(\theta_t\)는 데이터 분포를 생성하는 경우에만 전역 최적점으로 수렴합니다.

증명: 증명 세부사항은 주어진 가정 하에서 알고리즘 1에서 설명한 반복 과정이 \(\theta_t\) 파라미터가 대상 분포 \(p_{\text{data}}\)와 일치하는 데이터를 생성하도록 하는 값을 수렴하게 함을 보여줍니다.


6. 실험

  • 기반 모델과 데이터셋: zephyr-7b-sft-full 및 Ultrachat200k
  • 반복적 훈련과 데이터 확장: 50k에서 100k로 데이터 확대
  • 다면적 평가: HuggingFace Open LLM Leaderboard 활용

6.1 실험 설정

모델 및 데이터셋

본 연구에서는 Mistral-7B 모델을 기반으로 한 zephyr-7b-sft-full을 사용합니다. 이 모델은 pre-trained Mistral-7B (Jiang et al., 2023)을 바탕으로 하며, HuggingFace가 제공하는 Ultrachat200k 데이터셋을 통해 추가적인 파인튜닝이 이루어졌습니다. Ultrachat200k는 UltraChat (Ding et al., 2023)의 상위 200k 대화 데이터로 구성되어 있으며, 이는 약 1.4M의 대화 데이터 중 고품질로 선별된 부분입니다.

실험 방법

UltraChat200k에서 임의로 선정된 50k의 프롬프트를 기반으로 기본 모델을 사용해 합성 응답을 생성합니다. 이후 섹션 4.1에서 설명된 최적화 방법을 따라 추가 훈련을 진행합니다. 여러 반복 과정을 거쳐, 가장 최근의 반복에서 생성된 합성 데이터를 새로 생성된 데이터에 추가하여, 0번 반복에서는 50k, 1, 2, 3번 반복에서는 각각 100k의 데이터셋 크기로 확장합니다. 각 반복에서 모델은 2 에포크 동안 훈련됩니다.

평가 방법

평가는 널리 사용되는 Huggingface Open LLM Leaderboard를 벤치마킹 도구로 사용하여 진행합니다. 이 리더보드는 LLM의 다양한 능력을 평가하기 위한 6개의 다른 데이터셋을 포함하고 있습니다. 상식 인퍼런스(Arc, Clark et al., 2018), HellaSwag (Zellers et al., 2019), Winogrande (Sakaguchi et al., 2021), 다양한 언어 이해 작업(MMLU, Hendrycks et al., 2020), 휴먼의 거짓말 모방(TruthfulQA, Lin et al., 2021), 그리고 수학 문제 해결(GSM8k, Cobbe et al., 2021). 평가는 few-shot in-context 예제와 질문을 모델에 제시하는 방식으로 이루어지며, 모든 데이터셋에 대한 평균 점수를 보고합니다.

이 실험 설정은 반복적인 데이터 활용과 훈련을 통해 언어 모델의 일반화 능력과 성능을 효과적으로 향상시키기 위한 체계적인 접근 방법을 제시합니다.

6.2 SPIN의 벤치마크 성능 향상 효과

  • 자기대교 방식의 훈련으로 언어 모델 성능 개선
  • 반복 훈련과 데이터 크기 확장의 효과 분석
  • 다양한 벤치마크에서의 지속적인 성능 향상 확인

SPIN 방법은 기존 모델인 zephyr-7b-sft-full을 이용하여 생성된 응답을 기반으로 합니다. 초기 반복에서는 모델이 기존 데이터셋에서 생성된 응답을 재학습하여, 성능을 점진적으로 개선하고, 이후 반복에서는 새로 생성된 응답을 사용하여 학습을 계속 진행합니다. 이 과정은 다음과 같은 수학적 최적화 문제로 설명될 수 있습니다.

\[\max_{\theta} \mathbb{E}_{x \sim q(\cdot), y \sim p_{\theta}(\cdot|x)} \left[ f_{t+1}(x, y) \right] - \lambda \text{KL}(p_{\theta}(\cdot|x) \| p_{\theta_t}(\cdot|x))\]

\(\lambda \text{KL}\) 항은 모델이 이전 반복에서의 분포와 너무 멀리 벗어나지 않도록 제약을 둡니다.

HuggingFace Open LLM Leaderboard를 사용한 평가에서, SPIN은 베이스라인 모델 대비 평균 2.66%의 성능 향상을 보였으며, 특히 TruthfulQA와 GSM8k 벤치마크에서 5% 이상의 향상을 보였습니다. 이런 결과는 SPIN이 기존 데이터셋을 재활용하여도 유효하게 모델 성능을 향상시킬 수 있음을 시사합니다.

6.3 유의분석

데이터 크기와 반복 훈련의 효과

SPIN의 효과를 극대화하기 위해 다양한 크기의 training dataset를 사용하여 실험을 수행했습니다. 데이터 크기가 증가함에 따라 성능이 개선되는 경향을 보였으며, 이는 SPIN이 데이터의 양에 따라 더욱 효과적으로 학습될 수 있음을 보여줍니다. 반면, 기존 SFT 방법을 사용할 때는 추가적인 반복 훈련이 모델 성능을 엄청나게 향상시키지는 못했습니다. 이는 SPIN의 자기대교 학습 방식이 기존 방법보다 더 효율적으로 데이터를 활용한다는 것을 의미합니다.

추가 태스크에서의 성능 검증

SPIN은 HuggingFace Open LLM Leaderboard 외에도 다양한 태스크에서 성능을 검증받았습니다. MT-Bench와 Big-Bench, OpenBookQA 등에서의 평가 결과, SPIN은 이런 다양한 벤치마크에서도 일관된 성능 향상을 보여줬으며, 특히 MT-Bench에서는 vicuna-13b-v1.5 모델을 상회하는 성과를 보였습니다. 이런 결과는 SPIN의 범용성과 효과적인 학습 능력을 입증합니다.


7. 결론

본 연구에서 제안된 SPIN 방법은 추가적인 휴먼 입력이나 AI 피드백 없이도 언어 모델의 성능을 개선할 수 있음을 보여줍니다. 반복적인 자기대교 학습을 통해 모델이 자체적으로 성능을 점진적으로 향상시킬 수 있으며, 이는 다양한 언어 처리 태스크에 유용하게 적용될 수 있습니다.


1 Introduction

Large Language Models (LLMs) have began a groundbreaking era in artificial general intelligence (AGI), demonstrating extraordinary capabilities across a wide range of domains that require intricate reasoning and specialized knowledge. These models excel in areas such as mathematical reasoning/problem solving (Cobbe et al., 2021; Wei et al., 2022; Lewkowycz et al., 2022), code generation/programming (Chen et al., 2021; Austin et al., 2021; Li et al., 2022), text generation (Bubeck et al., 2023; Anil et al., 2023; Touvron et al., 2023), summarization and creative writing, among others. A significant advancement in LLMs is the post-pretraining alignment with the more desirable behaviors (Mishra et al., 2021; Victor et al., 2022; Chung et al., 2022; Thoppilan et al., 2022), a process often reliant on the costly human-annotated data. Typical alignment methods include Supervised Fine-Tuning (SFT) (Ouyang et al., 2022; Tunstall et al., 2023a) based on human demonstrations, and Reinforcement Learning from Human Feedback (RLHF) (Christiano et al., 2017; Ziegler et al., 2019; Stiennon et al., 2020; Bai et al., 2022a) based on human preferences.

All the aforementioned alignment methods require a substantial volume of human annotated data. Therefore, there is increasing interest in developing fine-tuning methods that can effectively utilize human data, thereby streamlining the alignment process. This motivates us to study fine-tuning LLMs without the need for additional human-annotated data beyond the fine-tuning dataset. Our study is also related to the broader goal of converting weak models to strong models without the requirement for extra training data, which is of central interest in machine learning that can be traced back to the boosting algorithms (Kearns and Valiant, 1994; Schapire, 1990; Freund, 1995; Freund and Schapire, 1997). The self-training algorithm (Vapnik, 1999; Grandvalet and Bengio, 2004; Lee, 2013) has also been proved to be able to convert weak learners to strong learners in mixture models without the need for additional labeled data (Frei et al., 2022; Kou et al., 2022). However, the pursuit of autonomously enhancing a weak LLM without external guidance is both intriguing and understudied. This raises the following question:

Can we empower a weak LLM to improve itself without acquiring additional human annotated data?

In this paper, we answer this question affirmatively. Inspired by the success of self-play mechanisms (Samuel, 2000) in games, exemplified by AlphaGo Zero (Silver et al., 2017b), AlphaZero (Silver et al., 2017a), with historical roots traced back to TD-Gammon (Tesauro et al., 1995), we propose to convert a weak LLM to a strong one through the lens of self-play, where the model is enhanced by playing against itself without requiring any direct supervision. In particular, we propose a novel fine-tuning method called Self-Play fIne-tuNing (SPIN), which begins from a supervised fine-tuned model. SPIN allows the LLM to engage in self-play, eliminating the need for an expert annotator such as a human or more advanced LLMs like GPT-4. In detail, with the LLM from previous iteration t , we employ it to generate responses y′ to the prompts x in the human-annotated denoted by pθt SFT dataset. The subsequent objective is to find a new LLM pθt+1 , capable of distinguishing the from the responses y generated by humans. This process can be seen responses y′ generated by pθt as a two-player game: the main player, or the new LLM pθt+1 , seeks to discern between the responses and human-generated responses, while the opponent, or the old LLM of the opponent player pθt , generates responses as similar as possible to those in the human-annotated SFT dataset. The pθt new LLM pθt+1 , to prefer responses from pdata over pθt that is more aligned with pdata. In the next iteration, the newly resulting in a distribution pθt+1 obtained LLM pθt+1 becomes the opponent for response generation, with the self-play process aiming for the LLM to eventually converge to pθ∗ = pdata, so that the strongest possible LLM can no longer differentiate the responses generated by its previous version and those generated by the human.

Interestingly, our method exhibits similarity with the recently introduced direct preference optimization (DPO) method (Rafailov et al., 2023), with the notable distinction being the self-play nature of our method. Consequently, our approach stands out by eliminating the need for extra human preference data, a requirement present in the DPO method. Additionally, the self-play mechanism in our method resembles the idea of generative adversarial networks (GAN) (Goodfellow et al., 2014; Arjovsky et al., 2017), albeit that both the discriminator (main player) and the generator (the opponent) in our method are instances of the same LLM from different iterations. Theoretically, we prove that our method converges when the distribution of the LLM is identical to the target data distribution, i.e., pθt = pdata. Our experimental results on zephyr-7b-sft-full (Tunstall et al., 2023a), a fine-tuned LLM based on Mistral-7B (Jiang et al., 2023), show that while continued training using SFT on its own SFT dataset Ultrachat200k (Ding et al., 2023) reaches a performance plateau or even diminished evaluation scores, our method consistently improves zephyr-7b-sft-full across successive iterations while leveraging only a 50k subset of Ultrachat200k dataset. Ultimately, SPIN effectively improves the base model’s average score from 58.14 to 63.16 on the HuggingFace Open LLM Leaderboard (Beeching et al., 2023) with remarkable 10%+ improvement in scores on GSM8k and TruthfulQA, and from 5.94 to 6.78 on MT-Bench (Zheng et al., 2023). Notably, SPIN achieves results that are even comparable to models trained on additional 62k preference dataset (Tunstall et al., 2023a) on Open LLM leaderboard and MT-Bench.

Concurrent to our work, Singh et al. (2023) proposed the use of synthetic data with binary feedback in self-training, reducing the reliance on human data. In contrast, our approach eliminates the need for additional binary feedback from humans or an extra reward model thanks to the self-play mechanism. Additionally, Burns et al. (2023) employed a weak LLM model as the guidance to train stronger LLMs in a fashion of weak-to-strong generation. Unlike Burns et al. (2023), which necessitates both a weak supervisor and a strong model, our SPIN operates effectively with a single LLM. Notation. We use lowercase letters and lowercase boldface letters to denote scalars and vectors, respectively. We use [N] to denote the index set {1, . . . , N }. In the function space, let F be the function class. The symbol qdata designates the target data distribution, while p represents the conditional probability of LLM’s response (i.e., LLM policy).

Self-play (Samuel, 1959; Tesauro et al., 1995), where the algorithm learns by playing against itself, has gained notable attention due to its effectiveness in multi-agent reinforcement learning (MARL). This method involves agents engaging in interactions with copies of themselves, enabling an increasing level of challenge and complexity within the learning environment. A fundamental work in the field of self-play is AlphaGo Zero (Silver et al., 2017b), which demonstrated exceptional performance against human players using a self-play learning scheme. Subsequent research has expanded upon the concept of self-play, exploring various adaptations and implementations (Anthony et al., 2017; Lanctot et al., 2017; Bansal et al., 2018; Hernandez-Leal et al., 2018; Muller et al., 2019; Vinyals et al., 2019). Our method takes the self-play approach akin to AlphaGo Zero, which can convert a weak model to a strong one without additional human-annotated data. While the effectiveness of self-play in MARL is well-established, to our knowledge, our work is the first to apply this approach to the enhancement of LLMs.

In the context of supervised fine-tuning (SFT) of LLMs, humanSynthetic Data for LLMs. crafted data has proven to be a remarkably effective source that enhances the performance of LLMs on tasks such as code generation (Roziere et al., 2023; Yang et al., 2023) and mathematical reasoning (Yuan et al., 2023; Luo et al., 2023). While human data typically exhibits high quality, acquiring sufficient amount of such data poses a challenge in cost. In light of this consideration, the use of synthetic data has become increasingly popular and considered as a proxy for human data. This approach primarily leverages advanced LLMs such as the GPT series (Radford et al., 2019; Brown et al., 2020; OpenAI, 2023) as the guidance to generate high-quality data (Josifoski et al., 2023; Taori et al., 2023; Chiang et al., 2023; Li et al., 2023). Recent research has also highlighted the rephrasing capability of LLMs in prompting for better LLM response (Deng et al., 2023; Prasad et al., 2023) as well as augmenting synthetic data for more effective SFT (Yu et al., 2023; Liu et al., 2023). In contrast to prior studies that utilized more advanced models for synthetic data generation when pretraining or fine-tuning a target model, our approach directly generate synthetic data from the target model itself.

In deep learning, it has been observed that training models using data Curriculum Learning. samples arranged in a strategically meaningful order can lead to improved performance compared to training on randomly shuffled data. This approach is commonly known as curriculum learning (Bengio et al., 2009; Soviany et al., 2022). Initial studies in curriculum learning introduced efficient algorithms that adhere to an ‘easy-to-hard’ progression (Spitkovsky et al., 2009; Kumar et al., 2010; Lee and Grauman, 2011; Zhang et al., 2015). In the field of Natural Language Processing (NLP), criteria such as sentence length and term frequency are commonly utilized (Cirik et al., 2016; Zhang et al., 2018; Liu et al., 2018). More recent developments include the application of curriculum learning algorithms in multi-modal learning (Liu et al., 2021; Wu et al., 2022). Our work shares a similar idea to curriculum learning, wherein the training data evolves iteratively—beginning with responses that are easy to distinguish from human-annotated data and gradually progressing to more challenging instances.

3 Problem Setting and Preliminaries

We consider a Large Language Model (LLM) parameterized by \(\theta\) and denoted by \(p_{\theta}\). The model takes as input a sequence \(x = [x_1, \ldots, x_n]\), commonly referred to as the prompt, to generate the corresponding response \(y = [y_1, \ldots, y_m]\). The response \(y\) is therefore considered as a sample from the conditional probability distribution \(p_{\theta}(\cdot\\|x)\).

In LLMs, \(x_i\) and \(y_j\) represent individual tokens from a predetermined vocabulary within the sequences \(x\) and \(y\), respectively. The autoregressive model \(p_{\theta}\) generates tokens sequentially for a given position, leveraging only the sequence of previously generated tokens. This model therefore constitutes a Markov process, where the conditional probability distribution \(p_{\theta}(y\\|x)\) can be expressed through a decomposition as follows:

\[p_{\theta}(y\\|x) = p_{\theta}(y_1, y_2, \ldots, y_m\\|x) = \prod_{j=1}^{m} p_{\theta}(y_j\\|y_1, y_2, \ldots, y_{j-1}, x)\]

3.1 Supervised Fine-Tuning

Supervised fine-tuning (SFT) is employed to tailor a pre-trained LLM to specific downstream tasks, leveraging a relatively smaller dataset of labeled examples in comparison to the large-scale pre-training data (Ouyang et al., 2022; Yu et al., 2023). In this context, we consider a specific task where the prompts, denoted by \(x\), are derived from a specified distribution \(q(\cdot)\). The notation \(p_{\text{data}}(\cdot\\|x)\) then represents the probability distribution of the associated high-quality responses \(y\) from the training data. Consequently, SFT involves training the LLM to minimize the following negative log-likelihood loss associated with these distributions:

\[\mathcal{L}_{\text{SFT}}( ext) = -\mathbb{E}_{x \sim q(\cdot)} \mathbb{E}_{y \sim p_{\text{data}}(\cdot\\|x)} [\log p_{\theta}(y\\|x)]\]

It should be noted that excluding \(x \sim q(\cdot)\) from the expectation term yields the typical cross-entropy loss, expressed as:

\[-\mathbb{E}_{y \sim p_{\text{data}}(\cdot\\|x)} [\log p_{\theta}(y\\|x)]\]

\(\mathcal{L}_{\text{SFT}}( ext)\) attains its minimum when the model’s predictive distribution \(p_{\theta}(y\\|x)\) aligns perfectly with the distribution of the labeled high-quality responses \(p_{\text{data}}(y\\|x)\).

Consequently, the LLM after SFT is anticipated to generate responses that closely resemble those from \(p_{\text{data}}(y\\|x)\). This procedure is therefore expected to significantly enhance the model’s performance in generating appropriate responses for a specific task.

3.2 RL Fine-Tuning

RL fine-tuning (Christiano et al., 2017; Bai et al., 2022a; Gao et al., 2023a) offers another method for enhancing the specific capabilities of general-purpose pre-trained models. Typically, RL fine-tuning is employed subsequent to SFT to achieve improved alignment for LLMs (Tunstall et al., 2023a).

For a given sequence pair \((x, y)\), RL fine-tuning necessitates a deterministic reward function \(r(x, y)\). The higher the reward \(r(x, y)\), the better the response \(y\) is to the given prompt \(x\). The objective of the RL fine-tuning process is then to maximize the following objective function:

\[\mathbb{E}_{x \sim q(\cdot)} \mathbb{E}_{y \sim p_{\theta}(\cdot\\|x)} [r(x, y)] - \lambda \text{KL}(p_{\theta}(\cdot\\|x) \\\| p_{\text{ref}}(\cdot\\|x))\]

where the Kullback-Leibler (KL) regularization enforces the new model \(p_{\theta}\) to be close to the reference model \(p_{\text{ref}}\), and \(\lambda > 0\) is the regularization parameter to control the deviation of the new model \(p_{\theta}\) from the reference model \(p_{\text{ref}}\). In practice, the reference model \(p_{\text{ref}}\) is often initialized as the supervised fine-tuned model. The inclusion of KL regularization is vital for preventing excessive deviation from the reference model, which in turn reduces the risk of mode collapse.

Meanwhile, the primary challenge in RL fine-tuning lies in finding a good reward function. Typically, this function requires training on a preference dataset. The compilation of such a dataset demands significant resources, often involving comprehensive evaluations either by human annotators, i.e., reinforcement learning from human feedback (RLHF) (Christiano et al., 2017; Bai et al., 2022a) or strong AI agents, i.e., reinforcement learning from AI feedback (RLAF) (Bai et al., 2022b).

4 Method

In this section, we introduce a new fine-tuning method for enhancing the performance of LLMs without relying on additional human or AI feedback. Consider a high-quality supervised fine-tuning (SFT) dataset \(\mathcal{S}_{\text{SFT}} = \{(x, y)\}_n\) where \(x\) are sampled from the marginal distribution \(q(x)\) and \(y \sim p_{\text{data}}(y\\|x)\).

Given a supervised fine-tuned LLM \(p_{\theta_0}\), further application of the SFT approach with \(\mathcal{S}_{\text{SFT}}\) will be ineffective and potentially lead to worse performance. In addition, without human and/or AI feedback, it becomes infeasible to acquire a preference dataset for RL fine-tuning (e.g., RLHF and RLAF). This hinders the application of RL fine-tuning techniques.

Figure 1: Example of ground truth completion compared to the fine-tuned model generation at iteration 0 and 1.

We can observe that the model generation at iteration 0, although fluent, incorrectly quantifies transportation preferences with specific percentages that are potentially hallucinations. The model generation at iteration 1 provides a qualitative summary of the transportation forms at Southampton without specific percentage, aligning more closely with the ground truth while adding more details.

In iteration \(t + 1\), the opponent is the LLM from the previous iteration, denoted by \(p_{\theta_t}\), which generates responses \(y'\) for those prompts \(x\) in the SFT dataset according to \(p_{\theta_t}(\cdot\\|x)\). Our method, therefore, consists of the following two steps at iteration \(t + 1\): (1) training the main player, and (2) updating the opponent player.

Training the Main Player

We begin by illustrating how we expect the main player to be trained to distinguish LLM responses from human responses. Motivated by the integral probability metric (IPM) (Müller, 1997), we formulate our objective function such that the main player \(f_{t+1}\) maximizes the expected value gap between the target data distribution \(p_{\text{data}}\) and the opponent player’s distribution \(p_{\theta_t}\):

\[\mathbb{E}_{x \sim q(\cdot)} \left[ f_{t+1}(x, y) \mid y \sim p_{\text{data}}(\cdot\\|x) \right] - \mathbb{E}_{x \sim q(\cdot)} \left[ f_{t+1}(x, y') \mid y' \sim p_{\theta_t}(\cdot\\|x) \right]\]

where \(\mathcal{F}_t\) is a sequence of highly expressive function classes that we will determine in later deduction. The subscript \(t\) in \(\mathcal{F}_t\) indicates that the function class depends on \(p_{\theta_t}\). Given such an \(f_{t+1}\) and a response sequence \(y\) to the prompt \(x\), the value of \(f_{t+1}(x, y)\) reflects the main player’s degree of belief that \(y\) originates from \(p_{\text{data}}\) rather than \(p_{\theta_t}\). Ideally, the main player \(f_{t+1}\) should yield a high value when \(y \sim p_{\text{data}}(\cdot\\|x)\) and a low value when \(y' \sim p_{\theta_t}(\cdot\\|x)\).

Instead of solving the above equation directly, we can also solve the following more general optimization problem:

\[\max_{f_{t+1} \in \mathcal{F}_t} \mathbb{E}_{x \sim q(\cdot)} \left[ \ell \left( f_{t+1}(x, y) \right) \mid y \sim p_{\text{data}}(\cdot\\|x) \right] + \mathbb{E}_{x \sim q(\cdot)} \left[ \ell \left( -f_{t+1}(x, y') \right) \mid y' \sim p_{\theta_t}(\cdot\\|x) \right]\]

where \(\ell(\cdot)\) is a loss function that is both monotonically decreasing and convex. For example, a linear loss function \(\ell(t) = -t\) reduces the above to the minimization version of the earlier equation. However, using a linear loss function results in an unbounded objective value, which, during continuous training, leads to a negative infinite value of \(f(x, y')\) on the opponent player’s responses. Therefore, in our work, we choose the logistic loss function \(\ell(t) := \log(1 + \exp(-t))\) for its non-negativity, smoothness, and exponentially decaying tail as \(t \to \infty\). Such a choice of loss function aids in preventing the excessive growth in the absolute value of \(f\). It is worth noting that the objective function defined above with a linear loss reduces to a similar IPM framework as in Wasserstein Generative Adversarial Networks (WGAN) (Arjovsky et al., 2017). However, our approach differs in both the choice of the function class \(\mathcal{F}_t\) and the training procedure.

Updating the Opponent Player

Previously, we discussed the training of \(f_{t+1}\) given the opponent player’s distribution \(p_{\theta_t}\). Now suppose we have optimized our main player \(f_{t+1}\) to distinguish \(p_{\text{data}}\) from \(p_{\theta_t}\) within a certain function class \(\mathcal{F}_t\). We elaborate on how we get parameter \(\theta_{t+1}\) of the opponent player. Specifically, when presented with two responses \(y\) and \(y'\) to the same prompt \(x\), \(f_{t+1}\) assesses the values \(f_{t+1}(x, y)\) and \(f_{t+1}(x, y')\). It then infers that the response with the higher value is from the real data distribution \(p_{\text{data}}\) and the response with the lower value is attributed to the LLM \(p_{\theta_t}\). Subsequently, the objective of the opponent player is to find a better LLM that generates responses indistinguishable from \(p_{\text{data}}\) for the main player. This is achieved by maximizing the expected value:

\[\mathbb{E}_{x \sim q(\cdot), y \sim p_{\theta}(\cdot\\|x)} \left[ f_{t+1}(x, y) \right]\]

In addition, to prevent excessive deviation of \(p_{\theta_{t+1}}\) and stabilize the self-play, we incorporate a Kullback-Leibler (KL) regularization term. Putting these together gives rise to the following optimization problem:

\[\max_{\theta} \mathbb{E}_{x \sim q(\cdot), y \sim p_{\theta}(\cdot\\|x)} \left[ f_{t+1}(x, y) \right] - \lambda \text{KL}(p_{\theta}(\cdot\\|x) \\\| p_{\theta_t}(\cdot\\|x))\]

Algorithm 1: Self-Play Fine-Tuning (SPIN)

Input: \(\{(x_i, y_i)\}_{i \in [N]}:\) SFT Dataset, \(p_{\theta_0}\)

for \(t = 0, \ldots, T - 1\) do for \(i = 1, \ldots, N\) do

(Algorithm details follow, illustrating the iterative process of training the main player and updating the opponent player for each data point in the SFT dataset).

End-to-end Training Objective

We integrate the previously discussed two steps into a single end-to-end training objective with an update rule for \(\theta_{t+1}\). Specifically, plugging the optimization problem into the loss function arrives at the update rule:

\[\theta_{t+1} = \arg \min_{\theta \in \Theta} \mathcal{L}_{\text{SPIN}}( ext, \theta_t)\]

where \(\mathcal{L}_{\text{SPIN}}\) is the training objective defined as follows:

\[\mathcal{L}_{\text{SPIN}}( ext, \theta_t) = - \mathbb{E}_{x \sim q(\cdot), y \sim p_{\text{data}}(\cdot\\|x)} [f_{t+1}(x, y)] + \mathbb{E}_{x \sim q(\cdot), y' \sim p_{\theta_t}(\cdot\\|x)} [f_{t+1}(x, y')] + \lambda \text{KL}(p_{\theta}(\cdot\\|x) \\\| p_{\theta_t}(\cdot\\|x))\]

Namely, the opponent player chosen from the previous iteration \(t\) is employed to train the main player at iteration \(t + 1\), resulting in the LLM parameterized by \(\theta_{t+1}\). Then we determine the next opponent player at iteration \(t + 1\) by directly copying the LLM parameter \(\theta_{t+1}\), which is then used in training the main player at iteration \(t + 2\). The detailed algorithm is presented in Algorithm 1.

Remark 4.1

Equation \((4.7)\) bears resemblance to direct preference optimization (DPO) (Rafailov et al., 2023) for RL fine-tuning. However, SPIN exhibits significant distinctions from DPO. Specifically, SPIN is applied to supervised fine-tuning (SFT) and relies solely on the SFT dataset, represented by pairs \((x, y)\). In sharp contrast, DPO is designed for RL fine-tuning and necessitates a preference dataset, represented by \((x, y_w, y_l)\), where \(y_w\) and \(y_l\) denote the winner (chosen) and loser (rejected) responses, respectively. DPO demands that, at the instance level, \(y_w\) is superior to \(y_l\). In comparison, our method requires that, at the distribution level, the target \(p_{\text{data}}\) should be distinguishable from the weak LLM \(p_{\theta}\) before it becomes a strong one. In terms of algorithm design, DPO implements a single-iteration approach, while our method facilitates an iterative self-play strategy, as outlined in Algorithm 1.

Algorithm 1: Self-Play Fine-Tuning (SPIN)

Input: \(\{(x_i, y_i)\}_{i \in [N]}:\) SFT Dataset, \(p_{\theta_0}\)

for \(t = 0, \ldots, T - 1\) do for \(i = 1, \ldots, N\) do

  • Update \(\theta_{t+1}\) by minimizing \(\mathcal{L}_{\text{SPIN}}( ext, \theta_t)\)
  • Copy \(\theta_{t+1}\) to set the next opponent player end for end for

5 Theoretical Analysis

Theoretical Analysis

In this section, we provide a theoretical analysis for Algorithm 1 in Section 4. Under the monotonicity and convexity assumptions of the objective function \(\ell\), we show that the global optimum is obtained if and only if the parameter \(\theta_t\) generates the data distribution. We summarize our assumptions as follows:

Assumption 5.1

The loss function \(\ell(t) : \mathbb{R} \rightarrow \mathbb{R}\) is monotonically decreasing, i.e., \(\forall t, \ell'(t) \leq 0\) and satisfies \(\ell'(0) < 0\). In addition, \(\ell(t)\) is a convex function.

Assumption 5.1 holds for a wide range of loss functions commonly used in machine learning, including correlation loss \(\ell(t) = 1 - t\), hinge loss \(\ell(t) = \max(0, 1 - t)\), exponential loss \(\ell(t) = \exp(-t)\), and logistic loss \(\ell(t) = \log(1 + \exp(-t))\).

Under Assumption 5.1, we present the following theorem, which is pivotal in understanding the optimization dynamics of our method.

Theorem 5.1

(Convergence to the Global Optimum) Under Assumption 5.1, the parameter \(\theta_t\) converges to the global optimum if and only if the data distribution generated by \(p_{\theta_t}\) matches the target data distribution \(p_{\text{data}}\).

Proof: Proof details would follow here, demonstrating that under the given assumptions, the iterative process described in Algorithm 1 ensures that the parameters \(\theta_t\) converge to a set of values that result in the LLM generating data that aligns with the target distribution \(p_{\text{data}}\).

6. Experiment

6.1 Experiment Setup

In this study, we adopt zephyr-7b-sft-full as our base model. This Model and Datasets. model derives from the pre-trained Mistral-7B (Jiang et al., 2023) and has been further fine-tuned on the SFT dataset Ultrachat200k1 by HuggingFace. Ultrachat200k represents a high-quality 200k subset of the larger UltraChat (Ding et al., 2023) corpus, which comprises approximately 1.4M dialogues produced using OpenAI’s Turbo APIs. From UltraChat200k, We randomly sample 50k prompts and use the base model to generate the synthetic responses. We subsequently follow the optimization method described in Section 4.1 for further training. In multiple iterations, we leverage the synthetic data from the most recent iteration and add to the newly generated synthetic data, therefore resulting in a synthetic dataset size of 50k at iteration 0 and 100k at iteration 1, 2 and 3. At each iteration, we train our model for 2 epochs. Evaluation. We employed the widely used Huggingface Open LLM Leaderboard (Beeching et al., 2023) as our evaluation benchmark, using the same Language Model Evaluation Harness library (Gao et al., 2023b). This leaderboard encompasses 6 different datasets, each focusing on a a specific capability of LLMs. Collectively, these datasets provide a thorough assessment framework, evaluating LLMs on commonsense reasoning (Arc (Clark et al., 2018), HellaSwag (Zellers et al., 2019), Winogrande (Sakaguchi et al., 2021)), multi-task language understanding (MMLU(Hendrycks et al., 2020)), human falsehood mimic (TruthfulQA (Lin et al., 2021)) and math problem solving (GSM8k (Cobbe et al., 2021)). In evaluation, the language models are prompted with few-shot in-context examples and the question. We follow the standard approach and report the average score across all datasets. In Table 1, we detail the evaluation setting adopted by both the leaderboard and our experiments. We leave further implementation details to Appendix A.

Table 1: Detailed information of HuggingFace Open LLM Leaderboard. For each evaluation dataset, we present the number of few-shot examples and metric adopted for evaluation.

6.2 SPIN Effectively Improves Benchmark Performance

We demonstrate the effectiveness of SPIN using HuggingFace Open LLM Leaderboard as a wide range of evaluation. In Table 2, we compare the performance of our fine-tuned model by SPIN after iterations 0 to 3 with the base model zephyr-7b-sft-full. We can observe that SPIN exhibits remarkable effectiveness in improving the model’s performance by further leveraging the SFT dataset, on which the base model has already been fully fine-tuned. At iteration 0, where model responses are generated from zephyr-7b-sft-full, we observe an overall improvement of 2.66% on the average score. The improvement is particularly significant on the TruthfulQA and GSM8k benchmarks, with improvement exceeding 5% and 10% respectively. At iteration 1, we employ the LLM model from iteration 0 to generate new responses for SPIN, adhering to the procedure outlined in Algorithm 1. This iteration yields further enhancements of 1.32% on average, and especially significant on the Arc Challenge and TruthfulQA benchmarks. Subsequent iterations continue this trend of incremental improvement across various tasks. Meanwhile, the improvement at iteration t + 1 is naturally smaller than that at iteration t. As the iterative training progresses, the degree of improvement gradually approaches zero, suggesting that the model has reached a limiting point in the last iteration.

1 https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k

Figure 2: The average score of SPIN at different iterations on the HuggingFace Open LLM leaderboard datasets. For “SFT”, we report the performance of our base model zephyr-7b-sft-full, which has been fine-tuned on the same dataset we use to generate synthetic data.

Table 2: Test performance of SPIN based on zephyr-7b-sft-full across HuggingFace Open LLM Leaderboard datasets. We also denote the average improvement over last iteration in the Average column.

  • Comparison with DPO. zephyr-7b-beta is a model derived from zephyr-7b-sft-full, trained with DPO on approximately 62k preference data. This data, the UltraFeedback Binarized dataset2, comprises both chosen and rejected completions evaluated by GPT-4. We note that, DPO requires either human input or advanced language model feedback to determine the preference, making data generation a rather expensive procedure. In contrast, our SPIN only requires the initial model itself. Moreover, unlike DPO which requires new data source, our method exclusively leverages the existing SFT dataset. In Figure 3, we show the performance comparison of SPIN at iterations 0 and 1 (employing 50k SFT data) with DPO training, from the same SFT checkpoint. We can observe that, while DPO leverages more data from new sources, SPIN based on the existing SFT data can already achieve comparable average performance to DPO training at iteration 0. From iteration 1, SPIN even surpasses the performance of DPO on the leaderboard benchmark.

2 https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized

Figure 3: Performance comparison with DPO training across the six benchmark datasets. Self-play at iteration 0 achieves comparable performance to DPO training with 62k new data. At iteration 1, self-play has already surpassed DPO training on the majority of datasets.

6.3 Ablation Studies

In this subsection, we examine the effect of synthetic dataset size and training epochs within an iteration. Our analysis demonstrates the effectiveness of the synthetic data used by SPIN compared to the SFT data, as well as the necessity of iterative training in SPIN. Furthermore, to comprehensively assess the performance improvements of SPIN, we perform additional evaluations on benchmark tasks distinct from those in the Open LLM leaderboard.

Training Size. We investigate the effect of varying training data size on the performance of SPIN. In Figure 4, we demonstrate the effect of training size for SPIN during iteration 0 and additionally compare with SFT with the full original dataset. Specifically, for the SFT baseline, we fully fine-tune Mistral-7B on Ultrachat200k for three epochs and report first epoch performance as the starting point (with x-axis 0) in the figure for SFT. For SPIN, we report the zephyr-7b-sft-full checkpoint as the starting point, which has also been fine-tuned on Ultrachat200k for one epoch. We select the training size of SPIN at iteration 0 to be 14k, 26k, and 50k and generate the data accordingly, ensuring that the larger dataset encompasses the smaller dataset. The performance of SPIN was then evaluated after 1 epoch of self-play fine-tuning for each training size. We can observe that, while SPIN results in notable improvement with increasing training sizes, SFT on further epochs 2 and 3 fails to yield more than 1% improvement. Lastly, in Table 3, we also show the performance of SFT from zephyr-7b-sft-full on Ultrachat200k for one epoch. While self-play fine-tuning with synthetic data from zephyr-7b-sft-full effectively improves its performance, simply fine-tuning it again on the SFT data leads to degraded performance, as similarly observed in Figure 4.

Table 3: Test performance of zephyr-7b-sft-full fine-tuned on Ultrachat200k for 1 more epoch across HuggingFace Open LLM benchmark datasets. SFT fails to further leverage the fine-tuning data for performance enhancement and even results in degraded performance.

Figure 4: The scaling effect of training size of SPIN compared to SFT on the average score of Open LLM Leaderboard. For SPIN, we consider training data of sizes 14k, 26k and 50k where the larger dataset contains the smaller dataset. The starting point for SPIN (with x-axis 0) is the zephyr-7b-sft-full checkpoint, which has been fine-tuned on Ultrachat200k for 1 epoch. We report the model performance trained for 1 epoch with SPIN on the varying sizes of dataset. We additionally compare with SFT, where we fine-tune Mistral-7B on Ultrachat200k for 3 consecutive epochs and report the model performance at the first epoch as the starting point (with x-axis 0).

  • Iterative Training v.s. Training for More Epochs. We further study the training within iteration 0 and compare with the performance achieved in iteration 1, particularly contrasting the test performance obtained from extended training duration with that from next iteration. Figure 5 depicts the performance trajectory of the model trained using SPIN over multiple epochs at iteration 0. It is evident that the most substantial improvement occurs during the first two epochs, followed by only modest gains in subsequent epochs. Notably, SPIN exhibits robustness and stability; extending the training duration does not diminish performance but rather maintains a rather consistent level. Nevertheless, the observation suggests an inherent limitation to the performance achievable within a single iteration, thereby underscoring the necessity for iterative training. As shown by the test performance achieved at iteration 1 in the figures, extending the training in iteration 0 fails to reach the performance comparable to iteration 1.

  • Further Investigation on More Tasks. Here, we further investigate the performance of SPIN on a broader variety of tasks, including MT-Bench (Zheng et al., 2023), Big-Bench (bench authors, 2023) and OpenBookQA (Mihaylov et al., 2018) in addition to the Open LLM Leaderboard tasks. Specifically, we use the following tasks from Big-Bench-Hard for a more comprehensive evaluation, including Causal Judgment (causal reasoning), Sports Understanding (commonsense reasoning) and Formal Fallacies (logical reasoning). In Table 4, we show the resulting scores of SPIN on MT-Bench as well as those tasks from Big-Bench. In Figure 6, we detail the model performances on MT-Bench with regard to different types of questions. We can see a notably robust improvement in the performance of SPIN on various tasks besides the HuggingFace Benchmark, without major degradation. Notably, on MT-Bench, the model fine-tuned by SPIN has surpassed the performance of vicuna-13b-v1.5 (Chiang et al., 2023) with a score of 6.57.

Figure 5: The SPIN training dynamics of zephyr-7b-sft-full on the 50k synthetic data with regard to the number of training epochs during iteration 0. We can observe that iterative training is pivotal as training for more epochs during iteration 0 reaches a limit and cannot surpass iteration 1.

Table 4: Test performance on other reasoning benchmark datasets for SPIN at different iterations and zephyr-7b-sft-full. We report the average score for MT-Bench and the accuracy score for Big Bench datasets under standard few-shot CoT evaluation. On OpenBookQA, we report acc_norm with 1-shot example as used in Anil et al. (2023). As similar to Open LLM Leaderboard evaluation, we observe a steady improvement in performance on the other benchmark tasks, with no significant degradation.

7 Conclusion and Discussion

This paper introduces a novel fine-tuning method SPIN, to convert a weak LLM to a strong LLM by unleashing the full power of human-annotated data. Central to this method is a self-play mechanism, wherein a main player (the LLM) is fine-tuned to differentiate the responses of opponent player (the LLM from previous iteration) from the target data distribution, and the LLM is iteratively aligned with the target data distribution. Therefore, SPIN facilitates the LLM’s iterative self-evaluation and enhancement through self-play. In comparison to supervised fine-tuning and RL fine-tuning methods, SPIN enables the LLM to self-improve without additional human data or feedback from stronger LLMs. Empirical results demonstrate that SPIN significantly enhances LLM performance across diverse benchmarks, even outperforming models trained with additional human data or AI feedback.

  • Limitation and Future Work. Our theoretical results demonstrate that the optimization process of SPIN converges if and only if the LLM’s distribution aligns with pdata. Therefore, our study focuses on a fixed target data distribution generated by humans, which inherently imposes a ceiling on the performance of fine-tuned LLM. Exploring the dynamically changing target data distribution is an important direction to overcome this limitation and elevate the LLM’s performance beyond this ceiling or even to a super-human level. Moreover, considering the resource demands of synthetic data generation, another promising avenue for further exploration is to reduce the volume of required synthetic data.

Figure 6: Model performance on MT-Bench. We compare SPIN across different iterations with the base SFT model. Starting from iteration 1, our fine-tuned model by SPIN robustly outperforms the SFT checkpoint on all evaluation aspects.

Appendix

A Experiment Details

A.1 Hyperparameters and Implementation Details

We use the Alignment Handbook library (Tunstall et al., 2023b) as the codebase for our self-play fine-tuning method SPIN, which includes DeepSpeed ZeRO-3 (Rajbhandari et al., 2020) and FlashAttention-2 (Dao, 2023) to reduce training costs. We train our models with the RMSProp (Hinton et al., 2012) optimizer with no weight decay for all iterations, as commonly used in fine-tuning LLMs for alignment, with a global batch size of 64, 10% warmup steps, and bfloat16 precision.

We set the peak learning rate to be \(5 \times 10^{-7}\) for iterations 0 and 1, and decay this peak learning rate to \(1 \times 10^{-7}\) for iterations 2 and 3 as we approach the end of self-play fine-tuning. Lastly, we choose \(\beta = 0.1\) and a maximum sequence length of 2048 tokens as in Tunstall et al. (2023b). We note that at the last iteration (iter-3) where the model is close to convergence, we increase the value of \(\beta\) to 5.0.

We use the Accelerate library (Gugger et al., 2022) to generate our synthetic data using distributed inference with multiple GPUs with a global batch size of 64. We consider the prompting template:

### Instruction: {prompt}

### Response:

as commonly used in Taori et al. (2023). For Ultrachat200k containing multi-round conversations, we only sample the first round as our prompt and ground truth completion pairs.

A.2 Generation Examples

In Tables 5 and 6, we further provide the generation examples of our fine-tuned model by SPIN from different iterations. We can observe an improvement in response quality as compared to the generation of the SFT checkpoint. Meanwhile, the model generations at higher iterations typically becomes more concise than iteration 0 and resemble the ground truth completion better.

Previous: Model | DeepSeek-v1** Next: MoE | Mixtral of Experts

post contain ""

    No matching posts found containing ""