00:00:00

Share Your Feedback 🏝️

Model | Zephyr

Model | Zephyr

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Survey, Hallucination | Survey of Hallucination Next: Model | YaRN** Mistral with 128k context length

Model | Zephyr

  • Related Project: Private
  • Category: Paper Review
  • Date: 2023-11-02

Zephyr: Direct Distillation of LM Alignment

  • url: https://arxiv.org/abs/2310.16944
  • pdf: https://arxiv.org/pdf/2310.16944
  • abstract: We aim to produce a smaller language model that is aligned to user intent. Previous research has shown that applying distilled supervised fine-tuning (dSFT) on larger models significantly improves task accuracy; however, these models are unaligned, i.e. they do not respond well to natural prompts. To distill this property, we experiment with the use of preference data from AI Feedback (AIF). Starting from a dataset of outputs ranked by a teacher model, we apply distilled direct preference optimization (dDPO) to learn a chat model with significantly improved intent alignment. The approach requires only a few hours of training without any additional sampling during fine-tuning. The final result, Zephyr-7B, sets the state-of-the-art on chat benchmarks for 7B parameter models, and requires no human annotation. In particular, results on MT-Bench show that Zephyr-7B surpasses Llama2-Chat-70B, the best open-access RLHF-based model. Code, models, data, and tutorials for the system are available at this https URL.

Contents

TL;DR


  1. 모델 성능 향상을 위한 방법: 데이터 집중 교육, AI 피드백 최적화, 직접 선호 최적화 (dDPO).
  2. 연구 방법과 측정: 머신러닝 모델의 의도 정렬과 효율성 향상을 위한 벤치마킹.
  3. 결과 분석: Zephyr-7B 모델은 다양한 벤치마크에서 우수한 성능을 나타냄.

1 서론

최근 작고 개방된 대규모 언어모델(LLM)의 능력이 급격히 증가하였습니다. 이런 모델들은 초기의 GPT-2 유사 모델들에서부터 더 많은 토큰으로 훈련된 정확하고 컴팩트한 모델들로 발전해 왔습니다. 또한, 이런 모델들은 보다 능력 있는 teacher 모델의 출력을 이용하여 student 모델을 훈련시키는 정제된 지도 학습을 통해 정확성을 더욱 높일 수 있음이 입증되었습니다. 그러나 이런 모델들은 teacher 모델의 성능에는 미치지 못하며 사용자의 의도와 일치하지 않는 경향이 있습니다. 이 연구에서는 학습 목적을 인공 지능 피드백(AIF)을 통해 달성하고자 하는 새로운 접근 방식인 dDPO를 제안합니다.

  • (문제 정의) Zephyr는 사용자 의도에 맞는 더 작은 언어 모델을 만드는 것을 목표로 개발 되었으며, 이전 연구에 따르면 더 큰 모델을 사용해서 증류 지도 학습(distilled supervised fine-tuning, dSFT)을 수행할 경우 태스크의 정확도는 향상되나, 토큰이 제대로 정렬되지 않아 자연스러운 프롬프트(natural prompts)에 잘 반응하지 않는 문제가 있었고, 이를 해결하기 위한 방법을 제시함.
  • (방법) 이런 단점을 개선하기 위해 AI 피드백(AIF)의 선호도 데이터를 사용하는 실험을 진행하였으며, teacher model에 의해 순위가 매겨진 출력 데이터셋를 사용해 증류 직접 선호도 최적화(distilled direct preference optimization, dDPO)를 적용하여 의도 정렬(intent alignment)의 개선을 확인하였음.
  • (성과) Zephyr-7B는 7B 파라미터 모델에 대한 채팅 벤치마크에서 최고를 기록했으며, 사람이 주석을 달지 않아도 되며, MT-Bench에서 Llama2-Chat-70B를 능가하였음.
  • 참고*: Qwen 3.2.1 Reward Model 파트


2 관련 연구

LLM의 발전은 효율적인 파인튜닝, 더 긴 프롬프트 컨텍스트, 검색 보강 생성(RAG) 및 양자화 등을 연구하기 위한 출발점으로 사용되어 왔습니다. 이런 모델들은 소규모 모델의 성능을 개선하기 위해 더 큰 모델에서 정보를 추출하는 방식으로 발전해 왔으며, 이는 주로 dSFT 단계에 초점을 맞춘 증류 과정을 포함합니다.


3 방법

3.1 dSFT (Distilled Supervised Fine-Tuning)

dSFT는 주어진 데이터셋에서 student 모델 $\pi_\theta$를 훈련하는 초기 단계입니다. 주어진 시드 프롬프트 $x_0^J$에 대해 teacher 모델이 지시에 따라 응답을 생성하고, 이 응답들을 student 모델 훈련에 사용합니다. 수학적으로는, 각 프롬프트 $x_j$에 대한 모델의 응답 $y_j$를 최적화하여 \(\pi_\theta(y\\|x)\)가 주어진 $x$에 대해 $y$의 확률을 최대화합니다.

3.2 AIF (AI Feedback through Preferences)

AI 피드백은 선호도에 따라 training dataset를 구성하는 과정입니다. teacher 모델이 생성한 응답의 질을 평가하고, 가장 높은 점수를 받은 응답을 선호 응답으로, 낮은 점수를 받은 응답을 비선호 응답으로 선택합니다. 이 데이터셋은 다음과 같이 정의됩니다.

\[D = \{(x, y_w, y_l) \\| x \in X, y_w, y_l \in Y, score(y_w) > score(y_l)\}\]

3.3 dDPO (Distilled Direct Preference Optimization)

dDPO는 AIF에서 생성된 데이터를 사용하여 모델을 더욱 최적화하는 과정입니다. 핵심은 선호도 모델 $\pi^*$와 $\pi_\text{dSFT}$ 사이의 차이를 최소화하는 것입니다. 이 과정에서 사용되는 보상 함수 $r_\theta(x, y)$는 다음과 같이 표현됩니다.

\[\pi^*(y\\|x) = \frac{\exp(r_\theta(x, y))}{Z}, \quad Z = \sum_{y'} \exp(r_\theta(x, y'))\]


4 실험 세부사항

이 연구에서는 Zephyr-7B 모델을 사용하여 다양한 벤치마크에서 모델의 성능을 평가하였습니다. 이 모델은 16개의 A100 GPU에서 몇 시간 안에 훈련될 수 있으며, dSFT와 dDPO를 통한 훈련 후, 다양한 대화 벤치마크에서 테스트되었습니다.


5 결과 및 소견

Zephyr-7B는 dSFT 모델들과 비교할 때 MT-Bench 및 AlpacaEval 벤치마크에서 향상된 성능을 보여주었습니다. 이는 dDPO 접근 방식이 의도 정렬에 있어 중요한 역할을 함을 시사합니다. 또한, 다양한 학습 접근 방식을 통해 모델의 성능이 어떻게 달라지는지에 대한 평가도 수행되었습니다.


[참고자료 1] Post from Thomas Wolf, Co-founder at Hugging Face

Zephyr 프로젝트 개발 관련 게시글 by Thomas Wolf / Co-founder at Hugging Face

There is a beautiful story that just happened in AI so let me share it for a lighter tone weekend post among all the doom stories in our AI field this week.

It’s a story of people on three continents building and sharing in the open a new small efficient and state-of-the-art AI model. It started a couple of months ago when a new team in the AI scene released their first model from their headquarters in Paris (France): Mistral 7B. Impressive model, small and very strong performances in the benchmarks, better than all previous models of this size.

And open source! So you could build on top of it.

Lewis in Bern (Switzerland) and Ed (in Lyon, in the South of France) both from the H4 team, a team of researchers in model fine-tuning and alignment were talking about it over a coffee, in one of these gatherings that often happen at Hugging Face to break the distance between people (literal distance as HF is a remote company). What about fine-tuning it using this new DPO method that a research team from Stanford in California just posted on Arxiv, says one? Hey, that’s a great idea, replies the other. We’ve just build a great code base (with Nathan, Nazneen, Costa, Younes and all the H4 team and TRL community) let’s use it!

The next day they start diving in the datasets openly shared on the HF hub and stumble upon two interesting large and good quality fine-tuning datasets recently open-sourced by OpenBMB, a Chinese team from Tsinghua: UltraFeedback and UltraChat.

A few rounds of training experiments confirm the intuition, the resulting model is super strong, by far the strongest they have ever seen in their benchmarks from Berkeley and Stanford (LMSYS and Alpaca). Join Clementine, the big boss of the open evaluation leaderboard. Her deep dive into the model capabilities confirms the results: impressive performance. But the H4 team also hosts a famous faculty member, Pr. Sasha Rush, Associate Professor at Cornell University in his daytime, hacker at HF in his nighttime. Joining the conversation, he proposes to quickly draft a research paper to organize and share all the details with the community.

A few days later, the model, called Zephyr (a wind like Mistral), paper, and all details are shared with the world. Quickly other companies, everywhere in the world starts to use it. LlamaIndex, a famous data framework and community, shares how the model blew their expectations on real-life use-case benchmarks, while researchers and practitioners discuss the paper and work on the Hugging Face hub.

All this happened in just a few weeks catalyzed by open access to knowledge, models, research, and datasets released all over the world (Europe, California, China) and by the idea that people can build upon one another work in AI to bring real-world value with efficient and open models.

Stories like this are numerous everywhere around us and make me really proud of the AI community and see how we can build amazingly useful things together.

*출처: Hugging Face Co-founder Thomas Wolf 링크드인 포스트


1 INTRODUCTION

Smaller, open large language models (LLMs) have greatly increased in ability in recent years, from early GPT-2-like models (Wang & Komatsuzaki, 2021) to accurate and compact models (Touvron et al., 2023; Penedo et al., 2023; Jiang et al., 2023) that are trained on significantly more tokens than the “compute-optimal” amount suggested by the Chincilla scaling laws (De Vries, 2023). In addition, researchers have shown that these models can be further trained through distilled supervised fine-tuning (dSFT) based on proprietary models to increase their accuracy (Taori et al., 2023). In this approach, the output of a more capable teacher model is used as supervised data for the student model.

Distillation has proven to be an effective tool for improving open models on a range of different tasks (Chiang et al., 2023); however, it does not reach the performance of the teacher models (Gudibande et al., 2023). Users have noted that these models are not “intent aligned”, i.e. they do not behave in a manner that aligns with human users’ preferences. This property often leads to outputs that do not provide correct responses to queries.

Intention alignment has been difficult to quantify, but recent work has led to the development of benchmarks like MT-Bench (Zheng et al., 2023) and AlpacaEval (Li et al., 2023) that specifically target this behavior. These benchmarks yield scores that correlate closely with human ratings of model outputs and confirm the qualitative intuition that proprietary models perform better than open models trained with human feedback, which in turn perform better than open models trained with distillation. This motivates careful collection of human feedback for alignment, often at enormous cost at scale, such as in LLAMA2-CHAT (Touvron et al., 2023).

In this work, we consider the problem of aligning a small open LLM entirely through distillation. The main step is to utilize AI Feedback (AIF) from an ensemble of teacher models as preference data, and apply distilled direct preference optimization as the learning objective (Rafailov et al., 2023). We refer to this approach as dDPO. Notably, it requires no human annotation and no sampling compared to using other approaches like proximal preference optimization (PPO) (Schulman et al., 2017). Moreover, by utilizing a small base LM, the resulting chat model can be trained in a matter of hours on 16 A100s (80GB).

To validate this approach, we construct Zephyr-7B, an aligned version of Mistral-7B (Jiang et al., 2023). We first use dSFT, based on the UltraChat (Ding et al., 2023) dataset. Next we use the AI feedback data collected in the UltraFeedback dataset (Cui et al., 2023). Finally, we apply dDPO based on this feedback data. Experiments show that this 7B parameter model can achieve performance comparable to 70B-parameter chat models aligned with human feedback. Results show improvements both in terms of standard academic benchmarks as well as benchmarks that take into account conversational capabilities. Analysis shows that the use of preference learning is critical in achieving these results. Models, code, and instructions are available at https://github.com/huggingface/alignment-handbook.

We note an important caveat for these results. We are primarily concerned with intent alignment of models for helpfulness. The work does not consider safety considerations of the models, such as whether they produce harmful outputs or provide illegal advice (Bai et al., 2022). As distillation only works with the output of publicly available models this is technically more challenging to do because of added challenges in curating that type of synthetic data, and is an important subject for future work.

There has been significant growth in the number of open large language models (LLMs) that have served as artifacts for the research community to study and use as a starting model for building chatbots and other applications. After the release of ChatGPT, the LLaMA model (Touvron et al., 2023) opened the doors to a wide range of research on efficient fine-tuning, longer prompt context, retrieval augmented generation (RAG), and quantization. After LLaMA, there has been a continuous stream of open access text based LLMs including MosaicML’s MPT (ML, 2023), the Together AI’s RedPajama-INCITE (AI, 2023), the TII’s Falcon (Penedo et al., 2023), Meta’s Llama 2 (Touvron et al., 2023), and the Mistral 7B (Jiang et al., 2023). Zephyr uses Mistral 7B as the starting point due to its strong performance.

Figure 2: The three steps of our method: (1) large scale, self-instruct-style dataset construction (UltraChat), followed by distilled supervised fine-tuning (dSFT), (2) AI Feedback (AIF) collection via an ensemble of chat model completions, followed by scoring by GPT-4 (UltraFeedback) and binarization into preferences, and (3) distilled direct preference optimization (dPO) of the dSFT model utilizing the feedback data.

With the development of open models, researchers have worked on approaches to improve small model performance by distillation from larger models. This trend started with self-instruct method (Wang et al., 2023) and the Alpaca model (Taori et al., 2023), which was followed by Vicuna (Chiang et al., 2023)and other distilled models. These works primarily focused on distilling the SFT stage of alignment, whereas we focus on both SFT and preference optimization. Some models such as WizardLM (Xu et al.) have explored methods beyond dSFT. Contemporaneously with this work, Xwin-LM (Team, 2023) introduced an approach that distilled preference optimization through PPO (Schulman et al., 2017). We compare to these approaches in our experiments.

Tools for benchmarking and evaluating LLMs have greatly evolved to keep up with the pace of innovation in generative AI. Powerful LLMs such as GPT-4 and Claude are used as evaluators to judge model responses by scoring model outputs or ranking responses in a pairwise setting. The LMSYS chatbot arena benchmarks LLMs in anonymous, randomized battles using crowdsourcing (Zheng et al., 2023). The models are ranked based on their Elo ratings on the leaderboard. AlpacaEval is an example of another such leaderboard that compares models in a pairwise setting but instead uses bigger LLMs such as GPT-4 and Claude in place of humans (Dubois et al., 2023). In a similar spirit, MTBench uses GPT-4 to score model responses on a scale of 1-10 for multi-turn instructions across task categories such as reasoning, roleplay, math, coding, writing, humanities, STEM and extraction (Zheng et al., 2023). The HuggingFace Open LLM leaderbaord (Beeching et al., 2023), the Chain-of-Thought Hub (Fu et al., 2023), ChatEval (Sedoc et al., 2019), and FastEval (fas, 2023) are examples of other tools for evaluating chatty models. We present results by evaluating on MTBench, AlpacaEval, and the HuggingFace OpenLLM Leaderboard.

3 METHOD

The goal of this work is to align an open-source large-language model to the intent of the user. Throughout the work we assume access to a larger teacher model πT which can be queried by prompted generation. Our goal is be to produce a student model πθ and our approach follows similar stages as InstructGPT (Ouyang et al., 2022) as shown in Figure 2.

Distilled Supervised Fine-Tuning (dSFT) Starting with a raw LLM, we first need to train it to respond to user prompts. This step is traditionally done through supervised fine tuning (SFT) on a dataset of high-quality instructions and responses (Chung et al., 2022; Sanh et al., 2021). Given access to a teacher language models, we can instead have the model generate instructions and responses (Taori et al., 2023), and train the model directly on these. We refer to this as distilled SFT (dSFT). Approaches to dSFT follow the self-instruct protocol (Wang et al., 2023). Let x0 J be a set of seed prompts, constructed to represent a diverse set of topical domains. A dataset is constructed through iterative self-prompting where the teacher is used to both respond to an instruction and refine the instruction based on the response.

AI Feedback through Preferences (AIF) Human feedback (HF) can provide additional signal to align LLMs. Human feedback is typically given through preferences on the quality of LLM responses (Ouyang et al., 2022). For distillation, we instead use AI preferences from the teacher model on generated outputs from other models. We follow the approach of UltraFeedback (Cui et al., 2023) which uses the teacher to provide preferences on model outputs. As with SFT, the system starts with a set of prompts \(x_1, \ldots, x_J\). Each prompt \(x\) is fed to a collection of four models \(\pi_1, \ldots, \pi_4\), e.g. Claude, Falcon, Llama, etc, each of which yield a response \(y_1 \sim \pi_1(\cdot|x), \ldots, y_4 \sim \pi_4(\cdot|x)\). These responses are then fed to the teacher model, e.g. GPT-4, which gives a score for the response \(s_1 \sim \pi_T (\cdot|x, y_1), \ldots, s_4 \sim \pi_T (\cdot|x, y_4)\). After collecting the scores for a prompt \(x\), we save the highest scoring response as \(y_w\) and a random lower scoring prompt as \(y_l\). The final feedback dataset \(D\) consists of a set of these triples \((x, y_w, y_l)\).

Distilled Direct Preference Optimization (dDPO) The goal of the final step is to refine the \(\pi_{dSFT}\) by maximizing the likelihood of ranking the preferred \(y_w\) over \(y_l\) in a preference model. The preference model is determined by a reward function \(r_{\theta}(x, y)\) which utilizes the student language model \(\pi_{\theta}\). Past work using AI feedback has primarily focused on using RL methods such as proximal policy optimization (PPO) to optimize \(\theta\) with respect to this reward. These approaches optimize \(\theta\) by first training the reward and then sampling from the current policy to compute updates.

Direct preference optimization (DPO) uses a simpler approach to directly optimize the preference model from the static data (Rafailov et al., 2023). The key observation is to derive the optimal reward function in terms of the optimal LLM policy \(\pi^*\) and the original LLM policy \(\pi_{dSFT}\). Under an appropriate choice of preference model they show, for constant \(\beta\) and partition function \(Z\) that,

\[\pi^*(y|x) \propto \pi_{dSFT}(y|x) \exp\left(\text{r_{\theta}(x, y)}{\beta}\right)\]

By plugging this function of the reward into the preference model, the authors show that the objective can be written as,

\[\mathbb{E}_{(x, y_w, y_l) \sim D} \left[ \log \frac{\pi_{\theta}(y_w|x)}{\pi_{\theta}(y_l|x)} \right]\]

While this term looks complex, we note that it implies a simple training procedure. Starting with the dSFT version of the model, we iterate through each AIF triple \((x, y_w, y_l)\).

  1. Compute the probability for \((x, y_w)\) and \((x, y_l)\) from the dSFT model (forward-only).
  2. Compute the probability for \((x, y_w)\) and \((x, y_l)\) from the dDPO model.
  3. Compute Eq 1 and backpropagate to update.
  4. Repeat.

4 EXPERIMENTAL DETAILS

We conduct all of our fine-tuning experiments using Mistral 7B (Jiang et al., 2023), which is the current state-of-the-art base LM at the 7B parameter scale, and matches the performance of much larger models like LLaMA 34B on many NLP benchmarks. We use the Transformer Reinforcement Learning (TRL) library for fine-tuning (von Werra et al., 2020), in conjunction with DeepSpeed ZeRO3 (Rajbhandari et al., 2020) and FlashAttention-2 (Dao, 2023) to optimize memory and improve training speed. All models are trained with the AdamW optimizer and no weight decay. We did not experiment with parameter-efficient techniques such as LoRA (Hu et al., 2021), but expect similar results to hold with these methods. All experiments were run on 16 A100s using bfloat16 precision and typically took 2-4 hours to complete. For the full set of hyperparameters and instructions on how to train the models, see: https://github.com/huggingface/alignment-handbook.

4.1 DATASETS

We focus on two dialogue datasets that have been distilled from a mix of open and proprietary models, and have previously been shown to produce strong chat models like the UltraLM (Ding et al., 2023):

  • UltraChat (Ding et al., 2023) is a self-refinement dataset consisting of 1.47M multi-turn dialogues generated by GPT-3.5-TURBO over 30 topics and 20 different types of text material. We initially ran dSFT over the whole corpus, but found the resulting chat model had a tendency to respond with incorrect capitalization and would preface its answers with phrases such as “I don’t have personal experiences”, even for straightforward questions like “How do I clean my car?”. To handle these issues in the training data, we applied truecasing heuristics to fix the grammatical errors (approximately 5% of the dataset), as well as several filters to focus on helpfulness and remove the undesired model responses. The resulting dataset contains approximately 200k examples.

  • UltraFeedback (Cui et al., 2023) consists of 64k prompts, each of which have four LLM responses that are rated by GPT-4 according to criteria like instruction-following, honesty, and helpfulness. We construct binary preferences from UltraFeedback by selecting the highest mean score as the “chosen” response and one of the remaining three at random as “rejected”. We opted for random selection instead of selecting the lowest-scored response to encourage diversity and make the DPO objective more challenging. As noted above, this step is computed offline and does not involve any sampling from the reference model.

We make the pre-processed datasets available on the Hugging Face Hub.1

4.2 EVALUATION

Our main evaluations are on single-turn and multi-turn chat benchmarks that measure a model’s ability to follow instructions and respond to challenging prompts across a diverse range of domains:

  • MT-Bench (Zheng et al., 2023) is a multi-turn benchmark that consists of 160 questions across eight different areas of knowledge. In this benchmark, the model must answer an initial question, and then provide a second response to a predefined followup question. Each model response is then rated by GPT-4 on a scale from 1-10, with the final score given by the mean over the two turns.

  • AlpacaEval (Li et al., 2023) is a single-turn benchmark where a model must generate a response to 805 questions on different topics, mostly focused on helpfulness. Models are also scored by GPT-4, but the final metric is the pairwise win-rate against a baseline model (text-davinci-003).

We also evaluate Zephyr-7B on the Open LLM Leaderboard (Beeching et al., 2023), which measures the performance of LMs across four multiclass classification tasks: ARC (Clark et al., 2018), HellaSwag (Zellers et al., 2019), MMLU (Hendrycks et al., 2021), and Truthful QA(Lin et al., 2022). Although this leaderboard does not directly measure the conversational quality of chat models, it does provide a useful signal to validate whether fine-tuning has introduced regressions on the base model’s reasoning and truthfulness capabilities.

Across all benchmarks, we compare Zephyr-7B against a variety of open and proprietary models, each with different alignment procedures. To facilitate comparison across open model sizes, we group our comparisons in terms of 7B models (XWIN-LM (Team, 2023), MISTRALINSTRUCT (Jiang et al., 2023), MPT-CHAT (ML, 2023), and STABLELM-α), as well as larger models up to 70B parameters (LLAMA2-CHAT (Touvron et al., 2023), VICU ˜NA (Chiang et al., 2023), WizardLM (Xu et al.), and GUANACO (Dettmers et al., 2023)). For the chat benchmarks, we also compare against proprietary models, including CLAUDE 2, GPT-3.5-TURBO and GPT-4 (OpenAI, 2023).

https://huggingface.co/collections/HuggingFaceH4/

4.3 DETAILS OF SFT TRAINING

We train our SFT models for one to three epochs. We use a cosine learning rate scheduler with a peak learning rate of 2e-5 and 10% warmup steps. We train all models with a global batch size of 512 and use packing with a sequence length of 2048 tokens.

4.4 DETAILS OF DPO TRAINING

Similar to SFT, we train our DPO models for one to three epochs. We use a linear learning rate scheduler with a peak learning rate of 5e-7 and 10% warmup steps. We train all models with a global batch size of 32 and use β = 0.1 from Eq. (1) to control the deviation from the reference model. The final Zephyr-7B model was initialized from the SFT model that was trained for one epoch and further optimized for three DPO epochs (see Figure 3 for an epoch ablation on MT-Bench).

5 RESULTS AND ABLATIONS

In this section we collect our main results; see Appendix A for sample model completions.

Table 1: Chat benchmark results for open-access and proprietary models on MT-Bench and Al-pacaEval. A dash (−) indicates model or alignment information that is not publicly available, or an evaluation that is absent on the public leaderboards. Scores marked with an asterisk (∗) denote evaluations done by ourselves.

dDPO Improves Chat Capabilities. In Table 1 we compare the performance of Zephyr-7B on the MT-Bench and AlpacaEval benchmarks. Compared to other open 7B models, Zephyr-7B sets a new state-of-the-art and performs significantly better than dSFT models across both benchmarks. In particular, Zephyr-7B outperforms XWIN-LM-7B, which is one of the few open models to be trained with distilled PPO (dPPO). When compared to larger open models, Zephyr-7B achieves competitive performance with LLAMA2-CHAT 70B, scoring better on MT-Bench and within two standard deviations on AlpacaEval. However, Zephyr-7B performs worse than WIZARDLM-70B and XWIN-LM-70B, which suggests that applying dDPO to larger model sizes may be needed to match performance at these scales. When compared to proprietary models, Zephyr-7B is competitive with GPT-3.5-TURBO and CLAUDE 2 on AlpacaEval, however these results should be interpreted with care since the prompts in AlpacaEval may not be representative of real-usage and advanced applications. This is partly visible in Figure 1, which shows the breakdown of model performance on MT-Bench across each domain. We can see that although Zephyr-7B is competitive with proprietary models on several categories, is much worse in math and coding.

dDPO Improves Academic Task Performance Table 2 shows the main chat results comparing the performance of the proposed model with a variety of other closed source and open-source LLMs. Results show that the dDPO model performs the best among all 7B models, with a large gap over the best dSFT models as well as Xwin-LM dPPO model. Model scale does matter more for these results and the larger models perform better than Zephyr on some of the knowledge intensive tasks. However, Zephyr does reach the performance of the 40B scale models.

Table 2: Academic benchmark results for open-access models on the Open LLM Leaderboard.

Is Preference Optimization Necessary? of the alignment process by fine-tuning Mistral 7B in four different ways:

In Table 3 we examine the impact from different steps

dDPO

  • dSFT fine-tunes the base model directly with DPO for one epoch on UltraFeedback.
  • dSFT-1 fine-tunes the base model with SFT for one epoch on UltraChat.
  • dSFT-2 applies dSFT-1 first, followed by one more epoch of SFT on the top-ranked completions of UltraFeedback.
  • dDPO + dSFT applies dSFT-1 first, followed by one epoch of DPO on UltraFeedback.

First, we replicate past results (Ouyang et al., 2022) and show that without an initial SFT step (- dSFT), models are not able to learn at all from feedback and perform terribly. Using dSFT improves model score significantly on both chat benchmarks. We also consider running dSFT directly on the feedback data by training on the most preferred output (dSFT-2); however we find that this does not make an impact in performance. Finally, we see that the full Zephyr models (dDPO+dDSFT) gives a large increase in both benchmarks.

Does Overfitting Harm Downstream Performance? In the process of training Zephyr-7B we observed that after one epoch of DPO training, the model would strongly overfit, as indicated by perfect training set accuracies in Figure 3. Surprisingly, this did not harm downstream performance on MT-Bench and AlpacaEval; as shown in Figure 3, the strongest model was obtained with one epoch of SFT followed by three epochs of DPO. However, we do observe that if the SFT model is trained for more than one epoch, the DPO step actually induces a performance regression with longer training.

Align MT-Bench (score) AlpacaEval (win %)
dDPO - dSFT 4.76 30.76
dSFT-1 6.64 61.63
dSFT-2 6.19 85.65
dDPO + dSFT 7.00 78.54

Table 3: Ablation of different alignment methods on the base Mistral 7B model.

Figure 3: Train and test set accuracy during DPO (left) and MT-Bench scores for MISTRAL-7B models fine-tuned first with dSFT and then dDPO for a varying number of epochs on the UltraChat and UltraFeedback datasets (right).

6 CONCLUSIONS AND LIMITATIONS

We consider the problem of alignment distillation from an LLM onto a smaller pretrained model. The method avoids the use of sampling-based approaches like rejection sampling or PPO, and distills conversational capabilities with direct preference optimization (DPO) from a dataset of AI feedback. The resulting model Zephyr-7B, based on MISTRAL-7B, sets a new state=of-the-art for 7B parameter chat models, and even outperforms LLAMA2-CHAT-70B on MT-Bench. We hope this approach motivates further exploration of the capacity of smaller, open-models by demonstrating their ability to align to the intent of user interactions.

There are several limitations associated with our study. The main one is the use of GPT-4 as an evaluator for the AlpacaEval and MT-Bench benchmarks, which is known to be biased towards models distilled from it, or those that produce verbose, but potentially incorrect responses. Another limitation is examining whether our method scales to much larger models like LLAMA2-70B, where the performance gains are potentially larger.

Previous: Survey, Hallucination | Survey of Hallucination Next: Model | YaRN** Mistral with 128k context length

post contain ""

    No matching posts found containing ""