00:00:00

Share Your Feedback 🏝️

Magpie Scratch Data Synthesis

Magpie Scratch Data Synthesis

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Scaling Up using RL on Synthesized Data Next: Foundation Models

Magpie Scratch Data Synthesis

  • Related Project: Private
  • Category: Paper Review
  • Date: 2024-06-18

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

  • url: https://arxiv.org/abs/2406.08464
  • pdf: https://arxiv.org/pdf/2406.08464
  • html https://arxiv.org/html/2406.08464v1
  • abstract: High-quality instruction data is critical for aligning large language models (LLMs). Although some models, such as Llama-3-Instruct, have open weights, their alignment data remain private, which hinders the democratization of AI. High human labor costs and a limited, predefined scope for prompting prevent existing open-source data creation methods from scaling effectively, potentially limiting the diversity and quality of public alignment datasets. Is it possible to synthesize high-quality instruction data at scale by extracting it directly from an aligned LLM? We present a self-synthesis method for generating large-scale alignment data named Magpie. Our key observation is that aligned LLMs like Llama-3-Instruct can generate a user query when we input only the left-side templates up to the position reserved for user messages, thanks to their auto-regressive nature. We use this method to prompt Llama-3-Instruct and generate 4 million instructions along with their corresponding responses. We perform a comprehensive analysis of the extracted data and select 300K high-quality instances. To compare Magpie data with other public instruction datasets, we fine-tune Llama-3-8B-Base with each dataset and evaluate the performance of the fine-tuned models. Our results indicate that in some tasks, models fine-tuned with Magpie perform comparably to the official Llama-3-8B-Instruct, despite the latter being enhanced with 10 million data points through supervised fine-tuning (SFT) and subsequent feedback learning. We also show that using Magpie solely for SFT can surpass the performance of previous public datasets utilized for both SFT and preference optimization, such as direct preference optimization with UltraFeedback. This advantage is evident on alignment benchmarks such as AlpacaEval, ArenaHard, and WildBench.

Contents

TL;DR


  • MAGPIE 방법은 LLM을 이용하여 고품질의 지시 데이터셋을 자동 생성합니다.
  • 데이터셋의 품질과 다양성은 벤치마크에서 기존 데이터셋보다 우수함을 보여줍니다.
  • 비용 효율적이고 확장 가능한 데이터셋 생성 방법을 제공합니다.

1 서론

최근 대규모 언어모델(LLMs)인 GPT-4와 Llama-3은 다양한 AI 응용 프로그램에서 중요한 역할을 하고 있습니다. 이런 모델의 성공은 고품질의 지시 데이터셋에 크게 의존하며, 이는 모델이 훈련 중에 접하지 않은 작업들까지 처리할 수 있도록 합니다. 그러나 기존의 지시 데이터셋은 대부분 비공개로, 연구 및 AI의 민주화에 제한을 두고 있습니다. 이를 해결하기 위해, 연구자들은 휴먼의 노력을 요구하는 방법과 LLM을 이용한 합성 지시 생성 방법을 개발했습니다. 이 두 방법은 각각의 장단점을 가지고 있으나, 데이터셋의 다양성이 감소하는 문제를 겪고 있습니다.


2 MAGPIE: 지시 데이터 생성을 위한 확장 가능한 방법

2.1 지시 생성 단계

MAGPIE는 Llama-3-70B-Instruct 모델과 같은 정렬된 LLM을 이용하여 지시를 자동 생성합니다. 주어진 입력 쿼리는 LLM의 사전 정의된 지시 템플릿 형식을 따르며, 이는 지시 제공자의 역할만을 정의합니다. LLM은 자동 회귀적 특성을 이용하여 지시를 생성하며, 지시가 종료되면 생성을 중단합니다. 이 과정은 특정 씨드 질문 없이도 수행될 수 있어 지시의 다양성을 보장합니다.

2.2 응답 생성 단계

생성된 지시를 이용하여 해당 지시에 대한 응답을 생성합니다. 지시 제공자와 지시를 따르는 자의 역할을 결합하여 지시 데이터셋을 완성합니다. 이 단계는 추가적인 사용자 개입 없이 완전 자동화되어 있습니다.

2.3 확장성

MAGPIE는 다중 턴 지시 데이터셋과 선호도 데이터셋 생성으로 확장 가능합니다. 또한 지시에 의해 요구되는 작업을 특정할 수 있습니다.


3 데이터셋 분석

3.1 데이터셋 범위

MAGPIE-Pro 데이터셋은 t-SNE 분석을 통해 그 범위를 평가합니다. 결과적으로, MAGPIE-Pro는 다른 데이터셋보다 훨씬 광범위한 주제를 포괄하는 것으로 나타났습니다.

3.2 데이터셋 속성

  • [지시의 품질] MAGPIE-Air와 MAGPIE-Pro의 지시는 대부분 ‘평균’ 이상의 품질을 가집니다.
  • [지시의 난이도] 지시의 난이도는 MAGPIE-Pro에서 더 높게 나타났으며, 이는 더 강력한 모델을 사용한 결과입니다.
  • [지시의 유사성] 중복을 제거한 지시의 최소 이웃 거리를 계산하여 유사성을 평가했습니다.
  • [응답의 품질] 응답의 품질은 기본 모델과 비교하여 높은 보상 차이를 보여주었습니다.

3.3 안전성 분석

MAGPIE-Air와 MAGPIE-Pro는 주로 안전한 데이터를 포함하고 있으며, 해로운 지시나 응답이 적은 비율로 포함되어 있음을 확인했습니다.

3.4 비용 분석

MAGPIE의 실행 비용은 경제적이며, 데이터 생성 비용은 저렴합니다. 이는 MAGPIE 방법이 기존의 방법보다 비용 효율적임을 입증합니다.


4 성능 분석

4.1 실험 설정

데이터셋과 벤치마크 종류 및 주요 방법: MAGPIE 데이터셋은 Llama-3과 Qwen1.5 모델군을 이용하여 다양한 지시 튜닝 데이터셋과 비교합니다. 이 실험에서는 MAGPIE가 생성한 데이터셋과 ShareGPT, WildChat 등의 휴먼 작성 데이터셋, 그리고 Evol Instruct, UltraChat 같은 합성 데이터셋들을 비교 분석합니다. 이런 데이터셋들은 주로 공개적으로 제공되는 정렬 데이터셋을 포함하고 있습니다.

학습 세부 사항: \(\text{Learning rate} = 2 \times 10^{-5}\) 초기 학습률을 설정하고 코사인 일정한 learning rate을 사용하여 Llama-3 및 Qwen1.5 모델을 파인튜닝합니다. 최대 시퀀스 길이는 8192입니다. 학습은 NVIDIA A100 GPU를 사용하여 진행됩니다.

벤치마크 평가: AlpacaEval 2와 Arena-Hard 벤치마크를 사용하여 파인튜닝된 모델의 성능을 평가합니다. AlpacaEval 2는 실제 사용자 상호 작용에서 선택된 대표 지시어를 포함하며, Arena-Hard는 도전적인 사용자 쿼리를 포함합니다.

4.2 실험 결과

MAGPIE 데이터셋은 기존 데이터셋에 비해 우수한 성능을 보여줍니다. 예를 들어, MAGPIE 데이터셋을 사용하여 파인튜닝된 Llama-3 모델은 AlpacaEval 2 벤치마크에서 다음과 같은 성능을 보여줍니다. \(\text{WR (Win Rate)} = 29.47\%\) \(\text{LC (Length-Controlled Win Rate)}\) 는 응답의 길이를 고려하여 평가하는 지표로, MAGPIE 데이터셋이 기존 데이터셋보다 우수한 결과를 나타냅니다.

MAGPIE는 다양한 벤치마크에서 기존 데이터셋을 일관되게 능가합니다. WildBench 벤치마크에서는 MAGPIE가 모든 범주에서 기존 데이터셋보다 우수한 성능을 보여주며, 이는 MAGPIE 데이터셋이 instruction following 능력을 향상시키는 데 효과적임을 입증합니다.

데이터의 양과 질 모두 instruction following 능력에 중요한 영향을 미칩니다. MAGPIE-Pro-300K-Filtered 데이터셋과 같은 필터링된 데이터셋은 원시 데이터에 비해 우수한 성능을 보여주어, 데이터 필터링 기법의 효과를 강조합니다.

MAGPIE는 다른 백본 모델의 성능도 향상시킬 수 있습니다. Qwen1.5-4B 및 Qwen1.5-7B 모델과 같은 다른 모델에 MAGPIE를 적용한 결과, 공식적으로 지시 튜닝된 모델보다 더 나은 성능을 보여주며, 이는 MAGPIE의 효과와 생성된 지시의 질을 강조합니다.

추가 실험 결과는 Appendix E.1 및 E.3에서 다룹니다. 여기에서는 MAGPIE-Air-MT 및 MAGPIE-Pro-MT의 성능 분석과 다양한 벤치마크에서의 MAGPIE의 성능을 보고합니다.


1 Introduction

Large language models (LLMs) such as GPT-4 [1] and Llama-3 [40] have become integral to AI applications due to their exceptional performance on a wide array of tasks by following instructions. The success of LLMs is heavily reliant on the data used for instruction fine-tuning, which equips them to handle a diverse range of tasks, including those not encountered during training. The effectiveness of this instruction tuning depends crucially on access to high-quality instruction datasets. However, the alignment datasets used for fine-tuning models like Llama-3-Instruct are typically private, even when the model weights are open, which impedes the democratization of AI and limits scientific research for understanding and enhancing LLM alignment.

To address the challenges in constructing such datasets, researchers have developed two main approaches. The first type of method involves human effort to generate and curate instruction data [14, 26, 64, 65, 66], which is both time-consuming and labor-intensive [37]. In contrast, the second type of method uses LLMs to produce synthetic instructions [16, 31, 46, 47, 53, 55, 58, 59]. Although these methods reduce human effort, its success heavily depends on prompt engineering and the careful selection of initial seed questions. The diversity of synthetic data tends to decrease as the dataset size grows. Despite ongoing efforts, the scalable creation of high-quality and diverse instruction datasets continues to be a challenging problem.

Figure 1: This figure illustrates the process of self-synthesizing instruction data from aligned LLMs (e.g., Llama-3-8B-Instruct) to create a high-quality instruction dataset. In Step 1, we input only the pre-query template into the aligned LLM and generate an instruction along with its response using auto-regressive generation. In Step 2, we use a combination of a post-query template and another pre-query template to wrap the instruction from Step 1, prompting the LLM to generate the query for the second turn. This completes the construction of the instruction dataset. MAGPIE efficiently generates diverse and high-quality instruction data. Our experimental results show that MAGPIE outperforms other public datasets for aligning Llama-3-8B-base.

Is it possible to synthesize high-quality instructions at scale by directly extracting data from advanced aligned LLMs themselves? A typical input to an aligned LLM contains three key components: the pre-query template, the query, and the post-query template. For instance, an input to Llama-2-chat could be “[INST] Hi! [/INST]”, where [INST] is the pre-query template and [/INST] is the post-query template. These templates are predefined by the creators of the aligned LLMs to ensure the correct prompting of the models. We observe that when we only input the pre-query template to aligned LLMs such as Llama-3-Instruct, they self-synthesize a user query due to their auto-regressive nature. Our preliminary experiments indicate that these random user queries are of high quality and great diversity, suggesting that the abilities learned during the alignment process are effectively utilized.

Based on these findings, we developed a self-synthesis method to construct high-quality instruction datasets at scale, named MAGPIE (as illustrated in Figure 1). Unlike existing methods, our approach does not rely on prompt engineering or seed questions. Instead, it directly constructs instruction data by prompting aligned LLMs with a pre-query template for sampling instructions. We applied this method to the Llama-3-8B-Instruct and Llama-3-70B-Instruct models, creating two instruction datasets: MAGPIE-Air and MAGPIE-Pro, respectively.

Our MAGPIE-Air and MAGPIE-Pro datasets were created using 206 and 614 GPU hours, respectively, without requiring any human intervention or API access to production LLMs like GPT-4. Additionally, we generated two multi-turn instruction datasets, MAGPIE-Air-MT and MAGPIE-Pro-MT, which contain sequences of multi-turn instructions and responses. The statistics and advantages of our instruction datasets compared to existing ones are summarized in Table 1. We perform a comprehensive analysis of the generated data, allowing practitioners to filter and select data instances from these datasets for fine-tuning according to their particular needs.

To compare MAGPIE data with other public instruction datasets (e.g., ShareGPT [10], WildChat [64], Evol Instruct [58], UltraChat [16], OpenHermes [49], Tulu V2 Mix [24]) and various preference tuning strategies with UltraFeedback [13], we fine-tune the Llama-3-8B-Base model with each dataset and assess the performance of the resultant models on LLM alignment benchmarks such as AlpacaEval 2 [33], Arena-Hard [32], and WildBench [34]. Our results show that models fine-tuned with MAGPIE achieve superior performance, even surpassing the official Llama-3-8B-Instruct model on AlpacaEval, which was fine-tuned with over 10 million data points for supervised fine-tuning (SFT) and follow-up feedback learning. Not only does MAGPIE excel in SFT alone compared to prior public datasets that incorporate both SFT and preference optimization (e.g., direct preference optimization with UltraFeedback [13]), but it also delivers the best results when evaluated against six baseline instruction datasets and four preference tuning methods (DPO [44], IPO [2], KTO [19], and ORPO [23] with the UltraFeedback dataset). These findings show the exceptional quality of instruction data generated by MAGPIE, enabling it to outperform even the official, extensively optimized LLMs.

Table 1: Statistics of instruction datasets generated by MAGPIE compared to other instruction datasets. Tokens are counted using the tiktoken library [42].

2 MAGPIE: A Scalable Method to Synthesize Instruction Data

Overview of MAGPIE. In what follows, we describe our method, MAGPIE, to synthesize instruction data for fine-tuning LLMs. An instance of instruction data consists of at least one or multiple instruction-response pairs. Each pair specifies the roles of instruction provider and follower, along with their instruction and response. As shown in Figure 1, MAGPIE consists of two steps: (1) instruction generation, and (2) response generation. The pipeline of MAGPIE can be fully automated without any human intervention. Given the data generated by MAGPIE, practitioners may customize and build their own personalized instruction dataset accordingly (see Section 3 and Appendix B for more details). We detail each step in the following.

  • Step 1: Instruction Generation. The goal of this step is to generate an instruction for each instance of instruction data. Given an open-weight aligned LLM (e.g., Llama-3-70B-Instruct), MAGPIE crafts an input query in the format of the predefined instruction template of the LLM. This query defines only the role of instruction provider (e.g., user), and does not provide any instruction. Note that the auto-regressive LLM has been fine-tuned using instruction data in the format of the predefined instruction template. Thus, the LLM autonomously generates an instruction when the query crafted by MAGPIE is given as an input. MAGPIE stops generating the instruction once the LLM produces an end-of-sequence token. Sending the crafted query to the LLM multiple times leads to a set of instructions. Compared with existing synthetic approaches [16, 31, 47, 53, 55, 58, 59], MAGPIE does not require specific prompt engineering techniques since the crafted query follows the format of the predefined instruction template. In addition, MAGPIE autonomously generates instructions without using any seed question, ensuring the diversity of generated instructions.

  • Step 2: Response Generation. The goal of this step is to generate responses to the instructions obtained from Step 1. MAGPIE sends these instructions to the LLM to generate the corresponding responses. Combining the roles of instruction provider and follower, the instructions from Step 1, and the responses generated in Step 2 yields the instruction dataset. Detailed discussion on the generation configuration can be found in Appendix D.

  • Extensions of MAGPIE. MAGPIE can be readily extended to generate multi-turn instruction datasets and preference datasets. In addition, practitioners can specify the task requested by the instructions. We defer the detailed discussion on these extensions to Appendix A.

Figure 3: This figure compares the t-SNE plot of MAGPIE-Pro with those of Alpaca, Evol Instruct, and UltraChat, each of which is sampled with 10,000 instructions. The t-SNE plot of MAGPIE-Pro encompasses the area covered by the other plots, demonstrating the comprehensive coverage of MAGPIE-Pro.

Figure 2: Lengths of instructions and responses in MAGPIE-Air/Pro.

3 Dataset Analysis

We apply MAGPIE to the Llama-3-8B-Instruct and Llama-3-70B-Instruct models to construct two instruction datasets: MAGPIE-Air and MAGPIE-Pro, respectively. Examples of instances in both datasets can be found in Appendix G. In this section, we present a comprehensive statistical analysis of the MAGPIE-Air and MAGPIE-Pro datasets. An overview of the lengths of instructions and responses of the data in MAGPIE-Air and MAGPIE-Pro is presented in Figure 2. In what follows, we first assess the breadth of MAGPIE-Pro by analyzing its coverage. We then discuss the attributes of MAGPIE-Pro, including topic coverage, difficulty, quality, and similarity of instructions, as well as quality of response. Finally, we provide the safety analysis and cost analysis. Using our dataset analysis, practitioners can customize and configure their own datasets for fine-tuning LLMs. In Appendix B, we showcase the process of customizing and filtering an instruction dataset based on our analysis. Specifically, we select 300K instances from MAGPIE-Pro and MAGPIE-Air-Filtered, yielding datasets MAGPIE-Pro-300K and MAGPIE-Air-300K-Filtered, respectively.

3.1 Dataset Coverage

We follow the approach in [64] and analyze the coverage of MAGPIE-Pro in the embedding space. Specifically, we use the all-mpnet-base-v2 embedding model1 to calculate the input embeddings, and employ t-SNE [51] to project these embeddings into a two-dimensional space. We adopt three synthetic datasets as baselines, including Alpaca [47], Evol Instruct [58], and UltraChat [16], to demonstrate the coverage of MAGPIE-Pro.

Figure 3 presents the t-SNE plots of MAGPIE-Pro, Alpaca, Evol Instruct, and UltraChat. Each t-SNE plot is generated by randomly sampling 10,000 instructions from the associated dataset. We observe that the t-SNE plot of MAGPIE-Pro encompasses the area covered by the plots of Alpaca, Evol Instruct, and UltraChat. This suggests that MAGPIE-Pro provides a broader or more diverse range of topics, highlighting its extensive coverage across varied themes and subjects. We also follow the practice in [53] and present the most common verbs and their top direct noun objects in instructions in Appendix C, indicating the diverse topic coverage of MAGPIE dataset. Coverage analysis of MAGPIE-Air can also be found in Appendix C.

1 https://huggingface.co/sentence-transformers/all-mpnet-base-v2

3.2 Dataset Attributes

Attribute: Task Categories of Instructions.

We use Llama-3-8B-Instruct to categorize the instances in MAGPIE-Pro (see Figure 7 in Appendix C.1 for detail). The prompts used to query Llama-3-8B-Instruct can be found in Appendix F. Our observations indicate that over half of the tasks in MAGPIE-Pro pertain to information seeking, making it the predominant category. This is followed by tasks involving creative writing, advice seeking, planning, and math. This distribution over the task categories aligns with the practical requests from human users [33].

Attribute: Quality of Instructions. We use the Llama-3-8B-Instruct model to assess the quality of each instruction in MAGPIE-Air and MAGPIE-Pro, categorizing them as ‘very poor’, ‘poor’, ‘average’, ‘good’, and ‘excellent’. We present the histograms of qualities for both datasets in Figure 4-(a). We have the following two observations. First, both datasets are of high quality, with the majority of instances rated ‘average’ or higher. In addition, the overall quality of MAGPIE-Pro surpasses that of MAGPIE-Air. We hypothesize that this is due to the enhanced capabilities of Llama-3-70B compared with Llama-3-8B.

Attribute: Difficulty of Instructions. We use the Llama-3-8B-Instruct model to rate the difficulty of each instruction in MAGPIE-Air and MAGPIE-Pro. Each instruction can be labeled as ‘very easy’, ‘easy’, ‘medium’, ‘hard’, or ‘very hard’. Figure 4-(b) presents the histograms of the levels of difficulty for MAGPIE-Air and MAGPIE-Pro. We observe that the distributions across difficulty levels are similar for MAGPIE-Air and MAGPIE-Pro. Some instructions in MAGPIE-Pro are more challenging than those in MAGPIE-Air because MAGPIE-Pro is generated by a more capable model (Llama-3-70B-Instruct).

Figure 4: The statistics of input difficulty and quality.

Attribute: Instruction Similarity. We quantify the similarity among instructions generated by MAGPIE to remove repetitive instructions. We measure the similarity using minimum neighbor distance in the embedding space. Specifically, we first represent all instructions in the embedding space using the all-mpnet-base-v2 embedding model. For any given instruction, we then calculate the minimum distance from the instruction to its nearest neighbors in the embedding space using Facebook AI Similarity Search (FAISS) [17]. The minimum neighbor distances of instructions in MAGPIE-Air after removing repetitions are summarized in Figure 5-(a).

Attribute: Quality of Responses. We assess the quality of responses using a metric named reward difference. For each instance in our dataset, the reward difference is calculated as r∗ − rbase, where r∗ is the reward assigned by a reward model to the response in our dataset, and rbase is the reward assigned by the same model to the response generated by the Llama-3 base model for the same instruction. We use URIAL [35] to elicit responses from the base model. A positive reward difference indicates that the response from our dataset is of higher quality, and could potentially benefit instruction tuning. In our experiments, we follow [29] and use FsfairX-LLaMA3-RM-v0.1 [57] as our reward model. Our results on the reward difference are presented in Figure 5-(b).

Figure 5: This figure summarizes the minimum neighbor distances and reward differences.

3.3 Safety Analysis

We use Llama-Guard-2 [48] to analyze the safety of MAGPIE-Air and MAGPIE-Pro. Our results indicate that both datasets are predominantly safe, with less than 1% of the data potentially containing harmful instructions or responses. Please refer to Appendix C.2 for detailed safety analysis.

(a) Statistics on Input Quality (b) Statistics on Input Difficulty(a) Min Neighbor Distance of MAGPIE-Air(b) Reward Difference of Base Model and Instruct Model

3.4 Cost Analysis

We perform experiments on a server with four NVIDIA A100-SXM4-80GB GPUs, an AMD EPYC 7763 64-Core Processor, and 512 GB of RAM, using the VLLM inference framework [28]. The models are loaded in the bfloat16 format.

When creating the 3M MAGPIE-Air dataset, our MAGPIE spent 1.55 and 50 hours to generate the instructions (Step 1) and responses (Step 2), respectively. For the 1M MAGPIE-Pro dataset, MAGPIE used 3.5 and 150 hours to generate the instructions (Step 1) and responses (Step 2), respectively. Compared to existing approaches to create instruction datasets, the pipeline of MAGPIE is fully automated without any human intervention or API access to advanced commercial models such as GPT-4 [1]. Consequently, MAGPIE is cost-effective and scalable. On average, implementing MAGPIE on a cloud server2 would incur costs of $0.12 and $1.1 per 1,000 data instances for MAGPIE-Air and MAGPIE-Pro, respectively.

3.5 Additional Analysis

Additional dataset analysis, including the impact of generation configurations on the quality and difficulty of the generated instructions, is detailed in Appendix C.3.

4 Performance Analysis

In this section, we evaluate the quality of datasets generated by MAGPIE by utilizing them to fine-tune model families including Llama-3 [40] and Qwen1.5 [3].

4.1 Experimental Setups.

Baselines for Instruction Tuning. We compare the family of datasets generated by MAGPIE with six state-of-the-art open-source instruction tuning datasets: ShareGPT [10], WildChat [64], Evol Instruct [58], UltraChat [16], OpenHermes [49], and Tulu V2 Mix [24]. ShareGPT and WildChat are representative human-written datasets containing 112K and 652K high-quality multi-round conversations between humans and GPT, respectively. Evol Instruct and UltraChat are representative open-source synthetic datasets. Following [39], we use the 208K sanitized version of Ultrachat provided by HuggingFace3. OpenHermes and Tulu V2 Mix are crowd-sourced datasets consisting of a mix of diverse open-source alignment datasets, with 243K and 326K conversations, respectively. We note that to ensure fair comparison involving datasets of different sizes, we provide the results of MAGPIE-Pro-200K-Filtered and MAGPIE-Pro-100K-Filtered, which contains the first 200K and 100K conversations from MAGPIE-Pro-300K-Filtered. Detailed discussion on how to generate these datasets can be found in Appendix B.

Baselines for Instruction and Preference Tuning. We compare the models fine-tuned using data generated by MAGPIE with preference optimization baselines, including DPO [44], IPO [2], KTO [19] and ORPO [23]. Specifically, we follow [39] and use the models fine-tuned with the UltraChat dataset (for instruction tuning) and Ultrafeedback dataset (for preference optimization) [13].

Fine-Tuning Details. We follow [50] and use a cosine learning rate schedule with an initial learning rate of 2 × 10−5 when fine-tuning Llama-3 and Qwen1.5 models. The maximum sequence length is 8192. The fine-tuning process is conducted using four NVIDIA A100 GPUs with 80G memory, and the effective batch size is 32. The models are fine-tuned for 2 epochs. We follow the official instruction templates of each model.

Evaluation Benchmarks. We evaluate the performance of the fine-tuned models using two widely-adopted instruction-following benchmarks: AlpacaEval 2 [33] and Arena-Hard [32]. AlpacaEval 2 consists of 805 representative instructions chosen from real user interactions. Arena-Hard is an enhanced version of MT-Bench [66], containing 500 challenging user queries. Both benchmarks employ a GPT evaluator to assess responses generated by the model of interest and a baseline model. Specifically, we use GPT-4-Turbo (1106) and Llama-3-8B-Instruct as baselines for AlpacaEval 2. By default, Arena-Hard uses GPT-4 (0314) as its baseline model.

2 https://lambdalabs.com/service/gpu-cloud

3 https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k

Table 2: This table compares the performance of models instruction-tuned on the Llama-8B base models using our datasets and baseline datasets. We observe that models fine-tuned with our datasets significantly outperform those fine-tuned with baseline datasets of the same order of magnitude in terms of data size. In addition, our fine-tuned models achieve comparable performance to the official aligned model, despite only undergoing SFT with a much smaller dataset. Numbers in bold indicate that MAGPIE outperforms the official Llama-3-8B-Instruct model.

Metrics. We adopt two metrics to measure the capabilities of instruction-following of fine-tuned models. The first metric is the win rate (WR), which calculates the fraction of responses that are favored by the GPT evaluator. This metric is applied in both benchmarks including AlpacaEval 2 and Arena-Hard. The second metric is the length-controlled win rate (LC) [18], a debiased version of WR. The GPT evaluator considers the lengths of responses generated by the baseline model and model under evaluation when computing LC. By accounting for response length, LC reduces its impact on the win rate. This metric is specifically applied to the Al-pacaEval 2 benchmark [33].

Detailed Experimental Setups. We provide more detailed descriptions of our experimental setups, including more fine-tuning details and decoding hyper-parameters in Appendix D.

4.2 Experimental Results

MAGPIE datasets outperform others.

In Table 2, we first compare the performance of Llama-3 models fine-tuned with datasets generated by MAGPIE against those fine-tuned with baseline datasets. Using the AlpacaEval 2 evaluation bench-mark, we observe that both LC and WR of our fine-tuned models surpass those fine-tuned with baseline instruction datasets, regardless of the choice of the baseline model. This indicates that the datasets generated by MAGPIE are of higher quality, leading to significantly enhanced instruction-following capabilities. A similar observation is made when using the Arena-Hard evaluation benchmark. We highlight that the Llama-3 models fine-tuned with datasets generated by MAGPIE outperform even those models that have undergone preference optimization(e.g., instruction tuning combined with DPO), which emphasizes the high quality of data generated by MAGPIE.

Figure 6: This figure shows the performance breakdown by category of MAGPIE-Pro and baselines on WildBench.

4 https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

Table 3: This table compares the performance of models instruction-tuned on the Qwen base models using the MAGPIE-Pro-300K-Filtered dataset and the official instruction-tuned models. The Qwen base model enhanced with MAGPIE consistently outperforms the official instruction-tuned model.

To investigate the advantages of MAGPIE across different task categories, we also compare the performance of models fine-tuned with MAGPIE-Pro compared with baseline datasets using Wild-Bench benchmark [34]. This benchmark consists of 1024 tasks carefully selected from real-world human-LLM conversation logs. The results are demonstrated in Figure 6. We observe that MAGPIE consistently outperforms baseline datasets across categories.

Models fine-tuned with data generated by MAGPIE achieve comparable performance to the official aligned model, but with fewer data. In Table 2, we compare the performance of models fine-tuned with data generated by MAGPIE against the official aligned model (Llama-3-8B-Instruct). We observe that the Llama-3-8B base model fine-tuned with data from MAGPIE outperforms Llama-3-8B-instruct using the AlpacaEval 2 benchmark. For example, using the MAGPIE-Pro-300K-Filtered dataset to fine-tune the Llama-3-8B base model results in WC 29.47% against GPT-4-Turbo (1106). Furthermore, when Llama-3-8B-Instruct is chosen as the baseline model of AlpacaEval 2, we observe that WC of Llama-3-8B base models fine-tuned with data from MAGPIE exceeds 50%, indicating a preference for our fine-tuned models over the official aligned model. Finally, we highlight that our fine-tuning process uses no more than 300K data, whereas the official aligned models are fine-tuned with more than 10M data samples. This demonstrates the high quality of the data generated by MAGPIE. Using the Arena-Hard benchmark, we observe that a 1.7% difference between the WR achieved using our fine-tuned model and the official aligned model. We attribute this discrepancy to the fraction of coding-related instructions in our dataset. We believe that this gap could be easily bridged as we increase the size of datasets.

Both data quantity and quality matter to capabilities of instruction-following. In what follows, we compare within the family of datasets generated by MAGPIE in Table 2. These datasets differ in sizes, deployment of filtering, and models used to generate data. We observe that as the size of dataset increases, the performance of fine-tuned model improves, indicating that data quantity plays a critical role in enhancing instruction-following capabilities. Furthermore, the model fine-tuned with MAGPIE-Pro-300K-Filtered outperform those fine-tuned with the same amount of raw data. This demonstrates the effectiveness of our filtering technique, and underscores the importance of data quality. Finally, we observe that the models fine-tuned with MAGPIE-Pro consistently outperform those fine-tuned with MAGPIE-Air. The reason is that MAGPIE-Pro is generated by the more capable model, i.e., Llama-3-70B-Instruct.

MAGPIE can enhance the performance of other backbone models. Table 3 illustrates the efficacy of MAGPIE when applied to generate instruction dataset and fine-tune other backbone models, i.e., Qwen1.5-4B and Qwen1.5-7B. The results demonstrate that our fine-tuned models achieve better performance than the official aligned models, which have undergone instruction and preference tuning. These results underscore the effectiveness of MAGPIE and the quality of its generated instructions.

Additional Experimental Results. We defer additional experimental results and analysis of MAGPIE-Air-MT and MAGPIE-Pro-MT to Appendix E.1. Additionally, the performance of MAGPIE across various other benchmarks is reported in Appendix E.3.

LLM Alignment. Instruction tuning [56] and preference tuning [5] are widely used to align the responses of LLMs with human values. Instruction tuning utilizes an instruction dataset to fine-tune LLMs, where each instruction data consists of one turn or multiple turns of instructions and desired responses. The performance of instruction tuning heavily relies on the quality of instruction data [47, 53, 67]. Preference tuning further improves responses of LLMs using reinforcement learning human feedback (RLHF) [5] or preference optimization [2, 19, 23, 44] based on a preference dataset.

Alignment Dataset Construction. We classify the existing methods of creating datasets for model alignment into two main categories: human interactions with LLMs and synthetic instruction generation. To create datasets for alignment, previous studies have collected human interactions with LLMs [14, 64, 65, 66, 26]. However, manually crafting instructions is not only time-consuming and labor-intensive, but may also incorporate toxic content [64]. Another category of approaches [53, 47, 58, 59, 55, 46] focus on prompting LLMs to generate synthetic instruction datasets, beginning with a small set of human-annotated seed instructions and expanding these through few-shot prompting. However, these methods face a diversity challenge, as few-shot prompting often results in new instructions that are too similar to the original seed questions [31]. To enhance coverage, some research [16, 31] summarizes world knowledge and employs it to generate synthetic datasets. We note that our MAGPIE dataset also belongs to the synthetic dataset. However, we leverage the prompt template with no requirement for seed questions or prompt engineering.

Compared to the above two main categories, alignment data can also be generated by transforming existing data [54, 45, 20]. However, the constrained variety of NLP tasks in these datasets may impede the ability of tuned LLMs to generalize in real-world scenarios [31]. There are also mixture datasets (e.g., [24, 49, 38, 67]) that combine or select high-quality instruction data from various existing open-source instruction datasets to enhance coverage [24, 49] and/or improve overall performance [38, 67]. There are also data construction methods focusing on improving the reasoning and math abilities [61, 62], which can be further merged with MAGPIE for creating a better mixture of data for instruction tuning.

Training Data Extraction. Language models have the capability to memorize examples from their training datasets, potentially enabling malicious users to extract private information [8, 7, 9]. Pioneering work [27, 9, 41] has demonstrated that it is possible to extract private pre-training data from BERT [15], GPT-2 [43], and ChatGPT [1], respectively. Yu et al. [60] propose several tricks including adjusting sampling strategies to better extract training datasets from language models. Recently, Kassem et. al. [25] propose a black-box prompt optimization method that uses an attacker LLM to extract high levels of memorization in a victim LLM. Wang et al. [52] leverage membership inference attack (MIA) to extract fine-tuning datasets from fine-tuned language models. Bai et al. [4] extracts the training dataset of production language models via special characters (e.g., structural symbols of JSON files, and , # in emails and online posts). Different from the prior work, we aim to create publicly available alignment datasets with minimal human effort by leveraging the remarkable generation capabilities of LLMs, rather than extracting private training data from LLMs.

6 Limitations and Ethical Considerations

Limitations. In certain scenarios, users may aim to fine-tune LLMs using domain-specific instruction data. Investigating how to configure MAGPIE to efficiently generate the desired domain-specific instructions (e.g., math problems) is subject to our future work. Also, there is still a gap between Magpie-tuned LLMs and official Llama-3-Instruct on datasets such as WildBench and MMLU, which suggest that we should focus on producing harder reasoning tasks and feedback learning data.

License and Legality. The instruction datasets generated by MAGPIE in this paper are subject to CC BY-NC license and Meta Llama 3 Community license. While users are permitted to distribute, adapt, and further develop our method MAGPIE, it is the responsibility of the users to apply MAGPIE to LLMs in compliance with the associated license agreement. We hereby disclaim any liability for misuse of data generated by users of MAGPIE.

Societal Impact and Potential Harmful Consequences. The primary objective of this paper is to develop a scalable method to synthesize instruction data to enhance the instruction-following capabilities of LLMs, and thus align them with human values. However, the data generated by MAGPIE may contain harmful instructions and/or responses, which may lead to unsafe behaviors if used raw in instruction tuning. Our empirical evaluations indicate that such harmful data instances constitute less than 1% of the dataset. To mitigate this risk, we develop a filtering technique in Appendix B to identify and remove these instances.

7 Conclusion

In this paper, we developed a scalable method, MAGPIE, to synthesize instruction data for fine-tuning large language models. MAGPIE leveraged the predefined instruction templates of open-weight LLMs and crafted a prompt specifying only the role of instruction provider. Given the crafted prompt, the LLM then generated detailed instructions due to their auto-regressive nature. MAGPIE then sent the generated instructions to the LLM to generate corresponding responses. These pairs of instructions and responses constituted the instruction dataset. We used Llama-3-8B-instruct to label the instruction dataset and developed a filtering technique to select effective data instances for instruction tuning. We fine-tuned the Llama-3-8B base model using the selected data, and demonstrated that the fine-tuned model outperformed those fine-tuned using all baselines. Moreover, our fine-tuned models outperformed the official aligned model, Llama-3-8B-Instruct, which has been instruction-tuned and preference-optimized using more than 10M data instances. This highlighted the quality of the instruction data synthesized by MAGPIE.

Previous: Scaling Up using RL on Synthesized Data Next: Foundation Models

post contain ""

    No matching posts found containing ""