00:00:00

Share Your Feedback 🏝️

Model | Vicuna - FastChat

Model | Vicuna - FastChat

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Model | MPT Next: Model | Wizard

Model | Vicuna - FastChat

  • Related Project: private
  • Category: Paper Review
  • Date: 2023-08-13

Contents


Vicuna

[1] Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality

  • url: https://lmsys.org/blog/2023-03-30-vicuna/
  • abstract: We introduce Vicuna-13B, an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. Preliminary evaluation using GPT-4 as a judge shows Vicuna-13B achieves more than 90%* quality of OpenAI ChatGPT and Google Bard while outperforming other models like LLaMA and Stanford Alpaca in more than 90%* of cases. The cost of training Vicuna-13B is around $300. The code and weights, along with an online demo, are publicly available for non-commercial use.

TL;DR


  • GPT-4를 활용한 자체 Self-Instruct(Self-Instruct) 튜닝 방법 개발
  • 다양한 언어와 벤치마크를 통한 성능 평가
  • 고성능 Self-Instruct 언어 모델(LaMA)의 구현 및 검증

1. 서론

최근 대규모 언어모델들은 자연어 지시를 따라 다양한 작업을 수행하는 능력을 보여주었습니다. 이런 능력을 확장하기 위해, 연구자들은 모델이 지시를 따르도록 튜닝하는 다양한 방법을 탐구해왔습니다. 본 논문에서는 GPT-4를 사용하여 새로운 자체 Self-Instruct 튜닝 방법을 제안하고, 이를 LaMA 모델에 적용하여 그 효과를 평가합니다.


2. 선행 연구 및 기존 문제

기존 LLM 튜닝 방법은 주로 사람이 만든 프롬프트와 피드백을 기반으로 진행되었으나, 고비용과 한정된 언어 처리 능력 문제를 갖고 있습니다. 이를 개선하기 위해 최신 모델인 GPT-4를 활용하여 보다 효과적인 튜닝 방법을 개발하는 것이 필요합니다.


3. 데이터 수집 및 방법

본 논문에서는 GPT-4를 이용하여 52,000개의 instruction data를 영어와 중국어로 생성하고, 이 데이터를 사용하여 LaMA 모델을 튜닝했습니다. 데이터는 다음과 같은 수식으로 처리되었습니다.

\[\text{Data}_{\text{new}} = \text{GPT-4}( \text{Instruction})\]

이 데이터를 활용하여 모델을 튜닝함으로써, 다양한 언어로의 일반화 능력을 평가할 수 있었습니다.


4. 수학적 논증 및 해결 방안

튜닝 과정에서 중요한 요소는 모델의 행동을 사람의 선호와 일치시키는 것입니다. 이를 위해 보상 모델링 접근 방식을 사용했습니다.

\[r_{\theta}(x, y) = \log(\text(r_{\theta}(x, y_h) - r_{\theta}(x, y_l)))\]

$x$는 프롬프트, $y_h$와 $y_l$은 각각 높고 낮은 점수를 받은 응답입니다. 이 수식은 튜닝된 모델이 사람의 피드백을 보다 정확하게 반영하도록 돕습니다.


5. 실험 결과

실험을 통해 LaMA 모델은 영어와 중국어 지시를 모두 효과적으로 수행하는 것을 확인했습니다. 자동 평가와 사람의 평가에서 높은 점수를 얻었으며, 다양한 벤치마크에서 GPT-4와 유사한 성능을 보였습니다.


6. 결론 및 향후 연구

본 논문에서 제안한 자체 Self-Instruct 튜닝 방법은 LLMs의 Self-Instruct 능력을 향상시킬 수 있음을 확인하였습니다. 향후 연구에서는 더 다양한 언어와 복잡한 지시에 대한 일반화 능력을 검증할 필요가 있습니다. 또한, 모델의 행동을 보다 정밀하게 조정할 수 있는 새로운 수학적 방법을 개발하는 것이 중요하다고 언급합니다.

  • Vicuna-13B는 LLaMA를 fine-튜닝하여 ShareGPT에서 수집한 사용자 공유 대화를 기반으로 훈련된 오픈 소스 챗봇
  • GPT-4를 심사자로 사용한 초기 평가 결과, Vicuna-13B가 OpenAI ChatGPT와 Google Bard의 품질의 90% 이상을 달성하며, LLaMA 및 Stanford Alpaca와 같은 다른 모델을 90% 이상의 경우에서 능가하는 것으로 나타남.
  • Vicuna-13B를 훈련하는 비용은 약 $300이고, 코드와 가중치는 비상업적인 용도로 공개

[2] INSTRUCTION TUNING WITH GPT-4

  • url: https://arxiv.org/abs/2304.03277
  • pdf: https://arxiv.org/pdf/2304.03277
  • abstract: Prior work has shown that fine-tuning large language models (LLMs) using machine-generated instruction-following data enables such models to achieve remarkable zero-shot capabilities on new tasks, and no human-written instructions are needed. In this paper, we present the first attempt to use GPT-4 to generate instruction-following data for LLM fine-tuning. Our early experiments on instruction-tuned LLaMA models show that the 52K English and Chinese instruction-following data generated by GPT-4 leads to superior zero-shot performance on new tasks to the instruction-following data generated by previous state-of-the-art models. We also collect feedback and comparison data from GPT-4 to enable a comprehensive evaluation and reward model training. We make our data generated using GPT-4 as well as our codebase publicly available.
  • huggingface: https://huggingface.co/lmsys/vicuna-13b-v1.3
  • github: https://github.com/lm-sys/FastChat
  • textgenerationllm judge github: https://github.com/lm-sys/FastChat/tree/main/fastchat/TextGenerationLLM_judge

TL;DR


  • GPT-4를 이용한 자체 지시(Self-Instruct) 튜닝 방법
  • LLaMA 모델의 지시 수행 능력 평가
  • 다양한 언어 및 데이터셋을 사용한 교차 언어 일반화 능력 연구

1. 서론

대규모 언어모델(LLMs)은 자연어 지시를 이해하고 다양한 실세계 작업을 수행하는 능력이 탁월합니다. 이런 모델의 지시 수행 능력을 개선하기 위해 연구자들은 지시 기반 튜닝 방법을 탐구하고 있습니다. 본 논문에서는 최초로 GPT-4를 teacher model로 LLaMA 모델을 튜닝하는 새로운 방법을 제안하고, 이를 통해 얻은 데이터를 바탕으로 모델 평가를 진행합니다.

2. 선행 연구 및 연구의 필요성

기존 연구들은 휴먼의 주석이 달린 프롬프트를 사용하여 모델을 파인튜닝하거나, 자동 생성된 지시사항을 포함한 공개 벤치마크를 사용했습니다. 이런 방법들은 일부 성공을 거두었지만, 자체 지시(Self-Instruct) 튜닝 같은 새로운 접근 방식이 모델의 일반화 능력을 더욱 향상시킬 수 있음을 발견했습니다. 특히, GPT-4와 같은 최신 모델을 이용한 튜닝이 큰 잠재력을 가지고 있음을 확인했습니다.

3. 데이터 수집 및 방법

  • 3.1 데이터 수집

    이 연구에서는 52,000개의 영어 및 중국어 지시 데이터를 GPT-4를 이용하여 생성했습니다. 데이터는 다음 수식을 통해 처리되었습니다.

    \[\text{Output} = \text{GPT-4}(\text{Input})\]

    이 데이터는 모델을 튜닝하는 데 사용되며, 교차 언어 일반화 능력을 평가하는 데 사용됩니다.

  • 3.2 자체 지시 튜닝

    LLaMA 모델은 GPT-4에서 생성된 지시 데이터를 학습하여 다양한 언어의 지시를 따르도록 튜닝되었습니다. 이 과정에서 튜닝된 모델은 다음과 같은 수식을 기반으로 최적화되었습니다.

    \[\text{Loss} = \min \sum (\text{Output} - \text{Expected Output})^2\]
  • 3.3 보상 모델

    보상 모델은 GPT-4가 평가한 응답 품질에 따라 학습되었습니다. 각 응답 쌍에 대한 점수 \(s\)를 기반으로, 모델은 다음 목표 함수를 최소화하도록 학습됩니다.

    \[\text{Objective} = \min \log(\text(r_{\theta}(x, y_{\text{high}}) - r_{\theta}(x, y_{\text{low}})))\]

4. 실험 결과 및 평가

LLaMA 모델은 영어와 중국어 지시 데이터셋에서 높은 성능을 보였으며, 휴먼 평가와 자동 평가 모두에서 우수한 결과를 나타냈습니다. 모델은 특히 비정형 지시어에서 강력한 일반화 능력을 보여 주었습니다.

5. 결론 및 향후 연구

본 논문에서 제시된 자체 지시 튜닝 방법은 LLMs의 지시 수행 능력을 효과적으로 향상시킬 수 있음을 입증했습니다. 향후 연구에서는 더 다양한 언어 및 복잡한 지시를 포함하여 모델의 일반화 능력을 더욱 평가할 계획이며, 보다 세밀한 보상 모델을 개발하여 모델의 반응 품질을 더욱 향상시키는 방안을 모색할 예정이라고 하네요.

  • 이전 연구에 따르면 machine-generated instruction-following data를 사용하여 대규모 언어모델(LLM)을 파인튜닝하면 새로운 작업에서 더 나은 0-shot 기능을 달성할 수 있으며 사람이 직접 작성한 명령어가 필요하지 않다고 알려져있음.
  • 이 백서에서는 GPT-4를 사용하여 LLM 파인튜닝을 위한 instruction-following data를 생성하는 시도를 해보았음.
  • instruction 튜닝 LLaMA 모델에 대한 초기 실험 결과, GPT-4로 생성된 52K 영어 및 중국어 instruction-following data가 이전의 최신 모델에서 생성된 instruction-following data보다 새로운 작업에 대한 0-shot 성능이 더 우수하다는 것을 보여주었음.
  • 또한 종합적인 평가 및 보상 모델 학습을 위해 GPT-4로부터 피드백 및 비교 데이터를 수집하였으며, GPT-4를 사용하여 생성된 데이터와 코드베이스를 공개적으로 제공함.

1 INTRODUCTION

Large Language Models (LLMs) have shown impressive generalization capabilities such as in- context-learning (Brown et al., 2020) and chain-of-thoughts reasoning (Wei et al., 2022). To enable LLMs to follow natural language instructions and complete real-world tasks, researchers have been exploring methods of instruction-tuning of LLMs. This is implemented by either finetuning the model on a wide range of tasks using human-annotated prompts and feedback (Ouyang et al., 2022), or supervised finetuning using public benchmarks and datasets augmented with manually or automatically generated instructions (Wang et al., 2022b). Among these methods, Self-Instruct tuning (Wang et al., 2022a) is a simple and effective method of aligning LLMs to human intent, by learning from instruction-following data generated by state-of-the-art instruction-tuned teacher LLMs. It turns out that the line of instruction-tuning research has produced effective means to improve the zero and few-shot generalization abilities of LLMs. The recent success of ChatGPT (OpenAI, 2023a) and GPT-4 (OpenAI, 2023b) offers tremendous opportunities to improve open-source LLMs using instruction-tuning. LLaMA (Touvron et al., 2023) is a series of open-sourced LLMs, which match the performance of proprietary LLMs such as GPT-3. To teach LLaMA to follow instructions, Self-Instruct tuning has been quickly adopted given its superior performance and low cost. For example, Stanford Alpaca (Taori et al., 2023) uses 52K instruction-following samples generated by GPT-3.5, while Vicuna (Vicuna, 2023) uses around 700K instruction-following samples (70K conversions) shared user-ChatGPT (ShareGPT, 2023).

To advance the state of the art of instruction-tuning for LLMs, we propose for the first time to use GPT-4 as a teacher for self-instruct tuning. Our paper makes the following contributions:

  • GPT-4 data. We release data generated by GPT-4, including the 52K instruction-following dataset in both English and Chinese, and the GPT-4-generated feedback data that rate the outputs of three instruction-tuned models.

  • Models & Evaluation. Based on the GPT-4-generated data, we have developed instruction-tuned LLaMA models and reward models. To evaluate the quality of instruction-tuned LLMs, we use three metrics evaluated on test samples (i.e., unseen instructions): human evaluation on three alignment criteria, automatic evaluation using GPT-4 feedback, and ROUGE-L on un-natural instructions (Honovich et al., 2022). Our empirical study validates the effectiveness of using GPT-4-generated data for LLM instruction-tuning, and suggests practical tips of building a general-purpose instruction-following agent powered by LLMs.

Note: This is a preliminary release, and we will continue to expand the dataset and will finetune larger models.

Algorithm 1: Pseudo code for prompt engineering, GPT-4 call and hyper-parameters in data generation. Each instruction instance is used as variables in the prompt template, the data flow is highlighted in blue.

2 DATASET

Data Collection. We reuse 52K unique instructions in the instruction-following data collected in the Alpaca dataset (Taori et al., 2023). Each instruction describes the task the model should perform. We follow the same prompting strategy to consider cases with and without input, which is the optional context or input for the task. The output answers to the instruction instance using LLMs. In the Alpaca dataset, the output is generated using GPT-3.5 (text-davinci-003) but we instead consider GPT-4 (gpt-4) for data generation. Specifically, we generate the following four datasets with GPT-4:

  • (1) English Instruction-Following Data: For the 52K instructions collected in Alpaca (Taori et al., 2023), one English GPT-4 answer is provided for each. The details are described in Algorithm 1. We leave it as future work to follow an iterative process to construct our own instruction set using GPT-4 and self-instruct (Wang et al., 2022a).
  • (2) Chinese Instruction-Following Data: We use ChatGPT to translate the 52K instructions into Chinese and ask GPT-4 to answer them in Chinese. This allows us to build a Chinese instruction-following model based on LLaMA, and study cross-language generalization ability of instruction-tuning.
  • (3) Comparison Data: We ask GPT-4 to rate its own response from 1 to 10. Furthermore, we ask GPT-4 to compare and rate the responses from the three models, including GPT-4, GPT-3.5 and OPT-IML (Iyer et al., 2022). This is used to train reward models.
  • (4) Answers on Unnatural Instructions: The GPT-4 answers are decoded on the core dataset of 68K instruction-input-output triplets (Honovich et al., 2022). The subset is used to quantify the gap between GPT-4 and our instruction-tuned models at scale.

Data Statistics. We compare the English output response sets of GPT-4 and GPT-3.5 in Figure 1. For each output, the root verb and the direct-object noun are extracted; The frequency over the unique verb-noun pairs are computed over each output set. The verb-noun pairs whose frequency are higher than 10 are displayed in Figure 1(a) and (b), and the most frequent 25 pairs of two sets are compared

  • (a) GPT-4
  • (b) GPT3
  • (c) Frequencies of top 25 verb-noun pairs
  • (d) Frequencies of output sequence lengths

Figure 1: Comparison of generated responses using GPT-4 and GPT-3: (a,b) The root verb-noun pairs of GPT-4 and GPT-3, where the inner circle of the plot represents the root verb of the output response, and the outer circle represents the direct nouns. (c) The top 25 verb-noun pairs and their frequencies. (d) Comparison of output sequence length.

in Figure 1(c). The frequency distributions of the sequence length are compared in Figure 1(d). GPT-4 tends to generated longer sequences than GPT-3.5. The GPT-3.5 data in Alpaca exhibits an output distribution with a longer tail than our GPT-4-generated output distribution, probably because the Alpaca dataset involves an iterative data collection process to remove similar instruction instances at each iteration, which is absent in our current one-time data generation. Despite this simple process, the GPT-4 generated instruction-following data demonstrates more favorable alignment performance, as shown in experiments later.

3 INSTRUCTION-tuning LANGUAGE MODELS

3.1 SELF-INSTRUCT tuning

We train two models using supervised finetuning using the LLaMA 7B checkpoint: (i) LLaMA-GPT4 is trained on 52K English instruction-following data generated by GPT-4, which distribution is displayed in Figure 1. (ii) LLaMA-GPT4-CN is trained on 52K Chinese instruction-following data from GPT-4. We follow the training schedule in (Taori et al., 2023) for fair comparisons. These models are used to study the data quality of GPT-4 and the cross-language generalization properties when instruction-tuning LLMs in one language.

3.2 REWARD MODELS

Reinforcement Learning from Human Feedback (RLHF) aims to align the LLM behavior with human preferences in order to make it more useful. One key component of RLHF is reward modeling, where the problem is formulated as a regression task to predict a scalar reward given a prompt and a response (Askell et al., 2021; Ouyang et al., 2022). This approach typically requires large-scale comparison data, where two model responses on the same prompt are compared Ouyang et al. (2022). Existing open-source works such as Alpaca, Vicuna, and Dolly (Databricks, 2023) do not involve RLHF due to the high cost of labeling comparison data. Meanwhile, recent studies show that GPT-4 is capable of identifying and fixing its own mistakes, and accurately judging the quality of responses(Peng et al., 2023; Bai et al., 2022; Madaan et al., 2023; Kim et al., 2023). Therefore, to facilitate research on RLHF, we have created comparison data using GPT-4, as described in Section 2.

To evaluate data quality, we train a reward model based on OPT 1.3B (Iyer et al., 2022) to rate different responses. For each instance of the comparison data involving one prompt x and K responses, GPT-4 assigns a score s ∈ [1, 10] for each response. There are C K 2 unique pairs constructed from this instance, each pair is (yl, yh), whose corresponding scores follow sl < sh. A reward model rθ parameterized by θ is trained with the objective: min log(σ(rθ(x, yh) − rθ(x, yl))), where σ is the sigmoid function. The distribution of the comparison data is shown in Figure 2.

4 EXPERIMENTAL RESULTS

4.1 BENCHMARKS

Figure 2: The distribution of comparison data.

It is known that LLM evaluation remains a significant challenge. Our goal is to evaluate self-instruct tuned models on GPT-4 data on unseen instructions, to study their ability to follow instructions for arbitrary tasks. Specifically, we use three established datasets in our study:

  • User-Oriented-Instructions-252 2 (Wang et al., 2022a) is a manually curated set involving 252 instructions, motivated by 71 user-oriented applications such as Grammarly, StackOverflow, Overleaf, rather than well-studied NLP tasks.

  • Vicuna-Instructions-803 (Vicuna, 2023) is a dataset synthesized by gpt-4 with 80 challenging questions that baseline models find challenging. Beside generic instructions, there are 8 categories, including knowledge, math, Fermi, counterfactual, roleplay, generic, coding, writing, common-sense.

  • Unnatural Instructions4 (Honovich et al., 2022) is a dataset of 68,478 samples synthesized by text-davinci-002 using 3-shot in-context-learning from 15 manually-constructed examples.

2 https://github.com/yizhongw/self-instruct/blob/main/human_eval/user_oriented_instructions.jsonl

3 https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl

4 https://github.com/orhonovich/unnatural-instructions

(a) LLaMA-GPT4 vs Alpaca (i.e., LLaMA-GPT3 )

(b) LLaMA-GPT4 vs GPT-4

Figure 3: Human evaluation.

4.2 HUMAN EVALUATION WITH ALIGNMENT CRITERIA

To evaluate the alignment quality of our instruction-tuned LLMs, we follow alignment criteria from Anthropic Askell et al. (2021): an assistant is aligned if it is helpful, honest, and harmless (HHH). These criteria are used to evaluate how well an AI system is aligned with human values.

  • Helpfulness: whether it helps humans achieve their goals. A model that can answer questions accurately is helpful.

  • Honesty: whether it provides true information, and expresses its uncertainty to avoid misleading human users when necessary. A model that provides false information is not honest.

  • Harmlessness: whether it does not cause harm to humans. A model that generates hate speech or promotes violence is not harmless.

Based on HHH alignment criteria, we used Amazon Mechanical Turk to perform human evaluation on the model generation results. Please find the interface in Appendix Section A.1. Following (Wang et al., 2022a; Taori et al., 2023), we consider 252 user-oriented instructions for evaluation. We display the human evaluation results in pie charts in Figure 3.

First, we compare the quality of generated responses from two instruction-tuned LLaMA models, which are fine-tuned on data generated by GPT-4 and GPT-3, respectively. Note that aligning LLaMA to GPT-3 corresponds to the Stanford Alpaca model. From Figure 3(a), we observe that (i) For the “Helpfulness” criterion, GPT-4 is the clear winner with 54.12% of the votes. GPT-3 only wins 19.74% of the time. (ii) For the “Honesty” and “Harmlessness” criteria, the largest portion of votes goes to the tie category, which is substantially higher than the winning categories but GPT-3 (Alpaca) is slightly superior.

Second, we compare GPT-4-instruction-tuned LLaMA models against the teacher model GPT-4 in Figure 3(b). The observations are quite consistent over the three criteria: GPT-4-instruction-tuned LLaMA performs similarly to the original GPT-4. We conclude that learning from GPT-4 generated data can lead to very comparable performance with the original GPT-4 on the unseen instructional tasks, which suggests a promising direction to developing state-of-the-art instruction-following LLMs.

  • (a) Ranked groups against ChatGPT
  • (b) Ranked groups against GPT-4
  • (c) All chatbots against ChatGPT
  • (d) All chatbots against GPT-4

Figure 4: Performance comparisons evaluated by GPT-4. Each bar represents an evaluation result between two models; the sum of scores are computed and reported (the full score is 800). The relative score is reported in percentage, which is computed as the ratio against a strong opponent model. (a,b) The comparisons of responses from LLaMA GPT4 ranked by our reward model. ‘B’ indicates the baseline that the model decodes one response per question. (c,d) All chatbots are compared against ChatGPT and GPT-4, respectively.

4.3 COMPARISONS WITH SOTA USING AUTOMATIC EVALUATION

Automatic Evaluation with GPT-4. Following (Vicuna, 2023), we employ GPT-4 to automatically evaluate the generated responses of different models on 80 unseen questions in (Vicuna, 2023). We first collect answers from two chatbots, including LLaMA-GPT-4 (7B) and GPT-4, and use the release answers of other chatbots from (Vicuna, 2023), including LLaMA (13B), Alpaca (13B), Vicuna (13B), Bard (Google, 2023), and ChatGPT. For each evaluation, we ask GPT-4 to rate the response quality between two models with scores from 1 to 10. We compare all models against a strong competing model such as ChatGPT and GPT-4, respectively. The results are shown in Figure 4.

For LLaMA instruction-tuned with GPT-4, we provide two sets of decoding results: (i) One response per question, which is considered the baseline decoding result. (ii) Five responses per questions. For the latter, the reward model is used to rank the responses which are then grouped into five subsets ranked from top 1 to top 5. We compare the five ranked groups against the baseline, and show the relative scores in Figure 4 (a,b). The ChatGPT and GPT-4 evaluation is consistent with the orders suggested by our reward model, which demonstrate the value of the feedback data and effectiveness of the reward model.

  • (a) All chatbots against GPT-4, whose Chinese responses are translated from English
  • (b) All chatbots against GPT-4, whose Chinese responses are generated by asking Chinese questions
  • (c) All chatbots with Chinese questions and answers against GPT-4

Figure 5: Performance comparisons of Chinese instruction-following evaluated by GPT-4. In (a,b), all models are asked to respond in English, and the responses are translated into Chinese; the scores are computed against translated Chinese in (a) and model generated Chinese in (b). In (c), all models are asked to respond in Chinese.

We compare all the chatbots in Figure 4(c,d). Instruction tuning of LLaMA with GPT-4 often achieves higher performance than tuning with text-davinci-003 (i.e., Alpaca) and no tuning (i.e., LLaMA): The 7B LLaMA GPT4 outperforms the 13B Alpaca and LLaMA. However, there is still a gap compared with large commercial chatbots such as GPT-4.

We further study the performance of all the chatbots in Chinese in Figure 5. We first translate English responses of chatbots into Chinese using GPT-4. We also translate English questions into Chinese to obtain answers with GPT-4. The comparisons against translated and generated Chinese responses from GPT-4 are shown in Figure 5 (a) and (b), respectively. There are two interesting observations: (i) we find that the relative score metric of GPT-4 evaluation (Vicuna, 2023) is quite consistent, both in terms of different opponent models (i.e., ChatGPT or GPT-4) and languages (i.e., English or Chinese). (ii) For GPT-4 results alone, the translated responses show superior performance over the generated response in Chinese, probably because GPT-4 is trained in richer English corpus than Chinese, which leads to stronger English instruction-following ability. In Figure 5 (c), we show results for all models who are asked to answer in Chinese.

We compare LLaMA-GPT4 with GPT-4 and Alpaca unnatural instructions in Figure 6. In terms of the average ROUGE-L scores, Alpaca outperforms the other two models. We note that LLaMA-GPT4 and GPT4 is gradually performing better when the ground truth response length is increasing, eventually showing higher performance when the length is longer than 4. This means that they can better follow instructions when the scenarios are more creative. Across different subsets, LLaMA-GPT4 can closely follow the behavior of GPT-4. When the sequence length is short, both LLaMA-GPT4 and GPT-4 can generate responses that contains the simple ground truth answers, but add extra words to make the response more chat-like, which probably leads to lower ROUGE-L scores.

Figure 6: ROUGE-L on unnatural instructions evaluated with 9K samples. The instructions are grouped into four subsets based on the ground-truth response length. The mean values are reported in the legend. The difference with GPT-4 is reported on the bar per group. LLaMA-GPT4 is a closer proxy to GPT-4 than Alpaca.

Instruction tuning. Instruction tuning of LLMs is an increasingly popular research direction in NLP (Zhong et al., 2021; Ouyang et al., 2022; Wei et al., 2021). Existing works aim to improve the quality and scale of three factors in the development pipeline, including instruction-following data, foundation language models and evaluation benchmarks. Each group typically maintains its own pipeline. For example, scaling instruction-finetuned language models (Chung et al., 2022) is built on top of FLAN (Wei et al., 2021). PromptSource contains a growing collection of prompts (which is also called P3: Public Pool of Prompts) (Bach et al., 2022). T0 is a series of models trained on P3 via multitask prompted training (Sanh et al., 2021). Instruction-tuning of OPT models is considered in (Iyer et al., 2022), where a larger and more comprehensive benchmark OPT-IML Bench is employed, covering FLAN (Wei et al., 2021), Super-NaturalInstructions (Wang et al., 2022b), and UnifiedSKG (Xie et al., 2022).

Open-Source Efforts. Given the broad capabilities of LLMs exhibited by ChatGPT, open-source models have drawn a significant interest and promoted work towards open, general-purpose, text-based assistants that are aligned with human values. Early attempts on foundation LLMs include BLOOM (Scao et al., 2022), GPT-J (Wang & Komatsuzaki, 2021), GPT-NEO (Black et al., 2021) OPT (Zhang et al., 2022) and LLaMA (Zhang et al., 2023). To align LLMs with chat-based assistance, Open-Assistant (LAION-AI, 2023) is built on GPT-J, and Alpaca/Vicuna are built on LLaMA. Furthermore, OpenFlamingo (Awadalla et al., 2023) and LLaMA-Adapter (Zhang et al., 2023) connect LLaMA with image inputs, paving a way to build open-source multi-modal LLMs.

6 CONCLUSIONS

This paper demonstrates the effectiveness of instruction tuning using GPT-4. We release 52K English and Chinese instruction-following instances generated using GPT-4 as well as model checkpoints finetuned from LLaMA, We hope our empirical observations and resource will benefit the development of open-source and general-propose LLMs that can better align with human values to complete tasks.

This represents work in progress, and several directions can be explored: (i) Data and model scale. The GPT-4 data size is 52K and the base LLaMA model size is 7B. Vicuna collects around 700K conversion turns (approximated from the multi-turn ShareGPT data), and uses the 13B LLaMA model. Therefore, it would be promising to continue collecting more GPT-4 instruction-following data, combine with ShareGPT data, and train larger LLaMA models for higher performance. (ii) RLHF. The reward model is only used in the decoding stage, which suggests that comparison data is promising to provide useful feedback for LLM training. It is natural to continue to train LLMs with reward models, for example for reinforcement learning using machine-generated feedback.

Previous: Model | MPT Next: Model | Wizard

post contain ""

    No matching posts found containing ""