00:00:00

Share Your Feedback 🏝️

False Promise

False Promise

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Hallucination Mitigation Next: PL | Contrastive Preference Learning*

False Promise

  • Related Project: Private
  • Category: Paper Review
  • Date: 2024-01-13

The False Promise of Imitating Proprietary LLMs

  • url: https://arxiv.org/abs/2305.15717
  • pdf: https://arxiv.org/pdf/2305.15717
  • abstract: An emerging method to cheaply improve a weaker language model is to finetune it on outputs from a stronger model, such as a proprietary system like ChatGPT (e.g., Alpaca, Self-Instruct, and others). This approach looks to cheaply imitate the proprietary model’s capabilities using a weaker open-source model. In this work, we critically analyze this approach. We first finetune a series of LMs that imitate ChatGPT using varying base model sizes (1.5B–13B), data sources, and imitation data amounts (0.3M–150M tokens). We then evaluate the models using crowd raters and canonical NLP benchmarks. Initially, we were surprised by the output quality of our imitation models - they appear far better at following instructions, and crowd workers rate their outputs as competitive with ChatGPT. However, when conducting more targeted automatic evaluations, we find that imitation models close little to none of the gap from the base LM to ChatGPT on tasks that are not heavily supported in the imitation data. We show that these performance discrepancies may slip past human raters because imitation models are adept at mimicking ChatGPT’s style but not its factuality. Overall, we conclude that model imitation is a false promise: there exists a substantial capabilities gap between open and closed LMs that, with current methods, can only be bridged using an unwieldy amount of imitation data or by using more capable base LMs. In turn, we argue that the highest leverage action for improving open-source models is to tackle the difficult challenge of developing better base LMs, rather than taking the shortcut of imitating proprietary systems.

Contents

TL;DR


  • 모델 모방(Model Imitation) 연구: 상용 및 오픈 소스 언어 모델의 효율 비교
  • 데이터셋 활용 및 베이스 모델 성능 향상 전략 분석
  • 수학적 논리와 실증적 방법을 통한 성능 개선 가능성 평가

1. 서론

언어 모델(LM)의 발전은 상업적 API 형태로 제공되는 폐쇄 소스 모델과 자유롭게 배포되는 오픈 소스 모델 사이에서 중대한 질문을 제기합니다. 이 논문은 모델 모방이라는 방법을 통해 이 질문에 접근합니다. 모델 모방은 상업적 모델의 출력을 수집하여 오픈 소스 LM을 파인튜닝하는 프로세스입니다. 이 과정은 Hinton et al. (2014)의 연구에서 제안된 개념인 “distillation”과 유사합니다.


2. 모델 모방(Model Imitation)이란 무엇인가?

  • 정의: 모델 모방은 상용 LM의 API를 통해 데이터를 수집하고 이를 기반으로 오픈 소스 LM을 훈련시키는 과정입니다.
  • 목적: API로부터 출력된 데이터를 사용하여 동등한 성능의 LM을 생성함으로써, 상용 모델에 필적할 수 있는 오픈 소스 모델을 개발하는 것입니다.
  • 방법: 데이터 수집은 $\text{API}_{\text{out}}(x)$ 함수를 통해 이루어지며, $x$는 입력 데이터입니다. 이 함수는 상용 LM의 출력을 나타냅니다.

모델 모방의 수학적 처리는 다음과 같이 설명할 수 있습니다.

\[\hat{y} = f(x; \theta)\]

$\hat{y}$는 모방 모델의 출력, $x$는 입력 데이터, $\theta$는 모델 파라미터를 나타내며, 모방 학습 과정에서 최적화되는 목표 함수를 설명합니다.


3. 데이터셋 구축

모델 모방을 위한 데이터셋 구축은 두 가지 주요 전략으로 구분됩니다.

  1. 특정 태스크 모방: 예를 들어, ‘Natural Questions’와 같은 특정 질문에 대한 답변을 모방합니다.
    • 데이터셋: ‘NQ-synthetic’, 6,000개의 추가 예시를 생성하여 포함시킵니다.
    • 목적: ChatGPT의 특정 동작을 모방하여 해당 태스크에서의 정확도를 높입니다.
  2. 광범위 모방: ChatGPT의 전반적인 행동을 모방하기 위해 다양한 입력을 포함시킵니다.
    • 데이터셋: ‘ShareGPT-Mix’, 여러 소스에서 수집된 대화형 데이터를 사용합니다.
    • 목적: ChatGPT와 유사한 다양한 반응을 재현하려는 시도입니다.


4. 주요 결과 및 평가

  • 데이터 크기 및 베이스 모델의 영향: 베이스 모델의 크기와 데이터 양을 증가시킬수록 모방 모델의 성능이 향상되었다고 보고하며, 다음 수식을 통해 설명합니다.

    \[\text{Performance} \propto \log(\text{Data size}) + \sqrt{\text{Model size}}\]
  • 실증적 평가: ‘HumanEval’, ‘MMLU’, ‘Natural Questions’와 같은 벤치마크에서의 자동 평가 결과, 베이스 모델 대비 모방 모델의 성능 개선이 관찰되었습니다.


5. 결론

모델 모방은 오픈 소스 LM의 성능을 개선할 수 있는 유망한 방법이지만, 그 효과는 사용된 데이터의 질과 베이스 모델의 초기 성능에 많이 의존합니다.


1 Introduction

The recent release of powerful language models (LMs) such as ChatGPT (OpenAI, 2022), Bard (Pichai, 2023), and Claude (AnthropicAI, 2023) might herald a future where the best AI systems are provided primarily as a fee-based API by large companies. At the same time, open-source LMs are becoming increasingly accurate, with models like LLaMA and FLAN-T5 providing many of the same basic capabilities as their commercial counterparts, albeit at a lower level of performance (Touvron et al., 2023; Chung et al., 2022). This presents an important question, whose answer will have profound future implications: will the most powerful LMs be closed-source or will they be freely distributed for anyone to use, modify, and extend? Both possibilities have important pros and cons, and implications on policy, corporate strategy, and the future of scientific inquiry.

Figure 1: Crowdworkers initially rate the quality of our imitation models highly, as ∼70% of their outputs are rated as equal or better than those of ChatGPT (left). However, as we train on more imitation data, our models fail to further close the gap, and even begin to regress along other axes, e.g. factual knowledge according to Natural Questions (center). Our main conclusion is that the biggest limitation of current open-source LMs is their weaker base capabilities. In turn, the best way for the open-source community to improve models is by increasing these capabilities (e.g., via scaling, better pretraining data, etc.,) rather than fine-tuning on more and more imitation data (right).

In this work, we study one possible resolution to this question: model imitation (Wallace et al., 2020; Orekondy et al., 2019). The premise of model imitation is that once a proprietary LM is made available via API, one can collect a dataset of API outputs and use it to fine-tune an open-source LM. In theory, this imitation process may provide an easy method to distill (Hinton et al., 2014) the capabilities of any proprietary model, thus implying that open-source LMs will always be competitive with their commercial counterparts. To date, recent works have looked to imitate OpenAI’s best systems, e.g., Self-Instruct (Wang et al., 2022a) and Alpaca (Taori et al., 2023), and initial results suggest that these models have achieved near parity with proprietary models. Consequently, there has been a growing sentiment among many members of the broader tech community that closed-source models will soon have no advantage (Patel and Ahmad, 2023).

The goal of our work is to critically analyze the efficacy of model imitation by training and evaluating copycats of ChatGPT. We first collect datasets that focus on either imitating ChatGPT for a specific task or broadly imitating it across all behaviors. We then fine-tune LMs on these datasets using a range of model sizes (1.5B–13B), base models (GPT-2 and LLaMA), and data amounts (0.3M–150M tokens). We evaluate using human and GPT-4 evaluations (blind pairwise comparisons with ChatGPT) as well as accuracy on canonical NLP benchmarks (MMLU, NQ, HumanEval).

We were initially surprised by how much imitation models improve over their base models: they are far better at following instructions, and their outputs appear similar to ChatGPT’s. This was further supported by both human and GPT-4 evaluations, where the outputs of our best imitation model were rated as competitive with ChatGPT (e.g., Figure 1, left).

However, when conducting more targeted automatic evaluations, we found that the imitation models close little to none of the large gap between LLaMA and ChatGPT. In particular, we demonstrate that imitation models improve on evaluation tasks that are heavily supported in the imitation training data. On the other hand, the models do not improve (or even decline in accuracy) on evaluation datasets for which there is little support. For example, training on 100k ChatGPT outputs from broad-coverage user inputs provides no benefits to Natural Questions accuracy (e.g., Figure 1, center), but training exclusively on ChatGPT responses for Natural-Questions-like queries drastically improves task accuracy. Consequently, we conclude that broadly matching ChatGPT using purely imitation would require (1) a concerted effort to collect enormous imitation datasets and (2) far more diverse and higher quality imitation data than is currently available.

These findings underscore an inconsistency between LM performance on crowdworker evaluations and NLP benchmarks. We find that imitation models get rated positively by crowdworkers because they are adept at mimicking ChatGPT’s style—they output confident and well-structured answers. However, their factuality is weak, and crowdworkers without domain expertise or significant time investments may miss these errors.

Overall, our key takeaway is that model imitation is not a free lunch: there exists a capabilities gap between today’s open-source LMs and their closed-source counterparts that cannot be closed by cheaply fine-tuning on imitation data. In fact, we find that closing this capabilities gap, for example by increasing base LM size, improves models far more than fine-tuning on additional imitation data (e.g., Figure 1, right). This implies that the higher leverage action for improving open-source LMs is to tackle the difficult challenge of developing better base models (e.g. by scaling up models, improving pre-training data quality, improving pre-training, etc.), rather than taking the shortcut of imitating proprietary systems. Nevertheless, we believe that model imitation has utility in subverting the need to annotate high-quality fine-tuning data if one has a sufficiently strong base LM.

Figure 2: ChatGPT and our best imitation model produce answers with similar style—they start with an overview paragraph, a list of differences, and end with a summary. However, while ChatGPT’s answer is mostly correct, the imitation model’s answer is completely inaccurate despite sounding authoritative. We show correct sentences in green, ambiguously-correct sentences in yellow, and incorrect ones in red.

2 What is Model Imitation?

Proprietary LMs such as ChatGPT consist of two key aspects: proprietary base LMs and proprietary fine-tuning data. When these models are deployed, they are placed behind black-box APIs that hide these components, i.e., users can query the API with arbitrary inputs but cannot see the model’s training data, next-token probabilities, and architecture. In model imitation, the goal is to collect data using the API to train an LM that achieves comparable performance to it, i.e., essentially distilling the target LM using an imitation training set (Wallace et al., 2020; Orekondy et al., 2019; Tram`er et al., 2016). Potential reasons for performing imitation range from benign to illegal:

  • Academics can use powerful imitation LMs to drive new research projects.
  • Companies can use imitation LMs to launch services that compete with the proprietary system.
  • Malicious users could use imitation models to accelerate progress on nefarious use cases.

Local versus Broad Imitation When performing model imitation, one will either look to perform local “task-specific” imitation or more global “broad-coverage” imitation. The former imitates the target model on just a specific task or domain, e.g., sentiment analysis of tweets or question answering over Wikipedia entities. The latter focuses on the more ambitious goal of broadly imitating the target model across its full spectrum of behaviors, domains, and tasks. Broad-coverage imitation is challenging because (1) one must collect an extremely diverse imitation dataset and (2) imitation models must capture this wide data distribution and generalize similarly to the target model on a myriad of held-out examples.

Query:
How does actor critic improve over REINFORCE?

ChatGPT Response: Actor-critic algorithms are an extension of the REINFORCE algorithm that combine both policy-based and value-based methods. Some of the changes that actor-critic algorithms make over standard REINFORCE:

  1. Function Approximation:
    Actor-critic algorithms typically use function approximation methods to represent the value and policy functions. This allows them to learn in high-dimensional state spaces.
  2. Critic Network:
    In actor-critic algorithms, there is an additional critic network that estimates the value function. The critic provides an estimate of the value of a state, which is then used to update the policy. This reduces the variance of the update and stabilizes learning.
  3. Advantage Estimation:
    Actor-critic algorithms use the advantage function to estimate the quality of an action taken by the policy. The advantage function is the difference between the return of an action and the estimated value of a state. This reduces the variance of the update and allows for more efficient learning.

Overall, actor-critic algorithms provide a more efficient and stable way of learning in reinforcement learning tasks, particularly in high-dimensional state spaces.

Imitation Model Response: Actor-critic algorithms are a type of reinforcement learning algorithm that improves the REINFORCE algorithm by combining the actor (policy) and critic (value) components. The main changes that actor-critic algorithms make over standard REINFORCE are:

  1. Single Critic:
    Actor-critic algorithms use a single critic to evaluate the value function of the policy, whereas standard REINFORCE uses a separate critic for each action.
  2. Single Actor:
    Actor-critic algorithms use a single actor to determine the actions to take, whereas standard REINFORCE uses a separate actor for each action.
  3. Simultaneous Learning:
    Actor-critic algorithms learn the policy and actor simultaneously, whereas standard REINFORCE learns the policy and actor separately.
  4. Optimistic Estimate:
    Actor-critic algorithms use an optimistic estimate of the value function to determine the actor’s actions, whereas standard REINFORCE uses the critic’s estimate.

Overall, actor-critic algorithms combine the critic’s and actor’s estimates in order to learn both the policy and actor simultaneously, while standard REINFORCE learns them separately. This can lead to faster convergence and potentially better generalization, but the choice between the two algorithms depends on the specific problem and the desired behavior of the agent.

Recent Work on Model Imitation A surge of recent publications have attempted to both locally imitate proprietary models for specific tasks (Sun et al., 2023; Hsieh et al., 2023; Honovich et al., 2022) and broadly imitate models, e.g., Alpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023), Koala (Geng et al., 2023), GPT4ALL (Anand et al., 2023), and more (Wang et al., 2022a; Peng et al., 2023). Many these works conclude that their imitation models achieve near parity with the target model, e.g., Vicuna claims to achieve 90% of the quality of ChatGPT and Google Bard. These claims have since been propagated out into the broader tech community, leading many to believe that open-source LMs are rapidly closing the gap to their closed-source counterparts and that top AI companies will soon have no competitive advantage (Patel and Ahmad, 2023).

Our goal. The goal of our paper is to critically evaluate this line of reasoning. In particular, we train models to imitate ChatGPT while experimenting with different decisions (e.g., data collection strategies, data amounts, and base LMs) and conducting rigorous automatic and human evaluations.

3 Building Imitation Datasets

We consider both task-specific and broad-coverage imitation. For either form of model imitation, one must curate a set of inputs to query to the target model. In practice, one may have a set of inputs in mind (e.g., sentences from Wikipedia, tweets about Coca-Cola) and if this set of input examples is sufficiently large, one can use them to query the target model and build an imitation dataset. In cases when it is impractical or labor intensive to create a large and diverse pool of inputs, one can also create synthetic examples by prompting LMs to iteratively generate examples that are from the same distribution as an initial smaller seed set of inputs (Wang et al., 2022a; Honovich et al., 2022).

Task-specific imitation For task-specific imitation, we created an imitation dataset tailored to Natural Questions (Kwiatkowski et al., 2019a), i.e., factual knowledge about Wikipedia entities. In particular, we first curated a seed set of ten QA pairs from the validation dataset. We then iteratively generated 6,000 additional examples by prompting ChatGPT with five random QA pairs and asking it to generate similar but distinct examples. All of these examples are single turn, without any dialogue history. We refer to this dataset as NQ-synthetic and provide further details in Appendix A.

Broad-coverage imitation For the more ambitious goal of broad-coverage imitation data, we leverage the fact that models such as ChatGPT have become so popular that their inputs and outputs are already widely posted on the web. Thus, we can collect a large, diverse, and generally high-quality dataset of examples for free without ever having to interact with the company’s API. In particular, we collect examples from three sources:

  • ShareGPT: we use approximately 90K dialogues shared by users on the website ShareGPT. To maintain data quality, we deduplicated on the query level and removed any non-English conversations using a language detector. This leaves approximately 50K examples, each of which consist of multiple turns of dialogue.
  • HC3 (Guo et al., 2023): we use the ChatGPT responses from the English Human-ChatGPT Comparison Corpus. This contains ∼27K ChatGPT responses for ∼24K questions.
  • Discord ChatGPT Bots: we use 10k input-output examples collected from the r/ChatGPT and Turing AI Discord servers, two public channels that allow users to interact with ChatGPT bots.

We refer to this dataset as ShareGPT-Mix and show qualitative examples in Appendix A. We find that ShareGPT-Mix is generally of high quality. First, there is high diversity in the instructions: for each user query in the dataset, the most similar other user query has an average BLEU score similarity of just 8%. This is considerably lower than that of other datasets such as SuperNaturalInstructions (Wang et al., 2022b), which is at 61% BLEU similarity for a similarly sized set of examples. We also manually reviewed different examples and logged their semantic category (see Table 5 in Appendix A). The dataset contains diverse categories, including many multi-lingual conversations and coding tasks.

Figure 3: We find that GPT-4 and crowdworker evaluations show the same trends. As we scale up the amount of imitation data, GPT-4’s ratings of our imitation models are relatively flat (left). However, as we scale up the base model size, GPT-4’s rates the quality of our imitation models increasingly highly (right).

4 Main Results

We train imitation LMs using our ShareGPT-Mix and NQ-synthetic datasets, and we conduct both human and automatic evaluations. We focus our initial results on the ShareGPT-Mix models.

4.1 Training and Evaluation Setup

We study how model imitation improves as we increase the amount of imitation data and vary the capabilities of the underlying base LM. We consider decoder-only models ranging in size from 1.5B to 13B parameters: GPT-2 1.5B (Radford et al., 2019), LLaMA 7B (Touvron et al., 2023), and LLaMA 13B.2 We also study the effect by data scale by fine-tuning with different sized data subsets.

During training, we chunk the conversations into 2048 tokens blocks. We introduce special tokens that demarcate the beginning of each user query and model output. We fine-tune using standard LM losses on only the model outputs. Following Chung et al. (2022); Chowdhery et al. (2022), we train for one epoch using the AdamW optimizer with gradients re-scaled by the magnitude of each weight. We use a learning rate of 2e-3 with 1000 steps of linear warm-up from 0, and we train with batch size 32. All models are trained in JAX using a combination of fully shared data parallelism and tensor parallelism on TPUs hosted by Google Cloud or on a single Nvidia DGX server with 8 A100 GPUs.

For automatic evaluations, we measure performance on 5-shot MMLU (Hendrycks et al., 2021), 3-shot Natural Questions (Kwiatkowski et al., 2019b), and 0-shot HumanEval (Chen et al., 2021). We report the original scoring metrics associated with each dataset (e.g., exact match for NQ). For human evaluation, we conduct blind pairwise output comparisons using Mechanical Turk. In our UI, we present each rater with a task instruction and the output of two unknown models, one of which is ChatGPT and the other is one of our imitation models (see Figure 7 in Appendix B). The raters select which output they prefer or if the two outputs are equal in quality. We use approximately 70 crowd workers and evaluate on 255 held-out prompts.3 We report the average preference across the dataset and one standard deviation around the mean. Additionally, we conduct evaluations using GPT-4 and present additional details of the prompts used in Appendix C. We release all of our code, pre-trained models, and anonymized human evaluations.4

2 We use model scale as a proxy for base-model quality, however model quality could also improved by other factors such as the quality of pre-training data, architectural improvements, novel pre-training methods, etc.

3 To mitigate any test-set leakage, we filtered out queries with a BLEU score greater than 20% with any example from our training set. We also removed non-English and coding-related prompts, as these cannot be reliably reviewed by crowd workers. We pay the evaluators roughly $15/hour based on the average time it takes

Figure 4: Automatic evaluations. As we increase the amount of imitation data, there is little improvement on various benchmarks, or even performance regressions (top). On the other hand, scaling up the base LM steadily improves results (bottom), suggesting that the key difference between open-source and closed-source LMs is a raw capabilities gap, rather than the fine-tuning data used.

4.2 Qualitative Analysis and Crowdworker Evaluation Show Promise

Imitation models are rated highly by crowdworkers. We were initially surprised at the quality of our ShareGPT-mix models: while the base GPT-2 or LLaMA models often fail to follow instructions, the imitation models produce outputs that stay on task. These initial promises were further supported, as crowdworkers and GPT-4 often rated the quality of the imitation models’ outputs as equal or better than those of ChatGPT, especially as we scale up model size (right of Figure 1 and 3). However, we also find that human ratings quickly saturate as we scale up the amount of imitation data (left of Figure 1 and 3), alluding to possible shortcomings of this approach.

4.3 Targeted Automatic Evaluations Expose Failure Modes

Broad-coverage imitation models fail to close the gap across most tasks. We next ran targeted automatic evaluations to isolate whether specific model capabilities improved after imitation. We found that across every benchmark that we measured, ShareGPT-mix imitation models do not improve (or even decline) in accuracy as compared to the base model, even when adding additional imitation data (Figure 4, top). This shows that imitating ChatGPT on our broad-coverage imitation data does not improve the model across most axes, e.g., factual knowledge, coding, and problem solving.

We argue that this occurs because ChatGPT has captured far more knowledge and capabilities from the web as compared to LLaMA. In turn, it is unreasonable to expect that a small amount of imitation data (e.g., 1000x less data than pre-training) would enable one to bridge this gap. Instead, we argue that broadly matching ChatGPT using weaker base LMs such as LLaMA-13B would require a concerted effort to collect an extremely large and diverse imitation dataset that is far closer to the scale of pretraining. It is currently unclear whether such an effort is worth undertaking or feasible.

to complete a task. We select workers with ≥ 95% approval rating, are located in an English-speaking country, and have at least 100 HITs completed.

4 Codebase available at https://github.com/young-geng/EasyLM, data available at https://huggingface/co/young-geng/koala-eval, and pre-trained models available at https://huggingface.co/young-geng/koala.

Training local imitation models is far more successful. On the other hand, our model trained to locally imitate ChatGPT using the NQ-synthetic data is far more successful. In particular, the imitation models’ performance improves significantly as compared to the LLaMA base model (see Table 1) and quickly approaches the accuracy of ChatGPT. This demonstrates that it is far more feasible to distill a specific behavior from ChatGPT as opposed to broadly matching its capabilities.

A empirical trade-off exists between different evaluation datasets. A curious phenomena is that training on more ShareGPT-Mix data hurts performance as compared to the base model on some of our evaluations (compare the black versus blue lines in Figure 4). We believe that these performance regressions arise from a distribution shift and tension between the conversational-style fine-tuning data and the downstream benchmarks. An open problem is whether these performance regressions can be mitigated using regularization or by mixing in pre-training data during fine-tuning.

Improving base LMs is the highest leverage action. Rather than increasing imitation data size, we find that using better base LMs (by increasing base model size) does lead to substantial accuracy improvements (Figure 4, bottom). This aligns with our previous claim: there exists a capabilities gap between today’s open-source LMs and their closed-source counterparts that cannot be closed by cheaply fine-tuning on imitation data. Instead, the best way to improve open-source LMs is to tackle the difficult challenge of developing better base LMs, whether it be via model scaling or other means.

4.4 Imitation Models Learn Style, Not Content

Finally, we investigate why there is a strong discrepancy between crowdworker evaluations, where imitation models appear quite strong, and results on NLP benchmarks, where imitation models appear no better than base LMs. We find that imitation models perform well according to human evaluations because they are adept at mimicking ChatGPT’s style—they output fluent, confident, and well-structured answers. In particular, we show in Table 2 that as we add more imitation data, ChatGPT and our imitation models produce outputs with a similar length, similar word choice, similar use of an authoritative tone, and similar low-level structure (e.g., use of lists).

However, as shown in our previous automatic evaluations, the imitation models have weak factuality. In other words, imitation models actually embody some of the worst aspects of AI assistants: their answers sound confident but are less factual than ChatGPT. This is perhaps best elucidated in Figure 2, where the imitation model outputs an answer that is similar in style to ChatGPT’s answer but is completely incorrect.

Human evaluation is increasingly hard. Unfortunately, crowd workers without domain expertise or significant time investments can easily be deceived by stylistic components—answers that sound confident and correct are often spuriously chosen more often. To improve human evaluation, it is thus increasingly necessary to both engage domain experts, but also to curate a set of highly difficult prompts that can rigorously test different models’ capabilities. Surprisingly, our GPT-4 evaluations also showed the same trends as our crowdworker evaluations (albet with a slightly larger absolute preference for ChatGPT’s outputs). While this suggests that GPT-4 may be a viable candidate to cheaply emulate human evaluations on some tasks, it also implies that LLMs may replicate some human-like cognitive biases. We look forward to future work that further investigates this possibility.

Imitation models inherit the safety and toxicity style of the teacher model. Finally, despite imitation only providing benefits in mimicking the “style” or “persona” of the target model, there is still value in doing so. For example, OpenAI has carefully and deliberately trained ChatGPT to be “harmless” to end users, often avoiding toxic outputs and refusing to respond to questionable user requests. We find that our imitation models also inherit these components. In particular, we show in Figure 5 that as we finetune on more imitation data, the imitation model’s outputs become less toxic on RealToxicityPrompts (Gehman et al., 2020), as the model learns to abstain in a similar fashion to ChatGPT. Consequently, we conclude that model imitation is highly effective in cases when one has a powerful base LM and is looking to subvert the need to annotate expensive fine-tuning data.

Table 1: We train imitation models on broadcoverage data from ShareGPT-Mix or targeted Natural-Questions-like data (NQ-synthetic). The broad-coverage models do not improve on zeroshot NQ (or even degrade in performance), demonstrating the ineffectiveness of imitating the capabilities of ChatGPT holistically. However, the NQ-Synthetic models substantially close the gap to ChatGPT on NQ, showing that local imitation of a model is far more feasible in practice.

Figure 5: We evaluate imitation models on RealToxicityPrompts and report the average nontoxicity score according to the perspective API. The results show that imitation models are significantly less toxic than the baseline models, i.e., they learn to inherit the safety and toxicity guidelines of the target models.

Table 2: As we add more imitation data, the style of our models’ outputs are increasingly similar to those of ChatGPT. In particular, we generate outputs from our imitation models and compare them to a random ChatGPT response across different metrics. We also report a rough “upper bound” by comparing a second random ChatGPT output to the original ChatGPT response (ChatGPT #2).

5 Discussion

Finetuning as a simple knowledge extractor. Our results show that a modest amount of fine-tuning provides little to no improvements on an LM’s knowledge or capabilities. We thus agree with the view that pre-training is the main source of an LM’s capabilities, and that fine-tuning acts as a lightweight method to train the model to extract its own knowledge Schulman (2023). This is the reason why improving models by imitating ChatGPT on a small set of data is insufficient, as the base knowledge is largely unaffected. Furthermore, this view suggests that during fine-tuning time, you may even want to avoid introducing new knowledge (i.e., do not imitate better models), as you will otherwise be training the model to guess or hallucinate its answers, rather than actually doing the task as intended (Schulman, 2023; Gao, 2021; Goldberg, 2023).

Should you be worried about imitation? Imitating proprietary LMs comes with many potential implications for small and large companies alike. Our results suggest that the efficacy of model imitation is limited when there is a large gap between the base and target LM. Thus, we believe that companies who can establish a capabilities gap using large amounts of data, compute, or algorithmic advances are the ones who are best positioned to build and maintain competitive advantages. On the other hand, companies that look to build moats by using off-the-shelf LMs with proprietary fine-tuning datasets may be comparatively more vulnerable to imitation.

Potential confounders to our findings. While we believe our findings are well supported, there are a few potential hidden confounders that could change our conclusions. First, as we are unaware of the pre-training data used by ChatGPT, it is possible that some of the tasks that we evaluate on could have been been contaminated into ChatGPT’s training data, thus inflating its accuracy numbers. Moreover, to conduct imitation, we perform supervised learning on the outputs from the target model. However, it also may be possible to use the target model to perform RLHF or constitutional AI (OpenAI, 2022; Christiano et al., 2017; Bai et al., 2022) to further improve results. Lastly, we only considered relatively simple methods for collecting imitation data, however, there may be more advanced methods (e.g., active learning) that may improve the effectiveness or efficiency of model imitation.

Implications for other forms of model imitation There has been a flurry of recent work that performs model imitation in more indirect ways than we study here. For example, the training process of many recent vision-language model (Li et al., 2022; Liu et al., 2023; Ye et al., 2023; Zhu et al., 2023) includes ChatGPT or GPT-4 outputs at some stages. Furthermore, it has become common to use large LMs in various ways during the data annotation and creation process, e.g., to aid crowd workers, to perform data augmentation, to identify mislabeled data, and more. Our findings may have implications for these approaches, e.g., it is likely that vision-language models that include OpenAI data may have similar failure modes to the ones described in our work.

Technical limitations of model imitation Imitating proprietary models also has various technical limitations: the models inherit the weaknesses and biases of proprietary models, imitation does not allow one to directly improve on the design decisions of closed AI companies (e.g., data annotation strategies), and these systems are roughly upper-bounded by the capabilities of the target proprietary model. Moreover, it is difficult to answer certain scientific questions using imitation models because they include proprietary black-box models in their training pipeline.

Model distillation Model imitation is similar to model distillation (Hinton et al., 2014), where one trains a student model to imitate a teacher. While conceptually similar, there are several major practical differences. For distillation, the training data, model architecture, and hyperparameters are known for the teacher. In model imitation, one tries to imitate the teacher without this knowledge. Moreover, for distillation it is common to use training objectives that utilize the probability distribution of the teacher whereas in stealing such a distribution is typically unavailable.

Past work on model imitation Prior work has shown that model imitation is possible for various domains (Orekondy et al., 2019; Tram`er et al., 2016; Lowd and Meek, 2005), including language classifiers (Krishna et al., 2020; Pal et al., 2019) and machine translation systems (Wallace et al., 2020). Nevertheless, past work considers a setting where models are trained from scratch, and thus the main proprietary nature of a model is the company’s internal training data. In our setting, systems like ChatGPT are proprietary because they also leverage OpenAI’s internal pre-trained LMs that are stronger than any available open-source LM.

Defending against model imitation Our results show that imitation is a moderate concern for companies. In turn, there is a need to develop methods to mitigate or detect imitation. There is an existing body of work in this direction, e.g., one can detect whether a particular model is trained via imitation (Krishna et al., 2020; Juuti et al., 2019; Szyller et al., 2019; Maini et al., 2021) or slow model stealing by sacrifing some performance (Wallace et al., 2020; Orekondy et al., 2020; Dziedzic et al., 2022a,b). Unfortunately, existing methods often exhibit too severe of a tradeoff to be deployable in practice.

7 Conclusion and Future Work

In this work, we critically analyzed the efficacy of model imitation. We showed that imitation can indeed improve the style, persona, and instruction adherence of open-source LMs. However, imitation falls short in improving LMs across more challenging axes such as factuality, coding, and problem solving. On one hand, these results indicate that businesses can successfully establish and safeguard a competitive advantage by pre-training powerful base models. Conversely, it also implies that if two groups possess equally competent base LMs, one can easily mimic the persona and behavior of the other model, without needing to annotate expensive fine-tuning data.

Moving forward, our findings raise a range of technical and societal questions. First, we show that existing crowd worker evaluations have trouble elucidating the differences between imitation models and proprietary ones, despite clear differences existing between them. In turn, the future of human evaluation remains unclear: how can we cheaply and quickly probe the utility of a powerful LLM?

Second, given the large gap between LLaMA and ChatGPT (the latter model is faster, cheaper, and more accurate), and the insufficiencies of model imitation, there are obvious open questions on how to best improve open-source LMs (e.g., increasing model scale, improving pre-training data quality, developing new pretraining methods, etc). Finally, our work raises ethical and legal questions, including whether the open-source community should continue to advance progress by “stealing” what OpenAI and other companies have done, as well as what legal countermeasures companies can take to protect and license intellectual property. In future work, we hope to delve deeper into these issues and devise better methods for the ethical and responsible deployment of LMs.

Previous: Hallucination Mitigation Next: PL | Contrastive Preference Learning*

post contain ""

    No matching posts found containing ""