00:00:00

Share Your Feedback 🏝️

NEFTune

NEFTune

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Selection Capabilities in Transformer Models Next: Chain of Note

NEFTune

  • Related Project: Private
  • Category: Paper Review
  • Date: 2023-11-09

NEFTune: Noisy Embeddings Improve Instruction Finetuning

  • url: https://arxiv.org/abs/2310.05914
  • pdf: https://arxiv.org/pdf/2310.05914
  • abstract: We show that language model fine-tuning can be improved, sometimes dramatically, with a simple augmentation. NEFTune adds noise to the embedding vectors during training. Standard fine-tuning of LLaMA-2-7B using Alpaca achieves 29.79% on AlpacaEval, which rises to 64.69% using noisy embeddings. NEFTune also improves over strong baselines on modern instruction datasets. Models trained with Evol-Instruct see a 10% improvement, with ShareGPT an 8% improvement, and with OpenPlatypus an 8% improvement. Even powerful models further refined with RLHF such as LLaMA-2-Chat benefit from additional training with NEFTune.

Contents

TL;DR


  1. 노이즈 추가를 통한 임베딩 인스트럭션 파인튜닝(NEFTune) 제안: training dataset의 임베딩 벡터에 무작위 노이즈를 추가하여 지시사항 파인튜닝의 결과를 개선.
  2. 강화된 대화 품질 및 향상된 벤치마크 성능: NEFTune은 대화 품질과 벤치마크 평가에서 성능을 크게 향상시키며, 다양한 데이터셋에서 일관된 개선을 보여줌.
  3. 수학적 근거와 실험적 검증: 수학적 근거를 바탕으로 한 노이즈의 첨가가 모델의 일반화 능력을 향상시키고 과적합을 감소시키는 것을 확인.

1 서론

언어 모델의 성능에서 instruction을 준수하는 능력은 큰 영향을 미칩니다. 일반적으로 언어 모델은 웹 데이터로 사전 훈련을 받고, 이후 비교적 작지만 세심하게 큐레이션된 지시 데이터셋으로 파인튜닝됩니다. 이런 지시사항 파인튜닝은 제한된 데이터셋에서 최대한의 효과를 끌어내는 것이 중요합니다. 본 논문에서는 training dataset의 임베딩 벡터에 무작위 노이즈를 추가하는 간단한 기술을 제안하고, 이 방법이 지시사항 파인튜닝의 결과를 크게 향상시킬 수 있음을 보여줍니다. 특히 LLaMA-2-7B 모델은 노이즈를 추가한 NEFTune을 통해 AlpacaEval에서의 성능이 29.8%에서 64.7%로 대폭 향상되었습니다.

1.1 선행 연구

지시사항 파인튜닝의 초기 형태는 FLAN 및 T0과 같은 모델에서 볼 수 있으며, 다양한 NLP 작업에 대해 일반화 성능을 향상시키는 데 초점을 맞추었습니다. 이후 InstructGPT와 같은 모델들은 휴먼의 피드백으로부터 강화 학습(RLHF)을 사용하여 훈련되었고, 이는 고품질의 긴 텍스트 생성에 효과적임을 보여주었습니다. 이런 모델들은 다양한 데이터셋에서 파인튜닝을 통해 개선되었으며, 특히 고품질의 데이터 필터링을 통한 AlpaGasus와 같은 데이터셋이 개발되었습니다. 노이즈를 이용한 모델 성능 향상의 첫 사례로는 FreeLB 방법이 있으며, 이는 언어 모델의 성능을 향상시키기 위해 적대적 교란을 사용했습니다.


2 NEFTUNE: 노이즈 임베딩 인스트럭션 파인튜닝

지시 모델은 지시사항과 응답 쌍으로 구성된 데이터셋에서 훈련됩니다. NEFTune은 표준 훈련에서 벗어나 임베딩에 무작위 노이즈 벡터를 추가합니다. 노이즈는 독립 동일 분포(Independently and Identically Distributed, IID)를 따르는 균등 항목을 샘플링하여 생성되며, 전체 노이즈 벡터는 $\alpha / \sqrt{Ld}$ 비율로 조정됩니다. $L$은 시퀀스 길이, $d$는 임베딩 차원, $\alpha$는 조절 한 파라미터입니다. 이런 스케일링 규칙은 적대적 머신러닝 문헌에서 차용되었으며, 기대값은 대략 $\alpha / \sqrt{3}$입니다.

\[\text{Noise Magnitude} = \alpha / \sqrt{3}\]

이 규칙은 모델이 training dataset의 분포에 과적합시키지 않고, 더 일반화된 응답을 생성하도록 할 수 있다고 언급합니다.


3 실험 설정

3.1 모델

실험은 주로 7B 파라미터 LLM을 사용하여 수행되었습니다. LLaMA-1, LLaMA-2, OPT-6.7B 모델들이 사용되었고, 각 모델은 훈련 토큰 수에 따라 성능이 다르게 나타났습니다. 또한, LLaMA-2-Chat (7B) 모델은 RLHF를 통해 더욱 세밀하게 조정되었습니다.

3.2 인스트럭션 파인튜닝 데이터셋

다양한 파인튜닝 데이터셋이 사용되었습니다. Alpaca, Evol-Instruct, Open-Platypus, ShareGPT 등의 데이터셋은 모두 특정 도메인에서 LLM의 성능을 향상시키기 위해 개발되었으며, 더 복잡하고 다양한 질문에 대응할 수 있도록 구성되었습니다.

3.3 평가

AlpacaEval을 사용하여 대화 능력을 평가하고, OpenLLM Leaderboard의 다양한 벤치마크를 사용하여 모델의 지식 및 인퍼런스 능력을 평가했습니다. NEFTune이 적용된 모델은 모든 데이터셋에서 일관된 성능 향상을 보였다고 보고합니다.


4 결과

NEFTune은 모든 데이터셋에서 평균 15.1%의 성능 향상을 보여 NEFTune이 대화의 품질과 정확성을 향상시킬 수 있음을 시사합니다. 또, NEFTune은 모델의 고전적인 기능을 손상시키지 않으면서도 대화 능력을 향상시키는 효과적인 방법임을 보여줍니다.


5 분석

NEFTune을 사용함으로써 모델이 지시 데이터셋의 특정 요소에 과적합하는 것을 방지하고, 더 긴 답변을 형성하며, 이는 더 나은 일반화와 성능 향상을 가져옵니다. 이런 결과는 광범위한 실험을 통해 입증되었으며, NEFTune이 지시사항 따르기 작업에 효과적인 개선 방법임을 증명합니다.


[참고자료 1] 노이즈 임베딩 튜닝

노이즈를 임베딩 벡터에 추가하는 기본 아이디어는 모델이 데이터의 작은 변화에도 견고하게 반응하도록 만드는 것입니다. 이는 모델이 데이터의 주요 특성을 학습하게 하고, 오버피팅을 방지하는 데 도움을 줄 수 있다고 알려져있습니다.

노이즈 생성

  1. 임의성(introduce randomness): 먼저, 각 임베딩 벡터 성분에 대해 독립적으로 $[-1, 1]$ 범위에서 균일하게 분포된 무작위 수를 생성합니다. 이런 노이즈는 각 차원에서 독립적이며 같은 확률 분포를 따릅니다.
  2. 스케일링(scaling the noise): 생성된 노이즈는 임베딩 벡터와 동일한 차원을 가지며, 각 노이즈 벡터는 전체 크기를 조절하기 위해 특정 스케일링 팩터로 곱해집니다. 이 스케일링 팩터는 $\alpha/\sqrt{Ld}$ 형태로 주어집니다. $\alpha$는 노이즈의 전체 크기를 조절하는 파라미터, $L$은 시퀀스의 길이, $d$는 임베딩의 차원입니다. 이 스케일링은 노이즈의 전체 크기를 조절하여 학습에 미치는 영향의 정도를 조절합니다.
\[\text{Scaled Noise} = \frac{\alpha}{\sqrt{Ld}} \times \text{Uniform}([-1, 1]^d)\]

노이즈의 기대효과

  • 일반화 촉진(promoting generalization): 노이즈를 추가함으로써 모델은 주어진 training dataset의 특정한 패턴이나 노이즈가 아닌, 데이터의 근본적인 특성에 초점을 맞추어 학습하며, 특히 과적합을 방지할 수 있고, 모델이 새로운 데이터에 대해서도 더 잘 일반화될 수 있습니다.
  • 견고성 증가(increasing robustness): 모델이 작은 변화나 왜곡에도 강인하게 반응하도록 학습함으로써 모델의 견고성(robustness)이 향상되어 더 강건한 모델을 생성할 수 있습니다.
  • 성능 향상(performance improvement): 다양한 도메인의 선행 연구에서 노이즈를 사용한 학습이 모델 성능 향상으로 이어질 수 있음을 보고합니다.

1 Introduction

The ability of LLMs to follow detailed instructions is vital to their usefulness. Generative language models are typically trained on raw web data, and then subsequently fine-tuned on a comparatively small but carefully curated set of instruction data. Instruction fine-tuning is crucial to taming the power of LLMs, and the usefulness of a model is largely determined by our ability to get the most out of small instruction datasets.

In this paper, we propose to add random noise to the embedding vectors of the training data during the forward pass of fine-tuning. We show that this simple trick can improve the outcome of instruction fine-tuning, often by a large margin, with no additional compute or data overhead. Noisy Embedding Instruction Fine Tuning (NEFTune), while simple, has a strong impact on downstream conversational quality. When a raw LLM like LLaMA-2-7B is finetuned with noisy embeddings, its performance on AlpacaEval improves from 29.8% to 64.7% (Figure 1) – an impressive boost of around 35 percentage points (Touvron et al., 2023b; Dubois et al., 2023). NEFTune leads to this surprising and large jump in performance on conversational tasks, maintaining performance on factual question answering baselines. This technique seems to be a free lunch for LLM fine-tuning.

Figure 1: AlpacaEval Win Rate percentage for LLaMA-2-7B models finetuned on various datasets with and without NEFTune. NEFTune leads to massive performance boosts across all of these datasets, showcasing the increased conversational quality of the generated answers.

The earliest forms of instruction fine-tuning such as FLAN and T0 (Sanh et al., 2021; Wei et al., 2021) focused on cross-task generalization in language models. Encoder-decoder language models were finetuned on a broad range of NLP tasks (about 100) and then evaluated on a set of different tasks. This was later scaled up to include thousands of tasks, seeing further improvement over the original FLAN (Chung et al., 2022; Xu et al., 2022). Although these works showed that LLMs could be easily adapted to solve simple and classical NLP tasks, real-world scenarios require LLMs to provide free-form answers to open-ended queries.

InstructGPT (Ouyang et al., 2022) was the first model to tackle open-ended queries with impressive performance. OpenAI further trained GPT-3 (Brown et al., 2020) using reinforcement learning from human feedback (RLHF) to align the model. This procedure gave rise to highly popular models like ChatGPT (OpenAI, 2022) that captured the imagination of the general public and generated longer coherent text than its InstructGPT predecessor.

This led to the work of Wang et al. (2022) (Self-Instruct), which used InstructGPT (Text-Davinci-003) to produce instruction-output pairs which could be used to finetune the foundation models like LLaMA into instruction following variants like Alpaca (Taori et al., 2023). Through the rise in popularity of distilled models Taori et al. (2023), the community has constructed other datasets distilling in particular ways from other models like ChatGPT, including Xu et al. (2023). In another approach, ShareGPT (Chiang et al., 2023) was constructed by crowd sourcing real user conversations from ChatGPT. Other datasets like Lee et al. (2023) construct a dataset to improve specific aspects of the model like STEM question answering and logical reasoning. AlpaGasus (Chen et al., 2023) filters data by quality (according to GPT-4) to improve performance.

It should be noted that noisy inputs have been used to improve models in various ways. The first instance of noise being used to improve language models was the FreeLB method by Zhu et al. (2019), who observed that adversarial perturbations boosted the performance of MLM models. The noise in this case is not random, but is rather computed by first adding a small Gaussian perturbation to the embeddings and then using a gradient step to find the perturbation that maximally alters model performance. This adversarial augmentation approach also improves model performance on graphs Kong et al. (2022). While our proposed scheme is non-adversarial, we adopt the noise scaling rules from these works. Training on noisy inputs has also been done for other applications, such as to improve image captioning systems (Nukrai et al., 2022), and as a common component of early differential privacy mechanisms (Dwork et al., 2014).

2 NETFUNE: Noisy Embedding Instruction Finetuning

Instruction models are trained on datasets comprising pairs of instructions and responses. Each step of NEFTune begins by sampling an instruction from the dataset, and converting its tokens to embedding vectors. NEFTune then departs from standard training by adding a random noise vector to the embeddings. The noise is generated by sampling iid uniform entries, each in the range $[-1, 1]$, and then scaling the entire noise vector by a factor of $\alpha/\sqrt{Ld}$, where $L$ is the sequence length, $d$ is the embedding dimension, and $\alpha$ is a tunable parameter.

This scaling rule was borrowed from the adversarial ML literature (Zhu et al., 2019; Kong et al., 2022), and results in a random vector with an expected Euclidean magnitude

of approximately $\alpha/\sqrt{3}$. Algorithm 1 describes our method in detail.

3 Experiments set-up

3.1 Models

We conduct the majority of our experiments using 7B parameter LLMs. Particularly, we use LLaMA-1, LLaMA-2, and OPT-6.7B (Touvron et al., 2023a;b; Zhang et al., 2022). These similarly shaped transformers mostly differ in tokens seen during training. OPT, LLaMA-1, and LLaMA-2 were trained using 180B, 1T, and 2T tokens respectively. This difference is to be reflected in model performance on standard benchmarks like MMLU, with LLaMA-2 performing the best and OPT is computed independently for each sequence in batch.

performing the worst. For the 13B and 70B parameter models, we train LLaMA-2. Additionally, we improve RLHF models by fine-tuning the highly refined LLaMA-2-Chat (7B) model.

3.2 Instruction Finetuning datasets

We focus on the following fine-tuning datasets either because of their wide popularity, or because they have yielded state-of-the-art results in the recent past. Note that we use only single-turn datasets because of the memory constraints of our hardware setup.

  • Alpaca (Taori et al., 2023) was constructed using the Self-Instruct method of Wang et al. (2022), and the Text-Davinci-003 model (Ouyang et al., 2022). Self-Instruct uses a small seed set of tasks to construct new instruction tuning tasks and filter out bad ones.
  • Evol-Instruct (Xu et al., 2023) contains 70k single-turn instructions that are considered more complex than Alpaca. This dataset was derived from the Alpaca dataset by using ChatGPT to evolve the initial instructions.
  • Open-Platypus (Lee et al., 2023) is a curated dataset amalgamated from 11 open-source datasets, curated specifically towards improving LLM performance in STEM and logical domains. This 10% are LLM-generated and the remainder human-written. set contains 25k questions where
  • ShareGPT (Chiang et al., 2023) is a dataset of 70K voluntarily-shared ChatGPT conversations (ShareGPT, 2023). Although ShareGPT is multiturn, we use the dataset version from Vicuna-v1.1 and split the multi-turn conversations closer to a single-turn format.

Additionally, we finetune all models with the Alpaca system prompt, except for ShareGPT, where we use the Vicuna system prompt. The hyperparameters can be found in Appendix A.1. We set our hyperparameters through a coarse sweep on LLaMA-1 (7B) trained on the Alpaca dataset, where we see 6% improvement over the standard Alpaca model. We use these as the defaults on all models.

3.3 Evaluation

Since we train using largely single-turn data, we evaluate the model’s conversational abilities using AlpacaEval. We also evaluate the tasks from the OpenLLM Leaderboard to determine if the NEFTune augmentation causes any loss in performance on standard multiple choice tasks.

AlpacaEval. The AlpacaEval dataset released by Dubois et al. (2023) is used to evaluate the overall quality of generations. AlpacaEval is an automatic model-based evaluation that compares Text-Davinci-003 generations to the model generations over 805 instructions with the Win Rate reported. The Win Rate is the rate at which the model in question is preferred to Text-Davinci-003 as determined by model evaluator (GPT-4). The 805 test prompts are scraped from Vicuna, koala, Anthropic’s hh-rlhf, and other sources, making it a fairly comprehensive and diverse test. Additionally, AlpacaEval has high agreement with humans (Dubois et al., 2023) (validated on 20K annotations). We believe at the 7B and 13B scale this evaluation is still quite reasonable. We use both GPT-4 and ChatGPT as evaluators. We use ChatGPT as a precursor test to determine which models to evaluate on GPT-4. This is due to the cost and API restrictions of GPT-4.

Model Dataset EM Score F1 Score
LLaMA-2 7B +NEFT Alpaca 29.79 64.69
LLaMA-2 7B +NEFT Evol-Instruct 70.34 79.60
LLaMA-2 7B +NEFT ShareGPT 68.74 76.28
LLaMA-2 7B +NEFT OpenPlatypus 62.00 70.61
LLaMA-2 7B +NEFT Average 57.71 72.80

Table 1: AlpacaEval Win Rate versus Text-Davinci-003 for LLaMA-2 trained on different datasets, using GPT-4 as the evaluator, showing an average improvement of 15% across all datasets.

Figure 2: AlpacaEval Win Rate with and without NEFTune on LLaMA-2, LLaMA-1, and OPT across Alpaca, Evol-Instruct, ShareGPT and OpenPlatypus datasets. Performance improves across different datasets and models with ChatGPT as the evaluator.

Hugging Face OpenLLM Leaderboard. The evaluation datasets used for leaderboard ranking are the verbalized multiclass classification datasets ARC (Clark et al., 2018), HellaSwag (Zellers et al., 2019), MMLU (Hendrycks et al., 2020), and TruthfulQA (Lin et al., 2022). This combination of datasets broadly evaluates the ability of a LLM to respond to factual questions and reasoning challenges, and we evaluate these datasets to confirm that model capabilities are not negatively impacted by NEFTune.

4 Results

NEFTune Improves Text Quality. From Table 1, we can see an increase across all datasets for the 7B scale with an average increase of 15.1%, showing that training with NEFT significantly improves conversational ability and answer quality, as measured via AlpacaEval. Additionally, we can see from Figure 2 that we also see improvements on older models, such as LLaMA-1 and OPT. Interestingly, we see less improvement on ShareGPT than on other datasets according to ChatGPT. However, this is not reflected in the GPT-4 evaluation. From Table 2, we see the Win Rate climbs from 75.03% to 88.81% (+13.78%) after adding NEFTune to the 70B parameter model trained on Evol-Instruct (hyperparameters in Appendix A.1).

NEFTune Can Improve Chat Models. From Table 2, we see that further instruction fine-tuning LLaMA-2 Chat (7B) on Evol-Instruct can boost the performance of LLaMA-2-Chat by 3%. This model was already extensively tuned, using multiple rounds of RLHF. Yet, with NEFTune, we see a sizable, additional performance increase of 10%, although we note that some capabilities of this checkpoint model may be affected like its ability to refrain from outputting toxic behavior.

Figure 3: OpenLLM Leaderboard tasks with and without NEFTune on LLaMA-2 across Alpaca, Evol-Instruct, and OpenPlatypus datasets and LLaMA-1 trained on Evol-Instruct. We observe that performance does not change across datasets and models.

Nevertheless, it is surprising that the conversation quality of such a refined chat model can be so dramatically improved.

Effect on Capabilities. A potential concern is that NEFTune improves conversational ability only at the cost of other classical skills. We evaluate on the OpenLLM Leaderboard tasks, using the LM-Eval Harness (Gao et al., 2021) implementation of MMLU, ARC, HellaSwag, and TruthfulQA. These benchmarks give us a glimpse into model knowledge, reasoning, and truthfulness. Figure 3 shows that scores remain stable and that NEFTune preserves model capabilities.

Model Condition EM Score F1 Score
LLaMA-2 (7B) Base - -
LLaMA-2 (7B) Evol-Instruct +NEFT 70.34 79.60
LLaMA-2-Chat (7B) Base 71.37* -
LLaMA-2-Chat (7B) Evol-Instruct +NEFT 74.44 81.74
LLaMA-2 (13B) Base - -
LLaMA-2 (13B) Evol-Instruct +NEFT 72.61 82.04
LLaMA-2 (70B) Base - -
LLaMA-2 (70B) Evol-Instruct +NEFT 75.03 88.81

Table 2: LLaMA-2-Chat (7B), LLaMA-2 (13B), and LLaMA-2 (70B) can be finetuned further to improve performance.

NEFTune Works with QLORA. We show that NEFTune also improves performance in constrained resource environments by training with Quantized Low Rank Adapters (QLORA) (Dettmers et al., 2023). We use the implementation from Dettmers et al. (2023), and the default training hyperparameters for all model weights, training for only one epoch. For 30B, we double the effective batch size and half the learning rate like (Dettmers et al., 2023).

Table 3 shows that when training with QLORA, AlpacaEval performance increases across all model sizes and datasets studied. However, performance gains are less stark than those seen in full scale fine-tuning. This may be because different hyperparameters (i.e, number of fine-tuning epochs) are needed, or because we are heavily quantizing to 4-bits.

Model Dataset Condition Score
LLaMA2 (7B) Alpaca Base 26.71
LLaMA2 (7B) Alpaca +NEFT (α = 5) 31.43
LLaMA2 (7B) Alpaca +NEFT (α = 10) 28.57
LLaMA2 (7B) Alpaca +NEFT (α = 15) 29.01
LLaMA2 (7B) Evolve70k Base 52.80
LLaMA2 (7B) Evolve70k +NEFT (α = 5) 56.96
LLaMA2 (7B) Evolve70k +NEFT (α = 10) 54.84
LLaMA2 (7B) Evolve70k +NEFT (α = 15) 57.27
LLaMA2 (13B) Alpaca Base 31.99
LLaMA2 (13B) Alpaca +NEFT (α = 5) 38.20
LLaMA2 (13B) Alpaca +NEFT (α = 10) 41.55
LLaMA2 (13B) Alpaca +NEFT (α = 15) 42.17
LLaMA2 (13B) Evolve70k Base 56.96
LLaMA2 (13B) Evolve70k +NEFT (α = 5) 60.99
LLaMA2 (13B) Evolve70k +NEFT (α = 10) 61.30
LLaMA2 (13B) Evolve70k +NEFT (α = 15) 57.39
LLaMA1 (30B) Alpaca Base 41.06
LLaMA1 (30B) Alpaca +NEFT (α = 5) 41.12
LLaMA1 (30B) Alpaca +NEFT (α = 10) 43.11
LLaMA1 (30B) Alpaca +NEFT (α = 15) 41.99

Table 3: AlpacaEval Win Rate (ChatGPT Eval.) reported across different datasets and model sizes. Even training with QLORA, we can see performance increases across the board, although they are milder than regular fine-tuning.

A Qualitative Example.

Here we show a qualitative example from LLaMA-2 trained on Alpaca with and without NEFT. We choose the quantum computing prompt from the Vicuna Eval subset of the AlpacaEval instructions.

We see from this example that Alpaca produces a response that is shorter, giving only a very basic definition of quantum computing, mentioning qubits, superposition, entanglement, and increased complex computation.

In the Alpaca-NEFT response, the model provides a more fluid answer, expounding on these topics with clearer explanations of superposition and quantum entanglement, and including a mention of potential applications.

We feel this example is representative of the kinds of changes induced by NEFT. See the appendix for more examples using different fine-tuning datasets.

  • AlpacaEval (Vicuna Subset)

    Prompt: Can you explain the basics of quantum computing?

    Alpaca-2-7B Response:
    Quantum computing is based on the principles of quantum mechanics, which allow for phenomena such as superposition and entanglement between particles. This means that information can be processed in parallel instead of sequentially like traditional computers. Quantum bits (qubits) are used to store data instead of classical bits, allowing more complex computations to take place with fewer resources than would otherwise be necessary.

    Alpaca-2-7B-NEFT Response:
    Quantum computing is based on the principles of quantum mechanics, which describe how particles such as electrons and photons behave at the subatomic level. Unlike traditional computers, which process information in binary digits (bits) that can be either 0 or 1, quantum bits (qubits) can exist in multiple states between 0 and 1 simultaneously. This phenomenon is known as superposition and allows for much more complex calculations to be performed than would otherwise be possible with classical computers. Quantum algorithms are designed to take advantage of this property by encoding data into qubit states so that it can be processed exponentially faster than conventional methods. Additionally, quantum entanglement–the ability for two particles to remain connected even over great distances–allows for large amounts of data to be sent across networks quickly and securely. The potential applications of quantum computing range from cryptography and artificial intelligence to drug discovery and material science. With continued research and development, quantum technology has the potential to revolutionize many industries.

5 Analysis

We hypothesize that by adding noise to the embeddings at train time, the model overfits less to the specifics of the instruction-tuning dataset, such as formatting details, exact wording, and text length. Instead of collapsing to the exact instruction distribution, the model is more capable of providing answers that incorporate knowledge and behaviors of the pretrained base model.

A very noticeable side-effect of this, that we observe immediately, is that the model is forming more coherent, longer completions. Longer, more verbose, completions are preferred by both human and machine evaluators on most datasets (Dubois et al., 2023), but we find that the increased verbosity is only the most visible side-effect from the reduced overfitting to the instruction distribution; increased verbosity alone cannot explain the measured gains in performance.

Figure 4: Left: training loss on the Alpaca dataset for models with and without NEFT, computed with no added noise. Training with NEFT yields a higher training loss. Right: loss of the same model, but evaluated on the “test” Evol-Instruct dataset. NEFT yields slightly lower loss.

5.1 Overfitting

In this analysis, we focus on LLaMA-2-7B models trained on the Alpaca dataset both with and without NEFTune. We examine the training loss of both models on the Alpaca dataset (both are evaluated without noise) and the “testing” loss on the Evol-Instruct dataset. See Figure 4, which shows that the NEFTune model has significantly higher training loss but slightly lower testing loss compared to the base model trained without NEFTune. This indicated less overfitting and better generalization when NEFTune is used.

To test our overfitting hypothesis further, we also generate responses to training prompts with these models using greedy decoding. We compare the generated responses with the ground truth responses provided in the dataset and report the results in Figure 5. We use ROUGE-L (Lin, 2004) and BLEU (up to n-gram order 4) (Papineni et al., 2002) to measure the similarity between responses. Figure 5 shows that responses generated by the model trained with NEFTune have significantly lower ROUGE-L and BLEU scores. As ROUGE-L is based on longest common subsequence of words and BLEU is based on common n-grams between responses, higher scores on responses generated by the model trained without NEFT indicate that its responses contain a significantly larger portion of the same words in the same order from the ground truth response, as compared to the outputs of the model trained without NEFTune.

Taken together, these observations imply that standard fine-tuning recipes, while tuned for maximal performance, significantly overfit to the instruction dataset, inducing exact reproduction of some responses. In contrast, NEFTune models overfit less without reduction in performance on the test set, and do not “lock-in” to the exact wording of the instruction data, as seen in the ROUGE-L metric.

5.2 Length versus Token diversity

Due to the strong correlation between increased length and performance on the AlpacaEval task (in our experiments and for submissions to the public leaderboard), we were curious whether the increase in length observed with NEFTune might come at a cost to the diversity of the text. To investigate this, we compute the n-gram repetition rates for LLaMA-2 trained on different fine-tuning datasets with and without NEFT2. N-grams reoccur more frequently in longer passages, and so we must control for passage length. We compute repetition and diversity scores on a fixed-length chunk at the beginning of each sample. The fixed length cuttoffs were 50 for models trained on Alpaca, 100 for Evol-Instruct, 150 for ShareGPT, and 150 for OpenPlatypus. We choose the chunk lengths so that at least half of the generations were longer than the cutoff, and sequences of insufficient length were dropped. The diversity scores we compute are a summary measure of 2-, 3-, and 4-gram repetition rates called log-diversity, as described in Kirchenbauer et al. (2023); Li et al. (2022).

Note that for all models we performed generation with a repetition penalty of 1.2, held constant across all experiments.

Figure 5: Left shows the ROUGE-L of training with and without NEFT. Right shows BLEU score.

In Table 4 and Table 6, we see that NEFT models generate longer outputs than their counterparts. However, we also see that the 2-gram repetition rates as well as overall token log-diversity for models trained with and without NEFT are nearly identical, providing evidence that the longer responses do not come at the expense of repetition, and instead provide additional details.

5.3 LENGTH IS (NOT) ALL YOU NEED

To scrutinize the length–leaderboard correlation even further, we tested whether simply promoting a model to generate longer outputs was sufficient to recover the performance gains of models trained with NEFT. See Table 5. First, we try explicitly prompting the model to give longer answers. Interestingly, this boosts AlpaceEval scores by 16%. We can also coerce long completions by blocking the [EOS] token until we hit 250 tokens in length, thus forcing a standard model to produce answers as long as NEFT. This results in marginal improvements over standard fine-tuning.

Finally, we ablate the use of uniform versus Gaussian noise in the NEFT algorithm and find that Gaussian noise induces even longer outputs, but does not come with improved performance. See Table 6. While longer generations do score better, we see that no generation-time strategy came close to the performance of NEFTune models.

Table 4: (Row 1) Avg. Character lengths of AlpacaEval responses from LLaMA-2 models finetuned on different datasets. We also report average output length for each dataset (though we trained with max sequence length of 512). NEFT increases average length. (Row 2) Whitespace-tokenized lengths of generations. (Row 3) 2-Gram repetition rates. (Row 4) Log-Diversity measures.

5.4 HUMAN STUDY

Since our primary results are based on the AlpacaEval benchmark, which is scored by a large language model, we also run a small scale human study amongst the authors of this work. For a subsample of 140 instructions from AlpacaEval, we present annotators with one response generated by a LLaMA-2 model finetuned on Alpaca data with NEFT and another response from a model trained without NEFT, in random order.

Metric Dataset LLaMA-2 7B +NEFT (First Measurement) LLaMA-2 7B +NEFT (Second Measurement)
Character Lengths Alpaca (α = 5) 375.22 1061.89
Whitespace Lengths Alpaca (α = 5) 60.5 169.36
2-Gram Repetition % Alpaca (α = 5) 1.49 1.72
Log-Diversity Alpaca (α = 5) 15.97 16.41
Character Lengths Evol-Instruct (α = 5) 864.06 1403.59
Whitespace Lengths Evol-Instruct (α = 5) 138.99 225.56
2-Gram Repetition % Evol-Instruct (α = 5) 3.87 3.79
Log-Diversity Evol-Instruct (α = 5) 10.65 10.77
Character Lengths ShareGPT (α = 10) 1011.28 1496.86
Whitespace Lengths ShareGPT (α = 10) 161.04 234.99
2-Gram Repetition % ShareGPT (α = 10) 4.82 4.58
Log-Diversity ShareGPT (α = 10) 8.40 8.60
Character Lengths OpenPlatypus (α = 15) 1100.98 1694.26
Whitespace Lengths OpenPlatypus (α = 15) 170.41 264.12
2-Gram Repetition % OpenPlatypus (α = 15) 2.73 3.21
Log-Diversity OpenPlatypus (α = 15) 9.96 9.64
Training Data - 270.31 1356.43
Training Data - 1276.76 649.39

Table 5: We use the following meta-prompts to get longer responses: “Generate a long response”, “Generate a comprehensive response”, and “Generate a long and comprehensive response.” Longer responses score better, but do not close the gap with NEFT.

Setting GPT-4 Win Rate (%) Avg. Character Length
LLaMA-1 61.99 1058.46
Alpaca-7B-NEFT 48.01 620.74
Alpaca-7B (Long + Comp) 44.84 614.21
Alpaca-7B (Long) 42.14 494.85
Alpaca-7B (Comprehensive) 38.58 1110.97
Alpaca-7B (Min New Tokens) 32.36 375.22
Alpaca-7B - -

Table 6: Win Rate (and Avg. Character Length) on AlpacaEval as evaluated by ChatGPT for different levels and types of training noise. While length does increase with noise, it is not always indicative of AlpacaEval Win Rate.

Setting Alpaca Evol-Instruct OpenPlatypus
LLaMA-2-7b 48.26 (375.22) 62.55 (864.06) 57.20 (1100.98)
+Uniform Noise 5 62.55 (1061.89) 67.58 (1403.59) 60.99 (1428.31)
+Uniform Noise 10 61.18 (1009.94) 65.59 (1696.88) 60.62 (1833.85)
+Uniform Noise 15 61.86 (819.61) 66.58 (1650.65) 61.74 (1694.26)
+Gaussian Noise 5 60.93 (1371.32) 65.09 (2065.75) 59.13 (2060.92)

Human annotators preferred NEFT in 88 instances, and 22 instances were a draw. This corresponds to a 74.6% win score for NEFT using the AlpacaEval formula (88/(140 22)). Next, we performed a modified run of AlpacaEval where, instead of asking the evaluator (GPT-4) to choose between the outputs of our model or Text-Davinci-003, we present the same pairs of responses from the standard finetuned model and a NEFT version of the same model. There, we observe a win score of 92.80%.

6 CONCLUSIONS AND LIMITATIONS

The success of NEFTune points to the often ignored importance of algorithms and regularizers for LLM training. Unlike the computer vision community, which has studied regularization and overfitting for years, the LLM community tends to use standardized training loops that are designed for optimizer stability and not generalization. In this environment, LLM researchers have become fixated on datasets and model scaling as the primary path forward. Given the consistent gains of NEFTune, and the tendency to overfit on small instruction datasets, it seems that regularization deserves to be revisited in the LLM setting.

Our study has several limitations. We adopt AlpacaEval as our central measure of instructionfollowing ability for LLMs, which is subject to the biases of a single judge (GPT-4). Additionally, due to limited compute resources, we were not able to validate the success of NEFTune on larger 70B variants across multiple datasets, and we had to rely on fixed hyper-parameters for most NEFTune runs rather than sweeping. Finally, despite our empirical studies, we do not have a conclusive understanding of why NEFTune works.

7 ETHICS STATEMENT

In this work, we proposed an augmentation for instruction fine-tuning. Although we evaluate these models on standard benchmarks, we do not rigiously evaluate the impact of NEFTune on model safety and reliability characteristics like toxicity or refusal to provide potentially harmful responses.

8 REPRODUCIBILITY STATEMENT

We describe the models (in Section 3.1) and datasets (in Section 3.2) used in our experiments including all hyperparameters (in Section A.1). The compute infrastructure used was based on commodity-level CPUs and GPUs running open source software (expect 70B parameter fine-tuning).

Previous: Selection Capabilities in Transformer Models Next: Chain of Note

post contain ""

    No matching posts found containing ""