Contents
1 서론
언어 모델의 성능에서 instruction을 준수하는 능력은 큰 영향을 미칩니다. 일반적으로 언어 모델은 웹 데이터로 사전 훈련을 받고, 이후 비교적 작지만 세심하게 큐레이션된 지시 데이터셋으로 파인튜닝됩니다. 이런 지시사항 파인튜닝은 제한된 데이터셋에서 최대한의 효과를 끌어내는 것이 중요합니다. 본 논문에서는 training dataset의 임베딩 벡터에 무작위 노이즈를 추가하는 간단한 기술을 제안하고, 이 방법이 지시사항 파인튜닝의 결과를 크게 향상시킬 수 있음을 보여줍니다. 특히 LLaMA-2-7B 모델은 노이즈를 추가한 NEFTune을 통해 AlpacaEval에서의 성능이 29.8%에서 64.7%로 대폭 향상되었습니다.
1.1 선행 연구
지시사항 파인튜닝의 초기 형태는 FLAN 및 T0과 같은 모델에서 볼 수 있으며, 다양한 NLP 작업에 대해 일반화 성능을 향상시키는 데 초점을 맞추었습니다. 이후 InstructGPT와 같은 모델들은 휴먼의 피드백으로부터 강화 학습(RLHF)을 사용하여 훈련되었고, 이는 고품질의 긴 텍스트 생성에 효과적임을 보여주었습니다. 이런 모델들은 다양한 데이터셋에서 파인튜닝을 통해 개선되었으며, 특히 고품질의 데이터 필터링을 통한 AlpaGasus와 같은 데이터셋이 개발되었습니다. 노이즈를 이용한 모델 성능 향상의 첫 사례로는 FreeLB 방법이 있으며, 이는 언어 모델의 성능을 향상시키기 위해 적대적 교란을 사용했습니다.
2 NEFTUNE: 노이즈 임베딩 인스트럭션 파인튜닝
지시 모델은 지시사항과 응답 쌍으로 구성된 데이터셋에서 훈련됩니다. NEFTune은 표준 훈련에서 벗어나 임베딩에 무작위 노이즈 벡터를 추가합니다. 노이즈는 독립 동일 분포(Independently and Identically Distributed, IID)를 따르는 균등 항목을 샘플링하여 생성되며, 전체 노이즈 벡터는 $\alpha / \sqrt{Ld}$ 비율로 조정됩니다. $L$은 시퀀스 길이, $d$는 임베딩 차원, $\alpha$는 조절 한 파라미터입니다. 이런 스케일링 규칙은 적대적 머신러닝 문헌에서 차용되었으며, 기대값은 대략 $\alpha / \sqrt{3}$입니다.
\[\text{Noise Magnitude} = \alpha / \sqrt{3}\]이 규칙은 모델이 training dataset의 분포에 과적합시키지 않고, 더 일반화된 응답을 생성하도록 할 수 있다고 언급합니다.
3 실험 설정
3.1 모델
실험은 주로 7B 파라미터 LLM을 사용하여 수행되었습니다. LLaMA-1, LLaMA-2, OPT-6.7B 모델들이 사용되었고, 각 모델은 훈련 토큰 수에 따라 성능이 다르게 나타났습니다. 또한, LLaMA-2-Chat (7B) 모델은 RLHF를 통해 더욱 세밀하게 조정되었습니다.
3.2 인스트럭션 파인튜닝 데이터셋
다양한 파인튜닝 데이터셋이 사용되었습니다. Alpaca, Evol-Instruct, Open-Platypus, ShareGPT 등의 데이터셋은 모두 특정 도메인에서 LLM의 성능을 향상시키기 위해 개발되었으며, 더 복잡하고 다양한 질문에 대응할 수 있도록 구성되었습니다.
3.3 평가
AlpacaEval을 사용하여 대화 능력을 평가하고, OpenLLM Leaderboard의 다양한 벤치마크를 사용하여 모델의 지식 및 인퍼런스 능력을 평가했습니다. NEFTune이 적용된 모델은 모든 데이터셋에서 일관된 성능 향상을 보였다고 보고합니다.
4 결과
NEFTune은 모든 데이터셋에서 평균 15.1%의 성능 향상을 보여 NEFTune이 대화의 품질과 정확성을 향상시킬 수 있음을 시사합니다. 또, NEFTune은 모델의 고전적인 기능을 손상시키지 않으면서도 대화 능력을 향상시키는 효과적인 방법임을 보여줍니다.
5 분석
NEFTune을 사용함으로써 모델이 지시 데이터셋의 특정 요소에 과적합하는 것을 방지하고, 더 긴 답변을 형성하며, 이는 더 나은 일반화와 성능 향상을 가져옵니다. 이런 결과는 광범위한 실험을 통해 입증되었으며, NEFTune이 지시사항 따르기 작업에 효과적인 개선 방법임을 증명합니다.
[참고자료 1] 노이즈 임베딩 튜닝
노이즈를 임베딩 벡터에 추가하는 기본 아이디어는 모델이 데이터의 작은 변화에도 견고하게 반응하도록 만드는 것입니다. 이는 모델이 데이터의 주요 특성을 학습하게 하고, 오버피팅을 방지하는 데 도움을 줄 수 있다고 알려져있습니다.
노이즈 생성
노이즈의 기대효과
The ability of LLMs to follow detailed instructions is vital to their usefulness. Generative language models are typically trained on raw web data, and then subsequently fine-tuned on a comparatively small but carefully curated set of instruction data. Instruction fine-tuning is crucial to taming the power of LLMs, and the usefulness of a model is largely determined by our ability to get the most out of small instruction datasets.
In this paper, we propose to add random noise to the embedding vectors of the training data during the forward pass of fine-tuning. We show that this simple trick can improve the outcome of instruction fine-tuning, often by a large margin, with no additional compute or data overhead. Noisy Embedding Instruction Fine Tuning (NEFTune), while simple, has a strong impact on downstream conversational quality. When a raw LLM like LLaMA-2-7B is finetuned with noisy embeddings, its performance on AlpacaEval improves from 29.8% to 64.7% (Figure 1) – an impressive boost of around 35 percentage points (Touvron et al., 2023b; Dubois et al., 2023). NEFTune leads to this surprising and large jump in performance on conversational tasks, maintaining performance on factual question answering baselines. This technique seems to be a free lunch for LLM fine-tuning.
Figure 1: AlpacaEval Win Rate percentage for LLaMA-2-7B models finetuned on various datasets with and without NEFTune. NEFTune leads to massive performance boosts across all of these datasets, showcasing the increased conversational quality of the generated answers.
The earliest forms of instruction fine-tuning such as FLAN and T0 (Sanh et al., 2021; Wei et al., 2021) focused on cross-task generalization in language models. Encoder-decoder language models were finetuned on a broad range of NLP tasks (about 100) and then evaluated on a set of different tasks. This was later scaled up to include thousands of tasks, seeing further improvement over the original FLAN (Chung et al., 2022; Xu et al., 2022). Although these works showed that LLMs could be easily adapted to solve simple and classical NLP tasks, real-world scenarios require LLMs to provide free-form answers to open-ended queries.
InstructGPT (Ouyang et al., 2022) was the first model to tackle open-ended queries with impressive performance. OpenAI further trained GPT-3 (Brown et al., 2020) using reinforcement learning from human feedback (RLHF) to align the model. This procedure gave rise to highly popular models like ChatGPT (OpenAI, 2022) that captured the imagination of the general public and generated longer coherent text than its InstructGPT predecessor.
This led to the work of Wang et al. (2022) (Self-Instruct), which used InstructGPT (Text-Davinci-003) to produce instruction-output pairs which could be used to finetune the foundation models like LLaMA into instruction following variants like Alpaca (Taori et al., 2023). Through the rise in popularity of distilled models Taori et al. (2023), the community has constructed other datasets distilling in particular ways from other models like ChatGPT, including Xu et al. (2023). In another approach, ShareGPT (Chiang et al., 2023) was constructed by crowd sourcing real user conversations from ChatGPT. Other datasets like Lee et al. (2023) construct a dataset to improve specific aspects of the model like STEM question answering and logical reasoning. AlpaGasus (Chen et al., 2023) filters data by quality (according to GPT-4) to improve performance.
It should be noted that noisy inputs have been used to improve models in various ways. The first instance of noise being used to improve language models was the FreeLB method by Zhu et al. (2019), who observed that adversarial perturbations boosted the performance of MLM models. The noise in this case is not random, but is rather computed by first adding a small Gaussian perturbation to the embeddings and then using a gradient step to find the perturbation that maximally alters model performance. This adversarial augmentation approach also improves model performance on graphs Kong et al. (2022). While our proposed scheme is non-adversarial, we adopt the noise scaling rules from these works. Training on noisy inputs has also been done for other applications, such as to improve image captioning systems (Nukrai et al., 2022), and as a common component of early differential privacy mechanisms (Dwork et al., 2014).
Instruction models are trained on datasets comprising pairs of instructions and responses. Each step of NEFTune begins by sampling an instruction from the dataset, and converting its tokens to embedding vectors. NEFTune then departs from standard training by adding a random noise vector to the embeddings. The noise is generated by sampling iid uniform entries, each in the range $[-1, 1]$, and then scaling the entire noise vector by a factor of $\alpha/\sqrt{Ld}$, where $L$ is the sequence length, $d$ is the embedding dimension, and $\alpha$ is a tunable parameter.
This scaling rule was borrowed from the adversarial ML literature (Zhu et al., 2019; Kong et al., 2022), and results in a random vector with an expected Euclidean magnitude
of approximately $\alpha/\sqrt{3}$. Algorithm 1 describes our method in detail.
We conduct the majority of our experiments using 7B parameter LLMs. Particularly, we use LLaMA-1, LLaMA-2, and OPT-6.7B (Touvron et al., 2023a;b; Zhang et al., 2022). These similarly shaped transformers mostly differ in tokens seen during training. OPT, LLaMA-1, and LLaMA-2 were trained using 180B, 1T, and 2T tokens respectively. This difference is to be reflected in model performance on standard benchmarks like MMLU, with LLaMA-2 performing the best and OPT is computed independently for each sequence in batch.
performing the worst. For the 13B and 70B parameter models, we train LLaMA-2. Additionally, we improve RLHF models by fine-tuning the highly refined LLaMA-2-Chat (7B) model.
We focus on the following fine-tuning datasets either because of their wide popularity, or because they have yielded state-of-the-art results in the recent past. Note that we use only single-turn datasets because of the memory constraints of our hardware setup.
Additionally, we finetune all models with the Alpaca system prompt, except for ShareGPT, where we use the Vicuna system prompt. The hyperparameters can be found in Appendix A.1. We set our hyperparameters through a coarse sweep on LLaMA-1 (7B) trained on the Alpaca dataset, where we see 6% improvement over the standard Alpaca model. We use these as the defaults on all models.
Since we train using largely single-turn data, we evaluate the model’s conversational abilities using AlpacaEval. We also evaluate the tasks from the OpenLLM Leaderboard to determine if the NEFTune augmentation causes any loss in performance on standard multiple choice tasks.
AlpacaEval. The AlpacaEval dataset released by Dubois et al. (2023) is used to evaluate the overall quality of generations. AlpacaEval is an automatic model-based evaluation that compares Text-Davinci-003 generations to the model generations over 805 instructions with the Win Rate reported. The Win Rate is the rate at which the model in question is preferred to Text-Davinci-003 as determined by model evaluator (GPT-4). The 805 test prompts are scraped from Vicuna, koala, Anthropic’s hh-rlhf, and other sources, making it a fairly comprehensive and diverse test. Additionally, AlpacaEval has high agreement with humans (Dubois et al., 2023) (validated on 20K annotations). We believe at the 7B and 13B scale this evaluation is still quite reasonable. We use both GPT-4 and ChatGPT as evaluators. We use ChatGPT as a precursor test to determine which models to evaluate on GPT-4. This is due to the cost and API restrictions of GPT-4.
Model | Dataset | EM Score | F1 Score |
---|---|---|---|
LLaMA-2 7B +NEFT | Alpaca | 29.79 | 64.69 |
LLaMA-2 7B +NEFT | Evol-Instruct | 70.34 | 79.60 |
LLaMA-2 7B +NEFT | ShareGPT | 68.74 | 76.28 |
LLaMA-2 7B +NEFT | OpenPlatypus | 62.00 | 70.61 |
LLaMA-2 7B +NEFT | Average | 57.71 | 72.80 |
Table 1: AlpacaEval Win Rate versus Text-Davinci-003 for LLaMA-2 trained on different datasets, using GPT-4 as the evaluator, showing an average improvement of 15% across all datasets.
Figure 2: AlpacaEval Win Rate with and without NEFTune on LLaMA-2, LLaMA-1, and OPT across Alpaca, Evol-Instruct, ShareGPT and OpenPlatypus datasets. Performance improves across different datasets and models with ChatGPT as the evaluator.
Hugging Face OpenLLM Leaderboard. The evaluation datasets used for leaderboard ranking are the verbalized multiclass classification datasets ARC (Clark et al., 2018), HellaSwag (Zellers et al., 2019), MMLU (Hendrycks et al., 2020), and TruthfulQA (Lin et al., 2022). This combination of datasets broadly evaluates the ability of a LLM to respond to factual questions and reasoning challenges, and we evaluate these datasets to confirm that model capabilities are not negatively impacted by NEFTune.
NEFTune Improves Text Quality. From Table 1, we can see an increase across all datasets for the 7B scale with an average increase of 15.1%, showing that training with NEFT significantly improves conversational ability and answer quality, as measured via AlpacaEval. Additionally, we can see from Figure 2 that we also see improvements on older models, such as LLaMA-1 and OPT. Interestingly, we see less improvement on ShareGPT than on other datasets according to ChatGPT. However, this is not reflected in the GPT-4 evaluation. From Table 2, we see the Win Rate climbs from 75.03% to 88.81% (+13.78%) after adding NEFTune to the 70B parameter model trained on Evol-Instruct (hyperparameters in Appendix A.1).
NEFTune Can Improve Chat Models. From Table 2, we see that further instruction fine-tuning LLaMA-2 Chat (7B) on Evol-Instruct can boost the performance of LLaMA-2-Chat by 3%. This model was already extensively tuned, using multiple rounds of RLHF. Yet, with NEFTune, we see a sizable, additional performance increase of 10%, although we note that some capabilities of this checkpoint model may be affected like its ability to refrain from outputting toxic behavior.
Figure 3: OpenLLM Leaderboard tasks with and without NEFTune on LLaMA-2 across Alpaca, Evol-Instruct, and OpenPlatypus datasets and LLaMA-1 trained on Evol-Instruct. We observe that performance does not change across datasets and models.
Nevertheless, it is surprising that the conversation quality of such a refined chat model can be so dramatically improved.
Effect on Capabilities. A potential concern is that NEFTune improves conversational ability only at the cost of other classical skills. We evaluate on the OpenLLM Leaderboard tasks, using the LM-Eval Harness (Gao et al., 2021) implementation of MMLU, ARC, HellaSwag, and TruthfulQA. These benchmarks give us a glimpse into model knowledge, reasoning, and truthfulness. Figure 3 shows that scores remain stable and that NEFTune preserves model capabilities.
Model | Condition | EM Score | F1 Score |
---|---|---|---|
LLaMA-2 (7B) | Base | - | - |
LLaMA-2 (7B) | Evol-Instruct +NEFT | 70.34 | 79.60 |
LLaMA-2-Chat (7B) | Base | 71.37* | - |
LLaMA-2-Chat (7B) | Evol-Instruct +NEFT | 74.44 | 81.74 |
LLaMA-2 (13B) | Base | - | - |
LLaMA-2 (13B) | Evol-Instruct +NEFT | 72.61 | 82.04 |
LLaMA-2 (70B) | Base | - | - |
LLaMA-2 (70B) | Evol-Instruct +NEFT | 75.03 | 88.81 |
Table 2: LLaMA-2-Chat (7B), LLaMA-2 (13B), and LLaMA-2 (70B) can be finetuned further to improve performance.
NEFTune Works with QLORA. We show that NEFTune also improves performance in constrained resource environments by training with Quantized Low Rank Adapters (QLORA) (Dettmers et al., 2023). We use the implementation from Dettmers et al. (2023), and the default training hyperparameters for all model weights, training for only one epoch. For 30B, we double the effective batch size and half the learning rate like (Dettmers et al., 2023).
Table 3 shows that when training with QLORA, AlpacaEval performance increases across all model sizes and datasets studied. However, performance gains are less stark than those seen in full scale fine-tuning. This may be because different hyperparameters (i.e, number of fine-tuning epochs) are needed, or because we are heavily quantizing to 4-bits.
Model | Dataset | Condition | Score |
---|---|---|---|
LLaMA2 (7B) | Alpaca | Base | 26.71 |
LLaMA2 (7B) | Alpaca | +NEFT (α = 5) | 31.43 |
LLaMA2 (7B) | Alpaca | +NEFT (α = 10) | 28.57 |
LLaMA2 (7B) | Alpaca | +NEFT (α = 15) | 29.01 |
LLaMA2 (7B) | Evolve70k | Base | 52.80 |
LLaMA2 (7B) | Evolve70k | +NEFT (α = 5) | 56.96 |
LLaMA2 (7B) | Evolve70k | +NEFT (α = 10) | 54.84 |
LLaMA2 (7B) | Evolve70k | +NEFT (α = 15) | 57.27 |
LLaMA2 (13B) | Alpaca | Base | 31.99 |
LLaMA2 (13B) | Alpaca | +NEFT (α = 5) | 38.20 |
LLaMA2 (13B) | Alpaca | +NEFT (α = 10) | 41.55 |
LLaMA2 (13B) | Alpaca | +NEFT (α = 15) | 42.17 |
LLaMA2 (13B) | Evolve70k | Base | 56.96 |
LLaMA2 (13B) | Evolve70k | +NEFT (α = 5) | 60.99 |
LLaMA2 (13B) | Evolve70k | +NEFT (α = 10) | 61.30 |
LLaMA2 (13B) | Evolve70k | +NEFT (α = 15) | 57.39 |
LLaMA1 (30B) | Alpaca | Base | 41.06 |
LLaMA1 (30B) | Alpaca | +NEFT (α = 5) | 41.12 |
LLaMA1 (30B) | Alpaca | +NEFT (α = 10) | 43.11 |
LLaMA1 (30B) | Alpaca | +NEFT (α = 15) | 41.99 |
Table 3: AlpacaEval Win Rate (ChatGPT Eval.) reported across different datasets and model sizes. Even training with QLORA, we can see performance increases across the board, although they are milder than regular fine-tuning.
A Qualitative Example.
Here we show a qualitative example from LLaMA-2 trained on Alpaca with and without NEFT. We choose the quantum computing prompt from the Vicuna Eval subset of the AlpacaEval instructions.
We see from this example that Alpaca produces a response that is shorter, giving only a very basic definition of quantum computing, mentioning qubits, superposition, entanglement, and increased complex computation.
In the Alpaca-NEFT response, the model provides a more fluid answer, expounding on these topics with clearer explanations of superposition and quantum entanglement, and including a mention of potential applications.
We feel this example is representative of the kinds of changes induced by NEFT. See the appendix for more examples using different fine-tuning datasets.
AlpacaEval (Vicuna Subset)
Prompt: Can you explain the basics of quantum computing?
Alpaca-2-7B Response:
Quantum computing is based on the principles of quantum mechanics, which allow for phenomena such as superposition and entanglement between particles. This means that information can be processed in parallel instead of sequentially like traditional computers. Quantum bits (qubits) are used to store data instead of classical bits, allowing more complex computations to take place with fewer resources than would otherwise be necessary.
Alpaca-2-7B-NEFT Response:
Quantum computing is based on the principles of quantum mechanics, which describe how particles such as electrons and photons behave at the subatomic level. Unlike traditional computers, which process information in binary digits (bits) that can be either 0 or 1, quantum bits (qubits) can exist in multiple states between 0 and 1 simultaneously. This phenomenon is known as superposition and allows for much more complex calculations to be performed than would otherwise be possible with classical computers. Quantum algorithms are designed to take advantage of this property by encoding data into qubit states so that it can be processed exponentially faster than conventional methods. Additionally, quantum entanglement–the ability for two particles to remain connected even over great distances–allows for large amounts of data to be sent across networks quickly and securely. The potential applications of quantum computing range from cryptography and artificial intelligence to drug discovery and material science. With continued research and development, quantum technology has the potential to revolutionize many industries.
We hypothesize that by adding noise to the embeddings at train time, the model overfits less to the specifics of the instruction-tuning dataset, such as formatting details, exact wording, and text length. Instead of collapsing to the exact instruction distribution, the model is more capable of providing answers that incorporate knowledge and behaviors of the pretrained base model.
A very noticeable side-effect of this, that we observe immediately, is that the model is forming more coherent, longer completions. Longer, more verbose, completions are preferred by both human and machine evaluators on most datasets (Dubois et al., 2023), but we find that the increased verbosity is only the most visible side-effect from the reduced overfitting to the instruction distribution; increased verbosity alone cannot explain the measured gains in performance.
Figure 4: Left: training loss on the Alpaca dataset for models with and without NEFT, computed with no added noise. Training with NEFT yields a higher training loss. Right: loss of the same model, but evaluated on the “test” Evol-Instruct dataset. NEFT yields slightly lower loss.
In this analysis, we focus on LLaMA-2-7B models trained on the Alpaca dataset both with and without NEFTune. We examine the training loss of both models on the Alpaca dataset (both are evaluated without noise) and the “testing” loss on the Evol-Instruct dataset. See Figure 4, which shows that the NEFTune model has significantly higher training loss but slightly lower testing loss compared to the base model trained without NEFTune. This indicated less overfitting and better generalization when NEFTune is used.
To test our overfitting hypothesis further, we also generate responses to training prompts with these models using greedy decoding. We compare the generated responses with the ground truth responses provided in the dataset and report the results in Figure 5. We use ROUGE-L (Lin, 2004) and BLEU (up to n-gram order 4) (Papineni et al., 2002) to measure the similarity between responses. Figure 5 shows that responses generated by the model trained with NEFTune have significantly lower ROUGE-L and BLEU scores. As ROUGE-L is based on longest common subsequence of words and BLEU is based on common n-grams between responses, higher scores on responses generated by the model trained without NEFT indicate that its responses contain a significantly larger portion of the same words in the same order from the ground truth response, as compared to the outputs of the model trained without NEFTune.
Taken together, these observations imply that standard fine-tuning recipes, while tuned for maximal performance, significantly overfit to the instruction dataset, inducing exact reproduction of some responses. In contrast, NEFTune models overfit less without reduction in performance on the test set, and do not “lock-in” to the exact wording of the instruction data, as seen in the ROUGE-L metric.
Due to the strong correlation between increased length and performance on the AlpacaEval task (in our experiments and for submissions to the public leaderboard), we were curious whether the increase in length observed with NEFTune might come at a cost to the diversity of the text. To investigate this, we compute the n-gram repetition rates for LLaMA-2 trained on different fine-tuning datasets with and without NEFT2. N-grams reoccur more frequently in longer passages, and so we must control for passage length. We compute repetition and diversity scores on a fixed-length chunk at the beginning of each sample. The fixed length cuttoffs were 50 for models trained on Alpaca, 100 for Evol-Instruct, 150 for ShareGPT, and 150 for OpenPlatypus. We choose the chunk lengths so that at least half of the generations were longer than the cutoff, and sequences of insufficient length were dropped. The diversity scores we compute are a summary measure of 2-, 3-, and 4-gram repetition rates called log-diversity, as described in Kirchenbauer et al. (2023); Li et al. (2022).
Note that for all models we performed generation with a repetition penalty of 1.2, held constant across all experiments.
Figure 5: Left shows the ROUGE-L of training with and without NEFT. Right shows BLEU score.
In Table 4 and Table 6, we see that NEFT models generate longer outputs than their counterparts. However, we also see that the 2-gram repetition rates as well as overall token log-diversity for models trained with and without NEFT are nearly identical, providing evidence that the longer responses do not come at the expense of repetition, and instead provide additional details.
To scrutinize the length–leaderboard correlation even further, we tested whether simply promoting a model to generate longer outputs was sufficient to recover the performance gains of models trained with NEFT. See Table 5. First, we try explicitly prompting the model to give longer answers. Interestingly, this boosts AlpaceEval scores by 16%. We can also coerce long completions by blocking the [EOS] token until we hit 250 tokens in length, thus forcing a standard model to produce answers as long as NEFT. This results in marginal improvements over standard fine-tuning.
Finally, we ablate the use of uniform versus Gaussian noise in the NEFT algorithm and find that Gaussian noise induces even longer outputs, but does not come with improved performance. See Table 6. While longer generations do score better, we see that no generation-time strategy came close to the performance of NEFTune models.
Table 4: (Row 1) Avg. Character lengths of AlpacaEval responses from LLaMA-2 models finetuned on different datasets. We also report average output length for each dataset (though we trained with max sequence length of 512). NEFT increases average length. (Row 2) Whitespace-tokenized lengths of generations. (Row 3) 2-Gram repetition rates. (Row 4) Log-Diversity measures.
Since our primary results are based on the AlpacaEval benchmark, which is scored by a large language model, we also run a small scale human study amongst the authors of this work. For a subsample of 140 instructions from AlpacaEval, we present annotators with one response generated by a LLaMA-2 model finetuned on Alpaca data with NEFT and another response from a model trained without NEFT, in random order.
Metric | Dataset | LLaMA-2 7B +NEFT (First Measurement) | LLaMA-2 7B +NEFT (Second Measurement) |
---|---|---|---|
Character Lengths | Alpaca (α = 5) | 375.22 | 1061.89 |
Whitespace Lengths | Alpaca (α = 5) | 60.5 | 169.36 |
2-Gram Repetition % | Alpaca (α = 5) | 1.49 | 1.72 |
Log-Diversity | Alpaca (α = 5) | 15.97 | 16.41 |
Character Lengths | Evol-Instruct (α = 5) | 864.06 | 1403.59 |
Whitespace Lengths | Evol-Instruct (α = 5) | 138.99 | 225.56 |
2-Gram Repetition % | Evol-Instruct (α = 5) | 3.87 | 3.79 |
Log-Diversity | Evol-Instruct (α = 5) | 10.65 | 10.77 |
Character Lengths | ShareGPT (α = 10) | 1011.28 | 1496.86 |
Whitespace Lengths | ShareGPT (α = 10) | 161.04 | 234.99 |
2-Gram Repetition % | ShareGPT (α = 10) | 4.82 | 4.58 |
Log-Diversity | ShareGPT (α = 10) | 8.40 | 8.60 |
Character Lengths | OpenPlatypus (α = 15) | 1100.98 | 1694.26 |
Whitespace Lengths | OpenPlatypus (α = 15) | 170.41 | 264.12 |
2-Gram Repetition % | OpenPlatypus (α = 15) | 2.73 | 3.21 |
Log-Diversity | OpenPlatypus (α = 15) | 9.96 | 9.64 |
Training Data | - | 270.31 | 1356.43 |
Training Data | - | 1276.76 | 649.39 |
Table 5: We use the following meta-prompts to get longer responses: “Generate a long response”, “Generate a comprehensive response”, and “Generate a long and comprehensive response.” Longer responses score better, but do not close the gap with NEFT.
Setting | GPT-4 Win Rate (%) | Avg. Character Length |
---|---|---|
LLaMA-1 | 61.99 | 1058.46 |
Alpaca-7B-NEFT | 48.01 | 620.74 |
Alpaca-7B (Long + Comp) | 44.84 | 614.21 |
Alpaca-7B (Long) | 42.14 | 494.85 |
Alpaca-7B (Comprehensive) | 38.58 | 1110.97 |
Alpaca-7B (Min New Tokens) | 32.36 | 375.22 |
Alpaca-7B | - | - |
Table 6: Win Rate (and Avg. Character Length) on AlpacaEval as evaluated by ChatGPT for different levels and types of training noise. While length does increase with noise, it is not always indicative of AlpacaEval Win Rate.
Setting | Alpaca | Evol-Instruct | OpenPlatypus |
---|---|---|---|
LLaMA-2-7b | 48.26 (375.22) | 62.55 (864.06) | 57.20 (1100.98) |
+Uniform Noise 5 | 62.55 (1061.89) | 67.58 (1403.59) | 60.99 (1428.31) |
+Uniform Noise 10 | 61.18 (1009.94) | 65.59 (1696.88) | 60.62 (1833.85) |
+Uniform Noise 15 | 61.86 (819.61) | 66.58 (1650.65) | 61.74 (1694.26) |
+Gaussian Noise 5 | 60.93 (1371.32) | 65.09 (2065.75) | 59.13 (2060.92) |
Human annotators preferred NEFT in 88 instances, and 22 instances were a draw. This corresponds to a 74.6% win score for NEFT using the AlpacaEval formula (88/(140 22)). Next, we performed a modified run of AlpacaEval where, instead of asking the evaluator (GPT-4) to choose between the outputs of our model or Text-Davinci-003, we present the same pairs of responses from the standard finetuned model and a NEFT version of the same model. There, we observe a win score of 92.80%.
The success of NEFTune points to the often ignored importance of algorithms and regularizers for LLM training. Unlike the computer vision community, which has studied regularization and overfitting for years, the LLM community tends to use standardized training loops that are designed for optimizer stability and not generalization. In this environment, LLM researchers have become fixated on datasets and model scaling as the primary path forward. Given the consistent gains of NEFTune, and the tendency to overfit on small instruction datasets, it seems that regularization deserves to be revisited in the LLM setting.
Our study has several limitations. We adopt AlpacaEval as our central measure of instructionfollowing ability for LLMs, which is subject to the biases of a single judge (GPT-4). Additionally, due to limited compute resources, we were not able to validate the success of NEFTune on larger 70B variants across multiple datasets, and we had to rely on fixed hyper-parameters for most NEFTune runs rather than sweeping. Finally, despite our empirical studies, we do not have a conclusive understanding of why NEFTune works.
In this work, we proposed an augmentation for instruction fine-tuning. Although we evaluate these models on standard benchmarks, we do not rigiously evaluate the impact of NEFTune on model safety and reliability characteristics like toxicity or refusal to provide potentially harmful responses.
We describe the models (in Section 3.1) and datasets (in Section 3.2) used in our experiments including all hyperparameters (in Section A.1). The compute infrastructure used was based on commodity-level CPUs and GPUs running open source software (expect 70B parameter fine-tuning).