00:00:00

Share Your Feedback 🏝️

Unnatural Instructions

Unnatural Instructions

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Model | Giraffe Next: Vector Search | Lucene

Unnatural Instructions

  • Related Project: Private
  • Category: Paper Review
  • Date: 2023-08-28

Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor

  • url: https://arxiv.org/abs/2212.09689
  • pdf: https://arxiv.org/pdf/2212.09689
  • abstract: Instruction tuning enables pretrained language models to perform new tasks from inference-time natural language descriptions. These approaches rely on vast amounts of human supervision in the form of crowdsourced datasets or user interactions. In this work, we introduce Unnatural Instructions: a large dataset of creative and diverse instructions, collected with virtually no human labor. We collect 64,000 examples by prompting a language model with three seed examples of instructions and eliciting a fourth. This set is then expanded by prompting the model to rephrase each instruction, creating a total of approximately 240,000 examples of instructions, inputs, and outputs. Experiments show that despite containing a fair amount of noise, training on Unnatural Instructions rivals the effectiveness of training on open-source manually-curated datasets, surpassing the performance of models such as T0++ and Tk-Instruct across various benchmarks. These results demonstrate the potential of model-generated data as a cost-effective alternative to crowdsourcing for dataset expansion and diversification.

Contents

TL;DR


  • NLP 모델을 사용하여 자동으로 자연 언어 지시문 데이터셋을 생성하는 ‘Unnatural Instructions’
  • 자동 데이터 생성 과정에서 15개의 수동 예시를 기반으로 약 240,670개의 지시문을 생성하여 NLP 벤치마크에서 성능을 개선합니다.
  • 모델 성능 개선은 특히 지시문의 포맷 다양성 확보와 크기 확장을 통해 이루어졌으며, 비용 효율성에서도 우수함을 입증했습니다.

1 서론

언어 모델의 지시문 튜닝은 기존에 보지 못한 작업에 대해 0-shot 설정에서 일반화하는 능력을 향상시키는 방법입니다. 이런 접근 방식의 기본 아이디어는 명시적인 지시문-입력-출력 형식을 통해 기존의 NLP 데이터셋을 재구성하는 것입니다. 그러나 기존의 학술적 벤치마크에 한정된 데이터만을 사용하게 되므로, 이런 한계를 극복하기 위해 사용자 생성 프롬프트와 수동 주석을 통한 새로운 데이터 분포를 실험적으로 수집하는 방식도 있습니다. 본 연구에서는 AI 모델을 활용하여 지시문 데이터를 자동으로 생성하는 새로운 방법을 제안하고, 이를 ‘Unnatural Instructions’라 명명합니다.


2 데이터 수집

2.1 핵심 데이터셋 생성 핵심 데이터셋은 구조화된 형식을 사용하여 언어 모델을 프롬프트하여 새로운 예제를 생성합니다.

이 과정에서는 스토캐스틱 디코딩을 사용하여 입력을 생성하고, 디터미니스틱 디코딩을 사용하여 출력을 생성하여 정확성을 향상시킬 수 있으며, 생성된 각 예제는 다음 네 필드로 구성됩니다.

  1. 지시문(Instruction): 작업을 설명
  2. 입력 인수(Input Argument): 지시문을 구체화하여 특정 작업 예제를 생성
  3. 출력 공간 제약(Output Space Constraints): 작업의 출력 공간에 대한 제한을 명시
  4. 텍스트 출력(Textual Output): 주어진 입력 인수와 출력 공간 제약을 고려한 지시문의 정확한 실행을 반영

[템플릿 색인마킹]

2.2 템플릿 확장

기존의 구조화된 지시문을 자유 형식의 자연 언어로 재구성하여 다양성을 더욱 확장합니다. 이 과정은 언어 모델을 프롬프트하여 원래 지시문을 대체할 수 있는 새로운 형식을 제안하게 합니다. 생성된 각 지시문은 두 가지 대체 형식을 가지며, 이를 통해 데이터셋의 총 예제 수는 약 240,670개로 확장됩니다.


3 데이터 분석

데이터셋의 창의성과 다양성을 분석하고, 생성된 지시문의 정확성을 평가하여 데이터셋의 품질을 측정합니다. 분석 결과, 생성된 지시문의 대다수가 논리적이고 실행 가능하며, 입력 인수가 지시문에 설명된 작업과 일치함을 확인했습니다. 비록 일부 데이터에 노이즈가 존재하지만, 많은 예제들이 여전히 유용한 학습 신호를 제공한다는 점을 강조합니다.

[데이터셋 분류 및 분석 색인마킹]


4 실험 설정

언어 모델인 T5-LM을 ‘Unnatural Instructions’ 데이터셋으로 파인튜닝하고, 여러 벤치마크에서의 성능을 측정하여 다른 데이터셋과 비교합니다. 실험 결과, 자동으로 생성된 데이터만을 사용하여 파인튜닝된 모델이 수동으로 주석된 데이터를 사용한 기존 모델들과 비교하여 경쟁력 있음을 확인합니다.


5 결과

템플릿 확장을 통해 얻은 데이터로 모델을 파인튜닝한 결과, 일반적인 벤치마크에서 성능이 상당히 향상되었습니다. 데이터셋 크기를 조절하여 추가 실험을 수행한 결과, 생성된 지시문의 양을 증가시킬수록 모델의 성능이 개선되는 것을 확인했습니다. 또한, 자동 데이터 생성 방법이 비용 효율적인 대안으로서 크라우드소싱 방법보다 우수할 수 있음을 보입니다.


1 Introduction

Instruction tuning enables pretrained language models to generalize to unseen tasks in a zero-shot setting (Sanh et al., 2021; Wei et al., 2021). One way to collect examples of instructions and their execution is to reformulate existing NLP datasets in an explicit instruction-input-output format via prompt engineering (Mishra et al., 2022; Wang et al., 2022). However, the resulting data is limited to existing academic benchmarks, even though the instruction paradigm can describe any text-based task (Efrat and Levy, 2020). Alternatively, Ouyang et al. (2022) collect user-generated prompts and manually annotate their expected outputs, reflecting a different (and arguably more desirable) distribution of the instruction space, but requiring a live application with existing users and major investments in human annotation. Can we create a large dataset of instructions that is diverse in tasks, content, and phrasing, without human labor?

1 We make our data publicly available: https://github.com/orhonovich/unnatural-instructions

Figure 1: An illustration of our data generation prompt. Black: The prompt provided to the model. Pink: One of the model’s generations for the given prompt. The full prompt is presented in Figure 2.

We introduce Unnatural Instructions, a dataset of natural language instructions and their corresponding inputs and outputs. Inspired by recent work on utilizing language models for data generation (Schick and Schütze, 2021b; Lee et al., 2021; Liu et al., 2022a), we collect data in a fully automatic manner by prompting a pretrained language model with three examples from the Super-Natural Instructions2 dataset (Mishra et al., 2022; Wang et al., 2022) and asking the model to generate a fourth (Figure 1).

2 Also known as Natural Instructions v2.

Example 1

Instruction: You are given a science question and four answer options. Your task is to find the correct answer.

  • Task Instructions and Examples
Input Instruction Example
Which part of a bicycle BEST moves in a circle? Convert a negative review into a positive review by making minimal changes. “We stood there in shock, because we…”
  Classify whether two given sentences from a conversation are sequential. “Noah: When and where are we meeting? :)”
In this task, you will be given a profile of someone and your job is to generate a set of interesting questions that can lead to a conversation with the person. “Yvonne has been playing the violin since she was four years old. She loves all kinds of music, but her favorite composer is Bach.”  

We repeat this process with 5 different seeds – i.e. the entire process requires only 15 instruction examples – to automatically produce 64,000 diverse triplets of instructions, inputs, and outputs.3 We further diversify the dataset’s format by generating additional natural language paraphrases of each instruction, while preserving the contents of any input arguments and outputs, expanding the dataset to approximately 240,000 examples. Although the dataset contains noise, our analysis reveals that more than 50% of generated examples are indeed correct, and that even incorrect examples typically contain valuable information for instruction tuning. At the same time, we find that Unnatural Instructions contains highly creative tasks – some of which are very different from “classic” NLP tasks – and has a more diverse set of instructions than Super-Natural Instructions. Experiments show that fine-tuning an 11Bparameter T5 model (Raffel et al., 2020) on Unnatural Instructions can outperform both T0++ (Sanh et al., 2021) and Tk-Instruct (Wang et al., 2022) across several benchmarks, including SuperNatural Instructions (Wang et al., 2022), BIGbench Hard (Suzgun et al., 2022), and LMentry (Efrat et al., 2022). When controlling for all variables besides the data, we find that a model trained on Unnatural Instructions performs competitively with a baseline model trained on Super-Natural Instructions. In particular, we observe an 18-point gain on BIG-bench Hard (original task formulation) and a 16-point gain on LMentry, suggesting that Unnatural Instructions is particularly useful for generalizing to instructions that deviate from the distribution of classic NLP tasks. These improvements become even more pronounced when the cost of generating examples is amortized; in this case, training on Unnatural Instructions substantially outperforms our baseline on all benchmarks. We observe a log-linear relationship between the number of generated examples and downstream task performance, suggesting that performance of models trained on Unnatural Instructions can further be improved simply by increasing its size.

Beyond the immediate implications on instruction tuning, this work demonstrates the viability of automatic dataset expansion using language models as an alternative to crowdsourcing. Unnatural Instructions highlights the ability of language mod-

els to produce creative and diverse data, a trait that is difficult to obtain with crowd workers, who lack the intrinsic motivation to create novel examples and typically collapse into predictable heuristics to form annotation artifacts (Gururangan et al., 2018). At the same time, language models are faster and cheaper than human labor, opening up new possibilities for scaling up data annotation.

2 Data Collection

We introduce Unnatural Instructions, a dataset of 240,670 natural language instructions for a wide variety of natural language tasks. Each example contains a natural language instruction as input and its expected execution as output. Table 2 displays examples from the dataset.

Unnatural Instructions is collected in a completely automatic process, requiring a seed of only 15 manually-constructed examples, which can be produced in about one hour of human labor. We first collect a core set of 68,478 examples (§2.1) by prompting a pretrained language model M 4 with a seed of 3 manually-annotated examples to produce a new (fourth) example. This phase uses a structured instruction format and filtering heuristics to ensure data quality. We then expand the core dataset by rephrasing the structured instructions in free-form natural language (§2.2). This expansion is performed automatically by prompting a language model with manually-constructed examples, scaling up the dataset more than 3-fold.

2.1 Core Dataset Generation

The core dataset consists of examples in a structured format, making it easier for the generating model M to predict and for us to filter automatically. We use stochastic decoding to generate example inputs (to promote creativity), followed by deterministic decoding to generate their outputs (for accuracy). Figure 3 illustrates the process.

Format Each example in the core dataset contains four fields:

  • An instruction describing the task. The instruction can be a generic template (e.g. “Write whether the following review is positive or negative”) that can be instantiated by a particular input argument (e.g. the review itself).
  • The input argument that instantiates the instruction, creating a specific example of the task.
  • Output space constraints, which detail the restrictions on the task’s output space. Constraints are mainly relevant for classification tasks; for tasks with no specific output space constraints, this field is “None.”
  • A textual output reflecting a correct execution of the instruction given the input arguments and output space constraints.

3 In practice, we collected 68,478 examples, but only used

4 Throughout this section, we use OpenAI’s text-davinci-subsets of 64,000 examples for training.

Figure 2: Our data generation prompt. Blue: The meta-prompt, which contains the number of the in-context example, as well as the constant fields of each example: instruction, input, and constraints. Black: The in-context examples. We show here one of our 5 in-context seeds. Pink: One of the model’s generations for the given prompt. The generated example includes an instruction, input, and constraints.

Figure 3: The core Unnatural Instructions generation pipeline. We use a seed of three in-context demonstrations x1, x2, x3 to create a large dataset of NLP tasks with instructions, inputs and outputs. As a first step, we sample instructions, inputs, and constraints from a language model M . In the next step, we use M to deterministically generate the corresponding outputs. Finally, the data can be used for instruction tuning.

The first three fields (instruction, input argument, constraints) are the model’s input, and the output field acts as the reference for training and/or evaluation. The constraints field is meant to guide M during output generation and is discarded after generating the outputs (see next). In §6 we provide data-driven evidence for selecting this particular format.

Input Generation The first step in the data generation pipeline is to generate examples of instruction-input-constraints. We do so by prompting a model with three task demonstrations x1, x2, x3, each presented in the structured instruction-input-constraint format (without outputs). These demonstrations are wrapped by a simple meta-prompt that incentivizes the model to create a fourth example x4, which we collect. This process is illustrated in Figure 2.

We use 5 different seeds of 3 demonstrations In each to generate the entire core dataset.

Task Instructions and Examples

Constraint Instruction Example
None In this task, you are given two sentences taken from a conversation, and your job is to classify whether these given sentences are sequential or not. We will mark the given sentence pair as ‘True’ if it’s sequential, otherwise ‘False’. The two sentences are spoken by two different people. Input: Noah: When and where are we meeting? :), Madison: I thought you were busy…?
None In this task, you will be given a profile of someone and your job is to generate a set of interesting questions that can lead to a conversation with the person. Input: Yvonne has been playing the violin since she was four years old. She loves all kinds of music, but her favorite composer is Bach.
None Input:
    Constraints:
    Output:
Finetune Instruction-Tuned Model with nucleus sampling and greedy decoding In other words, the whole process requires only 15 manually-constructed examples. All demonstrations are taken from the Super-Natural Instructions (Wang et al., 2022) train set. To obtain various examples using the same prompt, decoding is done by nucleus sampling (top p) with p = 0.99 (Holtzman et al., 2020).

Filtering We apply three automatic filters to the generated examples to remove: (1) model generations that do not include the three input fields (instruction, input argument, and constraints), (2) instructions and inputs that are identical to those demonstrated in the prompt, (3) duplicate examples, i.e. two different examples that have the same instruction and input argument.

Output Generation Given a generated example x, we generate the corresponding output y by conditioning a pretrained language model with the instruction, input argument, and constraints (if not none), followed by an “Output:” prompt. Here we apply greedy decoding to prioritize correctness over creativity. We ignore examples for which the generated output is an empty string.

2.2 Template Expansion

Examples in the Unnatural Instructions core dataset have a strict instruction-input-output format. To increase the format diversity and obtain tasks phrased in free-form natural language (Schick and Schütze, 2021a; Sanh et al., 2021), we collect alternative formulations that preserve the content of the original instructions. Specifically, we prompt a language model to reformulate the tasks in the core dataset and collect two alternative formulations for each generated task.5 The alternative formulations are often shorter and less formal than the original instructions. The rephrasing prompt contains two examples of instructions and their alternative formulation. We do not include inputs, constraints, and outputs in the rephrasing prompt; instead, we utilize the already-generated inputs and outputs to complement the rephrased instruction. Unlike the examples in the core dataset, in some alternative formulations, the input is embedded into the task description rather than following it. We achieve that by adding an “{INPUT}” placeholder, which marks the position for input insertion (Figure 4).

In some cases, the model generates two identical additional formulations, while in others, it partially taken from PromptSource (Bach et al., 2022). copies the original instruction. Some alternative formulations may also have an invalid format e.g., not containing the “{INPUT}” placeholder. When such failures occur we continue to sample alternative formulations, stopping after five unsuccessful attempts. For this reason, some instructions have only one alternative formulation, while others have none. Overall, more than 97.5% of the instructions have two valid and distinct alternative formulations. In fact, some instructions end up with more than two paraphrases because we generate two paraphrases per example (i.e. instruction-input-output pair) and the core dataset contains examples that share the exact same instruction but not the same input argument. Therefore, by cross-referencing each instruction’s alternative phrasings with all of its input arguments, we can extend the data even further and arrive at a total of 240,670 examples without additional cost.

5 The seed reformulations in each prompt are inspired and

Figure 4: Our template expansion prompt. Black: Few-shot demonstrations of instructions and a possible alternative formulation. Blue: A model-generated instruction for which the model should suggest an alternative formulation. Pink: An example of a model-generated task reformulation.

3 Data Analysis

We first demonstrate the creativity of instructions in Unnatural Instructions and then manually analyze 200 examples, randomly sampled from our core dataset, focusing on correctness and diversity. We also compare the distribution of Unnatural Instructions to that of Super-Natural Instructions, and find that the inputs of Unnatural Instructions tend to be more diverse.

Table 1: Examples of eight interesting generated instructions and their corresponding category. The first four examples are taken from the core dataset, while the last four were generated during the template expansion phase.

Creativity A major challenge when creating a general-purpose instructions dataset is task creativity. Crowd workers may struggle to do so, and typically collapse into predictable heuristics to form annotation artifacts (Gururangan et al., 2018). While the high performance of models trained on Unnatural Instructions across several benchmarks (see §5) suggests that it is indeed diverse and creative, we additionally present in Table 1 some cherry-picked examples of the generated instructions, providing a glimpse at their creativity.

Correctness When evaluating correctness, we test whether (1) the generated instructions are logical and executable, (2) the input arguments correspond to the task described in the instruction, and (3) the outputs are correct, given the instruction and input.

Although our data filtering process is minimal, 113 of the 200 analyzed examples (56.5%) are correct. Of the 87 incorrect examples, 9 (4.5%) had incomprehensible instructions, 35 (17.5%) had an input that did not match the task description, and 43 (21.5%) had incorrect outputs. Table 2 shows some correct and incorrect examples from our analysis. While the amount of noise in the data may raise concerns regarding its usability, we note that many of the examples that were marked as incorrect can still be considered informative. For example, one erroneous example had the instruction “In this task, you will be provided with a list of countries and their corresponding capital cities. You are also given a list of clues to help you solve the puzzle. For each clue, determine which country it is referring to and write down that country’s name in the space next to the clue…” The input argument was “Clue 1: This capital city is on two different continents.” This example was marked as incorrect since the input did not conform with the format described by the instruction – a list of countries and their capitals was not provided, only a clue. However, the output was Istanbul, Turkey, which indeed lies in both Europe and Asia and therefore corresponds with the clue provided as input. In §5 we provide quantitative evidence that, despite being noisy, Unnatural Instructions provides a highly informative training signal.

Table 2: Examples of generated instructions, inputs and outputs in our core dataset. For the first two examples, the entire pair of instruction, input and output is valid. The third example has an incorrect output; in the fourth example, the experiment is not described in the input.

4 Experimental Setup

We describe how we use Unnatural Instructions to fine-tune models and elaborate on our evaluation protocol.

4.1 Fine-Tuning on Unnatural Instructions

We fine-tune T5-LM, the language-model-adapted variant of T5-11B (Raffel et al., 2020; Lester et al., 2021). We follow standard practice for fine-tuning, using a batch size of 16 examples over 3 epochs. For training on our core dataset, we use the same template as Wang et al. (2022) for formatting instructions and inputs. Our full set of training hyperparameters is available in Appendix A. We create a small validation set of 1,000 examples for model selection following the methodology proposed by Wang et al. (2022): we randomly select 10 examples from 100 random tasks of the Super-Natural Instructions training set.

4.2 Baselines

We measure the relative utility of Unnatural Instructions by comparing it to a variety of models, all based on T5-11B, which were fine-tuned with different types and quantities of manually-annotated instruction data.

  • T0++ (Sanh et al., 2021) is an instruction-tuned variant of T5-LM, trained on tasks in the PromptSource (Bach et al., 2022) prompt formats.
  • Tk-Instruct Wang et al. (2022) fine-tune T5 v1.1 on Super-Natural Instructions, using a subsample of 757 tasks with 100 examples each. Tk-Instruct is trained with a batch size of 1,024 examples for 1,000 steps. Since our evaluation focuses on zero-shot instruction understanding, we use the definition-only version of Tk-Instruct.
  • FLAN-T5 Chung et al. (2022) fine-tune T5 on a collection of tasks phrased as instructions in multiple prompting setups (zero-shot, few-shot, Chain-of-Thought (Wei et al., 2022)), achieving impressive zero-shot generalization capabilities.
  • T5-LM on Natural Instructions Our main point of comparison is the utility of the original manually-curated instructions in Super-Natural Instructions. We therefore train a model which is identical to ours in all aspects but data. Specifically, we fine-tune the LM-adapted variant of T5-11B on a subsample of 64,000 examples from Super-Natural Instructions training set, excluding examples from any task that participates in the validation set. This model differs from Tk-Instruct along three aspects: the dataset subsample, the base model (T5-LM), and some training hyperparameters (batch size 16 for 3 epochs).

4.3 Evaluation

We evaluate models on four different benchmarks, measuring a range of capabilities. All evaluations are carried out in a zero-shot setting, without few-shot demonstrations, unless explicitly provided in the instructions.

Natural Instructions We evaluate models on the test set of Super-Natural Instructions (Mishra et al., 2022; Wang et al., 2022). As in the original papers, outputs are generated using greedy decoding, and performance is measured using Rouge-L.

  • T0: Zero-Shot We evaluate models on the heldout set of T0 (Sanh et al., 2021), using rank classification for decoding and accuracy as a metric. For fair comparison, we remove tasks supersets of which are present in the Tk-Instruct training set. The final set contains six tasks: ANLI R1-R3, CB, COPA and RTE. We refer to this evaluation set as T0: Zero-Shot. Unlike Super-Natural Instructions, T0: Zero-Shot tasks do not have a strict format and are phrased in a rather free-form manner, including inputs that can be embedded into the task description. We therefore expect models trained on our core dataset (without instruction paraphrases) to perform poorly under these conditions, while adding the task reformulation data should boost performance on T0: Zero-Shot.
  • BIG-bench: Hard The “hard” subset of BIG-bench (Suzgun et al., 2022) contains 23 challenging tasks from BIG-Bench (Srivastava et al., 2022). We investigate two different formats for all tasks: their original format in BIG-bench, and the format of Suzgun et al. (2022), who reformulate each task as question answering with manually added instructions; for the latter, we remove all few-shot demonstrations. For both formats, we use greedy decoding and exact match with the reference for evaluation.
  • LMentry LMentry (Efrat et al., 2022) is a benchmark that tests basic language abilities, designed to complement common approaches for evaluating large language models. Outputs are generated by applying greedy decoding and evaluated using highaccuracy regular expressions. The benchmark’s metric is the LMentry score, which combines accuracy with multiple aspects of robustness.

Table 4: Performance of several models on the four benchmarks considered. Best results in our direct comparison setup are bold, best results overall are underlined. NHO indicates that a benchmark’s data is not held out because it was used for training. T5-LM on Unnatural Instructions performs better than several strong baselines and is competitive to our direct comparison baseline, outperforming it in three setups despite being finetuned on automatically generated data only. Template expansion substantially increases performance in most cases but gives worse results on Super-Natural Instructions.

5 Results

Our main results are shown in Table 4, which reports the performance of each model on each benchmark considered. Remarkably, T5-LM finetuned on Unnatural Instructions clearly outperforms several strong instruction-tuned baselines such as T0++ and Tk-Instruct; the only exception to this is BIG-bench: Hard (Orig), where T0++ performs better. Retraining a model on Super-Natural Instructions using our exact setup reveals that a much stronger performance than that of Tk-Instruct can be achieved using this dataset. However, even in this direct comparison setup, Unnatural Instructions leads to stronger or equal performance for every dataset except Super-Natural Instructions itself. While T5-LM finetuned on Unnatural Instructions is outperformed by FLAN-T5, the amount of training data for this model is larger by several orders of magnitude. These results demonstrate that fully automated data generation with pretrained LMs is indeed a viable and cost-effective alternative to human-curated data.

5.1 Performance with Template Expansion

We evaluate the contribution of template expansion (§2.2) to the performance of models trained on Unnatural Instructions. To this end, we finetune a single model on our full dataset with paraphrases; results are shown in the bottom row of Table 4.

Adding instruction paraphrases boosts performance on T0: Zero-Shot (+3.3), Big-bench: Hard in its original format (+12.1) and LMentry (+8.7). We surmise that this improvement is largely because examples in our core dataset were generated based on demonstrations from Super-Natural Instructions only and therefore have their exact format and style. Accordingly, models trained on our core dataset rely too much on this specific format and cannot generalize well to different formats found in other benchmarks. Obtaining more format diversity through template expansion successfully addresses this issue. On the other hand, over-reliance on the format of Super-Natural Instructions is probably preferable when testing on this dataset itself, which explains the performance drop when adding paraphrases compared to the boost in performance on other benchmarks.

While some of the performance gains observed may also be attributed to the fact that adding paraphrases simply increases the data, in §5.2 we show that template expansion is helpful even when controlling for dataset size.

5.2 Performance Scaling by Dataset Size

As all of our data is generated from the same model using the same set of prompts, scaling up the amount of generated examples might lead to numerous repetitions and, as a consequence, diminishing returns in terms of downstream task performance. To investigate whether this is an issue, we analyze how the amount of training examples affects the performance of our finetuned models. To this end, we train models on subsets of both Super-Natural Instructions and Unnatural Instructions, ranging from 250 to 64,000 examples. As shown in Figure 6 (top row), our core and full data as well as Super-Natural Instructions all exhibit loglinear scaling laws, suggesting that even for subsets of Unnatural Instructions containing thousands of examples, simply generating more examples still adds a valuable signal to our training data.

Figure 6: Scaling experiments comparing Unnatural Instructions with Super-Natural Instructions. Top row: Model performance when controlling for dataset size, tested on Super-Natural Instructions (left) and LMentry (right). Bottom row: Model performance when controlling for the cost of obtaining data, tested on Super-Natural Instructions (left) and LMentry (right).

Results for LMentry (Figure 6, top right) show that our template expansion process is still beneficial when controlling for dataset size. The added value of the paraphrases is therefore likely to be in terms of format diversity rather than solely as a method for increasing the amount of data.

5.3 Performance Scaling by Cost

In practical scenarios with fixed annotation budgets, the actual cost associated with a certain level of performance is even more relevant than the number of required examples. We therefore measure model performance as a function of the cost for obtaining the training data. Based on OpenAI’s pricing as of December 2022, the cost for generating an example is estimated at \$0.02 for our core dataset, and \$0.01 for the expanded dataset. Kiela et al. (2021) estimate human annotation cost at \$ 0.50–\$ 1.00 per example, excluding indirect costs such as task design and UX development; for comparison with our automatic data collection method, we assume the lower-bound human annotation cost of \$0.50. As shown in Figure 6 (bottom row), Unnatural Instructions is clearly more cost-efficient than manually curated data. This is true even for the Super-Natural Instructions test set, where a model trained on Unnatural Instructions is weaker than a model trained on Super-Natural Instructions for a fixed number of examples, but better when controlling for cost, showing that our automatic data generation approach outperforms crowdsourcing for a fixed annotation budget.

6 Data Collection Ablations

We explore the effect of the different components of our data collection pipeline by conducting structural prompt ablations. Throughout this section, we train models for 1,500 steps using 2,000 examples and evaluate the Super-Natural Instructions validation set performance, averaged across three different random seeds.

Table 5: Performance of 11B T5-LM models trained on 2,000 examples, generated with different models, on the Super-Natural Instructions validation set.

6.1 Generative Model

As a data generation model, we used text-davinci-002, an instruction-tuned variant of GPT-3 (Brown et al., 2020). However, our approach is not limited to this specific model. We experiment with generating examples using the original (untuned) GPT-3 model by using it as the model M in both the input generation and output generation phases (see §2). Table 5 shows how replacing an instructiontuned model with a vanilla model affects the quality of the data using performance on the Super-Natural Instructions validation set as a proxy. We observe that while the quality of generated inputs does drop by 4.5 points, it is well within the range of other prompt ablations (see the remainder of this section). In other words, informative and diverse instructions can be generated by untuned language models. However, generating outputs does seem to require some level of instruction tuning. A manual analysis reveals that outputs generated by GPT-3 mainly suffer from the model’s inability to stop (i.e. predict EOS), often starting with the correct answer, but then degenerating into repetitions or tangents. While this may be remedied through various postprocessing heuristics, we leave exploration of such methods to future work.

6.2 Meta-Prompts

Language models are known to be sensitive to the meta-prompt – i.e., the text wrapping the in-context demonstrations, which can include task description or additional guidance regarding the desired output. We therefore experiment with three different metaprompt styles: minimal, enumeration, and verbose (Figure 7). tuning on datasets generated with different metaprompts. We observe that the simple enumeration approach elicits more informative examples than either the minimalistic or verbose approaches. Perhaps surprisingly, the verbose meta-prompt performs worse than the minimalistic one, possibly because the last line (the command) interrupts the pattern, and does not align well with patterns in the pretraining corpus.7

Table 6 presents the results obtained from fine-

Figure 7: The meta-prompts used in our ablations.

Table 6: Performance of 11B T5-LM models trained on 2,000 examples, generated with each meta-prompt, on the Super-Natural Instructions validation set.

6.3 In-Context Examples

Models such as GPT-3 are known to be sensitive to slight variations in prompt content, resulting in performance differences when provided with different demonstrations sampled from the same dataset (Liu et al., 2022b) and when permuting the in-context demonstrations (Kumar and Talukdar, 2021; Lu et al., 2022). To account for the effect of the provided demonstrations on the quality of the generated data, we experiment with each of our five demonstration sets separately.8 Table 7 shows that the data generation pipeline is largely robust to variations in the in-context demonstrations, with one outlier (seed 4). Inspecting the differences between these groups, we find that seed 4 led to less constrained instructions: 1,376 out of 2,000 examples do not have constraints, whereas that number is between 28 and 880 for all other sets. Indeed, in seed 4, only one out of three prompt demonstrations had constraints, while in other sets, at least two demonstrations had constraints.

7 While our core dataset was created using the enumeration meta-prompt, the remaining ablation experiments in this section were run using the verbose meta-prompt. 8See Appendix C for all demonstration sets.

Table 7: Performance of 11B T5-LM models trained on 2,000 examples, generated with various sets of three incontext demonstrations (seeds), on the Super-Natural Instructions validation set. Mix samples 400 examples from each of the five single-seed datasets.

6.4 Constraints

As mentioned in §2, each instruction-input demonstration is accompanied by an additional constraints field, which details the task’s output space restrictions (e.g., “entailment”, “contradiction” or “neutral” for NLI). We note that, in all demonstrations, the instruction itself lists the output space constraints. We hypothesize that adding the constraints field may emphasize these restrictions, ultimately steering the output generation model to produce outputs in the correct format. We verify our hypothesis by conducting two ablation experiments. First, we keep the constraints field when generating the instructions and inputs, but only use instructions and input arguments for the output generation step (i.e., without concatenating generated constraints). Second, we completely remove the constraints field from the data generation pipeline, leaving the instruction field as the only source of information for output space constraints. Table 8 shows that the constraints field has a positive effect both on the quality of the generated outputs and inputs. Removing constraints from the output generation step reduces performance by 3 points, and removing the field from the instructions-inputs generation phase decreases performance by an additional 2.2 points.

Table 9: Performance of 11B T5-LM models trained on 2,000 examples, generated either using separate input and output steps or a single unified step, on the Super-Natural Instructions validation set.

6.5 Two-Step Process

An alternative to our two-step pipeline is to generate instruction-input-output triplets in one pass. To test this approach, we provide the model with the same prompt used for the instruction-inputconstraints generation, only with an additional output field, added after the constraints field. As Table 8 shows, one-step generation obtains a score that is lower by 1.7 than the default two-step process. We suspect that this gap is a result of using stochastic decoding in the unified input-output generation phase, which is critical for obtaining diverse inputs. In contrast, when generating outputs in a separate phase, we can use deterministic decoding algorithms to maximize accuracy.

Instruction Tuning Efrat and Levy (2020) propose the Instruction Paradigm, where models learn new tasks from natural language instructions alone. Mishra et al. (2022); Wang et al. (2022) construct the first large-scale instruction benchmarks by collecting crowdsourcing instructions used to create NLP datasets and converting them into a uniform format. Sanh et al. (2021); Wei et al. (2021) further extend the usability of instructions by suggesting instruction tuning, where a language model is trained on many natural language instructions in the hope that it will generalize to new, unseen instruction tasks. Chung et al. (2022) advance instruction tuning by scaling the number of tasks, scaling the model size, and adding chain-of-thought (Wei et al., 2022), while Ouyang et al. (2022) propose a reinforcement learning approach for instruction tuning from comparative human judgements.

Automatic Data Generation Obtaining largescale supervised data can be expensive and timeconsuming. To mitigate this, several studies have explored automatic data generation. A common approach is to automatically augment existing datasets (Anaby-Tavor et al., 2020; Andreas, 2020; Yang et al., 2020; Kaushik et al., 2020; Lee et al., 2021, inter alia). Kiela et al. (2021) suggest a human-and-model-in-the-loop dataset creation, where a model is trained on initial data, then annotators are asked to seek examples that are misclassified by the model, in an iterative process. In the same manner, Nie et al. (2020) apply a process to create training data for the task of NLI (Dagan et al., 2006; Bowman et al., 2015), obtaining state-of-theart performance on a variety of NLI benchmarks. Liu et al. (2022a) combine human annotators and GPT-3 to create challenging examples for NLI.

While all the above techniques require an existing labeled dataset, other work suggested creating datasets entirely automatically, without the need for labeled data. Schick and Schütze (2021b) propose to leverage pretrained language models to generate entire datasets of labeled text pairs from scratch. Agrawal et al. (2022) use pretrained language models to automatically construct multilingual QA data using only five examples per language. To the best of our knowledge, Unnatural Instructions is the first work to go beyond a particular task and automatically generate a large-scale general-purpose dataset, which emphasizes task diversity.

8 Conclusion

We introduce Unnatural Instructions, an automatically generated dataset of natural language instructions and their corresponding inputs and outputs. To the best of our knowledge, this is the first general-purpose NLP dataset that was automatically generated. Our experiments show that models trained on Unnatural Instructions can outperform models trained on manually annotated datasets across several benchmarks. Unnatural Instructions is not only very cost-effective, we also provide evidence of enhanced diversity in the instructions produced and a high level of creativity in the tasks devised, a trait difficult to obtain with crowd workers. Ablations show that even weaker models without instruction tuning can generate useful instructions, though they may struggle with producing the corresponding outputs. However, coming up with interesting tasks and writing diverse instructions for them is arguably the main challenge of the data collection process, whereas given instructions and inputs, outputs are often far easier to annotate through crowdsourcing. Our findings incentivize utilizing models for general-purpose data generation, which we view as an intriguing direction for future research.

Previous: Model | Giraffe Next: Vector Search | Lucene

post contain ""

    No matching posts found containing ""