[프롬프트 분류 분포 핵심색인마킹]
Our method relies on a set of seed tasks to bootstrap the generation. These seed tasks are important for both encouraging the task diversity and demonstrating correct ways for solving the diverse tasks. For example, with coding tasks to prompt the model, it has a larger chance to generate coding-related tasks; it’s also better to have coding output to guide the model in writing code for new tasks. So, the more diverse the seed tasks are, the more diverse and better quality the generated tasks will be.
Our seed tasks were written when we initiated this project, and targeted for the diverse and interesting usages of LLMs. The tasks were written by the authors and our lab mates at UWNLP, without explicit reference to existing datasets or specific testing tasks. We further categorized the tasks into classification and non-classification tasks, based on whether the task has a limited output label space. In total, there are 25 classification tasks and 150 non-classification tasks. We release this data in our GitHub repository.
To provide a sense of how much the model is generalizing beyond these seed tasks, we further quantify the overlap between the instructions of these seed tasks and the instructions of our test sets, including both SUPERNI task instructions (§4.3) and the user-oriented instructions in our humane evaluation (§4.4). We compute ROUGE-L similarities between each seed instruction and its most similar instruction in the test set. The distribution of the ROUGE-L scores are plotted in Figure 8, with the average ROUGE-L similarity between these seed instructions and SUPERNI as 0.21, and the average ROUGE-L similarity between the seed instructions and user-oriented instructions as 0.34. We see a decent difference between the seed tasks and both test sets. There is exactly one identical seed instruction occurring in the user-oriented instruction test set, which is “answer the following question,” and the following questions are actually very different.
We used different sets of hyperparameters when querying GPT-3 API for different purposes. These hyperparameters are found to work well with the GPT-3 model (“davinci” engine) and the other instruction-tuned GPT-3 variants. We listed them in Table 4. OpenAI charges $0.02 per 1000 tokens for making completion requests to the “davinci” engine as of December 2022. The generation of our entire dataset cost around $600.
GPT3SELF-INST and some of our baselines are finetuned from GPT-3 model (“davinci” engine with 175B parameters). We conduct this fine-tuning via OpenAI’s fine-tuning API. While the details of how the model is finetuned with this API are not currently available (e.g., which parameters are updated or what the optimizer is), we tune all our models with the default hyperparameters of this API so that the results are comparable. We only set the “prompt_loss_weight” to 0 since we find this works better in our case, and every fine-tuning experiment is trained for two epochs to avoid overfitting the training tasks. Finetuning is charged based on the number of tokens in the training file. In our case, fine-tuning GPT3SELF-INST from the GPT-3 model on the entire generated data cost $338.
SELF-INSTRUCT relies on a number of prompting templates in order to elicit the generation from language models. Here we provide our four templates for generating the instruction (Table 5), classifying whether an instruction represents a classification task or not (Table 6), generating non-classification instances with the input-first approach (Table 7), and generating classification instances with the output-first approach (Table 8).
Here we provide more details for the humane evaluation described in §4.4 for rating the models’ responses to the 252 user-oriented instructions. To ensure faithful and reliable evaluation, we asked two authors of these instructions (and of this paper) to judge model predictions. These two evaluators coordinated the standards for the 4-level rating system before starting annotation, and then each of them rated all the instances independently. They were presented with the instruction, instance input, target output (as a reference), and model responses. Model responses are listed in random order, with all the model information anonymized. Figure 9 provides a screenshot of the annotation interface. The reported performance in this paper is based on the results from one of the evaluators, and the trends from the other evaluator’s results are the same.
[정성적 평가 핵심색인마킹]
To measure how reliable our humane evaluation is, we calculate the inner-rater agreement between our two evaluators. We first report Cohen’s 𝜅, which is commonly used to measure inter-rater agreement for categorical items. When calculating this, we treat the 4-level rating (A-D) as a categorical variable, leading to a 𝜅 of 0.58, which is a moderate agreement according to common practice. Furthermore, we also calculate the agreement of our evaluators on classifying acceptable responses ((A or B) vs. (C or D)), with a final 𝜅 of 0.75, indicating substantial agreement. We also compute the Spearman correlation coefficient 𝜌 between the ratings of our two evaluators by treating the rating as an ordinal variable (A > B > C > D). The final coefficient is 𝜌=0.81, indicating a high correlation between the two evaluators.
We present a selection of user-oriented tasks, the corresponding GPT3SELF-INST-produced responses, and annotator ratings in Table 9. We see that even for responses rated as level C, the model demonstrates extensive steps in solving the task, even though its final output is incorrect.