[Instruction Tuning Survey 핵심색인마킹]
Contents
[임의의 부분만 줄임]
1. 서론
최근 몇 년간 대규모 언어모델(Large Language Models, LLMs) 분야는 눈부신 발전을 이루었습니다. GPT-3, PaLM, LLaMA 등의 모델들은 자연어 처리 작업에서 향상된 성능을 보여주었습니다. 그러나 이런 모델들은 훈련 목표와 사용자의 목표 사이에 괴리가 있습니다. 모델은 대규모 코퍼스에서 문맥적 단어 예측 오류를 최소화하는 것을 목표로 훈련되는 반면, 사용자는 모델이 지시에 따라 유용하고 안전하게 작동하기를 원합니다.
이런 문제를 해결하기 위해 ‘지시 튜닝(Instruction Tuning, IT)’이 제안되었습니다. 지시 튜닝은 LLMs의 능력과 제어 가능성을 향상시키는 효과적인 기술로, 모델을 (INSTRUCTION, OUTPUT)
쌍을 사용하여 추가 훈련하는 것을 포함합니다. 이 방법은 다음과 같은 이점을 제공합니다.
그러나 고품질의 지시를 만드는 것은 쉽지 않으며, IT는 주로 IT 훈련 데이터셋에서 지원되는 작업에서만 개선된다는 우려가 있습니다. 또한 IT가 표면적인 패턴과 스타일만을 파악하고 실제 작업을 이해하고 학습하지 못한다는 비판이 있습니다.
2. 방법
2.1 지시 데이터셋 구성
지시 데이터셋의 각 인스턴스는 세 가지 요소로 구성됩니다.
지시 데이터셋을 구성하는 두 가지 방법은 다음과 같습니다.
2.2 지시 튜닝
수집된 IT 데이터셋을 바탕으로, 사전 훈련된 모델을 직접 파인튜닝할 수 있습니다. 지시와 입력이 주어지면 모델은 출력에서 각 토큰을 순차적으로 예측하도록 훈련됩니다.
3 데이터셋
이 섹션에서는 커뮤니티에서 널리 사용되는 지시 튜닝 데이터셋에 대해 자세히 설명합니다. 데이터셋의 예시와 구성, 사용 방법에 대한 설명은 위와 같이 진행됩니다. 각 데이터셋은 지시 튜닝 과정에서 어떻게 활용되는지, 어떤 구체적인 방법을 통해 생성되었는지에 대한 설명을 포함합니다.
데이터셋 이름 | 설명 | 구성 요소 예시 |
---|---|---|
Natural Instructions | 193K 인스턴스로 구성된 영어 지시 데이터셋 | “지시”, “입력”, “출력” |
P3 | 170개의 영어 NLP 데이터셋과 2,052개의 프롬프트로 구성된 데이터셋 | “입력”, “응답 선택지”, “타겟” |
xP3 | 46개 언어로 구성된 16가지 다양한 자연어 작업의 다국어 지시 데이터셋 | “입력”, “타겟” |
Flan 2021 | 62개의 널리 사용되는 NLP 벤치마크를 언어 입력-출력 쌍으로 변환하여 구성된 데이터셋 | “입력”, “타겟” |
Unnatural Instructions | InstructGPT를 사용하여 구축된 약 240,000개의 인스턴스를 포함하는 데이터셋 | “지시”, “입력”, “제약 조건”, “출력” |
Self-Instruct | InstructGPT를 사용하여 구축된 52K의 훈련 지시와 252개의 평가 지시를 포함하는 데이터셋 | “지시”, “입력”, “출력” |
Evol-Instruct | ChatGPT를 사용하여 생성된 진화 전략을 포함하는 데이터셋 | “지시” |
LIMA | 커뮤니티 Q&A 웹사이트, 수동 작성 및 Super-Natural Instructions에서 파생된 데이터셋 | “지시”, “입력”, “출력” |
Super-Natural Instructions | 1,616 NLP 작업과 5M 작업 인스턴스를 포함하는 다국어 데이터셋 | “정의”, “긍정 예시”, “부정 예시” |
Dolly | 사용자와 유사하게 상호작용할 수 있도록 설계된 LLM용 데이터셋 | “Open Q&A”, “Closed Q&A”, “정보 추출” 등 |
OpenAssistant Conversations | 휴먼이 제작한 다국어 어시스턴트 스타일의 대화를 특징으로 하는 데이터셋 | “메시지”, “사용자 프롬프트”, “어시스턴트 답변” |
Baize | ChatGPT가 사용자 및 어시스턴트 역할을 모두 수행하는 셀프챗 메커니즘을 사용하는 데이터셋 | “인스턴스”, “턴” |
각 데이터셋은 지시 튜닝을 위해 다양한 방법으로 구축되었으며, 특정 NLP 작업 또는 언어 모델 튜닝의 효과를 검증하는 데 사용됩니다.
4. 지시 사항에 따른 LLMs의 파인튜닝
4.1 InstructGPT
InstructGPT는 GPT-3 기반의 모델로, 사람의 지시에 따라 파인튜닝되었습니다. 파인튜닝 과정은 다음과 같습니다.
$L(\theta)$는 손실 함수, $R(t)$는 보상, $\pi_\theta$는 정책, $\beta$는 규제화 계수, $KL$은 쿨백-라이블러 발산
이 수식은 PPO의 목적이 최근 정책에서 발생하는 보상의 기대값을 최대화하면서 이전 정책과의 발산을 최소화하는 것임을 보여줍니다.
평가 결과
InstructGPT는 TruthfulQA에서 10%, RealToxicityPrompts에서 7% 더 높은 성능을 보이며, 휴먼 평가에서도 지시 사항 따르기, 제약 조건 충족, 적절한 응답 생성 등에서 눈에 띄는 개선을 보였습니다.
[Check] 쿨백-라이블러
4.2 BLOOMZ
BLOOMZ는 BLOOM 기본 모델에서 시작하여 xP3 데이터셋에 파인튜닝되었습니다. 이 데이터셋은 46개 언어를 포함하고 있습니다.
데이터셋은 coreference 해결, 문장 완성, 자연어 인퍼런스과 같은 NLP 작업에서의 자동 평가에서 더 나은 성능을 보입니다. 이는 트랜스퍼러닝(전이학습) 기법을 통한 학습의 일반화 능력을 강화시키는 중요한 요소입니다.
4.3 Flan-T5
Flan-T5는 T5 모델을 기반으로 하며, FLAN 데이터셋에 파인튜닝되었습니다. 이 과정에서는 JAX 기반의 T5X 프레임워크를 사용하였습니다.
\(\text{minimize } L(\theta) = \sum_{i=1}^{N} \log P(y_i \\| x_i, \theta)\) $L(\theta)$는 손실 함수, $P(y_i \| x_i, \theta)$는 조건부 확률을 나타내며, 이는 모델이 주어진 입력 $x_i$에 대해 올바른 출력 $y_i$를 예측할 확률을 최대화하는 방향으로 학습됨을 의미합니다.
4.4 Alpaca
Alpaca는 7B 모델로, InstructGPT가 생성한 지시 데이터셋에 파인튜닝되었습니다.
이 모델은 특정 지시에 따른 응답 생성을 최적화하는 것을 목표로 합니다. 수학적 최적화 과정에서 PPO 같은 기술이 사용되어, 모델이 더 효율적으로 지시를 이해하고 따르도록 합니다.
이 섹션은 각 모델의 개발 및 평가 과정을 통해 얻은 깊이 있는 이해와 그에 따른 수학적, 기술적 세부 사항을 제공합니다. 각 단계는 명확한 수학적 정의와 함께 연결되어, 이해와 인퍼런스를 위한 체계적인 접근을 가능하게 합니다.
5. 다중 모드 지시 사항에 따른 파인튜닝
5.1 다중 모드 데이터셋
5.1.1 Multi-Instruct (Xu et al., 2022)
5.1.2 PMC-VQA (Zhang et al., 2023c)
5.1.3 LAMM (Yin et al., 2023)
이 섹션은 다양한 다중 모드 작업을 위한 데이터셋과 그에 따른 특성 및 수학적 접근 방식을 제공합니다. 각 데이터셋은 특정 도메인의 복잡한 문제를 해결하기 위한 모델의 능력을 향상시키는 데 중점을 두고 있습니다.
5.2 다중 모드 지시 사항에 따른 파인튜닝 모델
5.2.1 InstructPix2Pix (983M)
5.2.2 LLaVA (13B)
6. 도메인별 지시 사항에 따른 파인튜닝
6.1 대화
6.1.1 InstructDial
6.2 의도 분류 및 슬롯 태깅
6.2.1 Lingustic
6.3 정보 추출
6.3.1 InstructUIE
6.4 관점 기반 감정 분석
6.4.1 Varia et al.
6.5 글쓰기 보조
6.5.1 Writing-Alpaca-7B
7. 효율적인 튜닝 기술
대규모 언어모델(LLMs)을 downstream 작업에 적용하기 위해 소수의 파라미터를 여러 방법으로 최적화하는 것을 목표로 합니다. 추가 기반, 지정 기반, 재파라미터화 기반.
[방법 상세]
7.1 LoRA (Low-Rank Adaptation, Hu et al., 2021)
7.2 HINT (Ivison et al., 2022)
7.3 QLORA (Dettmers et al., 2023)
7.4 LOMO (LOw-Memory Optimization, Lv et al., 2023)
7.5 Delta-tuning (Ding et al., 2023b)
8. 평가, 분석 및 비평
8.1 HELM 평가
HELM (Liang et al., 2022)은 LMs의 투명성을 향상시키기 위한 종합적인 평가입니다. LMs의 능력, 위험 및 한계에 대한 종합적인 이해를 제공합니다. 평가는 세 가지 주요 요인에 초점을 맞춥니다.
평가 요인
8.2 저자원 지시 튜닝
Gupta et al. (2023)은 IT 모델에 필요한 최소 downstream 훈련 데이터를 조사합니다. 연구 결과는 다음과 같습니다.
8.3 작은 지시 데이터셋
Zhou et al. (2023)은 단 1,000개의 신중하게 선택된 훈련 예제로 LLM을 파인튜닝하는 LIMA를 제안합니다.
8.4 지시 튜닝 데이터셋 평가
Wang et al. (2023c)은 자동 및 휴먼 평가를 통해 다양한 IT 데이터셋을 평가합니다.
8.5 IT는 패턴 복사만 학습하는가?
Kung and Peng (2023)은 지시 튜닝 중 모델이 실제로 무엇을 학습하는지에 대해 질문합니다.
8.6 독점 LLMs 모방
Gudibande et al. (2023)은 모델 모방의 효과를 조사합니다.
The field of large language models (LLMs) has witnessed remarkable progress in recent years. LLMs such as GPT-3 (Brown et al., 2020b), PaLM (Chowdhery et al., 2022), and LLaMA (Touvron et al., 2023a) have demonstrated impressive capabilities across a wide range of natural language tasks (Zhao et al., 2021; Wang et al., 2022b, 2023a; Wan et al., 2023; Sun et al., 2023c; Wei et al., 2023; Li et al., 2023a; Gao et al., 2023a; Yao et al., 2023; Yang et al., 2022a; Qian et al., 2022; Lee et al., 2022; Yang et al., 2022b; Gao et al., 2023b; Ning et al., 2023; Liu et al., 2021b; Wiegreffe et al., 2021; Sun et al., 2023b,a; Zhejiang University, Shannon.AI, Nanyang Technological University, Amazon Adlakha et al., 2023; Chen et al., 2023).
Major Issues with LLMs
One of the major issues with LLMs is the mismatch between the training objective and users’ objective: LLMs are typically trained on minimizing the contextual word prediction error on large corpora; while users want the model to “follow their instructions helpfully and safely” (Radford et al., 2019; Brown et al., 2020a; Fedus et al., 2021; Rae et al., 2021; Thoppilan et al., 2022).
Instruction Tuning (IT)
To address this mismatch, instruction tuning (IT) is proposed, serving as an effective technique to enhance the capabilities and controllability of large language models. It involves further training LLMs using (INSTRUCTION, OUTPUT)
pairs, where INSTRUCTION
denotes the human instruction for the model, and OUTPUT
denotes the desired output that follows the INSTRUCTION
.
Benefits of IT
Challenges of IT
Research Directions
Improving instruction adherence and handling unanticipated model responses remain open research problems. These challenges highlight the importance of further investigations, analysis, and summarization in this field, to optimize the fine-tuning process and better understand the behavior of instruction fine-tuned LLMs.
In the Literature
In the literature, there has been an increasing research interest in analysis and discussions on LLMs, including pre-training methods (Zhao et al., 2023), reasoning abilities (Huang and Chang, 2022), downstream applications (Yang et al., 2023; Sun et al., 2023b), but rarely on the topic of LLM instruction fine-tuning. This survey attempts to fill this blank, organizing the most up-to-date state of knowledge on this quickly advancing field.
Survey Sections
In this section, we describe the general pipeline employed in instruction tuning.
Each instance in an instruction dataset consists of three elements:
There are generally two methods for constructing instruction datasets:
Data integration from annotated natural language datasets: In this approach, (instruction, output) pairs are collected from existing annotated natural language datasets by using templates to transform text-label pairs to (instruction, output) pairs. Datasets such as Flan (Longpre et al., 2023) and P3 (Sanh et al., 2021) are constructed based on the data integration strategy.
Generating outputs using LLMs: An alternate way to quickly gather the desired outputs to given instructions is to employ LLMs such as GPT-3.5-Turbo or GPT4 instead of manually collecting the outputs. Instructions can come from two sources:
Next, the collected instructions are fed to LLMs to obtain outputs. Datasets such as InstructWild (Xue et al., 2023) and Self-Instruct (Wang et al., 2022c) are generated following this approach.
For multi-turn conversational IT datasets, we can have large language models self-play different roles (user and AI assistant) to generate messages in a conversational format (Xu et al., 2023b).
Based on the collected IT dataset, a pretrained model can be directly fine-tuned in a fully-supervised manner, where given the instruction and the input, the model is trained by predicting each token in the output sequentially.
In this section, we detail widely-used instruction tuning datasets in the community. Table 1 gives an overview of the datasets.
Natural Instructions (Mishra et al., 2021) is a human-crafted English instruction dataset consisting of 193K instances, coming from 61 distinct NLP tasks. The dataset is comprised of “instructions” and “instances”.
Each instance in the “instructions” is a task description consisting of 7 components: title, definition, things to avoid, emphasis/caution, prompt, positive example, and negative example. Subfigure (a) in Figure 2 gives an example of the “instructions”.
“Instances” consists of (“input”, “output”) pairs, which are the input data and textual result that follows the given instruction correctly. Subfigure (b) in Figure 2 gives an example of the instances.
The data comes from existing NLP datasets of 61 tasks. The authors collected the “instructions” by referring to the dataset annotating instruction file. Next, the authors constructed the “instances” by unifying data instances across all NLP datasets to (“input”, “output”) pairs.
P3 (Public Pool of Prompts) (Sanh et al., 2021) is an instruction fine-tuning dataset constructed by integrating 170 English NLP datasets and 2,052 English prompts. Prompts, sometimes named as task templates, function as mappings of data instances in conventional NLP tasks (e.g., question answering, text classification) to natural language input-output pairs.
Each instance in P3 has three components:
The authors built PromptSource, a tool for creating high-quality prompts collaboratively and an archive for open-sourcing high-quality prompts. The P3 dataset was built by randomly sampling a prompt from multiple prompts in the PromptSource and mapping each instance into an (“inputs”, “answer choices”, “targets”) triplet.
xP3 (Crosslingual Public Pool of Prompts) (Muennighoff et al., 2022) is a multilingual instruction dataset consisting of 16 diverse natural language tasks in 46 languages.
Each instance in the dataset has two components:
The original data in xP3 comes from three sources:
The authors built the xP3 dataset by sampling human-written task templates from PromptSource and then filling templates to transform diverse NLP tasks into a unified formalization. For example, a task template for the natural language inference task is as follows: “If Premise is true, is it also true that Hypothesis?”; “yes”, “maybe”, “no” with respect to the original task labels “entailment (0)”, “neutral (1)” and “contradiction (2)”.
Flan 2021 (Longpre et al., 2023) is an English instruction dataset constructed by transforming 62 widely-used NLP benchmarks (e.g., SST-2, SNLI, AG News, MultiRC) into language input-output pairs.
Each instance in the Flan 2021 dataset has two components:
The authors transformed conventional NLP datasets into input-target pairs by:
Unnatural Instructions (Honovich et al., 2022) is an instruction dataset with approximately 240,000 instances, constructed using InstructGPT (textdavinci-002) (Ouyang et al., 2022). Each instance in the dataset has four components:
The authors first sampled seed instructions from the Super-Natural Instructions dataset (Wang et al., 2022e), which is manually constructed. They prompted InstructGPT to elicit a new (instructions, inputs, constraints) pair with three seed instructions as demonstrations. Then, the dataset was expanded by randomly rephrasing the instruction or the input. The concatenation of instruction, input, and constraint is fed to InstructGPT to obtain the output.
Self-Instruct (Wang et al., 2022c) is an English instruction dataset with 52K training instructions and 252 evaluation instructions, constructed using InstructGPT (Ouyang et al., 2022). Each data instance consists of:
The full dataset is generated based on the following steps:
Step 1: The authors randomly sampled 8 natural language instructions from the 175 seed tasks as examples and prompted InstructGPT to generate more task instructions.
Step 2: The authors determined whether the instructions generated in Step 1 is a classification task. If yes, they asked InstructGPT to generate all possible options for the output based on the given instruction and randomly selected a particular output category to prompt InstructGPT to generate the corresponding “input” content. For Instructions that do not belong to a classification task, there should be countless “output” options. The authors proposed to use the Input-first strategy, where InstructGPT was prompted to generate the “input” based on the given “instruction” first and then generate the “output” according to the “instruction” and the generated “input”.
Step 3: Based on the results of Step 2, the authors used InstructGPT to generate the “input” and “output” for corresponding instruction tasks using the output-first or input-first strategy.
Step 4: The authors post-processed the generated instruction tasks (e.g., filtering out similar instructions and removing duplicate data for input and output) and got a final number of 52K English instructions.
Evol-Instruct (Xu et al., 2023a) is an English instruction dataset that includes a training set with 52K instructions and an evaluation set with 218 instructions. The dataset was created using evolving strategies prompted by ChatGPT (OpenAI, 2022). These strategies include:
The dataset underwent four iterations of these evolving strategies to arrive at a final count of 250K instruction pairs. In addition, the authors compiled a test set of 218 human-generated instructions from real-world sources like open-source projects and forums.
LIMA (Zhou et al., 2023) is another English instruction dataset containing 1K training instances and 300 test instances. The training set is sourced from:
The test set comprises 300 instances, with 76.7% written by a different group of authors (Group B) and 23.3% sampled from the Pushshift Reddit Dataset (Baumgartner et al., 2020).
Super-Natural Instructions (Wang et al., 2022f) is a multilingual dataset featuring 1,616 NLP tasks and 5M task instances across 76 task types and 55 languages. The dataset includes:
The data is sourced from existing public NLP datasets, crowdsourced annotations, and synthetic tasks.
Dolly (Conover et al., 2023a) is designed to help Language Learning Models (LLMs) interact with users similarly to ChatGPT. The dataset contains 15,000 human-generated data instances and covers seven specific types of tasks, including:
Examples of each task type are detailed in Table 2.
OpenAssistant Conversations (Köpf et al., 2023) is a dataset that features human-crafted, multilingual assistant-style conversations. The dataset includes:
Each conversation is represented as a conversation tree (CT), where nodes signify either a prompt or a reply from the assistant. A path from the root node to any other node in the CT is considered a valid conversation thread.
The dataset was built using a five-step pipeline:
Inappropriate and offensive conversation trees were filtered out.
Baize (Conover et al., 2023b) is an English multi-turn chat corpus containing:
The dataset is built using ChatGPT and employs a self-chat mechanism where ChatGPT plays both user and assistant roles. To create the dataset, the authors:
Conversations continue until a natural stopping point is reached.
This section provides an overview of Language Learning Models (LLMs) that have been fine-tuned through specific instruction-based methodologies.
InstructGPT is a model based on GPT-3, fine-tuned on human instructions. The fine-tuning process consists of:
In evaluations, InstructGPT performs 10% better on TruthfulQA and 7% better on RealToxicityPrompts compared to GPT-3. In human evaluations, it shows significant improvements in following instructions, constraints, and generating appropriate responses.
BLOOMZ starts from the BLOOM base model and is fine-tuned on xP3, a dataset covering 46 languages. It performs better in automatic evaluations in coreference resolution, sentence completion, and natural language inference tasks, among others.
Flan-T5 starts from T5 and is fine-tuned on the FLAN dataset. During fine-tuning, it utilizes the JAX-based T5X framework and achieves better or comparable performance to much larger models, including PaLM, in a variety of NLP tasks.
Alpaca is a 7B model fine-tuned from LLaMA on an instruction dataset generated by InstructGPT. It achieves comparable performance to InstructGPT in human evaluations and excels in the self-instruct dataset.
It seems like you’ve provided a detailed review or summary of a variety of state-of-the-art fine-tuned large language models (LLMs). These models are specialized for various tasks and have been assessed using both automatic and human evaluations, based on specific metrics like truthfulness, toxicity, performance on core NLP tasks, and so on.
Vicuna (13B) (Chiang et al., 2023) is a language model trained by fine-tuning LLaMA (13B) (Touvron et al., 2023a) on the conversational dataset generated by ChatGPT. The authors gathered user-shared ChatGPT conversations from ShareGPT.com, and got 70K conversation records after filtering out low-quality samples. LLaMA (13B) was fine-tuned on the constructed conversation dataset using a modified loss function tailored to multi-turn conversations. The authors expanded the max context length from 512 to 2048 for better understanding long context across multiple-turn dialog. Training involved gradient checkpointing and flash attention techniques to reduce GPU memory cost. Fine-tuning took 24 hours on an 8 × 80GB A100 device.
Evaluation: Vicuna outperforms Alpaca (13B) and LLaMA (13B) in 90% of the test questions and generates equal or better rating responses compared to ChatGPT in 45% of the questions.
GPT-4-LLM (7B) (Peng et al., 2023) is a language model fine-tuned from LLaMA (7B) on the GPT-4 generated instruction dataset. The fine-tuning process involves supervised fine-tuning followed by optimizing using proximal policy optimization (PPO).
Evaluation: GPT-4-LLM outperforms not only the baseline Alpaca (7B), but also larger models including Alpaca (13B) and LLAMA (13B).
Claude is a language model fine-tuned on an instruction dataset with the aim to generate helpful and harmless responses. The fine-tuning process involves two steps: supervised fine-tuning followed by optimizing using proximal policy optimization (PPO).
Evaluation: Claude generates more helpful and harmless responses compared to the backbone model. Claude outperforms GPT-3 by 7% on the RealToxicityPrompts in terms of toxicity.
Certainly! Below is the text converted into Markdown format.
WizardLM (7B) (Xu et al., 2023a) is a language model fine-tuned on LLaMA (7B) using an instruction dataset called Evol-Instruct generated by ChatGPT. The fine-tuning takes about 70 hours on 3 epochs with 8 V100 GPUs, utilizing the Deepspeed Zero-3 technique.
Evaluation: WizardLM outperforms Alpaca (7B) and Vicuna (7B) significantly and offers comparable or better responses than ChatGPT in 67% of test cases. It gains a performance boost compared to Alpaca by +6.2% and +5.3% on various test sets, and outperforms Vicuna by +5.8% and +1.7% on specific test sets.
ChatGLM2 (6B) (Du et al., 2022) is a language model fine-tuned on GLM (6B). It is trained on a bilingual dataset containing both English and Chinese instructions. To model long context, the maximum context length is increased to 32K.
Evaluation: ChatGLM2 outperforms GLM (6B) and the baseline model on all benchmarks. Specifically, ChatGLM2 outperforms GLM by +3.1 on MMLU, +5.0 on C-Eval, +8.6 on GSM8K, and +2.2 on BBH.
LIMA (65B) (Zhou et al., 2023) is a large language model fine-tuned on LLaMA (65B). It is developed based on the “superficial alignment hypothesis,” which suggests that language models acquire most of their capabilities during pre-training and only need a small set of instruction data for fine-tuning to align with user preferences.
Evaluation: For human evaluations, LIMA outperforms InstructGPT and Alpaca by 17% and 19%, respectively. In automatic evaluations conducted by GPT-4, LIMA outperforms InstructGPT and Alpaca by 20% and 36%, respectively.
WizardLM (7B): Fine-tuned on the Evol-Instruct dataset generated by ChatGPT. Excellent at following complex human-generated instructions.
ChatGLM2 (6B): Fine-tuned on a bilingual dataset containing both English and Chinese instructions. Handles a wide range of benchmarks.
LIMA (65B): Focuses on the superficial alignment hypothesis. Performs well on instruction tasks and generates user-satisfying responses.
OPT-IML (175B): Trained on the Instruction Meta-Learning (IML) dataset, excels at various NLP benchmarks.
Dolly 2.0 (12B): Fine-tuned on an instruction dataset for various NLP tasks such as text classification and information extraction.
Falcon-Instruct (40B): Fine-tuned on English dialogue dataset and employs techniques to reduce memory usage.
Guanaco (7B): A multi-turn dialog model trained on a multilingual dataset.
Minotaur (15B): Supports a maximum context length of 18K tokens and is fine-tuned on open-source instruction datasets.
Nous-Herme (13B): Fine-tuned on a dataset containing over 300k instructions and performs well on multiple tasks.
TÜLU (6.7B): Fine-tuned on a mixed instruction dataset and performs relatively well compared to other larger models.
YuLan-Chat (13B): A bilingual model with comparable performance to state-of-the-art models.
MOSS (16B): Focused on multi-turn conversations and aligns well with human preferences.
Airoboros (13B): Fine-tuned on the Self-instruct dataset and outperforms LLAMA on all benchmarks.
UltraLM (13B): Surpasses several previous best models in evaluations, including Vicuna and WizardLM.
It’s evident that the race to improve language models has led to many specialized versions, fine-tuned for different benchmarks, languages, and types of instructions. This paints a vivid picture of the advancements in the field and how research is pushing to make models more effective, efficient, and aligned with human needs and preferences.
Others
InstructUIE (Wang et al., 2023b)
Varia et al. (2022)
Writing-Alpaca-7B (Zhang et al., 2023d)
CoEdIT (Raheja et al., 2023)
Radiology-GPT (Liu et al., 2023c)
ChatDoctor (Li et al., 2023g)
Goat (Liu and Low, 2023)
WizardCoder (Luo et al., 2023)
Efficient fine-tuning techniques aim to adapt Large Language Models (LLMs) to downstream tasks by optimizing a small fraction of parameters in multiple ways: addition-based, specification-based, and reparameterization-based.
Methods
HELM (Liang et al., 2022) is a holistic evaluation of Language Models (LMs) aimed at improving transparency. It provides a comprehensive understanding of the capabilities, risks, and limitations of LMs. The evaluation focuses on three main factors:
Factors for Evaluation
Gupta et al. (2023) investigate the minimal downstream training data required for IT models. Findings include:
Zhou et al. (2023) proposed LIMA, fine-tuning LLMs on only 1,000 carefully selected training examples.
Performance
Wang et al. (2023c) evaluate various IT datasets through both automatic and human evaluations.
Findings
Kung and Peng (2023) question what models actually learn during instruction tuning.
Results
Gudibande et al. (2023) investigate the efficacy of model imitation.
Observations