00:00:00

Survey | Instruction Tuning Survey**

https://dsdanielpark.github.io https://github.com/dsdanielpark

Survey | Instruction Tuning Survey**

MinWoo(Daniel) Park | Tech Blog

Created: 2023-12-07 05:23:21 +0000

Last modified: 2024-09-05 20:56:50 +0900

Survey | Instruction Tuning Survey**

Related Project: Private
Category: Paper Review
Date: 2023-08-31

Instruction Tuning for Large Language Models: A Survey

url: https://arxiv.org/abs/2308.10792
pdf: https://arxiv.org/pdf/2308.10792
abstract: This paper surveys research works in the quickly advancing field of instruction tuning (IT), a crucial technique to enhance the capabilities and controllability of large language models (LLMs). Instruction tuning refers to the process of further training LLMs on a dataset consisting of \textsc{(instruction, output)} pairs in a supervised fashion, which bridges the gap between the next-word prediction objective of LLMs and the users’ objective of having LLMs adhere to human instructions. In this work, we make a systematic review of the literature, including the general methodology of IT, the construction of IT datasets, the training of IT models, and applications to different modalities, domains and applications, along with an analysis on aspects that influence the outcome of IT (e.g., generation of instruction outputs, size of the instruction dataset, etc). We also review the potential pitfalls of IT along with criticism against it, along with efforts pointing out current deficiencies of existing strategies and suggest some avenues for fruitful research.

[Instruction Tuning Survey 핵심색인마킹]

Contents

Instruction Tuning for Large Language Models: A Survey

TL;DR

대규모 언어모델의 성능 향상을 위한 지시 튜닝 기법 소개
지시 데이터셋 구성 및 지시 튜닝을 위한 방법 개발
다양한 지시 튜닝 데이터셋과 실험적 접근 방법 검토

[임의의 부분만 줄임]

1. 서론

최근 몇 년간 대규모 언어모델(Large Language Models, LLMs) 분야는 눈부신 발전을 이루었습니다. GPT-3, PaLM, LLaMA 등의 모델들은 자연어 처리 작업에서 향상된 성능을 보여주었습니다. 그러나 이런 모델들은 훈련 목표와 사용자의 목표 사이에 괴리가 있습니다. 모델은 대규모 코퍼스에서 문맥적 단어 예측 오류를 최소화하는 것을 목표로 훈련되는 반면, 사용자는 모델이 지시에 따라 유용하고 안전하게 작동하기를 원합니다.

이런 문제를 해결하기 위해 ‘지시 튜닝(Instruction Tuning, IT)’이 제안되었습니다. 지시 튜닝은 LLMs의 능력과 제어 가능성을 향상시키는 효과적인 기술로, 모델을 (INSTRUCTION, OUTPUT) 쌍을 사용하여 추가 훈련하는 것을 포함합니다. 이 방법은 다음과 같은 이점을 제공합니다.

지시 데이터셋에 대한 세밀한 조정은 LLMs의 다음 단어 예측 목표와 사용자의 instruction following 목표 사이의 간극을 메울 수 있습니다.
IT는 표준 LLMs에 비해 더 제어 가능하고 예측 가능한 모델 행동을 가능하게 합니다. 지시들은 모델의 출력을 원하는 응답 특성이나 도메인 지식에 맞추도록 제약할 수 있습니다.
IT는 계산 효율성이 높으며, LLMs가 특정 도메인에 빠르게 적응할 수 있도록 돕습니다.

그러나 고품질의 지시를 만드는 것은 쉽지 않으며, IT는 주로 IT 훈련 데이터셋에서 지원되는 작업에서만 개선된다는 우려가 있습니다. 또한 IT가 표면적인 패턴과 스타일만을 파악하고 실제 작업을 이해하고 학습하지 못한다는 비판이 있습니다.

2. 방법

2.1 지시 데이터셋 구성

지시 데이터셋의 각 인스턴스는 세 가지 요소로 구성됩니다.

지시: 작업을 지정하는 자연어 텍스트 시퀀스
선택적 입력: 맥락을 위한 보충 정보
지시와 입력에 따른 예상 출력

지시 데이터셋을 구성하는 두 가지 방법은 다음과 같습니다.

데이터 통합: 이 방법에서는 기존의 주석이 달린 자연어 데이터셋에서 (지시, 출력) 쌍을 수집하여 텍스트-레이블 쌍을 (지시, 출력) 쌍으로 변환합니다.
LLMs를 사용한 출력 생성: 이 방법에서는 수동으로 출력을 수집하는 대신 GPT-3.5-Turbo 또는 GPT-4와 같은 LLMs를 사용하여 주어진 지시에 대한 원하는 출력을 빠르게 수집합니다.

2.2 지시 튜닝

수집된 IT 데이터셋을 바탕으로, 사전 훈련된 모델을 직접 파인튜닝할 수 있습니다. 지시와 입력이 주어지면 모델은 출력에서 각 토큰을 순차적으로 예측하도록 훈련됩니다.

3 데이터셋

이 섹션에서는 커뮤니티에서 널리 사용되는 지시 튜닝 데이터셋에 대해 자세히 설명합니다. 데이터셋의 예시와 구성, 사용 방법에 대한 설명은 위와 같이 진행됩니다. 각 데이터셋은 지시 튜닝 과정에서 어떻게 활용되는지, 어떤 구체적인 방법을 통해 생성되었는지에 대한 설명을 포함합니다.

데이터셋 이름	설명	구성 요소 예시
Natural Instructions	193K 인스턴스로 구성된 영어 지시 데이터셋	“지시”, “입력”, “출력”
P3	170개의 영어 NLP 데이터셋과 2,052개의 프롬프트로 구성된 데이터셋	“입력”, “응답 선택지”, “타겟”
xP3	46개 언어로 구성된 16가지 다양한 자연어 작업의 다국어 지시 데이터셋	“입력”, “타겟”
Flan 2021	62개의 널리 사용되는 NLP 벤치마크를 언어 입력-출력 쌍으로 변환하여 구성된 데이터셋	“입력”, “타겟”
Unnatural Instructions	InstructGPT를 사용하여 구축된 약 240,000개의 인스턴스를 포함하는 데이터셋	“지시”, “입력”, “제약 조건”, “출력”
Self-Instruct	InstructGPT를 사용하여 구축된 52K의 훈련 지시와 252개의 평가 지시를 포함하는 데이터셋	“지시”, “입력”, “출력”
Evol-Instruct	ChatGPT를 사용하여 생성된 진화 전략을 포함하는 데이터셋	“지시”
LIMA	커뮤니티 Q&A 웹사이트, 수동 작성 및 Super-Natural Instructions에서 파생된 데이터셋	“지시”, “입력”, “출력”
Super-Natural Instructions	1,616 NLP 작업과 5M 작업 인스턴스를 포함하는 다국어 데이터셋	“정의”, “긍정 예시”, “부정 예시”
Dolly	사용자와 유사하게 상호작용할 수 있도록 설계된 LLM용 데이터셋	“Open Q&A”, “Closed Q&A”, “정보 추출” 등
OpenAssistant Conversations	휴먼이 제작한 다국어 어시스턴트 스타일의 대화를 특징으로 하는 데이터셋	“메시지”, “사용자 프롬프트”, “어시스턴트 답변”
Baize	ChatGPT가 사용자 및 어시스턴트 역할을 모두 수행하는 셀프챗 메커니즘을 사용하는 데이터셋	“인스턴스”, “턴”

각 데이터셋은 지시 튜닝을 위해 다양한 방법으로 구축되었으며, 특정 NLP 작업 또는 언어 모델 튜닝의 효과를 검증하는 데 사용됩니다.

4. 지시 사항에 따른 LLMs의 파인튜닝

언어학습모델(Large Language Models, LLMs)의 정교한 조정 방법 연구
수학적 논증과 연결고리를 중심으로 한 방법 설명
다양한 벤치마크와 데이터셋을 활용한 성능 평가 결과 제공

4.1 InstructGPT

InstructGPT는 GPT-3 기반의 모델로, 사람의 지시에 따라 파인튜닝되었습니다. 파인튜닝 과정은 다음과 같습니다.

지도 학습(Supervised Fine-Tuning, SFT): 휴먼이 필터링한 지시 데이터셋을 사용하여 수행됩니다.
보상 모델 훈련(Reward Model Training): 휴먼의 선호도를 예측하는 모델을 훈련합니다.
최적화(Optimization): 근사 정책 최적화(Proximal Policy Optimization, PPO)를 사용합니다.

\[L(\theta) = E_{\pi_\theta} \left[ R(t) \right] - \beta KL(\pi_\theta \| \pi_{old})\]

$L(\theta)$는 손실 함수, $R(t)$는 보상, $\pi_\theta$는 정책, $\beta$는 규제화 계수, $KL$은 쿨백-라이블러 발산

이 수식은 PPO의 목적이 최근 정책에서 발생하는 보상의 기대값을 최대화하면서 이전 정책과의 발산을 최소화하는 것임을 보여줍니다.

평가 결과

InstructGPT는 TruthfulQA에서 10%, RealToxicityPrompts에서 7% 더 높은 성능을 보이며, 휴먼 평가에서도 지시 사항 따르기, 제약 조건 충족, 적절한 응답 생성 등에서 눈에 띄는 개선을 보였습니다.

[Check] 쿨백-라이블러

4.2 BLOOMZ

BLOOMZ는 BLOOM 기본 모델에서 시작하여 xP3 데이터셋에 파인튜닝되었습니다. 이 데이터셋은 46개 언어를 포함하고 있습니다.

데이터셋은 coreference 해결, 문장 완성, 자연어 인퍼런스과 같은 NLP 작업에서의 자동 평가에서 더 나은 성능을 보입니다. 이는 트랜스퍼러닝(전이학습) 기법을 통한 학습의 일반화 능력을 강화시키는 중요한 요소입니다.

4.3 Flan-T5

Flan-T5는 T5 모델을 기반으로 하며, FLAN 데이터셋에 파인튜닝되었습니다. 이 과정에서는 JAX 기반의 T5X 프레임워크를 사용하였습니다.

$\text{minimize } L(\theta) = \sum_{i=1}^{N} \log P(y_i \\| x_i, \theta)$ $L(\theta)$는 손실 함수, $P(y_i \| x_i, \theta)$는 조건부 확률을 나타내며, 이는 모델이 주어진 입력 $x_i$에 대해 올바른 출력 $y_i$를 예측할 확률을 최대화하는 방향으로 학습됨을 의미합니다.

4.4 Alpaca

Alpaca는 7B 모델로, InstructGPT가 생성한 지시 데이터셋에 파인튜닝되었습니다.

이 모델은 특정 지시에 따른 응답 생성을 최적화하는 것을 목표로 합니다. 수학적 최적화 과정에서 PPO 같은 기술이 사용되어, 모델이 더 효율적으로 지시를 이해하고 따르도록 합니다.

이 섹션은 각 모델의 개발 및 평가 과정을 통해 얻은 깊이 있는 이해와 그에 따른 수학적, 기술적 세부 사항을 제공합니다. 각 단계는 명확한 수학적 정의와 함께 연결되어, 이해와 인퍼런스를 위한 체계적인 접근을 가능하게 합니다.

5. 다중 모드 지시 사항에 따른 파인튜닝

5.1 다중 모드 데이터셋

5.1.1 Multi-Instruct (Xu et al., 2022)

특징
- 62가지 다양한 다중 모드 작업을 포함합니다.
- 10개의 광범위한 범주에 걸쳐 있습니다.
- 21개의 기존 오픈 소스 데이터셋에서 파생되었습니다.
- 각 작업은 5개의 전문가 작성 지시를 포함합니다.
수학적 이해 및 방법
- 전달 학습 기술을 강화합니다. 이 데이터셋은 다양한 범주의 작업을 통해 모델이 일반화 능력을 개선할 수 있도록 돕습니다.
- \[L = \sum_{i=1}^{N} \log P(y_i \\| x_i, \theta)\]
- $L$은 손실 함수, $N$은 데이터 포인트의 수, $y_i$는 타겟 레이블, $x_i$는 입력 데이터, $\theta$는 모델 파라미터

5.1.2 PMC-VQA (Zhang et al., 2023c)

특징
- 대규모 의료 시각 질문 응답 데이터셋입니다.
- 227,000개의 이미지-질문 쌍이 149,000개의 이미지에서 파생됩니다.
- 개방형 및 객관식 작업 모두에 사용될 수 있습니다.
성능 및 평
- 다양한 벤치마크에서 기존 모델을 능가합니다.
- 모델은 의료 이미지의 상세한 분석을 통해 정확한 진단을 돕는 방식으로 학습됩니다.
- \[\text{Accuracy} = \frac{\text{Number of correct answers}}{\text{Total number of questions}}\]

5.1.3 LAMM (Yin et al., 2023)

특징
- 포괄적인 다중 모드 지시 조정 데이터셋입니다.
- 2D 이미지 및 3D 포인트 클라우드 이해에 중점을 둡니다.
- 상식 지식 질문 응답을 위한 데이터 쌍을 포함합니다.
수학적 이해 및 방법
- 이 데이터셋은 3D 구조 인식과 2D 이미지 분석의 결합을 통해 모델의 시각적 이해력을 향상시키는 것을 목표로 합니다.
- 복합 데이터 입력에서의 정보 추출은 공간적 및 시각적 컨텍스트를 결합하는 학습 알고리즘을 요구합니다.
- \[I(X;Y) = H(X) - H(X\\|Y)\]
- $I(X;Y)$는 상호 정보량, $H(X)$는 엔트로피, $X$와 $Y$는 입력 데이터와 모델 출력을 각각 나타냅니다.

이 섹션은 다양한 다중 모드 작업을 위한 데이터셋과 그에 따른 특성 및 수학적 접근 방식을 제공합니다. 각 데이터셋은 특정 도메인의 복잡한 문제를 해결하기 위한 모델의 능력을 향상시키는 데 중점을 두고 있습니다.

5.2 다중 모드 지시 사항에 따른 파인튜닝 모델

5.2.1 InstructPix2Pix (983M)

모델: 조건부 확산 모델 (Conditional Diffusion Model)
45만 개 이상의 텍스트 편집 지시와 해당 이미지를 포함
이 모델은 텍스트 지시에 따라 이미지를 생성하는 확률적 프로세스를 사용합니다. 이는 주어진 조건 $x$에 대해 이미지 $y$를 생성하는 조건부 확률 $P(y\\|x)$을 최대화하려고 합니다.

5.2.2 LLaVA (13B)

모델: 대규모 다중 모드 모델
158,000개의 독특한 언어-이미지 지시를 따르는 샘플을 사용하여 파인튜닝
언어와 이미지 간의 상호 작용을 모델링하여 다중 모드 학습에서의 연관성을 강화합니다. 이는 상호 정보량 $I(X;Y)$을 극대화하여 언어 입력 $X$와 이미지 출력 $Y$ 사이의 상관관계를 높이는 것을 목표로 합니다.

6. 도메인별 지시 사항에 따른 파인튜닝

6.1 대화

6.1.1 InstructDial

프레임워크: 대화를 위해 설계된 지시 조정 프레임워크
48개의 대화 작업 수집, 지시 선택 작업과 지시 이진 작업의 두 가지 메타 작업 포함
이 프레임워크는 각 대화 상황에서 최적의 지시를 선택하는 결정적 알고리즘을 사용합니다. 선택된 지시는 대화의 흐름을 유도하고 사용자의 의도를 정확히 파악하는 데 중점을 둡니다.

6.2 의도 분류 및 슬롯 태깅

6.2.1 Lingustic

모델: AlexaTM 5B를 의도 분류와 슬롯 태깅 지시 데이터셋으로 파인튜닝
성능: 최신 방법을 크게 상회하는 성능을 보임
이 모델은 조건부 확률 $P(intent, slots\\|input)$을 최적화하여 입력에 대해 올바른 의도와 슬롯을 인퍼런스합니다.

6.3 정보 추출

6.3.1 InstructUIE

프레임워크: 지시 조정을 기반으로 한 통합 정보 추출 프레임워크
성능: BERT를 지도 학습(Supervised learning)하여 비교하며, GPT3.5를 0-shot 설정에서 상회함.
정보 추출 작업은 입력 텍스트로부터 구조화된 정보를 추출하는 것을 목표로 하며, 이는 특정 패턴이나 키워드에 따라 텍스트에서 정보를 분류하고 태깅하는 확률적 방법을 사용함.

6.4 관점 기반 감정 분석

6.4.1 Varia et al.

프레임워크: T5 (220M) 모델을 기반으로 한 통합 지시 조정 프레임워크
성능: 소수샷 학습에서 상당한 개선을 보이며, 전체 파인튜닝에서 경쟁력 있는 성능 유지
이 모델은 감정과 관련된 각 측면을 정확하게 분류하기 위해 조건부 확률을 최적화하며, 감정, 의견, 측면 용어를 인식하는 복잡한 패턴 인식 알고리즘을 사용합니다.

6.5 글쓰기 보조

6.5.1 Writing-Alpaca-7B

프레임워크: LLaMA-7B를 글쓰기 지시 데이터셋으로 파인튜닝
성능: 모든 글쓰기 작업에서 개선을 보이며 다른 LLM들을 상회함
글쓰기 작업에서의 성능 향상은 입력에 대한 모델의 응답을 최적화하는 것에 초점을 맞추며, 이는 텍스트의 문맥 이해 및 적절한 응답 생성에 대한 확률적 접근을 포함합니다.

7. 효율적인 튜닝 기술

대규모 언어모델(LLMs)을 downstream 작업에 적용하기 위해 소수의 파라미터를 여러 방법으로 최적화하는 것을 목표로 합니다. 추가 기반, 지정 기반, 재파라미터화 기반.

LLMs의 효율적인 파인튜닝 기법 연구
다양한 접근 방법을 통한 파라미터 최적화
실험적 평가 및 데이터셋 분석을 통한 성능 평가

[방법 상세]

(1) 추가 기반 방법: 원래 모델에 존재하지 않는 추가적인 훈련 가능한 파라미터나 모듈을 도입합니다.
- 대표적인 방법: 어댑터 튜닝(Adapter tuning, Houlsby et al., 2019), 프롬프트 기반 튜닝(Prompt-based tuning, Schick and Schütze, 2021)
(2) 지정 기반 방법: 특정 내재된 모델 파라미터를 튜닝하도록 지정하면서 다른 파라미터는 고정합니다.
- 예시: BitFit (Zaken et al., 2022)은 사전 훈련된 모델의 바이어스 항을 튜닝합니다.
(3) 재파라미터화 방법: 모델 가중치를 보다 파라미터 효율적인 형태로 변환하여 튜닝합니다.
- 핵심 가설: 모델 적응은 Low-Rank입니다. 예를 들어, LoRA (Hu et al., 2021)
(4) 내재적 프롬프트 튜닝: 다양한 작업에서 튜닝 프롬프트를 공유하는 저차원 부공간을 찾습니다.

7.1 LoRA (Low-Rank Adaptation, Hu et al., 2021)

훈련 백본으로 DeepSpeed (Rasley et al., 2020)를 사용합니다.
Low-Rank 업데이트를 사용하여 효율적인 적응을 가능하게 합니다.
전체 파인튜닝에 비해 훈련 가능한 파라미터 수를 10,000배, 메모리 사용을 3배 줄입니다.

7.2 HINT (Ivison et al., 2022)

지시 튜닝과 효율적인 온디맨드 파인튜닝을 결합합니다.
하이퍼네트워크를 사용하여 LLMs에 대한 파라미터 효율적인 모듈을 생성합니다.
계산 증가 없이 더 긴 지시사항과 추가적인 퓨샷을 가능하게 합니다.

7.3 QLORA (Dettmers et al., 2023)

최적의 양자화와 메모리 최적화를 포함합니다.
단일 48GB GPU에서 65B 파라미터 LLM을 훈련할 수 있게 하며, 성능 저하가 없습니다.
4비트 NormalFloat (NF4) 양자화와 2차 8비트 양자화를 활용합니다.

7.4 LOMO (LOw-Memory Optimization, Lv et al., 2023)

제한된 자원을 사용하여 전체 파라미터 파인튜닝을 가능하게 합니다.
기울기 계산과 업데이트를 하나의 단계로 융합합니다.
기울기 메모리를 O(1)로 줄입니다.

7.5 Delta-tuning (Ding et al., 2023b)

최적화와 최적 제어 관점을 제공합니다.
부분 공간 최적화를 수행합니다.
튜닝된 파라미터는 downstream 작업에 대한 최적의 컨트롤러로 작용합니다.

8. 평가, 분석 및 비평

8.1 HELM 평가

HELM (Liang et al., 2022)은 LMs의 투명성을 향상시키기 위한 종합적인 평가입니다. LMs의 능력, 위험 및 한계에 대한 종합적인 이해를 제공합니다. 평가는 세 가지 주요 요인에 초점을 맞춥니다.

평가 요인

넓은 범위 커버리지: HELM은 시나리오 커버리지를 17.9%에서 96.0%로 향상시키기 위해 상향식 분류 체계를 제안합니다.
다중 지표 측정: HELM은 16개의 다른 시나리오와 7개의 지표를 포함하여 112개의 가능한 핵심 시나리오 중 98개를 측정합니다(87.5%).
표준화: Google, OpenAI, EleutherAI 등에서 30개의 잘 알려진 언어 모델을 벤치마킹합니다.

8.2 저자원 지시 튜닝

Gupta et al. (2023)은 IT 모델에 필요한 최소 downstream 훈련 데이터를 조사합니다. 연구 결과는 다음과 같습니다.

단일 작업 학습에서는 downstream 데이터의 25%가 충분합니다.
다중 작업 학습에서는 downstream 데이터의 단 6%만 필요합니다.

8.3 작은 지시 데이터셋

Zhou et al. (2023)은 단 1,000개의 신중하게 선택된 훈련 예제로 LLM을 파인튜닝하는 LIMA를 제안합니다.

5,200개 예제로 파인튜닝된 GPT-davinci003를 상회합니다.
GPT-4, Claude, Bard와 동등한 결과를 달성합니다.

8.4 지시 튜닝 데이터셋 평가

Wang et al. (2023c)은 자동 및 휴먼 평가를 통해 다양한 IT 데이터셋을 평가합니다.

모든 작업에서 최고의 IT 데이터셋은 없습니다.
작은 모델과 고품질 기반 모델이 IT에서 가장 많은 이득을 얻습니다.

8.5 IT는 패턴 복사만 학습하는가?

Kung and Peng (2023)은 지시 튜닝 중 모델이 실제로 무엇을 학습하는지에 대해 질문합니다.

모델은 특정 작업을 학습하기보다는 표면 수준의 패턴을 포착합니다.

8.6 독점 LLMs 모방

Gudibande et al. (2023)은 모델 모방의 효과를 조사합니다.

모방 모델은 지원되는 데이터셋이 있는 작업에서 향상된 성능을 보입니다.
모방 데이터셋이 없는 작업에서는 성능이 저조합니다.

1 Introduction

The field of large language models (LLMs) has witnessed remarkable progress in recent years. LLMs such as GPT-3 (Brown et al., 2020b), PaLM (Chowdhery et al., 2022), and LLaMA (Touvron et al., 2023a) have demonstrated impressive capabilities across a wide range of natural language tasks (Zhao et al., 2021; Wang et al., 2022b, 2023a; Wan et al., 2023; Sun et al., 2023c; Wei et al., 2023; Li et al., 2023a; Gao et al., 2023a; Yao et al., 2023; Yang et al., 2022a; Qian et al., 2022; Lee et al., 2022; Yang et al., 2022b; Gao et al., 2023b; Ning et al., 2023; Liu et al., 2021b; Wiegreffe et al., 2021; Sun et al., 2023b,a; Zhejiang University, Shannon.AI, Nanyang Technological University, Amazon Adlakha et al., 2023; Chen et al., 2023).

Major Issues with LLMs

One of the major issues with LLMs is the mismatch between the training objective and users’ objective: LLMs are typically trained on minimizing the contextual word prediction error on large corpora; while users want the model to “follow their instructions helpfully and safely” (Radford et al., 2019; Brown et al., 2020a; Fedus et al., 2021; Rae et al., 2021; Thoppilan et al., 2022).

Instruction Tuning (IT)

To address this mismatch, instruction tuning (IT) is proposed, serving as an effective technique to enhance the capabilities and controllability of large language models. It involves further training LLMs using (INSTRUCTION, OUTPUT) pairs, where INSTRUCTION denotes the human instruction for the model, and OUTPUT denotes the desired output that follows the INSTRUCTION.

Benefits of IT

Finetuning an LLM on the instruction dataset bridges the gap between the next-word prediction objective of LLMs and the users’ objective of instruction following.
IT allows for a more controllable and predictable model behavior compared to standard LLMs. The instructions serve to constrain the model’s outputs to align with the desired response characteristics or domain knowledge, providing a channel for humans to intervene with the model’s behaviors.
IT is computationally efficient and can help LLMs rapidly adapt to a specific domain without extensive retraining or architectural changes.

Challenges of IT

Crafting high-quality instructions that properly cover the desired target behaviors is non-trivial: existing instruction datasets are usually limited in quantity, diversity, and creativity.
There has been an increasing concern that IT only improves on tasks that are heavily supported in the IT training dataset (Gudibande et al., 2023).
There has been an intense criticism that IT only captures surface-level patterns and styles (e.g., the output format) rather than comprehending and learning the task (Kung and Peng, 2023).

Research Directions

Improving instruction adherence and handling unanticipated model responses remain open research problems. These challenges highlight the importance of further investigations, analysis, and summarization in this field, to optimize the fine-tuning process and better understand the behavior of instruction fine-tuned LLMs.

In the Literature

In the literature, there has been an increasing research interest in analysis and discussions on LLMs, including pre-training methods (Zhao et al., 2023), reasoning abilities (Huang and Chang, 2022), downstream applications (Yang et al., 2023; Sun et al., 2023b), but rarely on the topic of LLM instruction fine-tuning. This survey attempts to fill this blank, organizing the most up-to-date state of knowledge on this quickly advancing field.

Survey Sections

Section 2 presents the general methodology employed in instruction fine-tuning.
Section 3 outlines the construction process of commonly-used IT representative datasets.
Section 4 presents representative instruction-finetuned models.
Section 5 reviews multi-modality techniques and datasets for instruction tuning, including images, speech, and video.
Section 6 reviews efforts to adapt LLMs to different domains and applications using the IT strategy.
Section 7 reviews explorations to make instruction fine-tuning more efficient, reducing the computational and time costs associated with adapting large models.
Section 8 presents the evaluation of IT models, analysis on them, along with criticism against them.

2 Methodology

In this section, we describe the general pipeline employed in instruction tuning.

2.1 Instruction Dataset Construction

Each instance in an instruction dataset consists of three elements:

an instruction, which is a natural language text sequence to specify the task (e.g., write a thank-you letter to XX for XX, write a blog on the topic of XX, etc)
an optional input which provides supplementary information for context
an anticipated output based on the instruction and the input.

There are generally two methods for constructing instruction datasets:

Data integration from annotated natural language datasets: In this approach, (instruction, output) pairs are collected from existing annotated natural language datasets by using templates to transform text-label pairs to (instruction, output) pairs. Datasets such as Flan (Longpre et al., 2023) and P3 (Sanh et al., 2021) are constructed based on the data integration strategy.
Generating outputs using LLMs: An alternate way to quickly gather the desired outputs to given instructions is to employ LLMs such as GPT-3.5-Turbo or GPT4 instead of manually collecting the outputs. Instructions can come from two sources:
1. manually collected
2. expanded based on a small handwritten seed instructions using LLMs.
Next, the collected instructions are fed to LLMs to obtain outputs. Datasets such as InstructWild (Xue et al., 2023) and Self-Instruct (Wang et al., 2022c) are generated following this approach.

For multi-turn conversational IT datasets, we can have large language models self-play different roles (user and AI assistant) to generate messages in a conversational format (Xu et al., 2023b).

2.2 Instruction Tuning

Based on the collected IT dataset, a pretrained model can be directly fine-tuned in a fully-supervised manner, where given the instruction and the input, the model is trained by predicting each token in the output sequentially.

3 Datasets

In this section, we detail widely-used instruction tuning datasets in the community. Table 1 gives an overview of the datasets.

3.1 Natural Instructions

Natural Instructions (Mishra et al., 2021) is a human-crafted English instruction dataset consisting of 193K instances, coming from 61 distinct NLP tasks. The dataset is comprised of “instructions” and “instances”.

Each instance in the “instructions” is a task description consisting of 7 components: title, definition, things to avoid, emphasis/caution, prompt, positive example, and negative example. Subfigure (a) in Figure 2 gives an example of the “instructions”.
“Instances” consists of (“input”, “output”) pairs, which are the input data and textual result that follows the given instruction correctly. Subfigure (b) in Figure 2 gives an example of the instances.

The data comes from existing NLP datasets of 61 tasks. The authors collected the “instructions” by referring to the dataset annotating instruction file. Next, the authors constructed the “instances” by unifying data instances across all NLP datasets to (“input”, “output”) pairs.

3.2 P3

P3 (Public Pool of Prompts) (Sanh et al., 2021) is an instruction fine-tuning dataset constructed by integrating 170 English NLP datasets and 2,052 English prompts. Prompts, sometimes named as task templates, function as mappings of data instances in conventional NLP tasks (e.g., question answering, text classification) to natural language input-output pairs.

Each instance in P3 has three components:

“Inputs”: A sequence of text that describes the task in natural language (e.g., “If he like Mary is true, is it also true that he like Mary’s cat?”).
“Answer Choices”: A list of text strings that are applicable responses to the given task (e.g., [“yes”, “no”, “undetermined”]).
“Targets”: A text string that is the correct response to the given “inputs” (e.g., “yes”).

The authors built PromptSource, a tool for creating high-quality prompts collaboratively and an archive for open-sourcing high-quality prompts. The P3 dataset was built by randomly sampling a prompt from multiple prompts in the PromptSource and mapping each instance into an (“inputs”, “answer choices”, “targets”) triplet.

3.3 xP3

xP3 (Crosslingual Public Pool of Prompts) (Muennighoff et al., 2022) is a multilingual instruction dataset consisting of 16 diverse natural language tasks in 46 languages.

Each instance in the dataset has two components:

“Inputs”: A task description in natural language.
“Targets”: The textual result that follows the “inputs” instruction correctly.

The original data in xP3 comes from three sources:

The English instruction dataset P3
4 English unseen tasks in P3 (e.g., translation, program synthesis)
30 multilingual NLP datasets.

The authors built the xP3 dataset by sampling human-written task templates from PromptSource and then filling templates to transform diverse NLP tasks into a unified formalization. For example, a task template for the natural language inference task is as follows: “If Premise is true, is it also true that Hypothesis?”; “yes”, “maybe”, “no” with respect to the original task labels “entailment (0)”, “neutral (1)” and “contradiction (2)”.

3.4 Flan 2021

Flan 2021 (Longpre et al., 2023) is an English instruction dataset constructed by transforming 62 widely-used NLP benchmarks (e.g., SST-2, SNLI, AG News, MultiRC) into language input-output pairs.

Each instance in the Flan 2021 dataset has two components:

“Input”: A sequence of text that describes a task via a natural language instruction (e.g., “determine the sentiment of the sentence ‘He likes the cat.’ is positive or negative?”).
“Target”: A textual result that executes the “input” instruction correctly (e.g., “positive”).

The authors transformed conventional NLP datasets into input-target pairs by:

Manually composing instruction and target templates
Filling templates with data instances from the dataset.

3.5 Unnatural Instructions

Unnatural Instructions (Honovich et al., 2022) is an instruction dataset with approximately 240,000 instances, constructed using InstructGPT (textdavinci-002) (Ouyang et al., 2022). Each instance in the dataset has four components:

“INSTRUCTION”: A description of the instructing task in natural language.
“INPUT”: An argument in natural language that instantiates the instruction task.
“CONSTRAINTS”: Restrictions of the output space of the task.
“OUTPUT”: A sequence of text that correctly executes the instruction given the input argument and the constraints.

The authors first sampled seed instructions from the Super-Natural Instructions dataset (Wang et al., 2022e), which is manually constructed. They prompted InstructGPT to elicit a new (instructions, inputs, constraints) pair with three seed instructions as demonstrations. Then, the dataset was expanded by randomly rephrasing the instruction or the input. The concatenation of instruction, input, and constraint is fed to InstructGPT to obtain the output.

3.6 Self-Instruct

Self-Instruct (Wang et al., 2022c) is an English instruction dataset with 52K training instructions and 252 evaluation instructions, constructed using InstructGPT (Ouyang et al., 2022). Each data instance consists of:

“Instruction”: A task definition in natural language (e.g., “Please answer the following question.”).
“Input”: Optional supplementary content for the instruction (e.g., “Which country’s capital is Beijing?”).
“Output”: The textual result that follows the instruction correctly (e.g., “Beijing”).

The full dataset is generated based on the following steps:

Step 1: The authors randomly sampled 8 natural language instructions from the 175 seed tasks as examples and prompted InstructGPT to generate more task instructions.
Step 2: The authors determined whether the instructions generated in Step 1 is a classification task. If yes, they asked InstructGPT to generate all possible options for the output based on the given instruction and randomly selected a particular output category to prompt InstructGPT to generate the corresponding “input” content. For Instructions that do not belong to a classification task, there should be countless “output” options. The authors proposed to use the Input-first strategy, where InstructGPT was prompted to generate the “input” based on the given “instruction” first and then generate the “output” according to the “instruction” and the generated “input”.
Step 3: Based on the results of Step 2, the authors used InstructGPT to generate the “input” and “output” for corresponding instruction tasks using the output-first or input-first strategy.
Step 4: The authors post-processed the generated instruction tasks (e.g., filtering out similar instructions and removing duplicate data for input and output) and got a final number of 52K English instructions.

3.7 Evol-Instruct

Evol-Instruct (Xu et al., 2023a) is an English instruction dataset that includes a training set with 52K instructions and an evaluation set with 218 instructions. The dataset was created using evolving strategies prompted by ChatGPT (OpenAI, 2022). These strategies include:

In-depth Evolving Strategy: Includes operations like adding constraints, increasing reasoning steps, and complicating input.
In-breath Evolving Strategy: Enhances simple instructions or directly generates new ones for increased diversity.

The dataset underwent four iterations of these evolving strategies to arrive at a final count of 250K instruction pairs. In addition, the authors compiled a test set of 218 human-generated instructions from real-world sources like open-source projects and forums.

3.8 LIMA

LIMA (Zhou et al., 2023) is another English instruction dataset containing 1K training instances and 300 test instances. The training set is sourced from:

75% from community Q&A websites like Stack Exchange and wikiHow.
20% manually written by authors (Group A).
5% from the Super-Natural Instructions dataset (Wang et al., 2022d).

The test set comprises 300 instances, with 76.7% written by a different group of authors (Group B) and 23.3% sampled from the Pushshift Reddit Dataset (Baumgartner et al., 2020).

3.9 Super-Natural Instructions

Super-Natural Instructions (Wang et al., 2022f) is a multilingual dataset featuring 1,616 NLP tasks and 5M task instances across 76 task types and 55 languages. The dataset includes:

Definition: Natural language descriptions of tasks.
Positive Examples: Sample inputs and correct outputs with short explanations.
Negative Examples: Sample inputs and incorrect outputs with explanations.

The data is sourced from existing public NLP datasets, crowdsourced annotations, and synthetic tasks.

3.10 Dolly

Dolly (Conover et al., 2023a) is designed to help Language Learning Models (LLMs) interact with users similarly to ChatGPT. The dataset contains 15,000 human-generated data instances and covers seven specific types of tasks, including:

Open Q&A
Closed Q&A
Information extraction from Wikipedia
Summarizing information from Wikipedia
Brainstorming
Classification
Creative Writing

Examples of each task type are detailed in Table 2.

3.11 OpenAssistant Conversations

OpenAssistant Conversations (Köpf et al., 2023) is a dataset that features human-crafted, multilingual assistant-style conversations. The dataset includes:

161,443 messages
91,829 user prompts
69,614 assistant replies
66,497 conversation trees
35 languages
461,292 human-annotated quality ratings

Each conversation is represented as a conversation tree (CT), where nodes signify either a prompt or a reply from the assistant. A path from the root node to any other node in the CT is considered a valid conversation thread.

The dataset was built using a five-step pipeline:

Prompting: Initial prompts were crafted by contributors.
Labeling Prompts: Prompts were rated and high-quality ones were selected.
Expanding Tree Nodes: Contributors added additional replies.
Labeling Replies: Replies were rated for quality.
Ranking: Replies were ranked according to guidelines.

Inappropriate and offensive conversation trees were filtered out.

3.12 Baize

Baize (Conover et al., 2023b) is an English multi-turn chat corpus containing:

111.5K instances
Each instance contains an average of 3.4 turns

The dataset is built using ChatGPT and employs a self-chat mechanism where ChatGPT plays both user and assistant roles. To create the dataset, the authors:

Crafted a task template defining roles and tasks.
Sampled questions from Quora and Stack Overflow as conversation seeds.
Utilized ChatGPT to generate conversations based on the template and seeds.

Conversations continue until a natural stopping point is reached.

4 Instruction Fine-tuned LLMs

This section provides an overview of Language Learning Models (LLMs) that have been fine-tuned through specific instruction-based methodologies.

4.1 InstructGPT

InstructGPT is a model based on GPT-3, fine-tuned on human instructions. The fine-tuning process consists of:

Supervised Fine-Tuning (SFT) on a human-filtered instruction dataset.
Reward Model Training to predict human preferences.
Optimization using Proximal Policy Optimization (PPO).

In evaluations, InstructGPT performs 10% better on TruthfulQA and 7% better on RealToxicityPrompts compared to GPT-3. In human evaluations, it shows significant improvements in following instructions, constraints, and generating appropriate responses.

4.2 BLOOMZ

BLOOMZ starts from the BLOOM base model and is fine-tuned on xP3, a dataset covering 46 languages. It performs better in automatic evaluations in coreference resolution, sentence completion, and natural language inference tasks, among others.

4.3 Flan-T5

Flan-T5 starts from T5 and is fine-tuned on the FLAN dataset. During fine-tuning, it utilizes the JAX-based T5X framework and achieves better or comparable performance to much larger models, including PaLM, in a variety of NLP tasks.

4.4 Alpaca

Alpaca is a 7B model fine-tuned from LLaMA on an instruction dataset generated by InstructGPT. It achieves comparable performance to InstructGPT in human evaluations and excels in the self-instruct dataset.

It seems like you’ve provided a detailed review or summary of a variety of state-of-the-art fine-tuned large language models (LLMs). These models are specialized for various tasks and have been assessed using both automatic and human evaluations, based on specific metrics like truthfulness, toxicity, performance on core NLP tasks, and so on.

4.5 Vicuna

Vicuna (13B) (Chiang et al., 2023) is a language model trained by fine-tuning LLaMA (13B) (Touvron et al., 2023a) on the conversational dataset generated by ChatGPT. The authors gathered user-shared ChatGPT conversations from ShareGPT.com, and got 70K conversation records after filtering out low-quality samples. LLaMA (13B) was fine-tuned on the constructed conversation dataset using a modified loss function tailored to multi-turn conversations. The authors expanded the max context length from 512 to 2048 for better understanding long context across multiple-turn dialog. Training involved gradient checkpointing and flash attention techniques to reduce GPU memory cost. Fine-tuning took 24 hours on an 8 × 80GB A100 device.

Evaluation: Vicuna outperforms Alpaca (13B) and LLaMA (13B) in 90% of the test questions and generates equal or better rating responses compared to ChatGPT in 45% of the questions.

4.6 GPT-4-LLM

GPT-4-LLM (7B) (Peng et al., 2023) is a language model fine-tuned from LLaMA (7B) on the GPT-4 generated instruction dataset. The fine-tuning process involves supervised fine-tuning followed by optimizing using proximal policy optimization (PPO).

Evaluation: GPT-4-LLM outperforms not only the baseline Alpaca (7B), but also larger models including Alpaca (13B) and LLAMA (13B).

4.7 Claude

Claude is a language model fine-tuned on an instruction dataset with the aim to generate helpful and harmless responses. The fine-tuning process involves two steps: supervised fine-tuning followed by optimizing using proximal policy optimization (PPO).

Evaluation: Claude generates more helpful and harmless responses compared to the backbone model. Claude outperforms GPT-3 by 7% on the RealToxicityPrompts in terms of toxicity.

Certainly! Below is the text converted into Markdown format.

4.8 WizardLM

WizardLM (7B) (Xu et al., 2023a) is a language model fine-tuned on LLaMA (7B) using an instruction dataset called Evol-Instruct generated by ChatGPT. The fine-tuning takes about 70 hours on 3 epochs with 8 V100 GPUs, utilizing the Deepspeed Zero-3 technique.

Evaluation: WizardLM outperforms Alpaca (7B) and Vicuna (7B) significantly and offers comparable or better responses than ChatGPT in 67% of test cases. It gains a performance boost compared to Alpaca by +6.2% and +5.3% on various test sets, and outperforms Vicuna by +5.8% and +1.7% on specific test sets.

4.9 ChatGLM2

ChatGLM2 (6B) (Du et al., 2022) is a language model fine-tuned on GLM (6B). It is trained on a bilingual dataset containing both English and Chinese instructions. To model long context, the maximum context length is increased to 32K.

Evaluation: ChatGLM2 outperforms GLM (6B) and the baseline model on all benchmarks. Specifically, ChatGLM2 outperforms GLM by +3.1 on MMLU, +5.0 on C-Eval, +8.6 on GSM8K, and +2.2 on BBH.

4.10 LIMA

LIMA (65B) (Zhou et al., 2023) is a large language model fine-tuned on LLaMA (65B). It is developed based on the “superficial alignment hypothesis,” which suggests that language models acquire most of their capabilities during pre-training and only need a small set of instruction data for fine-tuning to align with user preferences.

Evaluation: For human evaluations, LIMA outperforms InstructGPT and Alpaca by 17% and 19%, respectively. In automatic evaluations conducted by GPT-4, LIMA outperforms InstructGPT and Alpaca by 20% and 36%, respectively.

4.11 Others

WizardLM (7B): Fine-tuned on the Evol-Instruct dataset generated by ChatGPT. Excellent at following complex human-generated instructions.
ChatGLM2 (6B): Fine-tuned on a bilingual dataset containing both English and Chinese instructions. Handles a wide range of benchmarks.
LIMA (65B): Focuses on the superficial alignment hypothesis. Performs well on instruction tasks and generates user-satisfying responses.
OPT-IML (175B): Trained on the Instruction Meta-Learning (IML) dataset, excels at various NLP benchmarks.
Dolly 2.0 (12B): Fine-tuned on an instruction dataset for various NLP tasks such as text classification and information extraction.
Falcon-Instruct (40B): Fine-tuned on English dialogue dataset and employs techniques to reduce memory usage.
Guanaco (7B): A multi-turn dialog model trained on a multilingual dataset.
Minotaur (15B): Supports a maximum context length of 18K tokens and is fine-tuned on open-source instruction datasets.
Nous-Herme (13B): Fine-tuned on a dataset containing over 300k instructions and performs well on multiple tasks.
TÜLU (6.7B): Fine-tuned on a mixed instruction dataset and performs relatively well compared to other larger models.
YuLan-Chat (13B): A bilingual model with comparable performance to state-of-the-art models.
MOSS (16B): Focused on multi-turn conversations and aligns well with human preferences.
Airoboros (13B): Fine-tuned on the Self-instruct dataset and outperforms LLAMA on all benchmarks.
UltraLM (13B): Surpasses several previous best models in evaluations, including Vicuna and WizardLM.

It’s evident that the race to improve language models has led to many specialized versions, fine-tuned for different benchmarks, languages, and types of instructions. This paints a vivid picture of the advancements in the field and how research is pushing to make models more effective, efficient, and aligned with human needs and preferences.

5. Multi-modality Instruction Fine-tuning

5.1 Multi-modality Datasets

MUL-TIINSTRUCT (Xu et al., 2022)
- 62 diverse multimodal tasks
- 10 broad categories
- Derived from 21 existing open-sourced datasets
- Each task has 5 expert-written instructions
- Enhances transfer learning techniques
PMC-VQA (Zhang et al., 2023c)
- Large-scale medical visual question-answering dataset
- 227k image-question pairs from 149k images
- Can be used for both open-ended and multiple-choice tasks
- Outperforms existing models on various benchmarks
LAMM (Yin et al., 2023)
- Comprehensive multi-modal instruction tuning dataset
- Focus on 2D image and 3D point cloud understanding
- Includes data pairs for commonsense knowledge question answering

5.2 Multi-modality Instruction Fine-tuning Models

InstructPix2Pix (983M) (Brooks et al., 2022)
- Conditional diffusion model
- More than 450K text editing instructions and corresponding images
LLaVA (13B) (Liu et al., 2023b)
- Large multimodal model
- Fine-tuned using 158K unique language-image instruction-following samples

6. Domain-specific Instruction Finetuning

6.1 Dialogue

InstructDial (Gupta et al., 2022)
- Instruction tuning framework designed for dialogue
- Collection of 48 dialogue tasks
- Two metatasks: instruction selection task and instruction binary task

6.2 Intent Classification and Slot Tagging

LINGUIST (Rosenbaum et al., 2022)
- Fine-tunes AlexaTM 5B on the instruction dataset for intent classification and slot tagging
- Shows significant improvements over state-of-the-art approaches

Others

Video-LLaMA (Zhang et al., 2023b)
- Multimodal framework for understanding both visual and auditory content in videos
- Two branche encoders: Vision-Language (VL) Branch and Audio-Language (AL) Branch
MultiModal-GPT (Gong et al., 2023)
- Multimodal instruction tuning model
- Capable of following diverse instructions, generating detailed captions, and maintaining continuous dialogues

6.3 Information Extraction

InstructUIE (Wang et al., 2023b)

Framework: Unified information extraction framework based on instruction tuning.
Architecture: Finetunes 11B FlanT5 (Chung et al., 2022) on a constructed IT dataset.
Benchmark: Introduces IE INSTRUCTIONS, a benchmark of 32 diverse information extraction datasets.
Task Properties: Task instruction, options, text, and output.
Performance: Comparable to BERT in supervised settings and outperforms GPT3.5 in zero-shot settings.

6.4 Aspect-based Sentiment Analysis

Varia et al. (2022)

Framework: Unified instruction tuning framework based on a fine-tuned T5 (220M) model.
Elements of ABSA: Aspect Term, Aspect Category, Opinion Term, and Sentiment.
Performance: Shows substantial improvement in few-shot learning and remains comparable in full fine-tuning.

6.5 Writing Assistance

Writing-Alpaca-7B (Zhang et al., 2023d)

Framework: Fine-tunes LLaMA-7B on the writing instruction dataset.
Instruction Scheme: Universal preface, instruction field, input field, and response field.
Performance: Improves on all writing tasks and outperforms other LLMs.

CoEdIT (Raheja et al., 2023)

Framework: Finetunes FLANT5 on the instruction dataset for text editing.
Task Characteristics: Text simplification, grammatical error correction, stylistic editing.
Performance: State-of-the-art performance in several text editing tasks.

6.6 Medical Applications

Radiology-GPT (Liu et al., 2023c)

Framework: Fine-tuned Alpaca-7B model for radiology.
Sections in Reports: “Findings” and “Impression”.
Performance: Demonstrates significant versatility in radiological diagnosis, research, and communication.

ChatDoctor (Li et al., 2023g)

Framework: Fine-tuned LLaMA-7B model utilizing the alpaca instruction dataset.
Functionality: Designed for retrieving external knowledge databases.
Performance: Significantly improves comprehension and advice accuracy.

6.7 Arithmetic

Goat (Liu and Low, 2023)

Framework: Fine-tuned LLaMA-7B model that aims to solve arithmetic problems.
Expression Types: Transforms arithmetic problems into natural language questions.
Performance: State-of-the-art performance on BIG-bench arithmetic subtask.

6.8 Code Writing

WizardCoder (Luo et al., 2023)

Framework: Utilizes StarCoder 15B as the foundation.
Method: EvolInstruct technique adapted to the domain of code.
Performance: Outperforms all other open-source Code LLMs and even largest LLMs on HumanEval and HumanEval+.

7 Efficient Tuning Techniques

Efficient fine-tuning techniques aim to adapt Large Language Models (LLMs) to downstream tasks by optimizing a small fraction of parameters in multiple ways: addition-based, specification-based, and reparameterization-based.

Methods

Addition-based methods: Introduce extra trainable parameters or modules not present in the original model.
- Representative Methods: Adapter tuning (Houlsby et al., 2019), Prompt-based tuning (Schick and Schütze, 2021)
Specification-based methods: Specify certain inherent model parameters to be tuned while freezing others.
- Example: BitFit (Zaken et al., 2022) tunes the bias terms of the pre-trained model.
Reparameterization methods: Transform model weights into more parameter-efficient forms for tuning.
- Key Hypothesis: Model adaptation is low-rank. E.g., LoRA (Hu et al., 2021)
Intrinsic prompt tuning: Finds a low-dimensional subspace shared by tuning prompts across diverse tasks.

7.1 LoRA (Low-Rank Adaptation)

Authors: Hu et al., 2021
Features:
- Uses DeepSpeed (Rasley et al., 2020) as the training backbone.
- Enables efficient adaptation using low-rank updates.
- Reduces the number of trainable parameters by 10,000x and memory usage by 3x compared to full fine-tuning.

7.2 HINT

Authors: Ivison et al., 2022
Features:
- Combines instruction tuning with efficient on-demand fine-tuning.
- Uses hypernetworks to generate parameter-efficient modules for LLMs.
- Benefits include longer instructions and additional few-shots without increasing compute.

7.3 QLORA

Authors: Dettmers et al., 2023
Features:
- Includes optimal quantization and memory optimization.
- Enables training a 65B parameter LLM on a single 48GB GPU with no degradation.
- Utilizes 4-bit NormalFloat (NF4) Quantization and second-level 8-bit quantization.

7.4 LOMO (LOw-Memory Optimization)

Authors: Lv et al., 2023
Features:
- Enables full parameter fine-tuning using limited resources.
- Fuses gradient computation and update into one step.
- Reduces gradient memory to O(1).

7.5 Delta-tuning

Authors: Ding et al., 2023b
Features:
- Provides optimization and optimal control perspectives.
- Performs subspace optimization.
- The tuned parameters act as optimal controllers for downstream tasks.

8 Evaluation, Analysis and Criticism

8.1 HELM Evaluation

HELM (Liang et al., 2022) is a holistic evaluation of Language Models (LMs) aimed at improving transparency. It provides a comprehensive understanding of the capabilities, risks, and limitations of LMs. The evaluation focuses on three main factors:

Factors for Evaluation

Broad Coverage: HELM proposes a top-down taxonomy to ensure wide scenario coverage, improving it from 17.9% to 96.0%.
Multi-Metric Measurement: HELM covered 16 different scenarios and 7 metrics, measuring 98 of 112 possible core scenarios (87.5%).
Standardization: Benchmarks 30 well-known language models including those from Google, OpenAI, and EleutherAI.

8.2 Low-resource Instruction Tuning

Gupta et al. (2023) investigate the minimal downstream training data required for IT models. Findings include:

In single-task learning, 25% of downstream data suffices.
In multi-task learning, only 6% of downstream data is needed.

8.3 Smaller Instruction Dataset

Zhou et al. (2023) proposed LIMA, fine-tuning LLMs on only 1,000 carefully selected training examples.

Performance

Outperforms GPT-davinci003, which was fine-tuned on 5,200 examples.
Achieves equivalent results to GPT-4, Claude, and Bard.

8.4 Evaluating Instruction-tuning Datasets

Wang et al. (2023c) evaluate various IT datasets through both automatic and human evaluations.

Findings

No single best IT dataset across all tasks.
Smaller models and high-base quality models benefit most from IT.

8.5 Do IT Just Learn Pattern Copying?

Kung and Peng (2023) question what models actually learn during instruction tuning.

Results

Models capture surface-level patterns instead of learning the specific task.

8.6 Proprietary LLMs Imitation

Gudibande et al. (2023) investigate the efficacy of model imitation.

Observations

Imitation models excel on tasks with supported datasets.
Imitation models perform poorly on tasks without imitation datasets.

post contain ""

No matching posts found containing ""

Share Your Feedback 🏝️

Survey | Instruction Tuning Survey**

Survey | Instruction Tuning Survey**

Instruction Tuning for Large Language Models: A Survey

TL;DR

1 Introduction

2 Methodology

2.1 Instruction Dataset Construction

2.2 Instruction Tuning

3 Datasets

3.1 Natural Instructions

3.2 P3

3.3 xP3

3.4 Flan 2021

3.5 Unnatural Instructions

3.6 Self-Instruct

3.7 Evol-Instruct

3.8 LIMA

3.9 Super-Natural Instructions

3.10 Dolly

3.11 OpenAssistant Conversations

3.12 Baize

4 Instruction Fine-tuned LLMs

4.1 InstructGPT

4.2 BLOOMZ

4.3 Flan-T5

4.4 Alpaca

4.5 Vicuna

4.6 GPT-4-LLM

4.7 Claude

4.8 WizardLM

4.9 ChatGLM2

4.10 LIMA

4.11 Others

5. Multi-modality Instruction Fine-tuning

5.1 Multi-modality Datasets

5.2 Multi-modality Instruction Fine-tuning Models

6. Domain-specific Instruction Finetuning

6.1 Dialogue

6.2 Intent Classification and Slot Tagging

6.3 Information Extraction

6.4 Aspect-based Sentiment Analysis

6.5 Writing Assistance

6.6 Medical Applications

6.7 Arithmetic

6.8 Code Writing

7 Efficient Tuning Techniques

7.1 LoRA (Low-Rank Adaptation)

7.2 HINT

7.3 QLORA

7.4 LOMO (LOw-Memory Optimization)

7.5 Delta-tuning

8 Evaluation, Analysis and Criticism

8.1 HELM Evaluation

8.2 Low-resource Instruction Tuning

8.3 Smaller Instruction Dataset

8.4 Evaluating Instruction-tuning Datasets

8.5 Do IT Just Learn Pattern Copying?

8.6 Proprietary LLMs Imitation

post contain ""

No matching posts found containing ""

Recent Posts

Most Likes

Most Views