[데이터셋 관련 핵심색인마킹]
Contents
1. 서론
최근 연구에 따르면, 대규모 언어모델(LLM)의 성능은 사전학습 단계에서 처리된 방대한 양의 텍스트 데이터에 대한 자기지도 학습에 크게 의존합니다. 이후 지도 파인튜닝(SFT)을 통해 명확하게 큐레이팅된 지시 데이터셋에서 학습을 강화하여 하위 작업에서의 성능을 더욱 향상시킬 수 있습니다. 이와 같은 데이터 관리는 LLM의 학습 효율성과 다양성을 보장하기 위해 중요하며, 여러 데이터 관리 전략의 이유와 효과에 대한 체계적인 분석이 필요합니다.
2. LLM의 사전학습
2.1 데이터 양
2.1.1 스케일링 법칙
LLM의 성능은 트레이닝 데이터셋의 크기와 모델 크기의 함수로 나타낼 수 있으며, 이 관계는 스케일링 법칙으로 설명됩니다. Kaplan 등(2020)에 따르면 모델의 성능 \(L\)은 데이터의 수 \(D\)와 모델 파라미터의 수 \(N\)에 대해 다음과 같은 관계를 가집니다.
\[L = f(D, N) = D^{-\alpha_D} \cdot N^{-\alpha_N}\]이때 \(\alpha_D\)와 \(\alpha_N\)는 스케일링의 계수입니다. 따라서 데이터와 모델 크기가 증가함에 따라 모델의 성능도 예측 가능하게 향상됩니다.
2.1.2 데이터 반복
데이터의 반복 사용은 training dataset의 고갈을 방지하는 한 방법입니다. 반복 데이터는 모델의 오버피팅을 유발할 수 있으나, 잘 조절된 반복 학습은 성능 저하 없이 데이터의 효율적 사용을 가능하게 합니다.
2.2 데이터 품질
2.2.1 중복 제거
데이터셋에서의 중복 제거는 훈련 효율성과 모델 일반화를 향상시키는 데 중요합니다. 효율적인 중복 제거 기법을 통해 모델의 training dataset 품질을 개선할 수 있습니다.
2.2.2 품질 필터링
고품질 데이터의 선택은 LLM의 학습에 결정적인 영향을 미칩니다. 다양한 필터링 기법을 통해 낮은 품질의 데이터를 제거하고, 모델의 성능을 최적화합니다.
2.2.3 독성 필터링
독성 텍스트의 제거는 모델이 부정적인 내용을 생성하지 않도록 하며, 이는 모델의 안전성과 신뢰성을 보장합니다.
2.3 도메인 구성
LLM의 효과적인 학습을 위해서는 다양한 도메인의 데이터를 포함시키는 것이 중요합니다. 이는 모델이 다양한 하위 작업에 효과적으로 대응할 수 있도록 만들며, 특정 도메인에 편향되지 않는 일반화된 성능을 제공합니다.
2.4 데이터 관리 시스템
효과적인 데이터 관리를 위한 통합 데이터 관리 시스템은 LLM 개발자에게 맞춤형 데이터 처리, 낮은 코드 필요성, 다양한 데이터 관리 도구를 제공합니다.
3 감독세부조정(SFT)의 대규모 언어모델(Large Language Models, LLM)
3.1 데이터 양
데이터 양의 확대와 축소가 LLM의 성능에 미치는 영향에 대해 두 가지 연구 경향이 존재합니다. 예를 들어, LIMA(Zhou et al., 2023a)는 1,000개의 고품질 샘플을 선별하여 제한된 양의 데이터만으로도 LLM이 사전학습 단계에서 습득한 지식과 능력을 충분히 발휘할 수 있다는 가설을 실험적으로 입증했습니다. 반면, Wei et al.(2021)과 Sanh et al.(2022)는 데이터 양의 확대가 성공의 핵심 요소임을 언급합니다. 이런 상반된 견해에 대응하여 Ji et al.(2023)은 실제 사용자 사례 12개를 연구하여 데이터 양의 증가가 지속적인 성능 향상을 가져온다는 결과를 제시했습니다.
3.2 데이터 품질
LLM의 SFT에서 데이터 품질은 중요한 요소입니다. 데이터 품질을 평가하기 위한 지표로는 토큰의 수, 보상 점수, 텍스트의 어휘 다양성(Measure of Textual Lexical Diversity) 등이 사용되었습니다(Cao et al., 2023). 이와 더불어, LLM 자체를 이용하여 데이터의 품질을 평가하는 방법도 연구되었는데, 예를 들어, Li et al.(2023a)는 언어 모델을 사용하여 지시문의 품질 점수를 매기고 모델의 예측을 반복적으로 개선하는 방법을 제안했습니다.
3.3 작업 구성
다양한 NLP 작업에 대한 LLM의 일반화 성능을 향상시키기 위해 다작업 감독 세부조정이 유망한 접근 방식으로 제시되었습니다. Wang et al.(2022)과 Sanh et al.(2022)는 다양한 크기의 모델에서 다수의 작업을 포함하는 SFT가 모델의 일반화 성능을 향상시킨다는 것을 실험적으로 입증했습니다. 그러나, Dong et al.(2023)은 작업 구성의 비율 조절이 특정 능력에 미치는 영향을 연구하여, 혼합 데이터의 양이 많을수록 능력 간의 충돌이 발생할 수 있다고 지적했습니다.
3.4 데이터 효율적 학습
앞선 연구에서 논의된 데이터 양, 품질, 작업 구성의 영향을 바탕으로 LLM을 더 효율적으로 세부조정하기 위한 방법이 제안되었습니다. 예를 들어, AlShikh et al.(2023)은 응답이 ‘답변과 같은지’ 여부를 예측하는 이진 분류기를 사용하여 instruction following 점수(Instruction Following Score, IFS)를 정의하고, 전체 데이터셋에 대한 세부조정 없이도 조기 중단 기준으로 사용할 것을 제안했습니다. 또한, Zhou et al.(2023b)는 데이터 선택에서 학습 가능성을 주요 기준으로 사용하여 SFT 데이터를 선택하는 새로운 방법을 제안했습니다.
[SFT 데이터 구성 관련 색인마킹]
4 도전과 미래 방향
[데이터 구성 고려사항 색인마킹]
5 관련 조사
*LLM에 대한 주목 증가: LLM에 대한 관심이 증가함에 따라, LLM의 전체 생명 주기에 대한 다양한 측면을 다루는 여러 조사가 발표되었습니다. 예를 들어, Zhao et al.(2023a)은 LLM의 발전과 최신 진전에 대해 폭넓게 논의합니다.
이 조사와의 차별점: 이 조사는 LLM의 사전 훈련 및 SFT 단계에서 데이터 관리의 적절한 조직과 다양한 데이터 관리 전략의 효과에 대해 체계적이고 상세한 개요를 제공합니다. 이는 효과적이고 효율적인 데이터 관리를 통해 강력한 LLM을 구축하려는 실무자들에게 유용한 지침 자료를 제공합니다.
Large Language Models (LLMs) have shocked the natural language processing (NLP) community with their strong performance and emergent abilities (OpenAI, 2023; Touvron et al., 2023a; Wei et al., 2022). According to previous studies (Kaplan et al., 2020; Hoffmann et al., 2022), LLMs’ achievements depend heavily on self-supervised pretraining over processed vast volumes of text data. Recent research (Zhou et al., 2023a; Ouyang et al., 2022) further enhances LLMs’ instructionfollowing ability and performance on downstream tasks through Supervised Fine-Tuning (SFT) on deliberately curated instruction datasets.
Organizing a well-suited training dataset using collected data, which we define as data management, is vitally important and challenging in both the pretraining and SFT stages of LLMs. In the pretraining stage, constructing datasets with highquality data is essential for efficient training (Jain et al., 2020; Gupta et al., 2021). To equip LLMs with diverse and comprehensive abilities, heterogeneous dataset composition with mixtures of domains is also required (Gao et al., 2020; Longpre et al., 2023b; Shen et al., 2023). However, many prominent LLMs do not enclose (Anil et al., 2023; OpenAI, 2023) or only document the techniques used in the construction of their pretraining dataset (Brown et al., 2020; Workshop et al., 2022; Touvron et al., 2023a), leaving the reason and effect of certain data management strategy absent. In the SFT stage, the performance and instructionfollowing abilities of LLMs are largely evoked by carefully constructed instruction datasets (Sanh et al., 2022; Ouyang et al., 2022). Although a handful of instruction datasets/benchmarks have been proposed with human crowd sourcing (Wang et al., 2022; Köpf et al., 2023), self-instruct (Wang et al., 2023c; Taori et al., 2023) or collection of existing datasets (Si et al., 2023; Anand et al., 2023), practitioners still find it confusing about the effect of instruction datasets on the performance of finetuned LLMs, leading to difficulties in choosing proper data management strategies in LLM SFT practices.
To address these challenges, it is necessary to conduct a systematic analysis on LLM data management, including the rationale behind management strategy selection and its consequential effect, the evaluation of curated training datasets, and the pursuit of improved strategies. Therefore, this survey aims to provide a comprehensive overview of current research in LLM data management, as shown in Figure 1. In Section 2, we focus on LLM pretraining data management, including the research on data quantity, data quality, domain composition, and data management systems. In Section 3, we discuss the data quantity, data quality, task composition, and data-efficient learning in the SFT stage of LLMs. In Section 4, looking into the future, we present the existing challenges and promising future directions in training data management for LLMs.
Figure 1: Taxonomy of research in data management for pretraining and supervised fine-tuning of Large Language Models (LLM).
Through this survey, we are devoted to offering a guiding resource to practitioners attempting to build powerful LLMs with effective and efficient data management practices.
Data management is found to be important in the pretraining stage of many prominent LLMs (OpenAI, 2023; Touvron et al., 2023a; Wei et al., 2022). Understanding the effects of these data management strategies is also crucial for building strong LLMs. However, most of the prominent LLMs do not report their data management procedures or only report the strategies they adopted without discussing the reason for choosing the specific strategies. Thus, some researchers try to disclose the working scheme of data management in the pretraining stage of LLMs. In this section, we first review the research studying pretraining dataset scaling law with/without data repetition. Then, data quality regarding deduplication, quality filtering, toxicity filtering, social bias, and data diversity and age are explored. After that, domain composition and domain re-weighting methods are discussed. Finally, two recently proposed data management systems are introduced to implement the pretraining data management pipelines.
The amount of data required for efficient pretraining of LLMs is an ongoing research topic in NLP communities. Scaling laws are proposed to depict the relationship between model size and training dataset size. According to the proposed scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022), with model size continuously increasing, the demand for more training data will also increase consistently. Thus, the exhaustion of text data draws researchers’ attention to data repetition in the LLMs’ pretraining.
Before the popularization of LLMs, the relationship between training dataset size and the performance of Transformer-based language models (Vaswani et al., 2017) had already attracted researchers’ attention. Kaplan et al. (2020) use Transformers and cross-entropy loss to study the empirical scaling laws for language model performance and find the model performance has a powerlaw relationship with training dataset size or model size, respectively, when not bottlenecked by each other and the training compute budget:
where L is the test loss, D is the number of training tokens, N is the number of model parameters, αD and αN is the power-law component for the scaling of D and N respectively, and Dc and Nc is constant numbers measured in the number of training tokens and non-embedding parameters respectively whose precise numerical value depends on vocabulary size and tokenization and does not have fundamental meaning.
Combining Equation 1 and 2, they derive Equation 3 to depict the dependence between model size N and dataset size D. Fitting Equation 3, they get αD = 0.103, αN = 0.076, Nc = 6.4 × 1013 and Dc = 1.8 × 1013 and conclude that model performance improves predictably as long as the model size and training dataset size are scaled up simultaneously but encounters overfitting if either of them is fixed while the other increases. Given fixed compute budget C, they analyze the optimal allocation of Dopt ∼ C0.27 and Nopt ∼ C0.73, showing that the model size should increase faster than the training dataset size.
Following the power-law relationship proposed by Kaplan et al. (2020), Hoffmann et al. (2022) conduct experiments on much larger language models and arrive at a new scaling law:
where E = 1.69, A = 406.4, B = 410.7, α = 0.34 and β = 0.28. They also analyze the optimal allocation of Nopt ∼ C0.46 and Dopt ∼ C0.54. Hence, they draw a different conclusion that model size and pretraining dataset size should scale at roughly the same rate with a larger computing budget.
While Kaplan et al. (2020) and Hoffmann et al. (2022) focus on scaling law with unique data trained only for one epoch, Hernandez et al. (2022) address the issue about text overlap in the training dataset and study the scaling law with a small fraction of repeated data. They observe a strong double descent phenomenon (Nakkiran et al., 2021) caused by repeated data, where a peak of test loss appears in the middle range of repetition frequency, i.e., the number of epochs trained on repeated data. They also show that repeated data can cause a divergence from power-scaling law (Kaplan et al., 2020) on model sizes larger than 100M parameters.
According to the scaling law, more training data is required as the model size grows, raising concerns about the exhaustion of high-quality training data (Villalobos et al., 2022; Hoffmann et al., 2022). One straightforward way to overcome this issue is to train on data repeatedly. However, repeated data subsets may lead to performance degradation as claimed by Hernandez et al. (2022). Motivated by this contradiction, several works study the consequence of pretraining on the whole datasets repeatedly for multiple epochs. Muennighoff et al. (2023) find that with constrained data and fixed compute budgets, repeatedly training on the whole dataset up to 4 epochs only causes trivial harm to test loss compared to training on unique new data. They also propose a scaling law on repeated training depicting the diminishing of returns with more repetition and larger model sizes. Xue et al. (2023) also observe a multi-epoch degradation in model performance and find that dataset size, model parameters, and training objectives are the key factors to this phenomenon. They further find that commonly used regularization techniques are not helpful in alleviating multi-epoch degradation, except for dropout. Instead of simply repeating over the whole dataset, Tirumala et al. (2023) show that repeatedly training on carefully selected data can outperform that on randomly selected new data, whilst repeatedly training on randomly selected data cannot, suggesting a feasible way of repeating on intelligently selected data.
High-quality data is crucial in the training of machine learning tasks according to previous studies (Jain et al., 2020; Gupta et al., 2021). In the pretraining of LLMs, quality assurance techniques are also adopted and usually form a data management pipeline (Rae et al., 2021; Nguyen et al., 2023; Tirumala et al., 2023), including deduplication, quality filtering, and toxicity filtering. Other aspects like social bias, data diversity, and data age are also influential topics in the research community.
Deduplication is widely used in data management procedures of many prominent LLMs and preprocessing of publicly available datasets (Brown et al., 2020; Workshop et al., 2022; Touvron et al., 2023a; Raffel et al., 2020). Lee et al. (2021) use N-gram similarity with MinHash (Broder, 1997) to detect duplications in training datasets and find that deduplication is beneficial in memorization mitigation, train-test overlap avoidance, and training efficiency improvement while keeping model perplexity. Kandpal et al. (2022) also show that deduplication can considerably lower the success rate of privacy attacks aiming at model memorization.
Among practices of deduplication, N-gramand-hashing is the most commonly adopted technique (Lee et al., 2021; Borgeaud et al., 2022; Rae et al., 2021). Silcock et al. (2022) compare it with two neural approaches (a contrastively trained biencoder and a “re-rank” style approach combining a biand cross-encoder) and conclude that neural approaches can significantly outperform traditional N-gram-and-hashing methods. Abbas et al. (2023) propose SemDeDup to remove semantic duplicates that lie closely in the pre-trained model embedding space and apply clustering to reduce the searching computation.
Quality filtering is another key step in constructing a high-quality pretraining dataset, because public datasets like Common Crawl 1 and multilingual datasets (Kreutzer et al., 2022) usually contain lowquality data that hampers the training of LLMs. Existing works usually perform quality filtering using a classifier (Brown et al., 2020; Gao et al., 2020; Du et al., 2022; Touvron et al., 2023a), handcrafted heuristics (Yang et al., 2019; Raffel et al., 2020; Nijkamp et al., 2022) or threshold filtering using criterion like perplexity (Wenzek et al., 2020; Muennighoff et al., 2023). Kaddour (2023) construct a subset of the Pile (Gao et al., 2020) called MiniPile by filtering out low-quality embedding clusters.
Quality filtering is usually proven to be beneficial in model performance improvement (Longpre et al., 2023b), despite the reduction of training data quantity and variety. Light-weight language model phi-1 (Gunasekar et al., 2023) and phi-1.5 (Li et al., 2023b) with 1.3B parameters trained on carefully selected high-quality data and synthetically generated data show outstanding performance on coding and commonsense reasoning tasks. Recently, Microsoft published a new 2.7B language model phi-2 (Javaheripi and Bubeck, 2023) with better language understanding and logic reasoning abilities. Penedo et al. (2023) construct the RefinedWeb dataset consisting of properly filtered and deduplicated high-quality web data, outperforming models trained on the Pile (Gao et al., 2020). However, Gao (2021) finds that aggressive filtering can lead to performance degradation on a wide range of tasks for GPT-like LLMs because the filtering proxy objectives are not representative enough of the true objective. To address this issue, Marion et al. (2023) comprehensively examines the combinations of three data quality estimators (perplexity, Error L2-Norm (EL2N), and memorization factor), three proportions of remaining data (top, middle, and bottom proportion ordered by corresponding estimators) and different percentage of remaining data through data pruning. Surprisingly, they find that pruning datasets based on perplexity and retaining the middle proportion of data performs far better than more complicated techniques like memorization. However, no combinations of pruning strategies achieve consistently high performance.
https://commoncrawl.org/, a large text corpus contains
Toxicity refers to the text content which is “rude, disrespectful, or unreasonable language that is likely to make someone leave a discussion” (Gehman et al., 2020; Welbl et al., 2021). As raw text corpora usually contain toxic text (Luccioni and Viviano, 2021; Longpre et al., 2023b), toxicity filtering aims to remove text with undesirable toxic text in the pretraining datasets, further preventing LLMs from generating toxic utterances. Similar to quality filtering, heuristic and rule-based filtering (Lees et al., 2022; Gargee et al., 2022; Friedl, 2023) and N-gram classifiers (Raffel et al., 2020) are usually adopted as toxicity filters. Although effective in model detoxifying, Longpre et al. (2023b) discover that toxicity filtering reduces the risk of toxic generation by sacrificing model generalization and toxicity identification ability. Moreover, Xu et al. (2021) and Welbl et al. (2021) both find that training dataset detoxification leads to the marginalization of minority groups like dialects and minority identity mentions.
Besides the marginalization of minority groups caused by data detoxifying, several works (Kurita et al., 2019; Nangia et al., 2020; Meade et al., 2022; Feng et al., 2023) find that pre-trained LLMs can capture social biases contained in the large amounts of training text. Evaluating on the C4.EN (Raffel et al., 2020) dataset, Dodge et al. (2021) recommend documenting the social biases and representational harms as well as excluded voices and identities in large web text corpora. Using a dataset of U.S. high school newspaper articles, Gururangan et al. (2022) also argue that the quality filters used for GPT-3 (Brown et al., 2020) prefer newspapers published by larger schools located in wealthier, educated, and urban ZIP codes, leading to a language ideology. Feng et al. (2023) conduct a comprehensive case study focusing on the effects of media political biases in the pretraining corpus on the fairness of hate speech detection and misinformation detection w.r.t. partisan leanings and how it is propagated to language models even further to downstream tasks.
There are also works focusing on other aspects of data management in the pretraining stage of LLMs. For example, Lee et al. (2023a) show that the format diversities of publicly available pretraining datasets are high when measured by the recently proposed Task2Vec diversity coefficient (Miranda et al., 2022). They also demonstrate the effectiveness of the diversity coefficient by showing its alignment with the number of latent concepts in the text and suggest using it to build more diverse datasets. Maharana et al. (2023) proposes a novel pruning method D2 Pruning to balance data diversity and difficulty in data selection by representing a dataset as an undirected graph with samples as nodes and difficulty scores as node properties. Each sample is connected with its k nearest neighbors weighted by distances in the embedding space. Then, a forward and reverse message passing strategy is adopted to select a subgraph enveloping both diverse and difficult data samples.
Longpre et al. (2023b) explore the age of the evaluation dataset and draw conclusions that the temporal shift between evaluation and pretraining data will lead to inaccurate performance estimation and the temporal misalignment might not be overcome by fine-tuning, especially for larger models.
Besides the marginalization of minority groups caused by data detoxifying, several works (Kurita et al., 2019; Nangia et al., 2020; Meade et al., 2022; Feng et al., 2023) find that pre-trained LLMs can capture social biases contained in the large amounts of training text. Evaluating on the C4.EN (Raffel et al., 2020) dataset, Dodge et al. (2021) recommend documenting the social biases and representational harms as well as excluded voices and identities in large web text corpora. Using a dataset of U.S. high school newspaper articles, Gururangan et al. (2022) also argue that the quality filters used
Public available pretraining datasets usually contain mixtures of data collected from multiple sources and domains. For example, the Pile (Gao et al., 2020) contains web documents from Common Crawl, Wikipedia, Books, and collections from medical, academic, coding and math, legal, and social resources. Many prominent models are also trained on a mixture of data from different domains. For example, LaMDA (Thoppilan et al., 2022) is trained on dialogs data from public forums, C4 data, Wikipedia (English), web documents, and code documents from programming-related Q&A sites, tutorials, etc.
Efforts are made to explore the impact of domain mixtures on the pre-trained model performance. Longpre et al. (2023b) group the Pile (Gao et al., 2020) data into nine domains and conduct ablateone-at-a-time experiments to show the impact of different domains. They draw conclusions that the domains with high quality (Books) and high diversity (Web) are broadly helpful, and it is beneficial to include as many data sources as possible even though they are less relevant to the downstream tasks. SlimPajama-DC (Shen et al., 2023) arrives at the same point that including all domains typically achieves better results than deliberately selected domain combinations, given that global deduplication is conducted to remove overlaps among different domain datasets. Both Longpre et al. (2023b) and Shen et al. (2023) agree that specific mixtures may excel in evaluation benchmarks for targeted tasks, but the former claim that the inclusion of diverse web domains may perform better than specific mixtures in certain tasks. CodeGen2 (Nijkamp et al., 2023) studies the impact of mixtures of programming and natural languages on model performance and finds that models trained with mixtures do not perform better than but closely to domain-matched models given the same compute budget.
Several methods are also proposed to find the proper domain composition weights. DSIR (Xie et al., 2023b) formulates the problem as a distribution matching problem given a large raw unlabeled dataset and some unlabeled target samples. Specifically, it leverages the classic importance resampling approach (Rubin, 1988) and estimates the importance weights using n-gram features and KL reduction. Without knowledge of downstream tasks or target distributions, DoReMi (Xie et al., 2023a) trains a small proxy model using Group Domain Robust Optimization (Group DRO) (Oren et al., 2019; Sagawa* et al., 2020) to generate domain weights. It improves model performance on all domains by up-weighting domains with the largest loss gap between the evaluated model and a pre-trained reference model. Improved from DoReMi (Xie et al., 2023a), Fan et al. (2023) propose DoGE which reweights training domains to minimize the average validation loss across all training domains or on a specific unseen domain. The key component of their generalization objective is a gradient-based generalization estimation function measuring the contribution of each domain to other domains. Then, domains contributing higher to learning other domains will receive larger weights, as well as those difficult to learn.
Addressing the difficulty in pretraining data management, integrated data management systems are beneficial for LLM practitioners with different demands. Chen et al. (2023a) provides a data processing system Data-Juicer featuring the generation of diverse data recipes. They provide over 50 versatile data management operators and dedicated tools targeting zero-code data processing, low-code customization, and off-the-shelf data processing components. A timely feedback loop at multiple development stages of data recipes and LLMs is also supported. Zhou et al. (2023c) also propose a pretraining data curation and assessment system Oasis, which contains an interactive modular rule filter module, a debiased neural quality filter module, an adaptive document deduplication module, and a holistic data assessment module.
Based on the general knowledge and capabilities learned in the pretraining stage, supervised fine-tuning (SFT) is proposed to further improve LLMs with instruction-following ability and alignment with human expectations (Wei et al., 2021; Sanh et al., 2022; Ouyang et al., 2022). Many efforts have been made to construct instruction data using human crowd-sourcing (Wang et al., 2022; Köpf et al., 2023), self-instruct (Wang et al., 2023c; Taori et al., 2023) or collection of existing datasets (Si et al., 2023; Anand et al., 2023). Although LLMs fined-tuned with existing instruction datasets have achieved remarkable performance in various NLP tasks, the impacts of instruction data management on fine-tuned model performance are still under debate. Consistent with previous discussion regarding LLM pretraining, in this section, we summarize the research explorations in LLM SFT into data quantity, data quality (instruction quality, diversity, complexity, prompt design), and task composition. Data-efficient SFT is also included to discuss current efforts on efficient SFT from the data aspect.
The explorations of the relationship between scaling instruction data quantity and fine-tuned model performance diverge in two directions. One branch of research focuses on scaling down the instruction data quantity to improve training efficiency. For example, LIMA (Zhou et al., 2023a) carefully curated 1,000 high-quality samples and experimentally justified their hypothesis that only limited instruction tuning data is needed to expose the knowledge and capabilities that the LLM has already acquired during pretraining. Chen et al. (2023b) observe that maybe a single instruction is sufficient for single task-specific LLM fine-tuning, and 16K samples with 1.9M tokens may be sufficient to train a model specialized in the natural language inference (NLI) task. Another branch of research argues that scaling up the instruction data quantity is crucial for success (Wei et al., 2021; Sanh et al., 2022).
Addressing this conflict, several works attempt to analyze the scaling pattern for different tasks or model abilities. Ji et al. (2023) conduct an empirical study on 12 major real-world online user cases and show that scaling up the instruction data leads to continuous improvement in tasks such as extraction, classification, closed QA, and summarization while leading to little improvement in tasks such as math, code, and chain-of-thought. Disagree with Ji et al. (2023), Dong et al. (2023) find that general ability can be enhanced with about 1,000 samples and improves slowly after then, while mathematical reasoning and code generation improve consistently with the increasing of data amount. Similarly, Yuan et al. (2023) observe a log-linear relation between instruction data amount and models’ mathematical reasoning performance, but stronger pretrained models improve less with larger fine-tuning datasets. Song et al. (2023) conduct experiments covering ten distinct in-domain abilities and three out-of-domain abilities and show that most abilities are consistent with data scaling. Still, each ability develops at different paces during instruction tuning, while some abilities show completely different patterns.
Data quality is always a focal point in the SFT of LLMs, addressing instruction quality, diversity, complexity, and prompt design. Here, we focus more on the management and analysis of existing instruction data instead of instruction generation methods discussed in previous surveys (Zhang et al., 2023b; Wang et al., 2023e).
Many researchers have found that the quality of instruction data is one of the most important factors in improving model performance (Chia et al., 2023; Zhou et al., 2023a; Ding et al., 2023). During the construction of instruction data, there is usually a filtering step to select high-quality instructions generated by models. Wang et al. (2023d) use perplexity as the criterion to select the most appropriate instructions from the pool of candidate instructions generated by open-source models. Cao et al. (2023) propose an automatic data selector InstructionMining to evaluate instruction data quality without human experts’ interventions. They first hypothesize that the inference loss of a fine-tuned model on an evaluation set can serve as a proxy for data quality. Then, they use a set of natural language indicators to predict the inference loss to save the efforts in actually fine-tuning an LLM, including the number of tokens in tokenized inputs and outputs, reward score (Köpf et al., 2023), perplexity, Measure of Textual Lexical Diversity (McCarthy and Jarvis, 2010), distance to approximate i-th nearest neighbors (Dong et al., 2011) in SentenceBERT (Reimers and Gurevych, 2019) embedding space, and the naturalness, coherence, and understandability scores provided by UniEval dialogue model (Zhong et al., 2022).
Instead of using indicators to filter low-quality instructions, several works (Li et al., 2023a; Lu et al., 2023a; Ye et al., 2023; Madaan et al., 2023) leverage the power of fine-tuned LLM itself to evaluate the quality of instructions. Li et al. (2023a) assign quality scores to augmented instructions using the language model and iteratively improve model prediction. Similarly, SELF (Lu et al., 2023a) and Self-Refine (Madaan et al., 2023) prompts LLM to provide self-feedback on their own responses. Strong LLMs like ChatGPT are also adopted as quality judges during the instruction collection process (Ye et al., 2023).
The intention and semantic diversity of instructions is another important factor that has shown a positive effect on model performance improvement (Zhou et al., 2023a; Ding et al., 2023; Taori et al., 2023). Self-Instruct (Wang et al., 2023c) adopts ROUGE-L similarity to filter out the newly generated instructions that are too similar to the existing ones. To better evaluate the instruction diversity of SFT datasets, #InsTag (Lu et al., 2023b) is proposed as an open-set fine-grained tagger using ChatGPT 2. Specifically, it first prompts ChatGPT to provide tags for given queries in an open setting, then performs a normalization procedure to deal with the noise in the raw tagging, including low-frequency filtering, rule aggregation, semantic aggregation with embedding clustering, and association aggregation that merges associated tags together. With the generated tags, they quantify instruction diversity as the unique tag coverage rate for the overall tag set. They also analyze popular open-set SFT datasets and show that larger dataset size tends to be more diverse and induces higher performance.
https://chatgpt.openai.com/
Though important, diversity can be challenging in domain-specific tasks due to data constraints. Wan et al. (2023) propose an approach called Explore-Instruct to enlarge the data coverage through active exploration via LLMs. ExploreInstruct starts from representative domain user cases and searches the variations and possibilities by looking ahead into potential fine-grained subtasks and backtracking alternative branches in the search space.
The complexity of instructions also attracts researchers’ attention, especially in developing LLMs with complex instruction-following and reasoning abilities (Xu et al., 2023a; Luo et al., 2023; Mukherjee et al., 2023). Several works endeavor to quantify and evaluate instruction complexity. Using aforementioned tags, #InsTag (Lu et al., 2023b) quantifies complexity as the average tag number assigned to queries in a dataset. He et al. (2023) evaluate complex instruction with eight features, i.e., multi-turn, length, noise, and heterogeneous information for input text, multi-tasking, semantics, formats, and quantity constraints for task description.
To delve into the exploration of instruction complexity, Zhao et al. (2023b) propose Tree-Instruct to enhance the complexity of instruction data conIt treats the instruction as a semantic trollably. tree and constructs new complex instructions by adding nodes to the tree. Thus, the complexity of instruction can be controlled by adjusting the number of added nodes. Through experiments, they find that increased complexity can lead to continuing performance improvement. What’s more, the improvement does not come from the increased number of tokens, as a few complex instructions still outperform diverse but simple instructions under the same token budget. Curriculum instruction
tuning ranging from easy to difficult might not be as helpful as expected, indicating the necessity of enhancing complexity. Another method of increasing complexity is also proposed. Evol-Instruct (Xu et al., 2023a; Luo et al., 2023) rewrite instructions step by step with operations such as increasing reasoning, adding constraints, in-breadth evolving, deepening, and complicating input with code and table. Similarly, Jiang et al. (2023) incrementally augment instructions with constraints on content, situation, style, format and example and propose FollowBench to evaluate LLMs’ constraint following ability.
Current instructions are either heuristically designed by human (Wang et al., 2022; Köpf et al., 2023) or synthetically generated by prominent models (Peng et al., 2023; Ding et al., 2023). However, the same intention and semantic meaning can be phrased into various prompts, and the choice of prompts can cause significant model performance variation (Gonen et al., 2022; Weber et al., 2023). Thus, what kinds of instruction prompts are better for LLM training might be vital yet have not been fully explored.
Early attempts include manual reformulation of prompts into the ones easier to follow for language models (Mishra et al., 2022) and choosing prompts with the lowest perplexity to get the most significant gains in model performance (Gonen et al., 2022). Khashabi et al. (2022) surprisingly find that the discretized interpretation of continuous prompts is not always consistent with the discrete prompts describing the same task as heuristically expected. To know which parts of the instructions are most important in LLM fine-tuning, Yin et al. (2023b) and Kung and Peng (2023) both conduct ablation analysis removing contents in task definitions. Yin et al. (2023b) find that removing the descriptions of task output, especially the label information, might be the only reason for performance degradation. They also propose an automatic task definition compression algorithm to remove almost half or more of the tokens while improving model performance. Kung and Peng (2023) also remove all semantic components in task definitions but the output space information. They achieve comparable model performance using the modified task definitions and delusive examples containing incorrect input-output mappings. Based on this finding, they cast doubts on the performance gain of finetuned models and state that the model may only learn superficial patterns during instruction tuning. Addressing the issue of instruction format inconsistency, Liang et al. (2023) develop a format transfer framework UIT to transfer instructions from different datasets into unified formats automatically.
Besides the choice of phrasing, the generation source of prompts is another factor in prompt design. Gudibande et al. (2023) raise questions on fine-tuning a weaker language model on outputs of a stronger model and find that the imitation model might adapt to mimic the stronger model’s style but not its functionality, indicating the potential challenge of closing the performance gap between openand close-sourced through imitation instruction tuning. Similarly, Song et al. (2023) also observe that human-designed data can outperform synthetically generated data from GPT-4 (OpenAI, 2023) to a relatively large extent.
Since LLMs have shown surprisingly emergent abilities in handling various NLP tasks, multitask fine-tuning appears to be a promising approach to improve LLMs’ generalization performance on unseen tasks. The benefits of increasing the number of tasks in SFT have been experimentally proven on models with different sizes ranging from 3B to 540B parameters (Wang et al., 2022; Sanh et al., 2022; Wei et al., 2021; Chung et al., 2022).
Besides the scaling of the number of tasks, the mixture ratio of different instruction benchmarks and task balancing is also found to be critical for effective instruction fine-tuning (Iyer et al., 2022; Longpre et al., 2023a). Dong et al. (2023) focus on task composition among mathematical reasoning, code generation, and general human-aligning abilities. Compared with individual source data, They find that model abilities are improved when the mixed data amount is small but decreased when the mixed data amount is large. The results indicate that conflicts exist among abilities with larger amounts of mixed data. To further explain the conflicts, they vary the ratio of general and specialized data and conclude that the impact of data ratio might lie in the similarity degree of data format and data distribution among different SFT tasks.
Different from compositing multiple tasks together, some works claim that LLM tuned on a single task data can outperform LLM tuned on multiple tasks (Jang et al., 2023; Chen et al., 2023b). Jang et al. (2023) state that training expert LLMs to form an expert library can outperform multitask language models by avoiding negative task transfer, continually learning new tasks without catastrophic forgetting, and improving compositional abilities. Wang et al. (2023b) conduct analysis on factual knowledge, reasoning, multilinguality, coding, and open-ended instruction following abilities of models trained with 12 instruction datasets and experimentally show that different instruction datasets may correspond to specific abilities. Also, it seems that winning across all evaluations using a single dataset or combination is challenging.
Based on explorations of the impact of data quantity, data quality, and task composition on model performance discussed previously, some works propose to fine-tune LLM more efficiently with subset selection or specially designed fine-tuning strategy addressing different aspects of instruction data.
Data Quantity AlShikh et al. (2023) introduce a metric to measure LLMs’ instruction-following ability. A binary classifier is used to predict whether the response is “answer-like” or not. Then, Instruction Following Score (IFS) is defined as the percentage of “answer-like” responses in the responses acquired by the whole instruction dataset. They suggest using this metric as an early stopping criterion without tuning on the full-sized dataset. Based on observations of different scaling patterns for different abilities, Dong et al. (2023) propose Dual-stage Mixed Fine-tuning (DMT) strategy to learn specialized abilities and general abilities sequentially while keeping a small proportion of specialized data to prevent forgetting.
Data Quality Several works focus on selecting a subset of instruction data with the highest quality. Cao et al. (2023) adopt BlendSearch (Wang et al., 2020) to automatically select the best subset. AlpaGasus (Chen et al., 2023c) uses strong LLMs as auto-graders and selects data with scores above a threshold in the Alpaca dataset (Taori et al., 2023). Motivated by the computational overhead caused by data pruning prior to fine-tuning, Attendu and Corbeil (2023) propose a dynamic data pruning method that periodically filters out unimportant examples during SFT using extended versions of EL2N metric (Paul et al., 2021; Fayyaz et al., 2022). Without discarding data samples, OpenChat (Wang et al., 2023a) considers the general SFT data as a mixture of a small amount of expert data and a large amount of sub-optimal data without any preference labels. Then, Conditioned-RLFT strategy is proposed, which treats different data sources as coarse-grained reward labels and optimizes the LLM as a class-conditioned policy.
To enhance instruction diversity in the chosen subsets, DiverseEvol (Wu et al., 2023) uses an iterative data sampling technique that selects new data points with maximized distances from any existing ones in model embedding space.
Task Composition Given a small amount of target task data, Ivison et al. (2023) finds the relevant multitask subsets for fine-tuning according to the similarity between the pre-trained model’s representations. Similarly, Dynosaur (Yin et al., 2023a) treats task selection based on instruction representations as a replay strategy in continual learning scenarios to mitigate catastrophic forgetting issues and improve generalization to unseen tasks. Yue et al. (2023) builds math generalist models MAmmoTH through instruction tuning on a unique hybrid of chain-of-thought and program-of-though rationales in math.
Zhou et al. (2023b) introduce learnabilOthers ity as a new dimension of SFT data selection that data can be learned more effectively by the model are preferable and data lacking informative content or excessively demanding for the model should be avoided. They also propose LoBaSS to select SFT data using learnability as the principal criterion measured by the loss difference between fine-tuned and pre-trained models. Xu et al. (2023b) propose a contrastive post-training technique via contrastive pairs from LLMs with different levels of abilities. They also use a data curriculum scheme where the model learns progressively from the “easier part” to the “harder part”. Data-Juicer (Chen et al., 2023a) implements pipelines for LLM fine-tuning and operators for users to compose different data management recipes, as well as the evaluation of consequence model performance.
The exploration of data management and its impact on LLM pretraining and SFT is still an ongoing task. In this section, we point out several challenges and corresponding future directions in training data management studies for LLMs.
forts have been made to understand the impacts of data management on different training stages addressing different aspects. While current studies contribute valuable pieces to the puzzle, a comprehensive understanding of the entire picture is still lacking. Moreover, explorations using different datasets and models on different tasks may lead to contradictory conclusions, e.g., the trade-off between quality and toxicity filtering (Longpre et al., 2023b), detoxification v.s. debiasing (Xu et al., 2021; Welbl et al., 2021), fine-tuning with a few high-quality data (Zhou et al., 2023a; Chen et al., 2023b) v.s. data scaling (Wei et al., 2021; Sanh et al., 2022), task composition (Wang et al., 2022; Chung et al., 2022) v.s. expert models (Jang et al., 2023; Wang et al., 2023b), etc. Hence, more finegrained understanding is required to solve these conflicts.
General Data Management Framework Although Data-Juicer (Chen et al., 2023a) and Oasis (Zhou et al., 2023c) propose data management systems to compose various data recipes in either the pretraining or SFT stage of LLM, practitioners still need to spend efforts on finding suitable datasets. Constructing a general data management framework suitable for a broad range of applications to reduce data management costs is an urgent and worthy future direction in developing and promoting LLMs.
Data Efficiency Through Data Management Training more powerful LLMs with less and more effective training data is always an ongoing pursuit in the training of LLMs. As discussed in Section 3.4, works have been proposed to fine-tune LLMs to achieve various abilities in a data-efficient way. Further exploration in data management is expected to offer good opportunities for achieving better data efficiency.
Data Curriculum Besides choosing better training data, data curriculum addressing the arrangement of data learning orders is also an important part of data management, e.g., learning from general abilities to target abilities or from easier tasks to harder tasks. There are a few works focusing on data curriculum in the training of LLMs (Xu et al., 2023b; Dong et al., 2023; Yin et al., 2023a). Although effective in practice, there is still a lack of analysis of data curriculum strategies.
Conflict Data Separation In the collection of training data, conflicts among the responses to the same query may exist. For example, given the same query, LLMs playing different roles may generate different responses. Mixing these samples together could lead to negative impacts on model performance because of the response conflicts. Thus, how to separate and effectively learn from these data samples is an interesting topic in the future.
Multimodal Data Management Current research in data management mostly focuses on natural language processing. With the application of LLMs extending to multimodalities like vision, audio, and so on, the construction of multimodality datasets becomes more and more important. The proposed multi-modal LLMs usually construct their own instruction-tuning datasets collected from benchmark adaptation (Zhang et al., 2023a; Gao et al., 2023) or self-instruction (Pi et al., 2023; Yang et al., 2023b). The hybrid composition of languageonly and multimodal data is also adopted in some works (Dai et al., 2023; Zhao et al., 2023c). It is interesting to see the impacts of multimodal data management on the performance of fine-tuned multimodal LLMs, such as the data scaling law in multimodal instruction fine-tuning, the quality-control techniques in multimodal dataset construction, task balancing in multitask multimodal training, and so on.
Hallucinations Despite their strong power, LLMs are notorious for their hallucinations, i.e. the generation of input-, contextor fact-conflicting contents (Zhang et al., 2023c). Several works in hallucination trace down the occurrence of hallucination to the lack of pertinent knowledge and the internalization of false knowledge from the pretraining corpora (Li et al., 2022; McKenna et al., 2023; Dziri et al., 2022). To mitigate hallucination, the curation of pretraining corpora is adopted by many LLMs, mainly focusing on the extracting of high-quality data, e.g., GPT-3 (Brown et al., 2020), Llama 2 (Touvron et al., 2023b), and Falcon (Penedo et al., 2023). The manually curated (Zhou et al., 2023a) and automatically selected (Chen et al., 2023c; Cao et al., 2023; Lee et al., 2023b) high-quality instruction data are also experimentally shown to be effective in reducing hallucination during the SFT stage. It can be seen from the previous research that data management in both the pretraining and SFT stages can be a promising solution to hallucination.
Social Biases and Fairness The problem of social biases and unfairness existing in current pre-training datasets and their impacts on pre-trained LLMs have been addressed in several works as discussed in Section 2.2.4.
However, there is still a large gap between current prominent LLMs and ideal LLMs without social biases. Many questions are worth exploring, such as how to mitigate the potential biases in pretraining datasets, the existence of bias in the SFT datasets, and whether it is feasible to reduce social bias through SFT.
As LLMs draw more and more attention, a handful of surveys have been published or preprinted addressing different aspects of the life circle of LLMs. Related to our work, several of them also include parts of the data preparation process in the pretraining or SFT of LLM. Zhao et al. (2023a) review the development of LLMs and the latest advancements covering a wide range of topics. Yang et al. (2023a) also provide an overview of the LLM evolution and discuss the related techniques from model, data, and downstream tasks. Also concentrating on data, Zha et al. (2023) introduce datacentric AI and its related tasks and methods for general machine learning models instead of LLMs. Zhang et al. (2023b) survey the instruction tuning of LLMs and its related methodologies, data construction, applications, and so on. Wang et al. (2023e) review the technologies aligning LLMs with human expectations including data collection, training methodologies, and model evaluation.
Different from previous surveys, this survey provides a systematic and detailed overview of data management at both the pretraining and SFT stages of LLMs. We focus on the proper organization of training datasets and the effects of different data management strategies, providing a guiding resource for practitioners aiming to build powerful LLMs through effective and efficient data management.
This paper overviews data management in the training of LLMs. We discuss the pretraining and supervised fine-tuning stage of LLM successively and summarize the up-to-date research efforts into data quantity, data quality, and domain/task composition for each stage. Data management systems in the pretraining stage and Data-efficient learning in the supervised fine-tuning stage are also discussed. Finally, we highlight several challenges and promising future directions for LLM training data management. We hope this survey can provide insightful guidance for practitioners and inspire further research in effective and efficient data management for the development of LLMs.