Contents
1. 소개
1.1 배경 및 문제제기
인공지능 기술의 발전과 더불어 대규모 언어모델(LLM)의 등장은 자연어 처리(NLP) 분야에서 근본적인 패러다임 전환을 가져왔습니다. 그러나 모델 학습 및 평가에 필수적인 고품질의 데이터 확보는 높은 비용, 데이터 부족, 개인정보 보호 등의 이유로 어려움이 있습니다. 이에, LLM을 활용한 합성 데이터 생성이 대안으로 떠오르고 있습니다. 합성 데이터는 다양성과 풍부한 지도 신호를 갖춰 휴먼 의도와 밀접하게 연결되어 있어야 합니다.
1.2 연구 목적
이 연구에서는 합성 데이터의 품질과 다양성을 향상시키기 위한 방법을 제시하고, LLM을 이용한 데이터 생성, 관리, 평가 방법을 체계적으로 검토합니다. 데이터의 질적인 요구사항을 충족시키는 동시에, 다양성을 확보하여 모델의 오버피팅 및 편향을 방지하는 것이 주요 목표입니다.
2. 기초 이론 및 문제 정의
2.1 문제 정의
LLM을 이용한 합성 데이터 생성 과제는 주어진 목표($\mathcal{T}$)에 따라 합성 데이터($\mathcal{D}{\text{gen}}$)을 생성하는 과정으로 정의할 수 있습니다. 이때, 주어진 조건($\mathcal{D}{\text{sup}}$)을 이용하여 데이터 증강을 행하는 것이 일반적입니다.
\[\mathcal{D}_{\text{gen}} \leftarrow \mathcal{M}_p(\mathcal{T}, \mathcal{D}_{\text{sup}})\]2.2 $\mathcal{D}_{\text{gen}}$의 요구사항
3. 일반적 워크플로우
3.1 데이터 생성
3.1.1 프롬프트 엔지니어링
LLM의 지시 수행 능력을 활용하여 데이터의 충실성과 다양성을 향상시키는 프롬프트 설계입니다. 다음 수식은 프롬프트 설계를 위한 기본 구조를 나타냅니다.
\[p(\mathcal{T}, \mathcal{D}) \leftarrow \mathcal{E}(e_{\text{task}}, e_{\text{condition}}, e_{\text{demo}})\]$\mathcal{E}$는 프롬프트의 템플릿, $e_{\text{task}}$, $e_{\text{condition}}$, $e_{\text{demo}}$는 각각 작업 명세, 조건, 시연을 나타냅니다.
3.1.2 다단계 생성
복잡한 구조나 의미를 가진 데이터를 목표로 할 경우, 전체 생성 과정을 간단한 부작업들로 분해하여 단계별로 데이터를 생성하는 전략입니다.
\[\mathcal{D}_i \leftarrow \mathcal{M}_{p_i}(\mathcal{T}_i, \mathcal{D}_{0:i-1}), \quad i=1, 2, \ldots, k\]3.2 데이터 관리
3.2.1 고품질 샘플 필터링
합성된 데이터셋(\(\mathcal{D}_{\text{gen}}\)) 중에서 고품질의 데이터만을 선택하는 과정입니다. 여기서는 유용한 샘플의 부분집합(\(\mathcal{D}_{\text{curated}}\))을 추출합니다.
3.3 데이터 평가
3.3.1 직접 평가
생성된 데이터의 충실성 및 다양성을 평가하는 방법입니다. 이는 데이터가 모델 훈련 및 평가에 적합한지를 결정하는 중요한 기준이 됩니다.
4. 미래 방향
LLM의 인퍼런스 및 계획 능력을 활성화하여 자율적인 데이터 생성을 가능하게 하는 방법의 개발(4.1 복잡한 작업 분해) 및 특정 도메인 지식을 통합하여 데이터 생성 과정의 효율성과 정확성을 높이는 방안(4.2 지식 강화)을 연구할 필요가 있습니다.
5. 결론
이 논문은 LLM을 활용한 합성 데이터 생성의 현황을 체계적으로 검토하고, 향후 연구 방향을 제시합니다. 데이터 중심의 인공지능 발전을 위해, 다양하고 고품질의 데이터를 효율적으로 생성하는 것이 중요하며, 이를 위해 LLM의 잠재력을 최대로 활용할 필요가 있습니다.
The game-changing emergence of Large Language Models (LLMs) instigated a significant paradigm shift in the field of deep learning Zhang et al. (2023a); Guo et al. (2023); Bang et al. (2023). Despite these advancements, a large amount of high-quality data remains the foundation for building robust NLP models Gandhi et al. (2024). To be more specific, here high-quality data typically refers to diverse data that carries rich supervision signals (generally in the form of labels) closely aligned with human intent. However, fulfilling such data reliance with human data can be challenging or even unrealistic sometimes, due to high costs, data scarcity, privacy concerns, etc. Kurakin et al. (2023). Moreover, several studies Hosking et al. (2023); Singh et al. (2023); Gilardi et al. (2023) have highlighted that human-generated data, being inherently susceptible to biases and errors, may not even be optimal for model training or evaluation. These considerations necessitate a more serious inquiry into the question: are there other more effective and scalable methods of data collection that can overcome the current limitations?
Given the recent advancements in LLMs, which demonstrate the capability to generate fluent text on par with human output Hartvigsen et al. (2022); Sahu et al. (2022); Ye et al. (2022a); Tang et al. (2023); Gao et al. (2023a), synthetic data produced by LLMs emerges as a viable alternative or supplement to human-generated data. Specifically, synthetic data is designed to mimic the characteristics and patterns of real-world data Liu et al. (2024). On the one hand, LLMs, through extensive pretraining, have acquired a vast repository of knowledge and demonstrate exceptional linguistic comprehension Kim et al. (2022); Ding et al. (2023a), which forms a foundation for generating faithful data. On the other hand, the profound instruction-following capabilities of LLMs allow better controllability and adaptability over the generation process, facilitating the creation of tailored datasets for specific applications with more flexible process designs Eldan and Li (2023). These two advantages make LLMs highly promising synthetic data generators.
Though seemingly straightforward, generating synthetic datasets that simultaneously have high correctness and sufficient diversity requires careful process designs and involves a lot of tricks Gandhi et al. (2024), making LLMs-driven synthetic data generation a non-trivial problem. While most existing works generally target data generation for various tasks (e.g., pre-training Gunasekar et al. (2023); Li et al. (2023b); Eldan and Li (2023), fine-tuning Mukherjee et al. (2023); Mitra et al. (2023); Xu et al. (2023a), evaluation Feng et al. (2023); Wei et al. (2024)) across different domains (e.g., math Yu et al. (2023a); Luo et al. (2023a), code Luo et al. (2023b); Wei et al. (2023b), instruction Honovich et al. (2023a); Wang et al. (2023d)), they share many common ideas. To address the lack of a unified framework in the emerging field of LLM-driven synthetic data generation and develop a general workflow, this survey investigates recent studies and organizes them according to the topics of generation, curation, and evaluation, which are closely related, as shown in Figure 2. Our primary aim is to provide a comprehensive overview of the current state of the field, identify key areas of focus, and highlight the gaps that remain to be addressed. We hope to bring insights to both the academic and industrial communities and drive further development in LLM-driven synthetic data generation.
In this paper, we investigate the challenge of generating high-quality synthetic data using pre-trained LLMs, denoted as $\mathcal{M}$. Rather than creating new datasets from scratch, in more cases, we perform data augmentation with a small number of seed samples or unlabeled inputs, which we denote uniformly as $\mathcal{D}{\text{sup}}$. Although optional for LLMs-driven synthetic data generation, $\mathcal{D}{\text{sup}}$ can typically provide valuable supporting information when available. Consequently, the overall generation task can be formulated as:
\[\mathcal{D}_{\text{gen}} \leftarrow \mathcal{M}(\mathcal{T}, \mathcal{D}_{\text{sup}}),\]where $\mathcal{D}_{\text{gen}}$ represents the final generated dataset, and $p$ refers to the prompt used for model inference. $\mathcal{T}$ specifies the generation task, such as rewriting, question answering, annotation, etc. Notably, data annotation as a specialized paradigm of synthetic data generation, has particularly extensive applicability, including RLAIF Bai et al. (2022) and LLMs-based evaluation Chen et al. (2023b); Zheng et al. (2023); Kim et al. (2023), which may involve specific challenges and corresponding solution techniques. Due to page limitations, further details about data annotation can be found in Appendix A.
Figure 2: A taxonomy of LLMs-driven synthetic data generation, curation, and evaluation.
Briefly speaking, our goal is to generate data that closely aligns with evaluation metrics. While the standard of high-quality data may vary across different downstream tasks, there are two general requirements that are considered challenging in most existing literature:
Faithfulness. To provide valid supervision, the generated data must first be logically and grammatically coherent. However, the inherent problems of hallucination fat-tailed knowledge distribution of LLMs can introduce significant noise into the generated results, manifesting as factual errors, incorrect labels, or irrelevant content. These issues become more pronounced when generating long, complex, or domain-specific data.
Diversity. Diversity captures the variation among the generated data, reflecting differences in text length, topic, or even writing style. It is crucial for generating synthetic samples that mimic the diversified nature of real-world data, thereby preventing overfitting and bias during model training or evaluation. Nevertheless, due to the inherent biases of LLMs, uncontrolled generated content often tends to be monotonous, limiting its applicability in downstream tasks.
These two requirements are the focal points of most current research efforts. In the subsequent workflow, we will introduce how different methods address these issues.
Figure 3: A toy example of effective synthetic data generation. The corresponding fields for task specification, conditions, and in-context demonstrations are highlighted, while < > marks the switchable contents.
Existing studies on LLMs-driven synthetic data generation generally incorporate three main topics: generation, curation, and evaluation. Various approaches are employed within these aspects to collaboratively achieve optimal data generation.
In this section, we systematically summarize some common practices for synthetic data generation with LLMs, which can be roughly divided into prompt engineering and multi-step generation. An overall illustration is provided in Figure 3.
In this paper, we investigate the challenge of generating high-quality synthetic data using pre-trained LLMs, denoted as $\mathcal{M}$. Rather than creating new datasets from scratch, in more cases, we perform data augmentation with a small number of seed samples or unlabeled inputs, which we denote uniformly as $\mathcal{D}{\text{sup}}$. Although optional for LLMs-driven synthetic data generation, $\mathcal{D}{\text{sup}}$ can typically provide valuable supporting information when available. Consequently, the overall generation task can be formulated as:
\[\mathcal{D}_{\text{gen}} \leftarrow \mathcal{M}(\mathcal{T}, \mathcal{D}_{\text{sup}})\]where $\mathcal{D}_{\text{gen}}$ represents the final generated dataset, and $p$ refers to the prompt used for model inference. $\mathcal{T}$ specifies the generation task, such as rewriting, question answering, annotation, etc. Notably, data annotation as a specialized paradigm of synthetic data generation, has particularly extensive applicability, including RLAIF Bai et al. (2022) and LLMs-based evaluation Chen et al. (2023b); Zheng et al. (2023); Kim et al. (2023), which may involve specific challenges and corresponding solution techniques. Due to page limitations, further details about data annotation can be found in Appendix A.
Task Specification.
In traditional crowdsourced annotation scenarios, the recruited workers are commonly offered a codebook that specifies the necessary contexts, such as task purpose, data explanation, and other background knowledge, so that they can better understand their jobs Gilardi et al. (2023). Similarly, such task specification is crucial for setting the right context for LLMs-driven data generation, which can also include role-play Li et al. (2023c), format clarification, knowledge augmentation Xu et al. (2023b); Sudalairaj et al. (2024), etc. Evidence shows that a simple prologue such as “suppose you are a {xxx}” can significantly improve the LLMs’ performance by setting up a proper scenario for data generation and allowing the LLMs to better take on the roles Li et al. (2023c). More formally, Yoo et al. (2021) defines the task specification with a triplet of text type, label type, and label-token verbalizer. Such a description header is particularly important when extra domain expertise is demanded to address issues like terminology complexities in both context understanding and data generation. Consequently, Xu et al. (2023b) leverages external knowledge graphs and LLMs to obtain domain topics for context-informed prompting, which effectively enhances the faithfulness and complexity of generated data.
Conditional Prompting.
As mentioned in Section 2.2, a pivotal challenge in using LLMs for synthetic data generation is ensuring sufficient diversity, as directly prompting the LLMs to produce data for certain tasks often results in highly repetitive outputs, even with a high decoding temperature Gandhi et al. (2024); Liu et al. (2024). Addressing this problem, a widely adopted strategy is conditional prompting, which explicitly and concretely communicates to the LLMs the specific type of data desired. The core of conditional prompting involves delineating the targeted data through the formulation of a series of condition-value pairs:
\[\mathbf{e}_{\text{condition}} = \{ (c_1, v_1), (c_2, v_2), \dots, (c_n, v_n) \}\]which effectively characterizes the desired attributes and characteristics of the synthetic data. With different combinations of such attributes, we can automatically achieve a degree of “artificially defined” diversity in the generated samples Gunasekar et al. (2023); Li et al. (2023b); Eldan and Li (2023). Conditional prompting not only allows better control over the diversity and coverage of the generated dataset but also refines the content to a narrower, more focused scope that is more likely to align with our specific expectations and requirements Li et al. (2023c). Current research on conditional prompting primarily centers on the following two subjects:
Conditioning Scope. As the backbone of $\mathbf{e}_{\text{condition}}$, conditioning scope defined by ${c_1, \dots, c_n}$ delineates the dimensions that we utilize to characterize our target data. Early studies Gao et al. (2023a); Ye et al. (2022a, b) employed a basic output-conditional prompting strategy, utilizing the specific label associated with the classification task as the conditioning variable. The rationale behind this was primarily to maintain class balance and coverage. However, such a strategy is unsuitable for data lacking explicit category labels. Subsequent work by Yu et al. (2023b) argues that conditional-prompting with finer-grained attributes (e.g., topics, length, and style Xu et al. (2023b)), can lead to more diversified generation due to the vast number of possible attribute combinations, being also applicable to open-ended data. Additionally, Eldan and Li (2023) also condition each generation on the task of incorporating three randomly chosen words into the generated story. This approach was also proven to significantly enhance the diversity of the generated data, shifting the focus from the heuristic features of the output to a more structured and targeted conditioning mechanism by adding “creative randomness” to the prompt Eldan and Li (2023).
Conditioning Values. After defining the conditioning scope, we then need to assign concrete values to each condition. Despite the seemingly straightforward strategy of sampling from the known classes or labels Ye et al. (2022a), there are cases where such an instance pool is unavailable. Addressing this problem, Josifoski et al. (2023) actively retrieves the conditioning instances from external knowledge graphs, while Xu et al. (2023b); Ding et al. (2023b) leverage the LLMs to generate diversified instances for conditional prompting. Specifically, Ding et al. (2023b) construct a concept tree to delve into different subtopics, ensuring the coverage of sampled conditioning values, which then contributes to more diverse generated data. Moreover, the prompt template $\mathcal{E}$ can also be considered a special type of condition. It has been demonstrated that incorporating templates with a certain level of randomness throughout the generation process can enhance the diversity of the generated contents Meng et al. (2022).
In-Context Learning.
Due to the inherent bias of LLMs, it remains challenging to elicit favorable responses from the LLMs with merely task specification and conditional prompting. In this case, a straightforward yet effective strategy is to provide several demonstrations, which can serve as a form of implicit human guidance. Research has shown that, owing to LLMs’ remarkable in-context learning (ICL) capabilities, a few exemplars can provide them with insights into the patterns exhibited in real-world data, thereby significantly improving the faithfulness of generated data Li et al. (2023c). In the few-shot setting, where labeled samples are available in the support set $\mathcal{D}_{\text{sup}}$, these samples can be directly utilized as demonstrations for ICL. However, in scenarios where no ground truth data is available, approaches like Self-Instruct Wang et al. (2023e) and Self-Prompting Li et al. (2022) instead leverage ICL with synthetic demonstrations generated by LLMs. This allows the models to learn from their own predictions or other teacher models, even in the absence of labeled data.
However, given the constraint of prompt length and data inconsistency, the quality of in-context samples significantly affects the effectiveness of in-context learning. Sudalairaj et al. (2024) argue that randomly selecting in-context examples from the pool of seed samples, as done in Self-Instruct Wang et al. (2023e), results in a lack of diversity and quality in the generated data. To address this issue, Sudalairaj et al. (2024) opt for selecting examples that concentrate on specific aspects to better stimulate the long tail of knowledge inherent in LLMs. Liu et al. (2022b) and Su et al. (2023) prioritize consistent samples as demonstrative examples based on their cosine similarity in the embedding space. Alternatively, Ye et al. (2022b) selects the most informative samples using quantified influence scores to steer the generation process. To enhance the informativeness of in-context examples, He et al. (2023) prompts LLMs to provide an explanation for each sample before integrating it into the prompt. This approach not only offers valuable additional information but also aligns well with the subsequent Chain-of-Thought generation.
In the previous paragraphs, we have introduced some common prompting strategies, which are typically designed for a specific generation task $\mathcal{T}$. However, in most cases, due to the lack of enough reasoning abilities, it is unrealistic to expect the LLMs to generate the entire desired dataset within a single reference, especially when targeting data with complex structures or semantics Cui and Wang (2023). In addressing this problem, a common strategy is multi-step generation, through which the overall generation process is manually decomposed into a chain of simpler sub-tasks $\mathcal{T}_1:\mathcal{T}_k$, to force the LLMs to produce data in a step-by-step manner as scheduled:
\(\mathcal{D}_i \leftarrow \mathcal{M}_i\)math
cal{p}i(\mathcal{T}_i, \mathcal{D}{0:i-1})), \quad i = 1, 2, \dots, k, $$
where $\mathcal{D}0 = \mathcal{D}{\text{sup}}$. Each intermediate output $\mathcal{D}_i$ is generated using model $\mathcal{M}_i$, prompted by $\mathcal{p}_i$, for a sub-task $\mathcal{T}_i$. These outputs can then potentially be used in subsequent generations. By manually scheduling the generation procedure, we implicitly align the reasoning paths of LLMs with human prior knowledge. Specifically, there are two common strategies for task decomposition: sample-wise and dataset-wise decomposition, which mainly aim at enhancing the quality of synthetic data at different scales.
Sample-Wise Decomposition.
A typical use-case of multi-step generation is for addressing the challenges of long-text processing and logical reasoning when dealing with multi-text data such as dialogues and entity-relation triplets. In such cases, a straightforward approach is to divide the sample into smaller chunks and generate only a portion of each sample at a time Li et al. (2022); Ye et al. (2023); Wang et al. (2023e). In this way, $\mathcal{D}{1:k}$ can be considered as different parts of $\mathcal{D}{\text{gen}}$:
\[\mathcal{D}_{\text{gen}} = (\mathcal{D}_1, \mathcal{D}_2, \dots, \mathcal{D}_k)\]Notably, as shown in Eq. 4, each iteration of the generation process can be conditioned on the previously generated contents. For example, Ding et al. (2023b) prompts the LLMs to alternate between acting as the assistant and the user, replying to each other based on the context, ultimately producing a complete conversation transcript. In this way, the coherence among each internal component $\mathcal{D}i$ can be pointedly reinforced with separated instructions, thus making it easier for the model to follow the requirements and generate more faithful data. It should be noted that $\mathcal{D}{1:k}$ may not necessarily form part of the final $\mathcal{D}_{\text{gen}}$, instead, explicitly outputting some intermediate reasoning steps can also improve the generation of complex data Bai et al. (2022); He et al. (2023). Chain-of-Thought (CoT) prompting stands out as one of the most popular strategies for improving the faithfulness of LLM-generated content Wei et al. (2022). Nevertheless, current research on the exploration of such latent metadata is still insufficient, leaving sample-wise task decomposition from a reasoning perspective an open problem for future studies.
Dataset-Wise Decomposition.
In Section 3.1.1, we introduced how to generate data with specified properties. However, generating a series of such data that can eventually form a dataset with good diversity and domain coverage requires long-term scheduling. To this end, dataset-wise task decomposition dynamically adjusts the conditions used at each stage of multi-step generation to ensure the overall dataset grows in the right direction:
\[\mathcal{D}_{\text{gen}} = \bigcup_{i=1}^{k} \mathcal{D}_i\]Specifically, S3 Wang et al. (2023b) targets the most frequently mislabeled categories at each iteration, according to the performance of the downstream model trained on previously generated data. Similarly, Honovich et al. (2023b); Shao et al. (2023) utilize a generate-then-expand paradigm, to enhance the diversity of the overall dataset accordingly. Some other methods also leverage specific data structures to model the pathways of data generation. For example, Explore-Instruct Wan et al. (2023) models the domain space as a tree structure and continually refines the generated data along with tree traversal to promote both the specialization and domain coverage of the generated data.
After the preceding steps, one may excessively generate overflowing and theoretically unlimited data $\mathcal{D}_{\text{gen}}$. However, these datasets often comprise a considerable portion of noisy, worthless, or even toxic samples, which primarily stems from two causes. Firstly, LLMs can inevitably produce corrupted samples with incorrect labels due to the hallucination problem. Secondly, ineffective prompts containing ambiguous descriptions can trick the model into generating irrelevant or redundant samples. Consequently, directly utilizing these low-quality data without proper processing may have a significant negative impact.
To address this, plenty of data curation approaches have been studied, which mainly fall into two dominant groups of high-quality sample filtering and label enhancement as elaborated below.
Figure 4: Two dominant approaches of data curation.
Sample filtering aims to weed out undesired low-quality samples and obtain a more helpful subset $\mathcal{D}{\text{curated}} \subset \mathcal{D}{\text{gen}}$. These methods typically design heuristic criteria or re-weighting functions to rerank samples for filtering, as shown in Figure 4.
Heuristic Metrics.
For methods based on heuristic metrics, the key step is to design appropriate criteria based on the learning dynamics, such as confidence score (Seedat et al., 2023), influence function Ye et al. (2022b), and generation ability Meng et al. (2022). SuperGen Meng et al. (2022) employs the estimated generation probability to identify samples most related to the desired label. Seedat et al. (2023) discard samples with both low confidence and low uncertainty. Some other methods assume that clean samples are prone to hold similar predictions under different conditions and employ cross-condition consistency for filtering. Specifically, such consistency can be between LLM and downstream classifier Yu et al. (2023c), between multiple executions Ye et al. (2023), or between neighboring data points Seedat et al. (2023). Chen et al. (2023b) leverage the powerful text understanding capabilities of LLMs to assess the quality of different samples and filter out those with low scores. Results show that Alpagasus Chen et al. (2023b), trained on a much smaller but curated dataset, surpasses the original Alpaca Taori et al. (2023) across several benchmarks, underscoring the importance of data curation.
Sample Re-Weighting.
On the other hand, re-weighting methods believe all data are valuable but with varying importance. Thus, they assign larger weights to correctly annotated or influential samples during downstream utilization Zhang et al. (2023b); Gao et al. (2023a); Meng et al. (2023). For instance, SunGen Gao et al. (2023a) proposes an adaptive bi-level re-weighting algorithm without human annotations. FewGen Meng et al. (2023) designs a discriminative meta-learning objective to adjust sample weights and demarcate the nuanced differences between different labels.
Label enhancement methods strive to rectify the potentially erroneous annotations in generated samples. Due to confirmation bias, it is unrealistic for LLMs to identify their own mistakes. To address this, recent works either rely on human intervention or incorporate a student model for human-free knowledge distillation.
Human Intervention.
A straightforward strategy for label refinery is to include human efforts to re-annotate the corrupted samples Chung et al. (2023a); Wang et al. (2021); Pangakis et al. (2023). Wang et al. (2021) proposed to actively select samples with the lowest confidence for human re-labeling. Pangakis et al. (2023) and Liu et al. (2022a) further emphasize the importance of human review and suggest comparing annotations from humans and LLMs guided by the same codebook. Despite the simplicity, these methods can lead to considerable labeling costs and can be unrealistic in practical deployment.
Auxiliary Model.
To reduce the labeling cost, a more pragmatic human-free paradigm is developed which involves auxiliary student models for knowledge distillation and label refinery Xiao et al. (2023); Zhao et al. (2023a); Saad-Falcon et al. (2023). These methods rely on the weakly supervised ability of student models and hypothesize that a student distilled from the LLM teacher can produce superior labels. The seminal work FreeAL Xiao et al. (2023) proposes a collaborative framework, where a student model is leveraged to distill the high-quality task-related knowledge from the weak annotations and in return feedback LLMs for label refinery. MCKD Zhao et al. (2023a) designs a multistage distillation pipeline with data-split training and cross-partition labeling to avoid overfitting on noisy labels. With the expanding abilities and availability of LLMs, the incorporation of auxiliary student models will play a more crucial role as a cost-effective alternative to human intervention.
Data Faithfulness.
Ideally, automatic evaluation of the LLMs’ generation results can be easily realized with ground truths from existing datasets, if available Zhu et al. (2023). However, for open-ended data, human-based evaluation is necessitated. A straightforward idea is to provide some generated samples to human experts, who will then determine whether they are correct, according to which we can estimate the overall generation quality Wang et al. (2023e). Theoretically, the larger the sample size, the more accurate the estimation results will be, but the labor it costs will correspondingly get higher. To this end, a reliable auxiliary model can be leveraged for a more comprehensive yet cost-effective evaluation of the generated data in replace of human experts Chung et al. (2023b). Considering that most models can only process contents of limited length, appropriate information extraction can reduce the burden of the auxiliary model and contribute to a more precise prediction of whether a sample contains factual errors Lee et al. (2022).
Data Diversity.
The quantification of data diversity primarily employs vocabulary statistics and sample relevance calculations. Vocabulary statistics, such as vocabulary size and N-gram frequency, provide a straightforward and intuitive approach. However, they struggle to capture the semantic information of a dataset. The calculation of sample relevance compensates for this limitation effectively. The most common measures of sample correlation are based on cosine similarity and sample distance, which can better capture the contextual and semantic diversity of the dataset. Furthermore, these metrics can also be leveraged to select in-context demonstrations $e_{\text{demo}}$ that are more dissimilar with the previously generated samples, thereby leading to more diversified generation results.
Benchmark Evaluation.
The performance of downstream models trained on the generated data can also reflect the generation quality to some extent Yu et al. (2023b); Chung et al. (2023b). Specifically, the impact of synthetic data can be evaluated from multiple dimensions except for the specialized capabilities of the downstream models. For example, TruthfulQA enables the assessment of a model’s ability to identify true claims Sun et al. (2023); NIV2 is employed to evaluate a model’s language comprehension and reasoning abilities across multiple tasks Wang et al. (2023e).
Open Evaluation.
For open-ended benchmarks, evaluation by humans or auxiliary models is necessitated due to the absence of standardized answers. To fully leverage the preference outputs of the auxiliary models, multiple evaluation strategies have been designed, such as response ranking Xu et al. (2023a), four-level rating system Wang et al. (2023e) and Elo scores Bai et al. (2022). To further reduce evaluation costs, Sun et al. (2023); Xu et al. (2023a) utilize the automatic evaluation framework based on GPT-4 proposed by Vicuna for evaluation. However, general LLMs may lack enough knowledge for domain-specific tasks, which hinders them to provide effective evaluation Bran et al. (2023). Therefore, collecting human assessment data to fine-tune open-source models for evaluation purposes is an important practice in real-world scenarios He et al. (2023). Other techniques like Peng et al. (2024, 2023) remain to be further explored.
Current multi-step generation algorithms depend on the model’s understanding of task requirements, requiring it to perform complex logical reasoning with limited information. However, in real-world complex scenarios, this limited information may not adequately support effective decision-making. For instance, the generation of mathematical problem-solution pairs entails multiple reasoning steps and may necessitate the utilization of calculator tools for validation. To date, there remains a lack of systematic investigation on how to activate the reasoning and planning capabilities of LLMs for autonomous synthetic data generation. Inspired by prevalent LLMs-based agents like HuggingGPT Shen et al. (2023) and MetaGPT Hong et al. (2023), we believe it would also be quite valuable to develop a data generation agent for industrial applications.
Recent research has found that LLMs’ knowledge is long-tailed and biased Navigli et al. (2023); Fei et al. (2023). Lacking specific domain knowledge, LLMs tend to generate biased, monotonous, and even unfaithful data. Though we have introduced how to mildly guide the data generation with task specification and conditional prompting in the previous sections, such methods still hold strong limitations and are not conducive to scalable implementation. Instead, we believe that developing automated condition controls directly on mature domain knowledge bases will significantly improve the efficiency of knowledge enhancement. For example, we can establish certain links between the LLMs and external knowledge graphs Ji et al. (2022) or retrieve augmentation from the website Gao et al. (2023b), which is helpful for the definition, decomposition, and reasoning of data features throughout the entire generation process. Additionally, with enhanced domain knowledge, we may also better assess the quality of generated data or even develop automatic evaluation systems. Overall, we believe that knowledge-driven data generation will be a key focus for future studies.
In Section 3.2, we introduced the use of small domain-specific models for data curation. In particular, FreeAL Xiao et al. (2023) has shown the feasibility of low-cost data curation with integrated collaboration between large and small models. The idea of leveraging real-time feedback provided by automated performance evaluation during the data generation process to guide the corresponding adjustments in the following generation hints at an important research direction. However, the exploitation of small LMs at the current stage is simply based on prediction confidence. In the future, we are looking forward to seeing more diversified collaboration modes between large and small models to improve the quality of generated data, e.g., usage of various output information, new design of collaborative architectures, and so on.
Data, as the source of model intelligence, theoretically cannot be generated completely without human intervention. Otherwise, wild synthetic data that carries noisy, toxic information can easily “poison” a model, even resulting in mode collapse. Due to the inherent bias of LLMs, they can hardly be self-aware of the bias in their generated data and finally deviate from our intentions. Thus, designing a human-friendly interactive system to involves a few necessary human knowledge for annotation and verification is vital and irreplaceable. To date, there is still a lack of a generic framework to standardize and systematize the human-machine collaboration involved in the data production process.
We believe that an appropriate design of such a system must be based on a thorough understanding of the strengths and limitations of human intervention, and should follow the human-centered principle. To achieve sustainable and efficient human involvement, we need comprehensive consideration of various factors such as feasibility, cost, and even labor psychology. For specific examples:
In this paper, we present a systematic review of advancements in synthetic data generation propelled by Large Language Models (LLMs). We aim to offer guidance to enterprises and organizations on effectively building their domain-specific datasets using LLMs. In the meantime, we endeavor to provide insights into the challenges and opportunities within this field, while also proposing potential directions for future research. We hope that our work can promote the rapid production of large amounts of data in various fields and push the limits of data-centric AI. We also envision a fantastic future, where an LLMs community, endowed with human-like abilities such as bionics and communication, may be constructed to generate data for its own self-improvement.