00:00:00

Survey | Datasets for LLMs

https://dsdanielpark.github.io https://github.com/dsdanielpark

Survey | Datasets for LLMs

MinWoo(Daniel) Park | Tech Blog

Created: 2024-03-04 04:50:44 +0000

Last modified: 2024-09-05 20:56:50 +0900

Survey | Datasets for LLMs

Related Project: Private
Category: Paper Review
Date: 2024-03-01

Datasets for Large Language Models: A Comprehensive Survey

url: https://arxiv.org/abs/2402.18041
pdf: https://arxiv.org/pdf/2402.18041
abstract: This paper embarks on an exploration into the Large Language Model (LLM) datasets, which play a crucial role in the remarkable advancements of LLMs. The datasets serve as the foundational infrastructure analogous to a root system that sustains and nurtures the development of LLMs. Consequently, examination of these datasets emerges as a critical topic in research. In order to address the current lack of a comprehensive overview and thorough analysis of LLM datasets, and to gain insights into their current status and future trends, this survey consolidates and categorizes the fundamental aspects of LLM datasets from five perspectives: (1) Pre-training Corpora; (2) Instruction Fine-tuning Datasets; (3) Preference Datasets; (4) Evaluation Datasets; (5) Traditional Natural Language Processing (NLP) Datasets. The survey sheds light on the prevailing challenges and points out potential avenues for future investigation. Additionally, a comprehensive review of the existing available dataset resources is also provided, including statistics from 444 datasets, covering 8 language categories and spanning 32 domains. Information from 20 dimensions is incorporated into the dataset statistics. The total data size surveyed surpasses 774.5 TB for pre-training corpora and 700M instances for other datasets. We aim to present the entire landscape of LLM text datasets, serving as a comprehensive reference for researchers in this field and contributing to future studies. Related resources are available at: this https URL.

[데이터셋 정리 색인마킹]

1 Introduction

With the release of ChatGPT (OpenAI, 2022), in just a few months, Large Language Models (LLMs) have attracted increasing research attention and become a hot research field. Various LLMs have been successively open-sourced, with parameter sizes ranging from several billion to over a hundred billion. Examples include the LLaMA (Touvron et al, 2023a,b), Phi (Gunasekar et al, 2023; Li et al, 2023k; Javaheripi et al, 2023), ChatGLM (Du et al, 2022; Zeng et al, 2023a), QWen (Bai et al, 2023a), Baichuan (Yang et al, 2023a), and so on. A considerable amount of work involves fine-tuning on base models, resulting in well-performing general conversational models or domain-specific models. The widespread adoption of Reinforcement Learning from Human Feedback (RLHF) and the refinement of LLM evaluations further optimize the performance of LLMs. The immense potential demonstrated by LLMs can be attributed, in part, to the datasets used for training and testing. As the saying goes, “You can’t make a silk purse out of a sow’s ear.” Without high-quality datasets as the foundation, it is challenging to grow the tree of LLMs with flourishing branches and leaves. Therefore, the construction and analysis of LLM datasets is an area worthy of attention.

The development of text datasets has undergone several stages, from earlier Nat- ural Language Processing (NLP) task datasets to the current era of LLM datasets. In the 1960s to 1980s, the early stages of NLP primarily focused on fundamental tasks such as semantic analysis and machine translation. The dataset scale was relatively small and typically manually annotated. Later, the Message Understanding Confer- ence (MUC) (Grishman and Sundheim, 1996) began in 1987, focusing on datasets for tasks such as information extraction and Relation Extraction (RE). After 2000, the NLP field continued to emphasize research on traditional tasks and linguistic struc- tures, while also turning attention to emerging areas such as dialogue systems (Paek, 2006; Yan et al, 2017; Devlin et al, 2019; Zhang et al, 2020b). With the rise of deep learning, NLP datasets evolved towards larger scales, greater complexity, more diver- sity, and increased challenges. Simultaneously, comprehensive performance evaluations (Srivastava et al, 2023; Liang et al, 2023; Li et al, 2023n), dialogue datasets (Zeng et al,

The current explosion in LLM datasets poses challenges for research. On the one hand, it often leads to situations where it is difficult to know where to start when trying to understand and learn about the datasets. On the other hand, there is a lack of systematic organization regarding the differences in types, domain orientations, real- world scenarios, etc., among various datasets. In order to reduce the learning curve, promote dataset research and technological innovation, broaden public awareness, we conduct a survey of LLM datasets. The objective is to provide researchers with a comprehensive and insightful perspective, facilitating a better understanding of the distribution and role of LLM datasets, thereby advancing the collective knowledge and application of LLMs.

This paper summarizes existing representative datasets across five dimensions: pre-training corpora, instruction fine-tuning datasets, preference datasets, evaluation datasets, and traditional NLP datasets. Moreover, it presents new insights and ideas, discusses current bottlenecks, and explores future development trends. We also provide a comprehensive review of publicly available dataset related resources. It includes statistics from 444 datasets across 8 language categories spanning 32 different domains, covering information from 20 dimensions. The total data size surveyed exceeds 774.5 TB for pre-training corpora and over 700M instances for other datasets. Due to space constraints, this survey only discusses pure text LLM datasets and does not cover multimodal datasets.

To the best of our knowledge, this is the first survey focused on LLM datasets, presenting the entire landscape. The timeline of LLM datasets is shown in Figure 2. Prior to this, several LLM-related surveys, such as Zhao et al (2023) and Minaee et al (2024), analyze the latest developments in LLMs but lack detailed descriptions and summaries of datasets. Zhang et al (2023g) summarizes the instruction fine-tuning stage of LLMs. Chang et al (2023) and Guo et al (2023c) summarize the evaluation stage. However, these surveys only concentrate on a part of the LLM datasets, and dataset-related information is not the central focus. In contrast to the aforementioned surveys, our paper places emphasis on LLM datasets, aiming to provide a more detailed and exhaustive survey in this specific domain.

The overall organizational structure is illustrated in Figure 1. The remainder of this paper is organized as follows. Section 2 summarizes general pre-training cor- pora categorized by data types and domain-specific pre-training corpora categorized by domains. It also outlines the preprocessing steps and methods for pre-training data. Section 3 summarizes general instruction fine-tuning datasets categorized by construction methods and domain-specific instruction fine-tuning datasets categorized by domains. 15 instruction categories are provided. Section 4 summarizes prefer- ence datasets categorized by preference evaluation methods. Section 5 summarizes evaluation datasets categorized by evaluation domains and synthesizes different eval- uation methods. Section 6 summarizes traditional NLP datasets categorized by tasks. Section 7 briefly identifies challenges encountered within the datasets and anticipates future research directions. Section 8 concludes this paper. Detailed descriptions of the datasets can be found in Appendices A through E.

2 Pre-training Corpora

The pre-training corpora are large collections of text data used during the pre-training process of LLMs. Among all types of datasets, the scale of pre-training corpora is typ- ically the largest one. In the pre-training phase, LLMs learn extensive knowledge from massive amounts of unlabeled text data, which is then stored in its model parameters. It enables LLMs to possess a certain level of language understanding and generation capabilities. The pre-training corpora can encompass various types of text data, such as webpages, academic materials, books, while also accommodating relevant texts from diverse domains, such as legal documents, annual financial reports, medical textbooks, and other domain-specific data.

Based on the domains involved in the pre-training corpora, they can be divided into two types. The first type is the general pre-training corpora, which comprise large-scale text data mixtures from different domains and topics. The data commonly includes text content from the Internet, such as news, social media, encyclopedias, and more. The objective is to provide universal language knowledge and data resources for NLP tasks. The second type is the domain-specific pre-training corpora, which exclusively contain relevant data for specific domains or topics. The purpose is to furnish LLMs with specialized knowledge.

As the cornerstones of LLMs, the pre-training corpora influence the direction of pre-training and the potential of models in the future. They play several pivotal roles as follows:

Providing Generality. Substantial amounts of text data help models better learn the grammar, semantics, and contextual information of language, enabling them to attain a universal comprehension of natural language.
Enhancing Generalization Ability. Data from diverse domains and topics allow models to acquire a broader range of knowledge during training, thereby enhancing their generalization ability.
Elevating Performance Levels. Knowledge injection from domain-specific pre-training corpora enables models to achieve superior performance on down- stream tasks.
Supporting Multilingual Processing. The inclusion of multiple languages in pre-training corpora empowers models to grasp expressions across diverse linguistic contexts, fostering the development of competencies for cross-lingual tasks.

2.1 General Pre-training Corpora

The general pre-training corpora are large-scale datasets composed of extensive text from diverse domains and sources. Their primary characteristic is that the text content is not confined to a single domain, making them more suitable for training general foundational models. As illustrated in Figure 3, the data types can be categorized into eight major classes: Webpages, Language Texts, Books, Academic Materials, Code, Parallel Corpus, Social Media, and Encyclopedia. The collected and organized information about general pre-training corpora is presented in Table 1 and Table 2.

2.1.1 Webpages

Webpages represent the most prevalent and widespread type of data in pre-training corpora, comprised of text content obtained by crawling a large number of webpages on the Internet. This type of data has several key characteristics.

Massive Scale. There is a vast number of websites, and new webpages emerge continuously.
Dynamism. Content undergoes continuous updates and changes over time.
Multilingualism. It may include content in multiple languages.
Rich in Themes. It encompasses content from different domains and subjects.
Semi-structured. The data is typically in hypertext markup language (HTML) format, exhibiting certain structural characteristics. However, it may include various modalities such as text, images, videos, and more.
Requires Cleaning. It often contains a significant amount of noise, irrelevant information, and sensitive content, making it unsuitable for direct use.

The construction of webpages corpora is commonly pursued through two primary approaches. The first method involves building upon Common Crawl1. Common Crawl is a massive, unstructured, multilingual web corpus that provides public access to web archives by regularly crawling and storing webpage data from the Internet. However, the data in Common Crawl are not clean, containing a lot of irrelevant infor- mation, such as advertisements, navigation bars, etc. Additionally, there is a presence of pornographic content, violence, machine-generated spam, and sensitive information involving personal privacy. Consequently, many subsequent pre-training corpora are derived by reselecting and cleaning data from Common Crawl. For instance, Refined- Web (Penedo et al, 2023), used for pre-training Falcon model2, undergoes rigorous filtering and deduplication processes on Common Crawl. It ultimately retains high- quality English text totaling 5T tokens. C4 (Raffel et al, 2020), derived from Common Crawl crawler data from April 2019, undergoes processing with multiple filters, remov- ing useless, harmful, and non-English text. In contrast to C4, mC4 (Xue et al, 2021), CC100 (Conneau et al, 2020), OSCAR 22.01 (Abadji et al, 2022), and RedPajama- V2 (Together, 2023) retain multilingual data during the cleaning process, utilizing different cleaning pipelines. CC-Stories (Trinh and Le, 2018) and RealNews (Zellers et al, 2019b) are selected subsets of text content from Common Crawl based on spe- cific themes. CC-Stories filters out text with a story-like style following the Winograd Schema (Levesque et al, 2012) for common-sense reasoning and language modeling. RealNews (Zellers et al, 2019b) extracts a substantial amount of webpages dedicated to news to obtain news data. The above corpora either exclusively contain English or belong to multilingual mixes. CLUECorpus2020 (Xu et al, 2020c) conducts data cleaning on the Chinese portion of Common Crawl, resulting in a high-quality Chinese pre-training corpus of 100GB. However, there still exists a small amount of noise in these corpora. Therefore, some corpora continue with subsequent cleaning efforts. For instance, CulturaX (Nguyen et al, 2023) performs a multi-stage cleaning process after combining mC4 and OSCAR corpora, resulting in higher-quality multilingual corpus. The second method involves independently crawling various raw webpages and then employing a series of cleaning processes to obtain the final cor- pus. WuDaoCorpora-Text (Yuan et al, 2021) is cleaned using over 20 rules from 100TB of raw webpages, covering many domains such as education and technology. Furthermore, webpage data in some multi-category corpora is also constructed using this method, including MNBVC (MOP-LIWU Community and MNBVC Team, 2023), WanJuanText-1.0 (He et al, 2023a), TigerBot pretrain zh corpus (Chen et al, 2023c), and others.

1 https://commoncrawl.org/

2 https://falconllm.tii.ae/

2.1.2 Languages Texts

The language text data mainly consists of two parts. The first part is electronic text data constructed based on widely sourced written and spoken language, typically in the form of large corpora for a specific language. The full name of ANC3 is the American National Corpus. The content primarily includes various written and spoken materials in American English. The second edition of the corpus has a scale of 22M words, making it highly suitable for models to learn language. Similarly, BNC4, short for the British National Corpus, encompasses 100M words of electronic text resources, covering spoken and written materials in British English.

The second part is electronic text data constructed based on relevant writ- ten materials in various fields or topics. For example, FinGLM (MetaGLM, 2023) covers annual reports of some listed companies between 2019 and 2021. The data type belongs to language text materials in the financial domain. TigerBot-law (Chen et al, 2023c) includes legal regulations from 11 categories such as the Chinese Constitution and the Chinese Criminal Law, falling within the language text materials in the legal domain. News-crawl5 extracts monolingual texts from online newspapers and other news sources, encompassing news text in 59 languages.

2.1.3 Books

Book data is also one of the common types of data in pre-training corpora. Com- pared to webpages, books have longer textual content and superior data quality, both of which contribute to enhancing the performance of LLMs. This helps improve their ability to capture human language features while learning more profound lan- guage knowledge and contextual information. The book data primarily possesses the following characteristics.

3 https://anc.org/

4 http://www.natcorp.ox.ac.uk/

5 https://data.statmt.org/news-crawl/

Breadth. It typically covers a wide range of subjects and topics, including novels, biographies, textbooks, and more.
High Quality. Books are usually authored by professionals, undergo editing and proofreading, resulting in more accurate grammar and spelling with less noise.
Lengthy Text. Longer texts and complex sentence structures provide additional contextual information.
Language and Culture. Books often contain rich language features such as professional terminology, colloquialisms, and idioms, reflecting diverse cultural backgrounds.

Book data can be found on e-book websites, with commonly used resources being Smashwords6 and Project Gutenberg7. Smashwords is a large repository of free e- books, containing over 500K electronic books. Project Gutenberg, as the earliest digital library, is dedicated to digitizing and archiving cultural works, and it also boasts a wealth of book resources.

Subsequently, many book corpora are constructed by scraping and cleaning e-book resources. In 2015, Toronto Book Corpus (Zhu et al, 2015) crawled 11,038 e-books from Smashwords, forming a large-scale corpus of books. This corpus was once publicly available but is no longer accessible. In 2019, PG-19 (Rae et al, 2020) collected books published before 1919 from Project Gutenberg and removed short-text books, resulting in a final count of 28,752 books. In 2021, BookCorpusOpen (Bandy and Vincent, 2021) built upon Toronto Book Corpus, Smashwords, and others, creating 17,868 book entries. In 2023, Anna’s Archive8 became the world’s largest open-source and open- data library. The creator scraped books from libraries such as Libgen, Sci-Hub, and made them publicly available. As of February 2024, its size has reached 641.2TB and it is continuously growing.

It is worth mentioning that the fields covered by books are extremely diverse. Thus, fine-grained categorization of books by domain is feasible. It not only facilitates more convenient gap analysis and supplementation but also enables the easy selection of relevant data when focusing on specific domains. Referring to the Chinese Library Classification System9, books can be straightforwardly categorized into 30 classes, as illustrated in Figure 4 for reference.

2.1.4 Academic Materials

Academic material data refers to text data related to the academic field, including but not limited to academic papers, journal articles, conference papers, research reports, patents, and more. These data are authored and published by experts and scholars in the academic community, possessing a high level of professionalism and academic rigor. The academic materials themselves exhibit exceptional quality. Incorporating them into pre-training corpora can provide more accurate and professional information, helping the model understand the terminology and knowledge within the academic domain.

6 https://www.smashwords.com/

7 https://www.gutenberg.org/

8 https://annas-archive.org/datasets

9 http://www.ztflh.com/

The most commonly used corpus currently is arXiv10, which gathers preprints of papers in physics, mathematics, computer science, biology, and quantitative eco- nomics. It not only furnishes high-quality academic knowledge but also enables models to grasp the LATEX format of papers. In addition to arXiv, S2ORC (Lo et al, 2020) encompasses English academic papers from various disciplines. It features extensive metadata, abstracts, reference lists, and structured full-text content. In the medical field, PubMed Central11 has played a role in the open access of nearly 5M biomedical publications.

Pre-training corpora exclusively consisting of academic material data are rare, as most multi-category corpora choose to include academic materials. In The Pile (Gao et al, 2020), academic material data accounts for 38.1%, surpassing the 18.1% propor- tion of Webpage data. In RedPajama-V112, the proportion of academic materials is 2.31%, totaling 28 billion tokens.

2.1.5 Code

The category of code data refers to textual information containing programming lan- guages, such as Python, Java, C++, and other code snippets. Its purpose is to assist models in better understanding programming languages and code structures, enabling them to perform well in downstream tasks like code comprehension, code recommen- dation, and code generation. Nowadays, LLMs are often leveraged to generate code, facilitating various tasks. The quality of the code data used during model training directly impacts the effectiveness of the generated code, underscoring the significance of code data in model performance.

The main corpora for code data include The Stack (Kocetkov et al, 2023), BIG- QUERY (Nijkamp et al, 2023), and Github13. The Stack comprises diverse collection of 385 programming languages and hosts over 6TB of source code files with open- source licenses. It is specifically tailored for the development of expansive LLMs in the programming domain. BIGQUERY, a subset of the publicly released Google BigQuery corpus14, focuses on six selected programming languages. Github serves as a hosting platform for both open-source and private software projects, supplying a rich array of varied code information. Notably, training data for significant code models like Star- Coder (Li et al, 2023j) is sourced from this repository. However, it is crucial to exercise caution during web scraping to adhere to the code usage protocols set by project authors. StackOverflow15 is also a common source of code data. As a Question-and- Answer (Q&A) community dedicated to programming and development, it features questions and answers spanning topics such as programming languages, development tools, and algorithms. StackOverflow is part of StackExchange16, which houses differ- ent Q&A sections. Therefore, it is categorized as social media data, as explained in Section 2.1.7. More recently, phi-1 (Gunasekar et al, 2023) is created specifically for training code models. It not only includes a subset of code selected from The Stack and StackOverflow but also utilizes GPT-3.5 (OpenAI, 2023) to generate textbooks and exercise questions related to Python.

10 https://arxiv.org/

11 https://www.ncbi.nlm.nih.gov/pmc/

12 https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T

13 https://github.com/

2.1.6 Parallel Corpus

Parallel corpus data refers to a collection of text or sentence pairs from different languages. These pairs of texts are translations of each other, where one text is in the source language (e.g., English), and the corresponding text is in the target language (e.g., Chinese). The incorporation of parallel corpus data is crucial for enhancing the machine translation capability and cross-lingual task performance of LLMs.

The collection of parallel corpora typically occurs through two main avenues. The first involves extracting text from Internet resources such as webpages. ParaCrawl (Ba˜n´on et al, 2020), for instance, utilizes open-source software to crawl webpages, constructing a publicly available parallel corpus. It encompasses 223M fil- tered sentence pairs. Similarly, MTP17 collects and organizes existing Chinese-English web text data, amassing a total of 300M text pairs. This stands as the currently largest open-source Chinese-English aligned text pair dataset.

The second approach involves the collection of parallel corpora from United Nations multilingual documents. MultiUN (Eisele and Chen, 2010) gathers par- allel text pairs through the United Nations Official Document System18. These documents cover the six official languages of the United Nations (Arabic, Chinese, English, French, Russian, and Spanish), as well as a limited amount of German. UNCorpus v1.0 (Ziemski et al, 2016) consists of public domain United Nations official records and other conference documents, aligned at the sentence level.

14 https://cloud.google.com/bigquery?hl=en

15 https://stackovjerflow.com/

16 https://stackexchange.com/

17 https://data.baai.ac.cn/details/BAAI-MTP

18 https://documents.un.org/

Social media data refers to textual content collected from various media platforms, primarily encompassing user-generated posts, comments, and dialogue data between users. The data reflects real-time dynamics and interactivity among individuals on social media. Despite the potential presence of harmful information such as biases, discrimination, and violence in social media data, it remains essential for the pre- training of LLMs. This is because social media data is advantageous for models to learn expressive capabilities in conversational communication and to capture social trends, user behavior patterns, and more.

The crawling of data on English social media platforms is commonly conducted on platforms such as StackExchange19 and Reddit20. StackExchange is a collection of Q&A pairs covering various topics and stands as one of the largest publicly avail- able repositories of such pairs. Spanning topics from programming to culinary arts, it incorporates a wide range of subjects. Reddit includes a substantial number of user- generated posts along with the corresponding upvote and downvote counts for each post. In addition to serving as social media data, Reddit can also be used to construct a human preference dataset based on the vote counts. WebText (Radford et al, 2019) crawls social media text from 45M webpages on Reddit, ensuring that each link has at least 3 upvotes to guarantee data quality. However, only a tiny fraction of WebText is publicly available. Therefore, OpenWebText (Gokaslan and Cohen, 2019) replicates the construction method of WebText and open-sources the collected social media data. Pushshift Reddit (Baumgartner et al, 2020) has been collecting Reddit data since 2015, providing real-time monthly updates to reduce the time costs for researchers.

Chinese social media data is typically collected from platforms such as Zhihu21 and so on. Zhihu contains high-quality Chinese Q&A pairs and user-created content, making it highly favored for training Chinese LLMs.

2.1.8 Encyclopedia

Encyclopedia data refers to textual information extracted from encyclopedias, online encyclopedia websites, or other knowledge databases. The data from online encyclope- dia websites is written and edited by experts, volunteers, or community contributors, providing a certain level of authority and reliability. Due to its ease of accessibility, it is included at a higher frequency in pre-training corpora, serving as a cornerstone in enhancing the knowledge base of LLMs.

The most common encyclopedia corpus is Wikipedia22. It possesses characteristics such as being free, open-source, multilingual, and having high textual value. Fre- quently, specific language data from Wikipedia is selected, crawled, and filtered to serve as part of the pre-training corpus. In relation to Chinese-language encyclopedia corpora, in addition to the Chinese version of Wikipedia, there is also the Baidu baike corpus23. It covers almost all knowledge domains. TigerBot-wiki (Chen et al, 2023c) is filtered from the data of Baidu baike.

19 https://stackexchange.com/

20 https://www.reddit.com

21 https://www.zhihu.com/

22 https://www.wikipedia.org/

23 https://baike.baidu.com/

2.1.9 Multi-category Corpora

Multi-category corpora contain two or more types of data, which is beneficial for enhancing the generalization capabilities of LLMs. During model pre-training, one can either choose existing open-source multi-category corpora directly for pre-training or select multiple single-category corpora for a certain proportion of mixing. To gain a clear understanding of the distribution of various data types within certain multi- category corpora, pie charts are presented here in Figure 5.

In English, there are several multi-category corpora, including RedPajama-V1, The Pile (Gao et al, 2020), TigerBot pretrain en (Chen et al, 2023c) and Dolma (Soldaini et al, 2024). RedPajama-V1 is a partial replication of the pre-training corpora used in the LLaMA model, based on the reports (Touvron et al, 2023a). It encompasses six data types, with webpage data constituting the majority at 87.0%. The overall presentation exhibits a skewed data distribution. In contrast, The Pile has a richer variety of data types, with a more evenly distributed proportion. It is a combination of various subsets, aiming to capture text in as many forms as possible. Similarly, TigerBot pretrain en selects five types of data from open-source corpora, striving for a balanced distribution. To advance open research in the field of pretraining models, the Dolma English corpus, comprising 3T tokens, has been publicly released. This corpus amalgamates content sourced from six distinct domains, namely webpages, academic materials, code, books, social media, and encyclopedia. Furthermore, Dolma provides specific processing guidelines for each data type alongside a comprehensive data curation toolkit.

Chinese multi-category corpora include MNBVC (MOP-LIWU Community and MNBVC Team, 2023) and TigerBot pretrain zh (Chen et al, 2023c). MNBVC does not provide the distribution of data types but encompasses pure-text Chinese data in various forms like news, novels, magazines, classical poetry, chat records, and more. Its goal is to reach 40TB of data, aiming to match ChatGPT. The data collection is still ongoing. TigerBot pretrain zh focuses on web content, encyclopedias, books, and language texts.

Apart from the common Chinese and English corpora, the Beijing Academy of Artificial Intelligence collaborates with other institutions to build the largest open- source Arabic pre-training corpus globally, known as ArabicText 202224. It can be used for training Arabic LLMs.

There are two multilingual and multi-category corpora, namely WanJuanText-1.0 (He et al, 2023a) and ROOTS (Lauren¸con et al, 2022). WanJuanText-1.0 consists of bilingual Chinese-English data collected from various sources such as webpages, patents, and exam questions. The data is uniformly processed and formatted into jsonl. ROOTS includes 46 natural languages and 13 programming languages, with a total size of 1.6TB.

2.2 Domain-specific Pre-training Corpora

Domain-specific pre-training corpora tailored for specific fields or topics. The type of corpus is typically employed in the incremental pre-training phase of LLMs. After training a base model on a general pre-training corpus, if the model needs to be applied to downstream tasks in a particular domain, domain-specific pre-training corpora can be further utilized to incrementally pre-train the model. This process enhances the models’ capabilities in a specific domain while building upon a foundation of general proficiency gained from the initial general pre-training. The collected and organized information from the domain-specific pre-training corpora is presented in Table 3 and Table 4. The categorization of the corpus is shown in Figure 6.

24 https://data.baai.ac.cn/details/ArabicText-2022

Table 3 Summary of Domain-specific Pre-training Corpora Information Part I. Public or Not: “All” indicates full open source; “Partial” indicates partially open source. “License” indicates the corpus follows a certain protocol. If the corpus is built upon other corpora, the licenses of the source corpora must also be adhered to

Publisher Corpus Fudan University et al. BBT-FinCorpus Du Xiaoman FinCorpus Knowledge Atlas et al. FinGLM Ming Xu Medical-pt Princeton University et al. Proof-Pile-2 NCBI PubMed Central TigerBot TigerBot-earning TigerBot-law TigerBot TigerBot-research TigerBot TransGPT-pt

2.2.1 Financial Domain

The pre-training corpora in the financial domain contribute to the learning of top- ics related to the financial market, economics, investment, and finance for LLMs. Text data is normally sourced from financial news, financial statements, company literature, market data, etc. annual reports, financial research reports, financial BBT-FinCorpus (Lu et al, 2023a) is a large-scale Chinese financial domain corpus, comprising four sections: company announcements, research reports, financial news, and social media. It is utilized for pre-training BBT-FinT5 base mode (Lu et al, 2023a). Analogously, the pre-training corpus FinCorpus (Zhang and Yang, 2023) used by XuanYuan (Zhang and Yang, 2023) consists of company announcements, financial information and news, financial exam questions. FinGLM (MetaGLM, 2023) covers annual reports of listed companies from 2019 to 2021. TigerBot-research (Chen et al, 2023c) and TigerBot-earning (Chen et al, 2023c) focus on research reports and finan- cial reports, respectively. It can be observed that the data type in the financial domain are generally similar, with differences in data timeframes, source websites, and other factors.

2.2.2 Medical Domain

Pre-training corpora in the medical field can provide learning meterials for LLMs on topics such as diseases, medical technologies, drugs, and medical research. Data is usually sourced from medical literature, healthcare diagnostic records, case reports, medical news, medical textbooks, and other related sources. Medical-pt (Xu, 2023) has been enhanced using open-access medical encyclopedias and medical text- book datasets, while PubMed Central has opened access to publications related to biomedical research.

2.2.3 Other Domains

Legal Domain. Legal text data typically originates from legal documents, law books, legal clauses, court judgments and cases, legal news, and other legal sources. For instance, TigerBot-law (Chen et al, 2023c) has compiled 11 cate- gories of Chinese law and regulations for model learning. Some multi-category corpora have also incorporated data scraped from legal-related websites, such as The Pile (Gao et al, 2020).
Transportation Domain. TransGPT (Duomo, 2023), as the first open-source large-scale transportation model in China, has provided the academic commu- nity with the TransGPT-pt corpus (Duomo, 2023). The corpus includes rich data related to transportation, such as literature on transportation, transporta- tion technology projects, traffic statistics, engineering construction information, management decision information, transportation terminology, etc.
Mathematics Domain. Proof-Pile-2 (Azerbayev et al, 2023) gathers mathe- matical-related code (in 17 programming languages), mathematical web data and mathematical papers. It has been utilized to train the mathematical LLMs Llemma (Azerbayev et al, 2023). The knowledge in this corpus is up-to-date as of April 2023.

2.3 Distribution Statistics of Pre-training Corpora

Figure 7 provides statistics on 59 pre-training corpora across six aspects: release time, license, data category, construction method, language, and domain. Some observations and conclusions are drawn as follows:

(1) The growth of pre-training corpora was relatively slow before 2018, gradu- ally accelerating until the release of BERT (Devlin et al, 2019), which marked the emergence of pre-trained models and a subsequent increase in pre-training corpora. The subsequent introduction of models such as GPT-2 (Radford et al, 2019), GPT-3 (Brown et al, 2020), T5 (Raffel et al, 2020), and others continued to drive develop- ment. However, there were not many open-source pre-training corpora. It wasn’t until the end of 2022 when OpenAI released ChatGPT, attracting unprecedented atten- tion to LLMs. The construction and open-sourcing of pre-training corpora experienced explosive growth in 2023.

(2) The Apache-2.0, ODC-BY, CC0 and Common Crawl Terms of Use licenses are commonly employed in pre-training corpora, offering relatively permissive restrictions for commercial use. Before utilizing any pre-training corpus, it is suggested to review the specific terms and conditions of the applicable license to ensure compliance with relevant regulations.

(3) The diversity of data types in pre-training corpora can impact the overall quality of LLMs. Models experience greater improvements when trained on corpora with a more diverse range of types. Hence, multi-category corpora are preferred, and they are the most numerous. Looking at singular data types, webpage data stands out as the most common in corpora due to its ease of access, large scale, and extensive content (as indicated in Figure 7 (c)).

(4) Corpora necessitate the collection of extensive data and undergo rigorous cleaning processes. Most often, approaches involve either direct manual construc- tion or improvement upon existing open-source data. Occasionally, a combination of both methods is employed. Instances of utilizing data generated by models as pre- training corpora are rare, such as Phi-1 (Gunasekar et al, 2023), which incorporates model-generated Python-related data.

(5) Statistics indicate that corpora in English, Chinese, and multilingual languages receive widespread research and attention. Corpora related to programming languages are also gradually being utilized for the study of code performance in LLMs. However, resources for corpora in other languages are much more limited.

(6) General pre-training corpora take the lead, being applicable to various NLP tasks. The number of open-source domain-specific pre-training corpora is limited, catering to specialized needs for specific fields and offering selectivity for different application scenarios.

2.4 Preprocessing of Pre-training Data

The collected data needs to undergo a preprocessing pipeline to enhance data quality and standardization while reducing harmful and sensitive content. Through a survey of the existing pre-training corpus construction process, a basic data preprocessing workflow has been summarized, as illustrated in Figure 9. Data preprocessing gen- erally consists of five steps: (1) Data Collection. (2) Data Filtering. (3) Data Deduplication. (4) Data Standardization. (5) Data Review.

2.4.1 Data Collection

The preprocessing of data is crucial right from the data collection stage. The quality and distribution of data in the collection phase directly impact the subsequent per- formance of the model. A comprehensive data collection phase generally involves ten steps.

Step 1: Define Data Requirements. The application scenario of the final model determines the selection of data for the pre-training corpus. Clearly defining specific data requirements, including data types, language, domain, sources, quality standards, etc., helps determine the scope and objectives of data collection.

Step 2: Select Data Source. Selecting appropriate data sources can include various websites, as well as books, academic papers, and other resources. Data sources should align with the requirements, and efforts should be made to ensure that selected sources are reliable. The CulturaX corpus (Nguyen et al, 2023), during construction, employed a blacklist to filter out pages from harmful sources, reducing potential risks in the data. Specialized filters can also be used to exclude low-quanlity websites in advance.

Step 3: Develop Collection Strategy. The collection strategy encompasses the time span, scale, frequency, and methods of data collection, facilitating the acquisition of diverse and real-time data.

Step 4: Data Crawling and Collection. Utilize web crawlers, APIs, or other data retrieval tools to collect text data from the selected data sources according to the predefined collection strategy. Ensure compliance with legal regulations and the relevant agreements and policies of the websites during the crawling process.

Step 5: Data Extraction and Parsing. Extract textual components from raw data, enabling accurate parsing and separation of text. This may involve HTML pars- ing (Penedo et al, 2023; Ba˜n´on et al, 2020), PDF text extraction (Lo et al, 2020), and similar methods. For example, data crawled from the Internet is often stored in for- mats such as WARC, WAT and WET. Text from HTML pages can be converted to plain text from WET files or through alternative methods.

Step 6: Encoding Detection. Employ encoding detection tools to identify the text encoding, ensuring that text is stored in the correct encoding format. Incorrect encoding may lead to garbled characters or data corruption. In the creation of MNBVC (MOP-LIWU Community and MNBVC Team, 2023), a Chinese encoding detection tool is currently used to rapidly identify encoding across numerous files, aiding in the cleaning process.

Step 7: Language Detection. Utilize language detection tools to identify the language of the text, enabling the segmentation of data into subsets based on different languages, selecting only the required language texts. WanJuanText-1.0 (He et al, 2023a) implements language classification using pyclid225.

Step 8: Data Backup. It is advisable to periodically back up the collected data to prevent data loss and damage.

Step 9: Privacy and Legal Compliance. Ensure that the entire process com- plies with data privacy laws and regulations, obtain necessary permissions, and protect personal and sensitive information in the data.

Step 10: Maintenance and Updates. Regularly maintain the data collection system to ensure the continuous updating of data. Consider replacing with new data sources and collection strategies as needed.

2.4.2 Data Filtering

Data filtering is the process of screening and cleaning the data obtained during the data collection stage, with the primary goal of improving data quality. It can be accomplished through model-based methods or heuristic-based methods.

Model-based methods. The methods filter low-quality data by training screen- ing models. High-quality pre-training corpora can be used as positive samples, with the contaminated text to be filtered as negative samples, to train classifiers for filtering. For instance, the creators of WanJuanText-1.0 (He et al, 2023a) take two measures. On one hand, they train content safety models for both Chinese and English content to filter potential harmful data related to topics like obscenity, violence, and gam- bling. On the other hand, they train data quality models for both Chinese and English to address low-quality contents such as advertising and random data in webpages, thereby reducing the prevalence.

Heuristic-based methods. Filtering can be conducted at both the document level and sentence level. The former operates at the document level, employing

25 https://pypi.org/project/pycld2/

At the document level, most corpora undergo language filtering to exclude unwanted documents. This step can also be completed during the language detection phase of data collection. Corpora such as RefinedWeb (Penedo et al, 2023) and The Pile (Gao et al, 2020) retain only English text, while WuDaoCorpora-Text (Yuan et al, 2021) and CLUECorpus2022 (Xu et al, 2020c) retain only Chinese text. Subsequently, by setting quality metrics and thresholds, quality filtering heuristic algorithms are applied for filtering (Penedo et al, 2023). Quality metrics may include quality filtering scores (Chen et al, 2023c), text density (Yuan et al, 2021; Lauren¸con et al, 2022; He et al, 2023a; Raffel et al, 2020; Xue et al, 2021), Chinese characters or word counts (Yuan et al, 2021; Lauren¸con et al, 2022; Nguyen et al, 2023), document length (Zhu et al, 2015; He et al, 2023a), proportion of special characters (Lauren¸con et al, 2022; Nguyen et al, 2023; He et al, 2023a), number of short lines (Nguyen et al, 2023), per- plexity scores (Nguyen et al, 2023), etc. Specific rules can also be set for particular data types. For example, S2ORC (Lo et al, 2020) specifically excludes papers without titles and authors, those that are too short, and those not in English.

At the sentence level, corresponding heuristic rules are set to selectively remove sen- tences that are not necessary to retain in the corpus. The following rules are primarily applied:

• Assessing the completeness of sentences by filtering out incomplete ones based on semantics and punctuation (Yuan et al, 2021; Xu et al, 2020c; Raffel et al, 2020).

• Removing content involving personal privacy or replacing privacy information

with other texts (Yuan et al, 2021).

• Deleting harmful content related to violence, pornography, and more (Yuan et al, 2021; Xu et al, 2020c; Raffel et al, 2020; Xue et al, 2021).

• Removing abnormal symbols (Yuan et al, 2021; Abadji et al, 2022). • Deleting identifiers such as HTML, CSS, JavaScript, etc. (Yuan et al, 2021; Xu et al, 2020c; Raffel et al, 2020; Nguyen et al, 2023; He et al, 2023a).

• Deleting sentences containing curly braces (Xu et al, 2020c; Raffel et al, 2020). • Deleting overly short sentences (Xu et al, 2020c; Abadji et al, 2022; Nguyen et al, 2023).

• Removing redundant content, such as like buttons, navigation bars, and other irrelevant elements (Penedo et al, 2023).

• Deleting text containing specific words (Raffel et al, 2020).

Different corpora should have corresponding rules set for cleaning purposes.

2.4.3 Data Deduplication

Data deduplication involves removing duplicate or highly similar texts in a corpus. Several typical deduplication methods are list belows:

TF-IDF (Term Frequency-Inverse Document Frequency) Soft Deduping (Chen et al, 2023c). This method involves calculating the TF-IDF weight of each word in the text to compare the similarity between texts. Texts with similarity above a threshold are deleted. TF-IDF weight is the frequency of a word in the text (TF) multiplied by the inverse document frequency (IDF) across the entire corpus. Higher weights indicate that a word frequently appears in a particular text but is uncommon across the entire corpus, making it a key feature of the text.

MinHash (Penedo et al, 2023; Nguyen et al, 2023). This method estimates the similarity between two sets. Texts are processed with random hashing to obtain a set of minimum hash values. Similarity is then estimated by comparing these minimum hash values. This method is computationally and spatially efficient.

SimHash (Yuan et al, 2021; Abadji et al, 2022). This algorithm is used for calcu- lating text similarity. Text feature vectors are hashed to generate a fixed-length hash code. Similarity is estimated by comparing the Hamming distance between text hash codes, with a smaller distance indicating greater similarity.

Other methods. CLUECorpus2020 (Xu et al, 2020c) adopts a duplicate removal operation, retaining only one occurrence when four consecutive sentences appear mul- tiple times. C4 (Raffel et al, 2020) and RefinedWeb (Penedo et al, 2023) also use similar methods. CulturaX (Nguyen et al, 2023) employs URL-based deduplication, remov- ing duplicate documents that share the same URL in the corpus. WanJuanText-1.0 (He et al, 2023a) uses MinHashLSH and n-grams to assess similarity, deleting content with a similarity greater than 0.8.

2.4.4 Data Standardization

Data standardization involves the normalization and transformation of text data to make it more manageable and comprehensible during the model training process. It mainly consists of four steps.

Sentence Splitting. MultiUN (Eisele and Chen, 2010) performs sentence segmen- tation on extracted text. Chinese text is segmented using a simple regular expression, while other texts use the sentence tokenization module from the NLTK toolkit26. CLUECorpus2020 (Xu et al, 2020c) utilizes PyLTP (Python Language Technology Platform) to separate text into complete sentences, with one sentence per line.

2.4.5 Data Review

The data review stage begins by meticulously documenting the previous preprocess- ing steps and methods for future reference and review. Subsequently, a manual review is conducted to sample the check if the data processing meets the expected stan- dards. Any issues identified during this review are then provided as feedback to steps

26 https://www.nltk.org/

3 Instruction Fine-tuning Datasets

The instruction fine-tuning datasets consists of a series of text pairs comprising “instruction inputs” and “answer outputs.” “Instruction inputs” represent requests made by humans to the model, encompassing various types such as classification, sum- marization, paraphrasing, and more. “Answer outputs” are the responses generated by the model following the instruction, aligning with human expectations.

There are four ways to construct the instruction fine-tuning datasets: (1) manual creation, (2) model generation, for example, using the Self-Instruct method (Wang et al, 2023f), (3) collection and improvement of existing open-source datasets, and (4) a combination of the three aforementioned methods.

The instruction fine-tuning datasets are used to further fine-tune pre-trained LLMs, enabling the models to better comprehend and adhere to human instructions. This process helps bridge the gap between the next-word prediction targets of LLMs and the goal of having LLMs follow human instructions, thereby enhancing the capabilities and controllability of LLMs (Zhang et al, 2023g).

The instruction fine-tuning datasets can be divided into two main categories: general instruction fine-tuning datasets and domain-specific instruction instruction fine-tuning datasets encompass vari- fine-tuning datasets. General ous types of instructions across lots of domains, aiming to enhance the models’ performance across a wide range of tasks. Through fine-tuning, LLMs can better adhere to general instructions. In domain-specific instruction fine-tuning datasets, the instructions are specifically designed for particular domains. For instance, medical instructions enable models to learn and perform tasks like medical diagnostics and healthcare assistance.

3.1 Instruction Category

InstructGPT-sft (Ouyang et al, 2022) categorizes instructions into 10 classes during construction, namely Generation, Open QA, Brainstorming, Chat, Rewrite, Summa- rization, Classification, Other, Closed QA and Extraction. BELLE train 3.5M CN (BELLEGroup, 2023) expands on this by adding Role-playing, Math, Translation, Code and Harmless categories while removing Chat and Other categories. Firefly (Yang, 2023) further refines instruction categories, covering 23 classes. Categories such as story generation and lyric generation are subcategories of the original cat- egory “Generation.” Considering the current classification status and focusing only on single-turn dialogue instructions, instructions are broadly grouped into 15 classes: Reasoning, Math, Brainstorming, Closed QA, Open QA, Code, Extrac- tion, Generation, Rewrite, Summarization, Translation, Role-playing, Social Norms, and Others. Concrete examples can be found in Figure 10.

• Reasoning. Deriving new judgments from known premises involves logical reasoning and making inferred assumptions, including processes like Chain-of- thought (CoT), analogical reasoning, inductive reasoning, and more.

• Math. The instructions incorporate mathematical calculations or mathematical reasoning. It can be categorized based on difficulty levels.

• Brainstorming. Generating new ideas around a specific theme, proposing innovative methods. Answers are typically in a bullet-point format. Provid- ing suggestions, giving recommendations and similar demands all fall under brainstorming.

• Closed QA. Select the correct option based on the provided prompts and questions or obtain the answer directly or indirectly from the provided textual information.

• Open QA. For Open QA instructions, questions do not come with options, and answers cannot be directly found within the question. One must rely on their own knowledge base to formulate a response. These questions can include com- mon knowledge queries with standard answers or open-ended inquiries without predefined solutions.

• Code. Questions involving code, including but not limited to code generation, code correction, and code comprehension.

• Extraction. Extract key information from the given content, including named entity recognition (NER), relation extraction (RE), event extraction, and more.

• Generation. Generate original content such as ad copy or articles based on the requirements of the question. Obtaining the answer involves a process of creating something from scratch.

• Rewrite. Process the text according to requirements, including word transfor- mation, style transformation, text ordering, text simplification and expansion, context rewriting, sentence rewriting, text correction, etc.

• Summarization. Summarize and condense the text content, or distill the content into a headline. Specific constrains can be applied when summarizing. • Classification. Categorize or rate information according to specified require-

ments, such as topic classification, quality scoring, and so on.

• Translation. Translation between different languages, including translations among various national languages, as well as translation between simplified and traditional Chinese, dialect translations, classical Chinese translations, etc. • Role-playing. Have the model play a certain role to accomplish a task. It can take on conventional roles such as an expert, a celebrity, or unconventional roles like a madman, an animal, a compiler, and so on.

• Social Norms. Social Norms instructions refer to ethical and moral issues, personal privacy, bias, discrimination, etc. The requirement is to provide answers that adhere to safety norms and align with human values.

• Others. This category can involve instructing the model to use a search engine for real-time information retrieval or providing illogical instructions such as “turn right” or “repeat what I say.”

3.2 General Instruction Fine-tuning Datasets

3.2.1 Human Generated Datasets

Human generated datasets involve manual creation and organization of all instruc- tions by human annotators, following specified requirements and rules, without the assistance of existing LLMs. This type of datasets has evident advantages and disadvantages. Its advantages include:

Table 5 Summary of General Instruction Fine-tuning Datasets Information Part I. Public or Not: “All” indicates full open source; “Partial” indicates partially open source; “Not” indicates not open source. “License” indicates the dataset follows a certain protocol. If the dataset is built upon other datasets, the licenses of the source datasets must also be adhered to

• High Quality. The datasets undergo processing and review by professional annotators, resulting in higher quality and cleanliness.

• Interpretability. After manual processing, the datasets are more easily inter-pretable and align well with human understanding.

3.2.2 Model Constructed Datasets

The method of constructing the model involves leveraging a LLM, using various approaches to guide its generation of instructional data needed by humans. This approach has several advantages compared to human construction:

• Abundant Data. LLMs can generate a vast amount of instructions, especially for content that occurs infrequently in real-world scenarios.

• Cost-Effective and Efficient. It reduces labor costs and time, enabling the acquisition of a large amount of data in a short period.

However, there are potential pitfalls in the content generated by the models, including: • Variable Quality. The quality of the generated content may not always be high. The model might produce hallucination, leading to inaccurate or inappropriate instructions. At the same time, the model itself may have inherent biases, and its output may not necessarily align with human values.

• Post-Processing Required. Generated samples need additional post- processing to ensure their quality and applicability before they can be used.

There are generally three methods for constructing datasets for model training. The first method involves guiding a LLM to output instructions that meet expec- tations. Typically, the LLM is given a certain identity (e.g., an expert question setter), along with requirements and examples for instruction generation. This allows the model to follow rules in answering questions or generating new instruction sam- ples. Self-Instruct (Wang et al, 2023f) is a framework that sets initial instructions, automatically generates instruction samples, and iteratively filters them. The Self- Instruct dataset (Wang et al, 2023f) uses 175 manually written instructions as initial

27 https://data.baai.ac.cn/details/OL-CC

28 https://github.com/wangrui6/Zhihu-KOL

Other datasets, such as BELLE train 0.5M CN (BELLEGroup, 2023), BELLE InstructionWild v1 (Ni et al, 2023), and train 1M CN (BELLEGroup, 2023), MOSS 002 sft data (Sun et al, 2023b), also adopt this method for construction. Addi- tionally, one can choose other well-performing models to build instruction datasets, like BELLE Generated Chat (BELLEGroup, 2023), BELLE Multiturn Chat (BEL- LEGroup, 2023), BELLE train 2M CN (BELLEGroup, 2023), BELLE train 3.5M CN (BELLEGroup, 2023), ChatGPT corpus29, Unnatural Instructions (Honovich et al, 2023), MOSS 003 sft plugin data (Sun et al, 2023b), and others.

To obtain higher-quality instructions, RedGPT-Dataset-V1-CN (Yang et al, 2023b) uses pre-existing LLMs to generate multi-turn dialogues. The pre-trained base model is fine-tuned, and the resulting RedGPT model (Yang et al, 2023b) is further used for instruction generation in an iterative manner to obtain a massive amount of high-quality data. WebGLM-QA (Liu et al, 2023e) generates data in three stages: Prompt Formulation, Instruction Inducting, and Few-shot In-context Learning. Wiz- ard evol instruct 196K (Xu et al, 2023b) and Wizard evol instruct 70K (Xu et al, 2023b) use the Evol-Instruct method, subjecting 175 seed instructions to four evolution stages to enhance the complexity of generated instructions.

The second method involves using real interactive conversations between humans and LLMs as instructional datasets. ShareGPT30 can be used to share the dialogue outcomes between users and ChatGPT. ShareGPT90K31 and Open- Chat (Wang et al, 2023a) have compiled tens of thousands of real conversations from ShareGPT. ShareGPT-Chinese-English-90k32 provides human-machine Q&A data in parallel Chinese-English corpora. ShareChat33 translates all acquired ShareGPT data into Chinese. LMSYS-Chat-1M (Zheng et al, 2023a) has gathered real conversation data from 25 LLMs between April and August 2023.

When constructing datasets, a combination of the first two methods can be employed. For instance, MOSS 003 sft data (Sun et al, 2023b) incorporates user data from MOSS-002 model (Sun et al, 2023b) and generated data from GPT-3.5-Turbo. The third method involves engaging in conversations using multiple LLM agents to obtain dialogue data. CAMEL (Li et al, 2023b) introduces a “role- playing” framework where LLMs generate metadata, creating 50 assistant roles and user roles for the “AI society.” UltraChat (Ding et al, 2023) involves the interaction of multiple ChatGPT APIs in a dialogue. It employs an LSTM (Hochreiter and Schmid- huber, 1997) to process input and output for each round, simultaneously utilizing attention mechanisms to model contextual information.

29 https://github.com/PlexPt/chatgpt-corpus

30 https://sharegpt.com/

31 https://huggingface.co/datasets/RyokoAI/ShareGPT52K

32 https://huggingface.co/datasets/shareAI/ShareGPT-Chinese-English-90k

33 https://paratranz.cn/projects/6725

3.2.3 Collection and Improvement of Existing Datasets

The Collection and Improvement of Existing Datasets is also a method for constructing instruction fine-tuning datasets. This method involves integrating and modifying sev- eral open-source datasets, ultimately consolidating them into a new dataset for LLM instruction fine-tuning. Such datasets can also be referred to as “Data Repositories.” It offers several advantages:

• Diversity and Comprehensiveness. The resulting datasets possess charac- teristics of rich data sources, diverse task types, and broad domain coverage.

• Large Scale. The more datasets selected, the larger the scale. • Time-saving. It reduces the time required for dataset construction.

However, it has its drawbacks:

• Quality and Format Standardization. It is necessary to comprehensively consider the quality of the source datasets and standardize the format of the data.

• Dataset Licenses. It is crucial to pay attention to the licenses of different source datasets to avoid privacy and regulatory issues.

A total of 16 datasets are compiled for this analysis. The source datasets for these “Data Repositories” primarily come from open-source traditional NLP datasets and other instruction fine-tuning datasets.

CrossFit (Ye et al, 2021). To investigate models’ few-shot learning capabilities across tasks, a collection of 269 NLP task datasets, known as CrossFit, has been assembled, covering 13 task types (Wang et al, 2022). In addition to being used for instruction fine-tuning, this dataset is employed for studying models’ cross-task generalization and transfer learning abilities.

DialogStudio (Zhang et al, 2023c). The DialogStudio dataset has gathered 87 open-source datasets, spanning six major task categories. The dataset integrates each sub-dataset while preserving the original information and is specifically designed for research on LLM instruction fine-tuning.

Dynosaur (Yin et al, 2023a). The Dynosaur dataset is designed to study the dynamic expansion of instruction fine-tuning data. With a focus on minimizing mainte- nance costs, it incorporates approximately 802K data instances. During construction, metadata from existing NLP datasets is used to generate instructions for various NLP tasks, and the necessary data fields for building the dataset are identified. Furthermore, the dataset achieves dynamic growth by integrating new datasets from the Hugging Face34 data platform.

Flan-mini (Ghosal et al, 2023). The Flan-mini dataset is a subset selected from the Flan 2022 (Longpre et al, 2023a). It maintains a high level of task diversity while reducing the overall dataset size. The dataset includes specific tasks from the Flan 2022 and additional code-related datasets. Each instruction here has been processed, with the random addition of various prompt templates.

Flan 2021 (Wei et al, 2022). The Flan 2021 dataset aggregates 62 existing NLP datasets, covering 12 tasks such as language understanding, generation, translation, and more. The collected datasets are predominantly in English.

34 https://huggingface.co/

Flan 2022 (Longpre et al, 2023a). The Flan 2022 dataset consists of five parts, namely Flan 2021, T0 (Victor et al, 2022), SUPER-NATURAL INSTRUCTIONS (Wang et al, 2022), CoT datasets, and Dialog datasets. It encompasses as many as 1836 tasks. Each instruction provides four distinct instruction input templates, along with the incorporation of zero-shot, few-shot, CoT templates, as well as techniques like task mixing and input reversal.

InstructDial (Gupta et al, 2022). The InstructDial dataset integrates 59 open- source dialogue datasets, covering 48 task types. Its goal is to enhance the models’ performance on dialogue-related tasks through instruction fine-tuning. Models fine- tuned on this dataset exhibit good performance in Out-of-Distribution (OOD) scenarios and few-shot learning.

NATURAL INSTRUCTIONS (Mishra et al, 2022b). The NATURAL INST- RUCTIONS dataset comprises 61 task datasets spanning 6 task types, totaling 193K instances. The dataset maps sub-datasets into a unified task pattern, exploring the cross-task generalization performance of models.

OIG35. The OIG dataset, which stands for Open Instruction Generation, aims to create a collection that includes a large-scale of medium-quality instructions and a smaller scale of high-quality instructions. The dataset continues to incorporate new sub-datasets. As of February 2024, it contains 3.88M instructions, predominantly in English.

Open-Platypus (Lee et al, 2023b). The Open-Platypus dataset aims to enhance the logical reasoning capabilities of models and is used to train the Platypus2 (Lee et al, 2023b). By conducting keyword searches on other open-source datasets and using Sentence Transformers (Wolf et al, 2020), questions with a similarity exceeding 80% are filtered out. This process results in approximately 25K English instructions.

OPT-IML Bench (Iyer et al, 2022). The OPT-IML Bench dataset comprises 2K NLP task datasets spanning 93 task types. The creators integrate and filter eight large data repositories, including the CrossFit, UnifiedSKG (Xie et al, 2022), PromptSource (Bach et al, 2022), and others. OPT-IML Bench is utilized to investigate the impact of a series of decisions in instruction fine-tuning on the downstream task performance. PromptSource (Bach et al, 2022). The PromptSource dataset encompasses 176 NLP task datasets across 13 task types. Its strength lies in constructing a diverse set of prompts, offering ample resources for research areas such as instruction fine-tuning. SUPER-NATURAL INSTRUCTIONS (Wang et al, 2022). The SUPER- NATURAL INSTRUCTIONS dataset comprises 1616 task datasets spanning 76 task types. It holds a linguistic advantage compared to other datasets, covering 55 languages. It is also suitable for studying the OOD capabilities of LLMs.

T0 (Victor et al, 2022). The T0 dataset comprises 62 task datasets spanning 12 task types. Constructed by collecting NLP datasets and modifying prompts, it aims to test the zero-shot generalization capabilities of LLMs across many tasks.

UnifiedSKG (Xie et al, 2022). The UNIFIEDSKG framework proposed by Xie et al (2022) integrates 21 structured knowledge grounding datasets into a text-to-text format, facilitating systematic SKG research. This dataset encompasses six task types, including semantic parsing and knowledge base Q&A.

35https://huggingface.co/datasets/laion/OIG

The xP3 dataset is a multilingual multitask dataset comprising 82 source datasets spanning 13 task types and 46 languages. The dataset is fine-tuned on multilingual pretrained models, resulting in variants of models such as BLOOMZ and mT0 (Muennighoff et al, 2023b). This exploration investigates performance on cross-lingual tasks.

3.2.4 Datasets Created with Multiple Methods

During the construction of certain general instruction fine-tuning datasets, multiple methods are concurrently employed to leverage the strengths of each, thereby enhanc- ing the datasets’ qualities. The three methods are mentioned in previous sections, and through various combinations, four scenarios can be generated: HG & CI, HG & MC, CI & MC, and HG & CI & MC. Here, “HG” stands for Human- Generated Datasets, “MC” for Model-Constructed Datasets, and “CI” for Collection and Improvement of Existing Datasets.

HG & CI. (1) While collecting data from other datasets, manual creation of data is concurrently undertaken to supplement missing task types. Firefly (Yang, 2023) gathers 23 common Chinese NLP tasks and constructs numerous tasks related to Chinese culture, such as couplets, poetry creation, and more. Each task is accompanied by manually written instruction templates to ensure high-quality and richness of the data. (2) The collected data undergoes manual selection. LIMA- sft (Zhou et al, 2023a) includes 1330 instructions carefully chosen and prepared by human experts to validate the importance of high-quality instruction data.

HG & MC. Combine manually authored data with user-model dialogue data. The InstructGPT-sft dataset (Ouyang et al, 2022), used in training the Instruct- GPT model (Ouyang et al, 2022) by OpenAI, has two sources: one authored by annotators and the other consisting of instructions submitted via API to early models. CI & MC. (1) Using other datasets as instruction inputs and select- ing different models to generate responses. Alpaca GPT4 data (Peng et al, 2023) employs instructions from the Alpaca data (Taori et al, 2023) as input, gen- erating responses using GPT-4 (Achiam et al, 2023). Alpaca GPT4 data zh (Peng et al, 2023) and Wizard evol instruct zh dataset (Ziang Leng and Li, 2023) trans- late English instructions into Chinese before invoking models to generate Chinese responses. Bactrain-X (Li et al, 2023c) utilizes a translation API to translate instruc- tion inputs from the Alpaca data and databricks-dolly-15K into 51 languages, then inputs them into ChatGPT to obtain responses. GPT4All (Anand et al, 2023) uses instructions from five public datasets as input and generates responses using GPT- 3.5-Turbo. LogiCoT (Liu et al, 2023c) and OpenOrca (Mukherjee et al, 2023) follow similar methods. GuanacoDataset36 expands the language of instruction data from English to Chinese and Japanese. LaMini-LM (Wu et al, 2023) uses the model to simultaneously generate synthetic instructions and responses corresponding to real instructions. These datasets reference existing instructions and are secondarily con- structed with the assistance of models. (2) Using open-source datasets as seed instructions to guide the generation of dialogues between models. Baize (Xu et al, 2023a) samples “seeds” from specific datasets, allowing ChatGPT to engage in self-dialogue and batch generate high-quality multi-turn dialogue data. The dia- logues cover both general and some vertical domains. (3) Directly constructing input-output text pairs using existing data. LongForm (K¨oksal et al, 2023) gen- erates complete instructions for existing pre-trained corpus documents using LLMs, then expands them using structured corpus examples and task instances. Luotuo-QA- B (Liao et al, 2023) instructs the model to generate five input-output text pairs for summaries or news content from three datasets.

36 https://guanaco-model.github.io/

HG & CI & MC. The six datasets combine the three construction methods men- tioned in previous sections. The relevant information is as follows. (1) COIG (Zhang et al, 2023a). The COIG dataset consists of 191K Chinese instructions categorized into five types. Translated instructions are derived from open-source datasets, and the translation process involves three stages: automatic translation, manual verification, and manual correction. Exam instructions are primarily sourced from Chinese college entrance exams, high school entrance exams, and civil service exams. Human value alignment instructions consist of two series—one focusing on general human value alignment in Chinese regions, and the other on human value alignment specific to certain countries or regional cultures. Counterfactual correction multi-round chat are built based on the CN-DBpedia knowledge graph dataset (Xu et al, 2017), address- ing hallucination issues in LLMs. Leetcode instructions gather programming-related prompts. (2) HC3 (Guo et al, 2023a). The HC3 dataset has both Chinese and English versions, totaling 37K Q&A pairs. The dataset is designed to compare responses between human experts and ChatGPT across various domains. It can be used for research in areas such as instruction fine-tuning, human value alignment, model response characteristics, and more. (3) Phoenix-sft-data-v1 (Chen et al, 2023d). The 464K multilingual dialogue data in the Phoenix-sft-data-v1 dataset is primarily divided into two parts: single-turn instructions and multi-turn conversations. Single- turn instructions include Chinese and English instructions from Alpaca, translated multilingual instructions, and user-generated multilingual instructions. Multi-turn conversations mainly originate from ShareGPT and Discord37. (4) TigerBot sft en & TigerBot sft zh (Chen et al, 2023c). These two datasets are fine-tuning data for the TigerBot (Chen et al, 2023c), containing a large amount of collected open- source data and self-developed data. The construction of the dataset mainly follows five principles: annotating and summarizing 10 instruction categories and 120 sub- task types based on the distribution of instructions; generating instructions using the Self-Instruct method; organizing question and answer data based on manual question generation, web search, and other methods; converting and cleaning the format based on public datasets; the overall data distribution conforms to the natural distribution of instructions. (5) Aya Collection (Singh et al, 2024). The Aya Collection is a compre- hensive and large corpus of datasets designed for training multilingual models, aimed at researchers worldwide. It comprises three primary sources of data: templated data, translated data, and the Aya Dataset (Singh et al, 2024). Templated data involves col- laboration with fluent speakers to create templates for automatic dataset expansion into various languages. Translated data involves translating a subset of 19 datasets into 101 languages using the NLLB 3.3B machine translation model (Costa-juss`a et al, 2022). The Aya Dataset is a human-annotated subset of the overall collection.

37 https://discord.com/

3.3 Domain-specific Instruction Fine-tuning Datasets

The domain-specific instruction fine-tuning datasets are constructed for a particular domain by formulating instructions that encapsulate knowledge and task types closely related to that domain. After fine-tuning the pre-trained base model on the domain- specific instruction fine-tuning datasets, it can be applied to various scenario tasks within that domain, exhibiting outstanding performance. As shown in Figure 13, the domain-specific instruction fine-tuning datasets are categorized into six major classes: medical, code, legal, mathematical, educational, and other domains. The collected and organized information from the domain-specific instruction fine-tuning datasets is presented in Table 7 and Table 8.

Fig. 13 Domain categories of the domain-specific instruction fine-tuning datasets

3.3.1 Medical Domain

Currently, there are numerous open-source large-scale models for medical tasks in both Chinese and English. All of them have constructed instruction fine-tuning datasets in the medical domain for supervised fine-tuning, demonstrating excellent generalization capabilities. In some cases, the performance even close to that of professional doctors in specific scenarios. CMtMedQA (Yang et al, 2023d) and MedDialog (Zeng et al, 2020) exclusively utilize authentic doctor-patient multi-turn dialogues, where all instruc- tions belong to real-world scenario data. In contrast, ChatMed Consult Dataset (Zhu and Wang, 2023) and ShenNong TCM Dataset (Wei Zhu and Wang, 2023) adopt the Publisher BELLE University of Texas Southwestern Medical Center et al.

Harbin Institute of Technology et al. Zhengzhou University Sahil Chaudhary DeepMind Bigcode Fudan University et al. Fudan University et al. Fudan University et al.

Self-Instruct method, utilizing the model to generate medical Q&A data. The former focuses on medical consultations, while the latter concentrates on traditional Chinese medicine knowledge Q&A.

Some datasets are collected and curated from open-source data such as knowledge bases and forums. For instance, Huatuo-26M (Li et al, 2023h) has multiple sources, including medical encyclopedia Q&A, medical knowledge graphs, and doctor-patient Q&A. QiZhenGPT-sft-20k38 formulates instructions based on the content collected from the Qizhen medical knowledge base. Medical-sft39 merges several Chinese and English medical datasets, including the ChatDoctor (Li et al, 2023l) and QiZhenGPT- sft-20k, among others.

In addition to the aforementioned, some datasets may comprise a combination of real and synthetic data or involve manual curation based on existing datasets. Chat- Doctor and HuatuoGPT-sft-data-v1 (Zhang et al, 2023b), while collecting authentic doctor-patient dialogues, incorporate conversation data generated by ChatGPT and information from a disease database. DISC-Med-SFT (Bao et al, 2023) and Medi- cal Meadow (Han et al, 2023) meticulously select several data sources, undergoing a certain degree of reconstruction to enhance the overall quality of the datasets.

38 https://github.com/CMKRG/QiZhenGPT

39 https://github.com/shibing624/MedicalGPT

3.3.2 Code Domain

The purpose of the code instruction fine-tuning datasets is to enhance the capabilities of LLMs in tasks such as code generation and tool invocation. Some datasets focus on instructions tailored for code generation tasks. CommitPackFT (Muennighoff et al, 2023a) extracts code files covering 350 programming languages, rigorously filtering and retaining code instruction data for 277 programming languages. Code Alpaca 20K (Chaudhary, 2023) follows the construction method of the Alpaca data (Taori et al, 2023), generating 20K instructions for fine-tuning the Code Alpaca model (Chaud- hary, 2023). CodeContest (Li et al, 2022a) merges data collected from Codeforces40, Description2Code (Caballero et al, 2016), and CodeNet (Puri et al, 2021). In addi- tion, some datasets emphasize instructions for tool invocation tasks. ToolAlpaca (Tang et al, 2023) creates a highly diverse tool usage dataset through the construction of a multi-agent simulation environment, fine-tuning the model with 3,928 instances of tool usage. The construction of the ToolBench (Anonymous, 2024) involves three stages: API collection, instruction generation, and solution path annotation, aiming to fine-tune the model for tool usage instructions.

40 https://codeforces.com/blog/entry/89502

3.3.3 Legal Domain

Various LLMs in the legal domain have been introduced, but there is a relatively lim- ited availability of open-source legal instruction datasets. Here, we compile information on four partially or fully open-source legal instruction datasets that can be utilized to enhance model capabilities in tasks such as legal Q&A, judgment prediction, and case classification. DISC-Law-SFT (Yue et al, 2023) is divided into two sub-datasets, each introducing legal reasoning abilities and the utilization of external knowledge to the model. Han Fei 1.0 (He et al, 2023c) merges general instructions with legal instructions, aiming to equip the model with legal knowledge while retaining its general capabili- ties. LawGPT zh (Liu et al, 2023b) includes scenario-based Q&A with legal basis and single-turn legal Q&A obtained through model cleaning. Lawyer LLaMA sft (Huang et al, 2023b) involves model-generated Chinese judicial exam Q&A, legal consultation responses, and multi-turn dialogue data.

3.3.4 Mathematics Domain

The performance and future potential of LLMs in the field of mathematics have always been a focal point of attention. Mathematical problems assess various skills such as computation, reasoning, spatial thinking, making them inherently challeng- ing. This often results in model performance on mathematical problems falling below expectations. Consequently, one common approach to improving models’ mathemati- cal abilities is to perform supervised fine-tune using effective mathematical instruction datasets.

BELL School Math (BELLEGroup, 2023) generates Chinese mathematical prob- lems, including the solution process, through the model. However, the overall difficulty is low, and the answers have not undergone rigorous verification, potentially con- taining errors. Goat (Liu and Low, 2023) consists entirely of artificially synthesized data for arithmetic tasks, covering addition, subtraction, multiplication, and division operations, with difficulty levels not posing significant challenges for humans. MWP (Lan et al, 2022) unifies eight mathematics-related NLP datasets into instruction for- mat, offering both single-equation and multiple-equation forms. OpenMathInstruct-1 (Toshniwal et al, 2024) leverages the Mixtral-8x7B model (Jiang et al, 2024) to reason over questions from the GSM8K (Cobbe et al, 2021) and MATH (Hendrycks et al, 2021d) datasets, generating a plethora of question-solution text pairs. It significantly enhances the models’ mathematical capabilities.

Currently, there is a scarcity of high-difficulty mathematical instruction datasets, limited by factors such as high entry barriers, complex symbols, high costs, and non- open sourcing.

3.3.5 Education Domain

LLMs in the education domain focus on course guidance, emotional support, child companionship, knowledge learning, and other aspects, serving teachers, students, and parents. Their goal is to become new tools applied in the education industry. LLMs in the education domain undergo fine-tuning using specifically collected education-related instructions. Child chat data41 primarily revolves around the theme of emotional com- panionship for children, containing both real and synthetic Chinese dialogue data related to emotional companionship for children. Educhat-sft-002-data-osm (Dan et al, 2023) is used for the development of the EduChat project and combines multiple Chi- nese and English educational instructions and dialogue data. It is used to train models that can provide open-ended questioning, emotional support, essay correction, and other functions in an educational setting. TaoLi data (Yu et al, 2023b) is constructed based on internationally circulated Chinese teaching materials, Hanyu Shuiping Kaoshi (HSK) exams42, Chinese dictionaries, and other resources. It includes various forms of instructions to enable the model to acquire knowledge related to international Chinese education.

3.3.6 Other Domains

Currently, other domain-specific fine-tuning datasets are gradually being open- sourced. The seven domains mentioned belows, although having fewer open resources for fine-tuning instructions, still hold significant meaning and value.

Financial Domain. DISC-Fin-SFT (Chen et al, 2023a) is a high-quality Chinese financial dataset. It is utilized for LoRA (Hu et al, 2022a) instruction fine-tuning on the Baichuan-13B-Chat model, ultimately resulting in the financial LLM DISC-FinLLM (Chen et al, 2023a). The dataset comprises 246K instructions categorized into four subtypes: financial consultation, financial tasks, financial calculations, and retrieval enhancement. Sourced diversely from financial NLP datasets, manually curated Q&A pairs, and model-generated dialogues, a portion of this dataset is currently open- sourced.

Geoscience Domain. GeoSignal (Deng et al, 2023) is being used for the fine- tuning of instructions for K2 (Deng et al, 2023), the first LLM in the field of geoscience. The creators have collected extensive data from various databases and websites in the earth science domain. They have restructured this data into a unified sequence format suitable for tasks such as interpretation, named entity recognition, reasoning, text classification, and Q&A. The original dataset size is 22.6M instances, but after cleaning, 40K data instances have been retained. A complete version is planned for future release.

Mental Health Domain. MeChat (Qiu et al, 2023) is Chinese psychological health dialogue data. Builders transform real psychological mutual assistance Q&A into multi-turn dialogues using models. The dataset comprises 56K instructions, catering to extended conversational scenarios.

Biology Domain. Mol-Instructions (Fang et al, 2023) consists of three main com- ponents: Molecule-oriented instructions, Protein-oriented instructions, and Biomolec- ular text instructions. Each part focuses on chemical reactions and molecular design, protein prediction, and bioinformatics in biochemistry, respectively. The dataset’s con- struction involves a combination of human-machine collaboration, database resource processing, and the transformation of biological data.

41 https://github.com/HIT-SCIR-SC/QiaoBan

42 https://www.chinesetest.cn/

IT Domain. Owl-Instruction (Guo et al, 2023b) is utilized for the instruction fine-tuning of the Owl model (Guo et al, 2023b). The instructions are specifically designed for handling IT-related tasks such as troubleshooting, log analysis, etc. The dataset construction involves four stages: data generation, GPT-4 filtering, manual verification, and supervised fine-tuning. It comprises 18K single-turn and multi-turn instructions.

Social Norms Domain. PROSOCIALDIALOG (Kim et al, 2022) is a multi- turn English conversation dataset that instructs models to respond to problematic inputs according to human social norms. The dataset covers various unethical, prob- lematic, biased, and harmful scenarios, created using a human-machine collaboration framework.

Transportation Domain. TransGPT-sft (Duomo, 2023) serves as the fine-tuning component for China’s pioneering open-source TransGPT traffic model (Duomo, 2023). Adopting a dialogue-centric methodology, the dataset involves extracting con- tent from documents in formats like PDFs and Doc files. LLMs are then employed to generate dialogues related to traffic based on the document content.

3.4 Distribution Statistics of Instruction Fine-tuning Datasets

Figure 14 provides statistics on 103 instruction fine-tuning datasets from six aspects: release time, license, data category, construction method, language, and domain. The following conclusion can be drawn:

(1) The number of instruction fine-tuning datasets is showing a growing trend. The widespread attention to LLMs and the application of the instruction fine-tuning aradigm have greatly facilitated the construction and open-sourcing of instruction fine-tuning datasets. The demand for model fine-tuning and research interest in this area are rapidly expanding.

(2) Data licenses to some extent reflect the openness and accessibility of datasets. For instruction fine-tuning datasets, the Apache-2.0 license is the most commonly used, covering 43 datasets, followed by the GPL-3.0 license and the MIT license. This reflects the developers’ inclination towards open and shared data.

(3) The majority of instruction fine-tuning datasets are concentrated in the range of 10K to 1M, totaling 63 datasets. This indicates that, in practical applications, datasets of this scale are sufficient to meet the demand. However, there are relatively fewer small-scale and large-scale datasets, reflecting the challenges and scarcity at both extremes. Small-scale datasets emphasize quality but may lack category richness, while large-scale datasets offer diversity but may be constrained by computational resources and affected by data redundancy.

(4) The “utilizing model-constructed instructions” method is the most prevalent in constructing datasets, highlighting its potential in dataset creation. The quality of such datasets relies primarily on the models’ performance and the guidance provided during construction. The second most common method is “curating existing datasets and improving them,” indicating the active utilization of open-source data. The num- ber of datasets manually generated is comparatively lower due to efficiency and cost considerations. There are 22 datasets that employ combinations of different methods to further enhance dataset quality, suggesting that this approach may become more mainstream in the future.

(5) Chinese and English instruction datasets hold a crucial position in research, gar- nering greater attention. Mixed Chinese and English, as well as multilingual datasets, show a considerable quantity, indicating that cross-language research is becoming a focus. There is a scarcity of open-source instruction datasets related to programming languages, primarily tailored for specific application scenarios.

(6) The number of general-domain datasets is 67, aligning with the widespread demand for instruction fine-tuning techniques in various application scenarios. Research and construction of instruction datasets for relevant LLMs have also been conducted in common fields such as healthcare, programming, law, etc. There are datasets available in other domains as well, indicating the potential applications of LLMs in diverse disciplines and industries. However, there are still instruction datasets for niche fields awaiting further research and exploration.

4 Preference Datasets

Preference datasets are collections of instructions that provide preference evaluations for multiple responses to the same instruction input. Typically, they consist of pairs of instructions with different responses, along with feedback from humans or other models. This setup reflects the relative preferences of humans or models for differ- ent responses within a given task or context. The feedback information in preference datasets is often manifested through voting, sorting, scoring, or other forms of compar- ison. Figure 15 categorizes various preference datasets based on the methods used for preference evaluation. The collected and organized information on preference datasets is presented in Table 9 and Table 10.

Preference datasets are primarily utilized during the alignment phase of large models, aiming to assist in aligning the models’ outputs more closely with human preferences and expectations. The alignment with human preferences is manifested in three main aspects: utility, possessing the ability to follow instructions; honesty, avoiding fabrications; and safety, refraining from generating illegal or harmful infor- mation (Zhao et al, 2023). Both RLHF (Christiano et al, 2017; Ziegler et al, 2019) and RLAIF (Reinforcement Learning from AI Feedback) (Lee et al, 2023c) employ rein- forcement learning methods to optimize models using feedback signals. In addition to fine-tuning with instruction datasets, it is also possible to train reward models with preference datasets. Subsequently, the Proximal Policy Optimization (PPO) algorithm can be applied for further fine-tuning based on the feedback from the reward models (Schulman et al, 2017).

4.1 Preference Evaluation Methods

The preference evaluation methods for preference datasets can be categorized into voting, sorting, scoring, and other methods. Each method can be conducted by humans or aligned high-quality LLMs. Human feedback provides preferences that are more aligned with real-world scenarios, capturing intuitive human cognition and language understanding. However, it may exhibit subjectivity and inconsistencies due to individual differences, requiring more time and cost for annotation. Model feedback can leverage learned human preference information and extensive knowledge from high-quality models, saving annotation time and cost. However, it may be influenced by inherent biases in the model, and the feedback information may be less authentic compared to human feedback. In general, a comprehensive approach that combines various forms and sources of preference data may be more advantageous. Figure 16 visually presents various preference evaluation methods.

4.1.1 Vote

The voting method typically involves selecting the better option from two answers or choosing several preferred options from multiple answers. The advantage is its simplicity and intuitiveness, making it easy to collect and reflecting the opinions of the group. However, the drawback is the lack of granularity in information.

Datasets using the “human vote” method are as follows. Chatbot arena con- versations (Zheng et al, 2023b) includes examples with answers from two models to the same question and the selection made by a human judge. It comprises outputs from a total of 20 models in 96 languages. The dataset also annotates unsafe conversations for related research. The hh-rlhf dataset (Bai et al, 2022; Perez et al, 2022) includes instances with accepted and rejected answers, where crowdworkers instruct the model to perform a task and choose the more useful and honest answer from two options. MT-Bench human judgments (Zheng et al, 2023b) involves graduate students compar- ing pairwise preferences for 80 instructions generated by six models. PKU-SafeRLHF (Ji et al, 2023a) focuses on comparing performance and safety preferences. After eval- uating the harmlessness of instructions, choices are made based on usefulness and harmlessness in the Q&A format. Each entry in the final dataset includes two answers and feedback information. SHP (Ethayarajh et al, 2022) is crawled from Reddit. Each post contains a question and a pair of answers, with one answer being more favored by Reddit users, constructing a preference dataset reflecting human preferences. Similarly, Zhihu rlhf 3k43 is built in the same way using the Zhihu. Summarize from Feedback (Stiennon et al, 2020) is primarily constructed to optimize summarization models. The dataset consists of two parts: one where annotators choose the better of two sum- maries, and the other where summaries are rated using a Likert scale. The dataset uses both human voting and human scoring.

A representative dataset for the “model vote” method is CValues (Xu et al, 2023d). The CValues dataset encompasses three types of responses: safe and responsibility, safe, and unsafe, focusing on the domain of social norms. During construction, models assign different types to various responses, enabling a safety comparison between pairs of responses.

4.1.2 Sort

The sorting method involves arranging multiple responses to the same question in descending order according predefined criteria. The criteria for sorting are deter- mined by specific requirements. This method provides more detailed information, reflecting the relative preference order, but it is cumbersome to collect and pro- cess the sorting information, and the sorting criteria need to be standardized.

43 https://huggingface.co/datasets/liyucheng/zhihu

OASST1 pairwise rlhf reward44 is a representative dataset in this category. It under- goes post-processing on the OASST1 (Wang et al, 2023a), generating data directly used for RLHF. The dialogues in the OASST1, constructed by humans and accom- panied by quality ratings, allow for direct sorting of different responses based on annotations, reflecting human preferences.

4.1.3 Score

The scoring method involves assigning scores within a certain range to several responses to the same question. This method provides a continuous evaluation, offering a more flexible representation of preference intensity, allowing the model to under- stand human preferences in a more nuanced manner. However, it is important to note issues related to the uniformity of scoring criteria and subjective awareness in the scoring process.

Some datasets use human scoring to reflect preferences. Stack-Exchange- Preferences (Askell et al, 2021) is derived from StackOverflow, where each answer is assigned a score defined by Askell et al (2021). This score is based on the number of likes the answer receives and whether it is accepted by the question asker. In Summa- rize from Feedback (Stiennon et al, 2020), a portion of it involves scoring the quality of different answers using the Likert scale. WebGPT (Nakano et al, 2021) includes examples with two model answers to a question along with relevant metadata. Each answer has a preference score assigned by humans to indicate which answer is better. In addition to human scoring, models can also be used to replace humans in this process. Alpaca comparison data (Peng et al, 2023) involves three models generating different responses, with GPT-4 scoring the quality of the responses. Each example contains one high-quality answer and one low-quality answer. Stable Alignment (Liu et al, 2023d) includes three types of alignment data from simulated social interactions, with multiple different model-generated responses and corresponding scores for each data point. UltraFeedback (Cui et al, 2023) employs models to score four answers from four dimensions, providing detailed textual explanations for improving the answers, thereby enriching the dimensions of instructions, models, and preferences.

4.1.4 Other

In addition to the three methods mentioned earlier, a small portion of preference datasets employs alternative preference evaluation methods.

Medical-rlhf (Xu, 2023). The Medical-rlhf dataset is a Chinese dataset designed for aligning medical models. The dataset consists of 4K examples sampled from a Chinese medical dialogue dataset. Each example includes two responses, with the higher-quality response being authentic professional replies from real doctors and the lower-quality response being model-generated. Nevertheless, the dataset has a rela- tively small scale, and the categorization of high and low quality is too direct and absolute for the given questions.

PRM800K (Lightman et al, 2023). The PRM800K dataset is used for supervised learning of the steps in the CoT process for mathematics. It contains 102K samples of mathematical solutions and 1M step-level labels, covering 12K mathematical problems. Human annotators have labeled each step of the model-generated solutions, providing an assessment of correctness. This supervision method can also be viewed as providing an alignment signal to the model.

4.2 Distribution Statistics of Preference Datasets

Figure 17 provides statistics on 16 preference datasets from six aspects: release time, license, preference evaluation method, construction method, language, and domain. The following conclusion can be drawn:

(1) The introduction of reinforcement learning and the in-depth research on LLMs alignment (Christiano et al, 2017; Ziegler et al, 2019; Lee et al, 2023c) have spurred the development of preference datasets, showing a rapid growth trend in 2023. The alignment between models and humans has become an increasingly important aspect. (2) The majority of preference datasets are available for commercial purposes, with Apache-2.0 license being predominant among them.

(3) Among all preference evaluation methods, human voting is the most commonly used. This method has a more convenient annotation process and reflects genuine human preferences. The next in popularity are human scoring and model scoring, which present preferences in a more intuitively distinguishable manner through scores. The sorting method and the combination of multiple evaluation methods are rarely used, constrained by the cumbersome process and inconsistencies in standards.

(4) From the perspective of dataset construction, the most common approach for preference datasets is human preference annotation and model-assisted generation of responses of varying quality, as these datasets require annotating feedback information based on different responses. The second approach involves scraping Q&A from social platforms and using metrics like upvotes as a preference indicator.

(5) Preference datasets are predominantly in English, with a small portion in Chi- nese or a mixture of multiple languages. Overall, preference datasets in languages other than English are relatively scarce.

(6) Preference dataset examples mainly focus on general domains and social norm domains, especially in the realm of social norms. The primary goal is to ensure that LLMs align with human expectations across various general tasks and comprehen- sive safety considerations. Preference datasets specifically tailored for other vertical domains have not received significant attention at the moment.

5 Evaluation Datasets

Evaluation datasets are a carefully curated and annotated set of data samples used to assess the performance of LLMs across various tasks. Different evaluation datasets focus on different evaluation aspects, providing an objective measure of different mod- els. By solely adjusting the conditions of the training, including the pre-training corpora, instruction fine-tuning datasets, and preference datasets, the performance of LLMs on corresponding evaluation datasets can indirectly reflect the quality and effec- tiveness of the datasets. This, in turn, aids in the ongoing optimization of training data. The collected and organized information on representative existing evaluation datasets is presented in Table 11, Table 12, and Table 13.

Fig. 18 Evaluation categories of the evaluation datasets

5.1 Evaluation Domains

Guo et al (2023c) categorizes LLM evaluations into five evaluation categories based on different dimensions: knowledge and capability evaluation, alignment evaluation, safety evaluation, specialized LLMs evaluation, and evaluation organization. As shown in Figure 18, this paper focuses on the key evaluation domains of each evaluation dataset, finely categorizing 112 datasets into 20 evaluation domains, namely: Gen- eral, Exam, Subject, Natural Language Understanding (NLU), Reasoning, Knowledge, Long Text, Tool, Agent, Code, OOD, Law, Medical, Financial, Social Norms, Factuality, Evaluation, Multitask, Multilingual, and Other. The “Other” category includes seven sub-domains: E-commerce, Few-shot learning,

5.1.1 General

45 https://github.com/lm-sys/vicuna-blog-eval

The knowledge and reasoning abilities in financial contexts NLP tasks in the financial domain The performance in the financial domain knowledge Multi-domain, multi-dimensional capabilities NLP tasks in the financial domain The factuality of LLMs Chinese Gaokao examination LLMs’ performance in understanding and utilizing geoscience knowledge Natural language understanding capability The out-of-distribution (OOD) robustness The factuality of LLMs The factuality of LLMs Evaluate LLMs on a wide range of scenarios and metrics in Chinese scenarios across nine instruction types, while the latter focuses on eval- uating their general performance in English scenarios across eight instruction types. The number of instructions in these datasets is all within 1K, with some limitations in comprehensiveness. SuperCLUE (Xu et al, 2023e) expands the scale of evaluation content. It serves as a comprehensive evaluation benchmark for Chinese general LLMs, designed to assess the effectiveness of current Chinese LLMs. The tasks include multi- turn open-ended Q&A and objective multiple-choice Q&A, with monthly updates and significant reference value.

The second aspect involves assessing the ability of LLMs to follow instruc- tions, especially when faced with complex directives. Datasets like Vicuna Evaluation, AlpacaEval, and BayLing-80 incorporate various types of instructions,

5.1.2 Exam

Evaluation datasets within the examination domain are crafted with the specific pur- pose of formulating instructions derived from significant exam questions across diverse nations. In this scenario, LLMs take on the role of candidates, responding to queries in accordance with specified guidelines. The primary objective is to assess the profi- ciency of LLMs in comprehending the nuances of question intent and their reservoir of knowledge pertaining to examinations. GAOKAO-Bench (Zhang et al, 2023k) employs Gaokao (China’s National College Entrance Examination) questions as the basis for evaluation, seeking to appraise the proficiency of LLMs across various subjects, encom- passing a spectrum of 10 disciplines. AGIEval (Zhong et al, 2023) expands the ambit of inquiries by devising benchmarks centered on human-centric tests, featuring a selection of 20 official, public, and stringent entrance and qualification examinations, including Gaokao, the U.S. SAT, the bar exam, and the national civil service exam. M3Exam Word sense disambiguation, Natural language inference, Coreference resolution, Question answering Mathematical reasoning and table QA Mathematics, Physics, Finance, CS & EE Open weather, The cat API, Home search, Trip booking, Google sheets, Virtual home, Web shop, Tabletop Toxicity, Bias, Value-alignment Health, Law, Conspiracies, Fiction, Misconceptions, Paranormal, Economics, Biology, Language, Indexical etc. Generic, Knowledge, Roleplay, Common-sense, Fermi, Counterfactual, Coding, Math, Writing

Multilingual natural language inference Classification, Structured prediction, QA, Retrieval Summarization, Question Answering, Aggregation

(Zhang et al, 2023i) assembles an array of multi-modal, multi-lingual, and multi-tiered sets of multiple-choice questions, sourcing exam questions from primary, secondary, and high school exams in nine countries distinguished by diverse languages.

5.1.3 Subject

Evaluation datasets in academic domains thoroughly gauge the mastery of LLMs in diverse fields, including disciplines like mathematics, law, psychology, and more. C-CLUE46 stands as a benchmark for assessing classical Chinese language compre- hension. It centers on tasks like NER and RE, all grounded in a historical knowledge graph. This dataset primarily scrutinizes proficiency within individual disciplines, yet it exhibits limited diversity. MMCU (Zeng, 2023) broadens the horizons by incor- porating disciplines such as medicine, law, psychology, and education to measure Chinese semantic comprehension. In the realm of university-level science and engineer- ing, SCIBENCH (Wang et al, 2023d) is tailor-made to evaluate LLMs’ capabilities, demanding the resolution of challenging subjective questions related to mathemat- ics, physics, and chemistry. TheoremQA (Chen et al, 2023b) narrows its focus to 350 theorems from mathematics, physics, finance, and CS & EE (Computer Science & Electrical Engineering). Lastly, ARB (Sawada et al, 2023) introduces a more demand- ing examination, appraising LLMs’ prowess in text comprehension and domain-specific reasoning. The questions delve into profound knowledge across disciplines such as mathematics, physics, biology, chemistry, and law.

The aforementioned datasets focus on evaluating specific disciplines on a smaller scale. In contrast, some datasets aim to comprehensively assess disciplinary capa- bilities, encompassing a wide range of subjects. ScienceQA (Lu et al, 2022) gathers multiple-choice questions from 26 subcourses in natural sciences, social sciences, and linguistics. C-Eval (Huang et al, 2023c) compiles 52 diverse subject questions, catego- rized into four difficulty levels, providing a holistic evaluation of models’ comprehensive subject proficiency in Chinese. Similarly, CG-Eval (Zeng et al, 2023b) requires LLMs to accurately answer 55 sub-subject questions across six major categories for auto- matic scoring. LLMEVAL-347 concentrates on evaluating proficiency in specialized knowledge, featuring generated questions from 13 academic categories outlined by China’s Ministry of Education and over 50 subcategories. It introduces a “question bank exam” mode. MMLU (Hendrycks et al, 2021b) assesses subjects ranging from traditional fields like mathematics and history to professional areas such as law and ethics, covering 57 subjects with difficulty levels from elementary to professional. As the content of MMLU is in English, CMMLU (Li et al, 2023d) is created as its Chi- nese counterpart for evaluating subject knowledge proficiency in a Chinese context, covering 67 subjects. M3KE (Liu et al, 2023a), originating from the Chinese educa- tion system, collects multiple-choice questions from 71 subjects spanning from primary school to university. XiezhiBenchmark (Gu et al, 2023), covering a record 516 differ- ent subjects, attains a scale of approximately 250K questions. Overall, these subject evaluation datasets share a high degree of similarity in data sources, primarily sourced from online materials related to their respective subjects. Additionally, multiple-choice question formats, conducive to automated evaluation, are particularly favored.

46 https://github.com/jizijing/C-CLUE

47 https://github.com/llmeval/llmeval-3

5.1.4 Natural Language Understanding

This class of evaluation datasets aims to comprehensively evaluate the multifaceted abilities of LLMs in natural language understanding (NLU) tasks, covering funda- mental comprehension of grammatical structures to advanced semantic reasoning and context handling. MCTS (Chong et al, 2023) and RAFT (Alex et al, 2021) serve as benchmarks for individual NLU tasks. The former stands as the most extensive evalu- ation dataset for Chinese text simplification, while the latter functions as a benchmark for text classification. Multiple NLU tasks are encompassed by most datasets. GLUE (Wang et al, 2018) incorporates nine English NLU tasks, assessing LLMs in tasks such as sentiment analysis, semantic matching, and textual entailment. Building upon GLUE, SuperGLUE (Wang et al, 2019) raises task difficulty, reflecting LLMs’ perfor- mance in a broader scope of language understanding. To evaluate the NLU capabilities of models in the Chinese context, CLUE (Xu et al, 2020b) is constructed with reference to GLUE. Comprising nine Chinese NLU tasks, the CLUE dataset evaluates LLMs in tasks like semantic matching, text classification, and reading comprehension. CUGE (Yao et al, 2021) is organized hierarchically by language-task-dataset structure, using 21 sub-datasets to evaluate LLMs in language understanding, information retrieval, Q&A, and language generation. SentEval (Conneau and Kiela, 2018) aggregates NLU datasets for 21 sub-tasks.

5.1.5 Reasoning

Reasoning evaluation datasets are designed to gauge the proficiency of LLMs in tasks related to logical reasoning and inference. Chain-of-Thought Hub (Fu et al, 2023) selects eight open-source datasets and evaluates LLMs’ multi-step reasoning perfor- mance by utilizing few-shot CoT prompting across domains like mathematics, science, and symbols. Choice-75 (Hou et al, 2023) tasks LLMs with selecting an appropriate decision solution in various given scenarios, assessing their competence in decision rea- soning. NeuLR (Xu et al, 2023c) assesses deductive reasoning, inductive reasoning, and abductive reasoning, emphasizing LLMs’ capabilities in these distinct reasoning direc- tions. TabMWP (Lu et al, 2023b), LILA (Mishra et al, 2022a), and miniF2F v1 (Zheng et al, 2022) all scrutinize LLMs’ reasoning prowess in mathematics. The TabMWP dataset requires LLMs to engage in table-based Q&A and mathematical reasoning based on provided text and table data. The LILA dataset serves as a comprehensive mathematical reasoning benchmark, evaluating various mathematical skills, including basic proficiency, algebra, calculus, and more. The miniF2F v1 dataset is a compila- tion of Olympiad-level mathematical problems, posing a substantial challenge to the mathematical acumen of LLMs. In summary, reasoning evaluation datasets encom- pass diverse assessment directions, categorized into multi-step reasoning, decision reasoning, deductive reasoning, mathematical reasoning, and other forms of reasoning.

5.1.6 Knowledge

Datasets for evaluating knowledge not only gauge the knowledge retention capabilities of LLMs but also assess additional skills such as knowledge analysis, learning novel information, and knowledge induction. LLMEVAL-2 (Zhang et al, 2023e), derived from external databases, constructs a repository of knowledge questions across 12 domains. Curated by GPT-4, LMExamQA (Bai et al, 2023c) categorizes questions based on the requisite knowledge level, spanning memorization, comprehension, and analysis. KoLA (Yu et al, 2023a) predominantly examines LLMs’ proficiency in grasping and applying world knowledge, categorized into memory, comprehension, application, and creation according to the cognitive hierarchy of knowledge. Serving as an assessment bench- mark for LLMs’ command of social knowledge, SocKET (Choi et al, 2023) classifies knowledge into humor and satire, aggressiveness, emotion, credibility, and social facts. While previous datasets evaluate models from the perspective of existing knowledge, the challenge lies in appraising the models’ learning abilities with entirely unfamiliar new knowledge. Hence, Yin et al (2023b) employs the knowGen method to generate new knowledge, resulting in the inaugural benchmark dataset, ALCUNA (Yin et al, 2023b), for evaluating and scrutinizing the models’ understanding, differentiation, and association capabilities regarding new knowledge.

5.1.7 Long Text

In recent times, numerous LLMs, including ChatGLM248 and Gemini 1.549, have sought to expand the context length of models to the scale of millions of tokens while maintaining performance (Bai et al, 2023b). This has given rise to the development of long text evaluation datasets to better assess the capabilities of LLMs in pro- cessing and understanding extensive textual inputs. Notable datasets in this domain include ZeroSCROLLS (Shaham et al, 2023), L-Eval (An et al, 2023), LongEval (Li et al, 2023a), and LooGLE (Li et al, 2023g), all focusing on the evaluation of lengthy English texts. ZeroSCROLLS standardizes datasets from diverse sources into a con- sistent input format with an average length of 10K words for assessment across 10 natural language tasks. L-Eval serves as a comprehensive evaluation suite for long- context language models, covering input lengths ranging from 4K to 60K words. It encompasses 18 multi-domain tasks involving inference, Q&A, summarization, and more on long documents. LongEval introduces two tasks of varying difficulty, gauging LLM performance in fine-grained topic retrieval and line retrieval with input lengths between 5K and 16K tokens. LooGLE focuses on more challenging tasks with long dependencies, evaluating performance on tasks such as multiple information retrieval and timeline reorder with an average length of 20K words. In contrast, LongBench (Bai et al, 2023b) comprises a diverse set of 14 English tasks, 5 Chinese tasks, and 2 code tasks, with most tasks exhibiting an average length between 5K and 15K tokens. Despite claims of some models supporting 100K+ contexts, the previously mentioned datasets reveal limitations in evaluating such lengths. To address this, InfiniteBench (Zhang et al, 2023j) increases the average length of evaluations in both Chinese and English to 200K tokens, introducing 10 new tasks among the set of 12 evaluation tasks to fill the void in assessing long texts exceeding 100K tokens.

48 https://github.com/THUDM/ChatGLM2-6B

49 https://deepmind.google/technologies/gemini/#introduction

5.1.8 Tool

The datasets for evaluating tools gauge the adeptness of LLMs in utilizing tools and invoking APIs. API-Bank (Li et al, 2023i) replicates real-world scenarios, establishing an API library with 53 commonly used tools for LLMs to call upon. Tasks involving API invocation are designed to assess the models’ abilities to effectively use APIs in fulfilling user requirements within a given conversational context. APIBench (Patil et al, 2023), crafted for evaluation purposes, generates 16,450 instructions derived from 1,645 API documents. These instructions are formatted to suit LLM-friendly chat interactions and are accompanied by evaluation scripts. ToolBench (Xu et al, 2023f), functioning as a benchmark for tool operations, encompasses a variety of software tools employed in real-world tasks. Tool invocations span single-step and multi-step action generation, covering eight subtasks, including open weather and webshop.

5.1.9 Agent

The research and application of LLMs as AI Agents, exemplified by entities like AutoGPT50 and AgentGPT51, are continuously advancing. Agent evaluation datasets specifically concentrate on the capabilities of LLMs functioning as Agents. Agent- Bench (Liu et al, 2023f) undergoes assessment within English scenarios. It stands out as the inaugural benchmark designed to evaluate the performance of LLM-as-Agent across various environments, encompassing eight distinct settings and providing a thor- ough examination of LLMs’ competence as independent agents. SuperCLUE-Agent52 is subjected to evaluation within the Chinese context. This dataset gauges the Agent capabilities of LLMs in a Chinese context through three core abilities and ten founda- tional tasks, covering aspects such as tool usage, task planning, and both short-term and long-term memory.

5.1.10 Code

The coding evaluation datasets aim to assess the capabilities of LLMs in handling programming-related tasks, including but not limited to code interpretation, code generation, code correction, and code optimization. These datasets are primarily cat- egorized into two types. The first type is single-task evaluation. APPS (Hendrycks et al, 2021a) serves as a benchmark for code generation, specifically evaluating the ability to generate Python code. Other datasets such as DS-1000 (Lai et al, 2023), HumanEval (Chen et al, 2021), MTPB (Nijkamp et al, 2023), and ODEX (Wang et al, 2023h) investigate code generation abilities in different forms. DS-1000 introduces data science problems related to seven Python libraries. HumanEval assesses LLMs using manually written programming problems, mitigating data leakage concerns to some extent. MTPB tasks LLMs with synthesizing a subroutine at each step, requiring con- sideration of both the current task description and previous steps. ODEX extends the variety of natural languages, using English, Spanish, Japanese, and Russian to describe code intent, evaluating LLMs’ abilities to generate code under multilingual descriptions. Additionally, BIRD (Li et al, 2023f) is a large-scale database benchmark for text-to-SQL (Structured Query Language) tasks that, compared to previous pop- ular datasets like Spider (Yu et al, 2018), reduces the gap between academic research and practical applications, enhancing the level of difficulty. The second type is multi- task evaluation. CodeXGLUE (Lu et al, 2021) categorizes code abilities into four types based on input-output pairs: code-code, text-code, code-text, and text-text. HumanEvalPack (Muennighoff et al, 2023a) is an extension of the HumanEval, cov- ering six programming languages and three code tasks, including code fixing, code comment generation, and code generation.

50 https://github.com/Significant-Gravitas/AutoGPT

51 https://github.com/reworkd/AgentGPT

52 https://github.com/CLUEbenchmark/SuperCLUE-Agent

5.1.11 Out-of-Distribution

The out-of-distribution (OOD) evaluation dataset is designed to gauge the capabilities of pre-trained base models after fine-tuning with instructions from a subset of tasks on previously unseen tasks. The emphasis is on scrutinizing the robustness of LLMs. Yuan et al (2023) conducted experiments on the BOSS dataset (Yuan et al, 2023), encompassing 5 tasks and 20 sub-datasets, to scrutinize the OOD performance of LLMs. Yang et al (2023c) employed GLUE-X (Yang et al, 2023c) to assess the models’ OOD performance and offered insights into the measurement and enhancement of model OOD performance.

5.1.12 Law

Legal evaluation datasets play a crucial role in the application of LLMs in the legal domain by providing standardized performance assessments and driving research and development in legal LLMs. The datasets can be categorized based on the linguistic environment they target. LAiW (Dai et al, 2023) and LawBench (Fei et al, 2023) are designed for the Chinese language environment. LAiW serves as a Chinese legal LLMs evaluation benchmark, focusing on 13 foundational tasks across three legal competen- cies. It compares LLMs in terms of NLP basic capabilities, fundamental application abilities, and complex application capabilities. LawBench, benchmarked on the Chi- nese legal system, evaluates LLMs’ legal abilities across 20 tasks simulating knowledge retention, understanding, and application, closely related to real-world applications. In the English language environment, LegalBench (Guha et al, 2023) and LexGLUE (Chalkidis et al, 2022) are relevant. LegalBench, constructed with the assistance of cross-disciplinary professionals, is a legal reasoning benchmark comprising six types of legal reasoning and 162 tasks. LexGLUE integrates open-source English legal datasets as an evaluation benchmark, examining legal Q&A and classification tasks.

For a multilingual environment, LEXTREME (Niklaus et al, 2023) and SCALE (Rasiah et al, 2023) are applicable. LEXTREME divides 18 legal-related tasks from 11 open-source datasets, covering 24 languages. SCALE challenges current LLMs in four dimensions: handling long documents, applying legal knowledge, multilingual com- prehension, and multitask processing. The benchmark is derived from the Swiss legal system, involving five languages.

5.1.13 Medical

The medical evaluation datasets focus on examining the comprehensive capabilities of LLMs in medical tasks such as term explanation, disease diagnosis, and treat- ment recommendations. This enables a comparison of the proficiency gap between various medical models and professional doctors. MultiMedQA (Singhal et al, 2023) serves as an evaluation benchmark for medical Q&A, blending multiple open-source datasets and proprietary datasets to assess LLMs’ abilities to address medical queries. QiZhenGPT-eval53 focuses on drug indication evaluation, tasking LLMs with identify- ing diseases for which a given drug is suitable. However, single-task datasets are overly restrictive in evaluation dimensions and may not reflect other medical competencies. Consequently, various integrated datasets have been gradually proposed.

CBLUE (Zhang et al, 2022) is an evaluation dataset for Chinese medical lan- guage understanding, presenting five medical tasks using authentic medical data. It assesses LLMs in medical text information extraction and medical Q&A. The design of CMB (Wang et al, 2023c) is based on the Chinese language and cultural framework, evaluating LLMs from the perspective of Chinese-style medical exams and complex clinical diagnoses. HuaTuo26M-test (Li et al, 2023h) is randomly sampled from var- ious sources, including medical encyclopedias and knowledge graphs, offering diverse task types. PromptCBLUE54 transforms 16 different NLP tasks in medical scenarios into an evaluation format, forming the first systematic Chinese benchmark for medical scenarios.

5.1.14 Financial

The financial evaluation dataset, akin to the legal and medical evaluation datasets mentioned in previous sections, focuses on knowledge related to the financial domain, assessing the performance of LLMs in handling financial texts and executing financial tasks. BBF-CFLEB (Lu et al, 2023a) encompasses six sub-datasets for financial tasks, strategically evaluating the language understanding and language generation capabil- ities of financial models from multiple perspectives. Both FinancelQ55 and FinEval (Zhang et al, 2023d) emphasize knowledge and reasoning abilities in financial scenarios, incorporating multiple-choice questions on different financial topics to assess LLMs’ financial knowledge. While the preceding datasets target the Chinese environment, FLUE (Shah et al, 2022) serves as an English-oriented testing benchmark, amalgamat- ing six financial NLP datasets with a focus on NLU in the financial domain. FinBen (Xie et al, 2024) is also an English benchmark dataset for evaluating the capabilities of LLMs in the financial domain. It gathers 35 existing datasets covering 23 financial tasks, categorized into three difficulty levels: foundamental tasks, advanced cognitive engagement, and general intelligence.

53 https://github.com/CMKRG/QiZhenGPT/tree/main/data/eval

54 https://github.com/michael-wzhu/PromptCBLUE

55 https://github.com/Duxiaoman-DI/XuanYuan/tree/main/FinanceIQ

The assessment dataset for societal norms evaluates LLMs across dimensions such as ethics, morality, prejudice, toxicity, and safety. It primarily investigates whether the models generate outputs that violate ethical and legal standards, display biased discrimination, or produce toxic and harmful content in response to unsafe instruc- tions. Datasets of this nature hold significant importance and societal value in the safety scrutiny of LLMs. CrowS-Pairs (Nangia et al, 2020) assesses LLMs for biases and discrimination within the context of American culture, encompassing nine stereo- types related to prejudice, including race, religion, age, and more. SafetyBench (Zhang et al, 2023n) stands as the inaugural benchmark for evaluating LLM safety through multiple-choice questions in both Chinese and English, covering seven distinct safety dimensions. Safety-Prompts (Sun et al, 2023a), featuring 13 safety scenarios and prompt attack evaluation data generated by ChatGPT, enables a comprehensive evaluation of the models’ safety. However, constrained by ChatGPT’s performance, occasional errors may be present in questions or answers. TRUSTGPT (Huang et al, 2023d) evaluates LLMs in three crucial domains: toxicity, bias, and value consistency. Compared to previous mainstream safety benchmarks, SuperCLUE-Safety56 intro- duces heightened challenges by incorporating adversarial techniques and multi-turn interactions, thereby enhancing the identification of LLM safety protection capabilities under various adverse inputs.

5.1.16 Factuality

The outputs produced by LLMs may exhibit deviations from the specified input criteria, preceding contextual information, or established facts and knowledge—a phenomenon commonly known as the hallucination of LLMs (Zhang et al, 2023m). Addressing this issue necessitates the use of datasets designed for factual evaluation to gauge the extent of hallucination in LLMs. There are three distinct forms of evaluating the factual accuracy of LLMs.

The first method entails the presentation of various options, prompting LLMs to discern the factually correct choice among alternatives or to assess the factual alignment of the provided content. In the FACTOR dataset (Muhl- gay et al, 2023), each instance comprises a prefix and four completions, with only one completion being factually accurate. LLMs are required to identify the accurate choice based on the given prefix and pertinent knowledge. HaluEval (Li et al, 2023e) furnishes inputs and outputs for tasks like Q&A, dialogue, and text summarization, challenging LLMs to recognize the existence of hallucination.

The second method entails assessing the factual accuracy of open-ended content generated by LLMs. FActScore (Min et al, 2023) employs information from biographies to create a factual evaluation dataset, incorporating novel evaluation techniques for appraising the factual precision of LLMs in producing extensive content. FactualityPrompt (Lee et al, 2022) similarly evaluates the factual aspects of LLMs in open-text generation, demanding the generation of accurate content under genuine and non-genuine prompts.

56 https://github.com/CLUEbenchmark/SuperCLUE-safety

The third method involves interrogating LLMs to assess the prevalence of hallucinatory phenomena. TruthfulQA (Lin et al, 2022) meticulously devises English questions prone to generating erroneous answers due to potential misunder- standings, evaluating the veracity of LLMs’ responses. Taking cues from this, HalluQA (Cheng et al, 2023) formulates Chinese questions designed to mislead Chinese LLMs, evaluating the hallucinatory tendencies in Chinese LLMs. FreshQA (Vu et al, 2023) acts as a dynamic benchmark for factual Q&A, necessitating not only a mastery of rapidly evolving world knowledge but also the ability to refute incorrect factual premises.

5.1.17 Evaluation

The rise of LLMs has ushered in a fresh paradigm for model evaluation, allowing pro- ficient LLMs to act as evaluators in scoring the outputs of other models. However, the reliability of involving LLMs in assessments and the performance variability among different LLMs in appraising the quality of model responses prompt inquiries. Con- sequently, datasets falling under the evaluation category are specifically tailored to probe into the potential and competence of LLMs as evaluators. FairEval (Wang et al, 2023b) critically examines the model evaluation paradigm to explore the dependabil- ity of LLMs as assessors. It utilizes the Vicuna Evaluation dataset57 as instructions, generating responses from various models, and subsequently engages models such as ChatGPT, GPT-4, and others to evaluate diverse responses. PandaLM testset (Wang et al, 2023g), enriched with human annotations, serves to validate the assessment capabilities of trained PandaLM (Wang et al, 2023g) when evaluating other LLMs. LLMEval2 (Zhang et al, 2023l), currently the largest and most diversified English benchmark for evaluating LLMs, spans 15 tasks and 8 abilities, employing innovative methods to gauge the quality of LLMs’ evaluation responses.

5.1.18 Multitask

Multitask evaluation datasets present a thorough examination of LLMs’ comprehen- sive capabilities, characterized by a substantial task volume, extensive scale, broad domains, and diverse task types. In the realm of English, DecaNLP (McCann et al, 2018) transforms 10 distinct task datasets into a Q&A format, introducing the “Decathlon” multitask challenge within the natural language domain. LMentry (Efrat et al, 2023) provides a swift, automated “unit test,” assessing LLMs’ performance across 25 task types that are relatively simple for human understanding. However, these datasets still lack task type richness. BIG-Bench (Srivastava et al, 2023) impres- sively includes 95 task types, totaling 204 tasks, covering a wide array of topics such as linguistics, common-sense reasoning, social biases, software development, and more. BBH (Suzgun et al, 2023) carefully selects 23 challenging tasks from BIG-Bench, where previous language models have not surpassed average human performance, presenting a considerable challenge. HELM (Liang et al, 2023) contemplates holistic model eval- uation, establishing a comprehensive evaluation system for LLMs with 73 evaluation scenarios and 65 evaluation metrics, ensuring a thorough and rigorous assessment.

57 https://github.com/lm-sys/vicuna-blog-eval

In the Chinese domain, CLEVA (Li et al, 2023n) stands as a comprehensive Chinese evaluation benchmark, featuring 11 application assessment tasks and 20 capability assessment tasks, with a scale reaching 370K. CLiB58 serves as a Chinese proficiency test list for LLMs, covering LLMs such as GPT-4, ERNIE Bot (Sun et al, 2021b), QWen (Bai et al, 2023a), and supporting multidimensional capability evaluations like classification and information extraction. LLMEVAL-1 (Zhang et al, 2023f), compris- ing 17 task categories, 5 scoring items, and various evaluation methods, systematically evaluates LLMs. Furthermore, FlagEval59 scrutinizes the models’ comprehensive per- formance in both Chinese and English environments, serving as an evaluation toolkit for AI base models capable of assessing over 600 sub-dimensions of base models.

5.1.19 Multilingual

Multilingual evaluation datasets assess the performance of LLMs in cross-lingual tasks using data encompassing multiple languages, contributing to the exploration of LLMs’ capabilities across diverse linguistic challenges. XNLI (Conneau et al, 2018) is specialized for evaluating low-resource language transfer and cross-lingual sentence classification, incorporating 15 languages, including English, French, Spanish, Chinese, and German. Conversely, XTREME (Siddhant et al, 2020) expands language coverage by translating content for four NLP tasks into 40 languages, crossing 12 language fam- ilies. In essence, multilingual evaluation datasets typically build on traditional NLP tasks, extend language diversity, maintain a moderate task difficulty, and necessitate a wealth of language knowledge.

5.1.20 Other

Apart from the aforementioned assessment datasets, there exist several datasets specifically dedicated to diverse domains, addressing deficiencies in the evaluation landscape. The subsequent section provides an overview of pivotal datasets within seven subdomains for reference.

E-commerce Domain. The EcomGPT eval dataset (Li et al, 2023m) is designed to evaluate the efficacy of LLMs in tasks within the realm of e-commerce. It consists of 6K instances, with 500 instances sampled from each of the 12 held-out datasets tailored for e-commerce evaluation. Tasks in the e-commerce domain are classified into four categories: classification, generation, extraction, and miscellaneous. These tasks span coarse and fine-grained product classification, product title generation, attribute value detection, and e-commerce NER, among others.

Few-shot Learning Domain. The FewCLUE dataset (Xu et al, 2021) has been created with a specific focus on assessing few-shot learning in the Chinese language. Its purpose is to leverage the generalization capabilities of pre-trained models and inves- tigate the practicality of few-shot learning models applied to Chinese. The dataset is composed of nine sub-datasets, with some containing slightly over a hundred anno- tated samples, providing a means to evaluate model generalization under conditions of extremely limited labeled data.

58 https://github.com/jeinlee1991/chinese-llm-benchmark

59 https://github.com/FlagOpen/FlagEval

Geoscience Domain. The GeoBench dataset (Deng et al, 2023) serves as a means to evaluate the proficiency of language models in tackling questions related to geo- science, assessing their capacity to comprehend and apply knowledge in this domain. The dataset is bifurcated into two sections. The initial segment comprises questions from the Chinese graduate entrance examination in geology and geography, encom- passing 182 multiple-choice questions, 150 fill-in-the-blank questions, 454 vocabulary explanation questions, and 335 essay questions. The subsequent segment includes 1,395 multiple-choice questions from advanced research examinations in the United States. IT Domain. The Owl-Bench dataset (Guo et al, 2023b) serves as a bilingual evaluation benchmark tailored for IT operations and maintenance contexts. It encom- passes 317 questions and answers, in addition to 1K multiple-choice questions. The tasks address numerous real-world industrial scenarios, spanning nine distinct sub- domains: information security, applications, system architecture, software architecture, middleware, networks, operating systems, infrastructure, and databases.

Multi-turn Interaction Domain. LLMs frequently interact with users across multiple turns, yet assessments typically focus on individual turns, overlooking their interactive capabilities. Thus, the MINT dataset (Wang et al, 2023e) is designed to evaluate LLMs in tasks involving multi-turn interactions, employing tools or utilizing natural language feedback. In this evaluation framework, the model being tested can access tools through the execution of Python code, receiving feedback simulated by GPT-4 to facilitate multi-turn interactive assessments.

Robustness Domain. The PromptBench dataset (Zhu et al, 2023) extensively explores the robustness of LLMs when confronted with seven distinct types of adversar- ial prompts. Simultaneously, it performs an analysis of the transferability of adversarial prompts generated by various models. The examination of robustness encompasses eight diverse NLP tasks across thirteen open-source datasets, encompassing domains like sentiment analysis, multi-task knowledge, reading comprehension, mathematics, and beyond.

Sentiment Domain. The EmotionBench dataset (Huang et al, 2023a) presents a pioneering benchmark for assessing the empathetic abilities of LLMs, examining how LLMs undergo emotional changes in response to particular situations. Encompassing more than 400 scenarios, the dataset generates eight distinct emotional categories: anger, anxiety, depression, frustration, jealousy, guilt, fear, and embarrassment.

5.2 Evaluation Methods

In this section, evaluation methods are classified into three types: code evalua- tion, human evaluation, and model evaluation. Figure 19 illustrates these three evaluation methods. Notably, code evaluation and model evaluation operate with minimal human intervention, with evaluation results being automatically computed and generated through the pipeline. These two methods are categorized as auto- mated evaluation. In contrast, human evaluation is characterized as a non-automated approach.

The approach of code evaluation entails the extraction of responses from LLMs, referencing authentic annotations, and utilizing code to statistically compute prede- fined evaluation metrics. The efficacy of LLMs is consequently gauged through the numerical values of these metrics. Prominent evaluation metrics include accuracy, F1 score, BLEU (Papineni et al, 2002), ROUGE (Lin, 2004), Exact Match60, Pearson correlation coefficient61, among others. For instance, accuracy can be employed in clas- sification tasks to appraise the precision of LLMs’ classifications. In translation tasks, BLEU serves to assess the resemblance between LLMs’ translations and authentic annotations. Certain evaluation datasets not only provide custom calculation methods but also furnish pertinent code, facilitating direct application for the evaluation and analysis of LLMs’ performance. This evaluation methodology is commonly used for objective questions and straightforward subjective questions with predefined answers, such as basic knowledge queries and translation exercises. While its simplicity is ben- eficial, it may not be as effective for assessing open-ended subjective questions such as those involve generation and brainstorming.

The human evaluation approach, on the other hand, often involves the evaluation of LLM outputs by crowdsourced individuals, trained volunteers, students with relevant expertise, or expert panels. Evaluation methods include quality scoring (as seen in the QizhenGPT eval dataset62 and the CLiB dataset63), quality comparison assessment (Xu et al, 2023e), and similar techniques. This manual evaluation method is versatile, suitable for various question types, especially open-ended subjective inquiries and complex problems lacking standard answers. Nevertheless, its limitation lies in the substantial costs, the need for extensive human resources, and a potential for subjective bias.

The method of evaluating models represents a novel paradigm in which questions, reference answers, evaluation criteria and standards, along with the responses of the tested models, are integrated into an optimal prompt. This combined information is then inputted to the model for evaluation (Ji et al, 2023b; Zheng et al, 2023b; Zhang et al, 2023h; Dubois et al, 2023; Cheng et al, 2023; Bai et al, 2023c; Guo et al, 2023b). This evaluation approach emphasizes the selection of LLMs with currently high performance and provides suitable evaluation instructions. Its advantage lies in its capacity to substitute for a considerable amount of manual effort, resulting in a quicker evaluation process. Nevertheless, the limitation lies in the dependency on the LLMs’ performance and may not always correspond with human values and judgements.

60 https://huggingface.co/spaces/evaluate-metric/exactmatch

61 https://libguides.library.kent.edu/SPSS/PearsonCorr

62 https://github.com/CMKRG/QiZhenGPT/tree/main/data/eval

63 https://github.com/jeinlee1991/chinese-llm-benchmark

It is increasingly common to employ a mix of multiple assessment methods (An et al, 2023; Zhang et al, 2023k; Sun et al, 2023a; Sawada et al, 2023; Li et al, 2023i; Wang et al, 2023c; Min et al, 2023; Deng et al, 2023; Liang et al, 2023; Guha et al, 2023; Zhang et al, 2023f,e; Singhal et al, 2023; Xu et al, 2023e; Lin et al, 2022), leveraging the strengths and mitigate the weaknesses of each method. This approach aims to achieve a comprehensive, rigorous, and standardized evaluation.

5.3 Distribution Statistics of Evaluation Datasets

Figure 20 provides statistics on 112 evaluation datasets from eight aspects: release language, domain, question type, and time, evaluation method. Based on these statistics, the following conclusions can be drawn:

(1) There is a noticeable upward trend in the evaluation datasets. The ongoing maturation of technologies related to LLMs is driving the expansion of datasets tai- lored for LLMs evaluation. Specifically, in the year 2023, there has been a significant surge in the number of evaluation datasets, reflecting the need for diverse datasets to keep pace with the rapid iteration of LLMs and to improve model performance.

(2) The distribution of evaluation dataset licenses shows a preference for widely recognized licenses such as, Apache-2.0 and MIT. The overall pattern of distribution in these protocols underscores the delicate equilibrium sought within the LLMs data evaluation domain, balancing knowledge sharing and intellectual property protection. The flexibility provided by open licenses such as Apache-2.0 and MIT contributes to the widespread use and sharing of evaluation datasets, which is essential for advancing relative research.

(3) The majority of evaluation datasets fall within the 0-100K size range, with datasets containing fewer than 10K samples constituting 56.4% of the total. This indicates that many tasks can be effectively assessed with relatively small datasets, which may be also due to cost considerations during dataset construction and evalua- tion. Nevertheless, a few datasets still surpass the 1M mark, mainly derived from web scraping or the consolidation of open-source datasets.

(4) Manual construction and the compilation of open-source datasets are the dominant methods for creating evaluation datasets. Manual construction is often preferred for its precision and relevance to specific domains, whereas the combina- tion of open-source datasets creates common benchmarks for evaluation. The use of model-generated data for evaluation is less common due to concerns about question authenticity and answer accuracy, and it is generally used as a supplemental method. (5) English language datasets are the most prevalent, with Chinese language datasets also being significant, reflecting the focus on evaluating LLM performance for tasks in these two languages. Although there are a limited number of datasets that cover evaluations in other languages, resources for low-resource minority languages are notably limited.

(6) Evaluation datasets including multiple disciplines and task types are prevalent, underscoring the increased focus on evaluating the holistic capabilities of LLMs. The research community is particularly concerned with the model’s general applicability and extensive knowledge. Various evaluation datasets cover conventional instructions, knowledge domains, social norms, and several prevalent vertical fields. Nevertheless, the distribution of domains within evaluation datasets continues to exhibit a long-tail pattern, with niche areas like e-commerce and earth sciences having limited evaluation resources. Notably, domains like ancient texts and cultures currently lack evaluation benchmarks.

(7) Subjective questions, especially those related to Natural Language Understand- ing (NLU), dominate the evaluation datasets. A minority of datasets encompasses objective questions, including multiple-choice and fill-in-the-blank formats. Regarding the methodologies employed for evaluation, the widespread use of code-based assess- ment is attributable to its applicability for objective questions and straightforward subjective tasks, manifesting advantages in efficiency and consistency. Conversely, manual evaluation is unsuitable for extensive tasks and objective questions due to cost considerations and is consequently infrequently utilized. It is crucial to highlight that model evaluation, to some degree, amalgamates the strengths of code-based and man- ual evaluations, potentially steering towards becoming the predominant evaluation methodology in the future. Naturally, the strategic combination of evaluation methods should consider practical aspects, including the scale and diversity of questions.

6 Traditional NLP Datasets

Diverging from instruction fine-tuning datasets, we categorize text datasets dedicated to natural language tasks before the widespread adoption of LLMs as traditional NLP datasets. These datasets, devoid of instructional formats, are specifically crafted for training, optimizing, and testing traditional NLP models. The resultant NLP models find application in diverse text processing tasks, including text classification, information extraction, text summarization, etc.

In contemporary LLMs projects, a plethora of traditional NLP datasets finds appli- cation. These datasets undergo dual roles: firstly, their format and content transform into instructional formats for the instruction-guided fine-tuning phase of LLMs, aug- menting the models’ capacities to adhere to instructions and excel in such tasks; secondly, they serve as evaluation datasets for LLMs, enabling the comparison of diverse LLMs in natural language tasks. Notably, several LLMs instruction datasets and evaluation datasets emerge from the conversion of traditional NLP datasets. Consequently, this section succinctly summarizes classical traditional NLP datasets commonly integrated into existing LLMs and various LLMs evaluation platforms. The objective is to streamline and offer references for traditional NLP datasets, facilitating the dataset selection process for LLMs projects.

In this context, the compiled traditional NLP datasets are systematically classified into 15 distinct categories, aligning with various tasks. Figure 21 visually represents these categories, encompassing question answering, recognizing textual entail- ment, math, coreference resolution, sentiment analysis, semantic matching, text generation, text translation, text summarization, text classification, text quality evaluation, text-to-code, named entity recognition, relation extraction, and multitask. We will summarize various categories of NLP datasets in a straightforward manner using text and tables (Table 14 to Table 30). Detailed information about the datasets is presented in the Appendix E.

6.1 Question Answering

The task of question-answering requires the model to utilize its knowledge and reason- ing capabilities to respond to queries based on provided text (which may be optional) and questions. This task often includes subcategories like reading comprehension, knowledge QA, and reasoning QA.

6.1.1 Reading Comprehension

The task of reading comprehension entails presenting a model with a designated text passage and associated questions, prompting the model to understand the text for the purpose of answering the questions. Based on the answering approach of the task, it can be roughly classified into four categories: selection & judgment, cloze test, answer extraction, and unrestricted QA.

There are two modes for selection & judgment tasks. Mode one requires the model to select the most appropriate option from several answer options. RACE (Lai et al, 2017) and DREAM (Sun et al, 2019) are specifically selected from English exams designed by human experts, requiring the model to answer multiple- choice questions about the content of given English articles. Similarly, C3 (Sun et al, 2020) and ReClor (Yu et al, 2020b) are extracted from corresponding Chinese exams and graduate entrance exams, respectively, each containing relevant multiple-choice questions. Mode two involves judging the correctness of a question using either “Yes” or “No.” BoolQ (Clark et al, 2019) requires the model to respond with “Yes” or “No” to complex inquiries and non-factual information. CondaQA (Ravichan- der et al, 2022), as the first English dataset to assess negation statements, tests the model’s understanding of negative assertions, with answers in the form of “Yes,” “No,” or “Don’t Know.” PubMedQA (Jin et al, 2019), focusing deeply on the biomedical field, presents higher professional knowledge requirements, necessitating judgment on the correctness of questions based on the abstracts of medical articles.

The cloze task requires the model to select a word or sentence to fill in the missing part of the text, making the text coherent and logical. Tasks are typically set at both the word and sentence levels. LAMBADA (Paperno et al, 2016) and CLOTH (Xie et al, 2018) are English word-level cloze datasets. By perceiving the context, the model predicts the positions of missing words in the sentences. ChID (Zheng et al, 2019) requires the model to choose the correct idiom to fill in the blank, focusing on testing the model’s understanding of Chinese idioms. CMRC2019 (Cui et al, 2020) is a sentence-level cloze-style dataset that requires the model to fill in several blank spaces in the article with candidate sentences.

The answer extraction task involves the model pinpointing a continuous excerpt within the text as the answer to a given question. Fundamentally, the answers to the questions can be extracted or composed directly from the textual content, eliminating the necessity of generating supplementary open-ended content. SQuAD (Rajpurkar et al, 2016) extracts text passages and answers to questions from Wikipedia arti- cles for answer extraction tasks. SQuAD 2.0 (Rajpurkar et al, 2018) extends the SQuAD dataset by adding unanswerable questions, testing the models’ ability to judge ambiguous questions. Adversarial QA (Bartolo et al, 2020) expands upon the SQuAD dataset by creating more challenging questions using adversarial human annotations. Additionally, other datasets such as TriviaQA (Joshi et al, 2017), Natural Questions (Kwiatkowski et al, 2019), and CMRC2018 (Cui et al, 2019) feature more complex, challenging, and realistic reading comprehension questions.

The unrestricted QA task exhibits greater openness when contrasted with answer extraction tasks. The task entails producing a fitting response by leveraging both textual content and a posed question. The answer, rather than being an exact extraction from the text, is openly generated by the models. Presently, this task cat- egory stands as a predominant focus in the evaluation of LLMs. DROP (Dua et al, 2019) and QASPER (Dasigi et al, 2021) assess models’ reasoning ability to generate open-ended answers. Answers cannot be directly extracted from the text but require models to search for clues from multiple sources and then perform certain operations. CoQA (Reddy et al, 2019) measures models’ ability to answer related questions, with answers being in free-form text. Compared to the previous datasets, DuReader 2.0 (He et al, 2018) expands the scale of text and questions, conducting open-domain Q&A at the document level.

6.1.2 Knowledge QA

In the knowledge QA task, models respond to questions by leveraging world knowl- edge, common sense, scientific insights, domain-specific information, and more. Unlike reading comprehension tasks, each instance does not come with a reference text. This task assesses the model’s depth of knowledge and its capacity to comprehend questions. ARC (Clark et al, 2018), CommonsenseQA (Talmor et al, 2019), and OpenBookQA (Mihaylov et al, 2018) evaluate models’ knowledge mastery and comprehension abil- ities based on scientific facts and human common sense. These datasets emphasize general knowledge known to the general public. However, some datasets place more emphasis on testing vertical domain knowledge. PIQA (Bisk et al, 2020) and SciQ (Welbl et al, 2017) examine knowledge of science, JEC-QA (Zhong et al, 2020) exam- ines legal analysis, WebMedQA (He et al, 2019) examines medical diagnosis, and PsyQA (Sun et al, 2021a) examines psychological counseling.

6.1.3 Reasoning QA

The focal point of reasoning QA tasks is the requirement for models to apply abili- ties such as logical reasoning, multi-step inference, and causal reasoning in answering questions. These types of questions typically necessitate models to grasp the logi- cal connections within the text, deduce concealed information, and arrive at sensible conclusions.

HellaSwag (Zellers et al, 2019a), Social IQa (Sap et al, 2019), ROPES (Lin et al, 2019), and WIQA (Tandon et al, 2019) are grounded in contextual reasoning, aim- ing to enable models to infer the subsequent development direction based on given contexts. COPA (Roemmele et al, 2011) specifically tests causal reasoning ability, selecting appropriate causal relationships based on premises. LogiQA (Liu et al, 2021) extensively investigates logical reasoning, covering various deductive patterns. Thus, it is evident that datasets for reasoning question answering tasks involve different dimensions of reasoning.

6.2 Recognizing Textual Entailment

The primary objective of tasks related to Recognizing Textual Entailment (RTE) is to assess whether information in one textual segment can be logically inferred from another. This is formally structured with a “premise” denoted as P and a “hypothesis” denoted as H, aimed at determining the relationship between P and H. If P logically entails H, it is categorized as “Entailment”; if P and H are logically contradictory, it is categorized as “Contradiction”; if there is no discernible logical connection or contradiction between P and H, it is categorized as “Neutral.” In some instances, the latter two scenarios are combined into “Non-Entailment.”

For example, RTE (Dagan et al, 2006; Bar-Haim et al, 2006; Giampiccolo et al, 2007; Bentivogli et al, 2009) integrates a portion of the Recognizing Textual Entailment challenge datasets, comprising two types of relationships: “Entailment” and “Non- Entailment.” CommitmentBank (De Marneffe et al, 2019), OCNLI (Hu et al, 2020), and CINLID64 expand the judgment of relationships to three types. ANLI (Nie et al, 2020) introduces adversarial samples, increasing the difficulty of textual relationship judgment and making it more challenging.

6.3 Math

Mathematical assignments commonly involve standard mathematical calculations, theorem validations, and mathematical reasoning tasks, among others. These tasks aim to investigate the latent capabilities of models within the field of mathematics.

Datasets related to mathematical tasks vary in difficulty. GSM8K (Cobbe et al, 2021), ASDiv (Miao et al, 2021), Math23K (Wang et al, 2017), and Ape210K (Zhao et al, 2020) only contain primary school mathematical calculations, which are relatively simple for humans. MATH (Hendrycks et al, 2021d) targets mathematical competition problems, which are more challenging and also examine the models’ ability to follow thinking chains when solving problems. NaturalProofs (Welleck et al, 2021) involves mathematical proposition proofs, axiom inferences, and so on.

6.4 Coreference Resolution

The core objective of tasks related to coreference resolution is the identification of ref- erential relationships within texts. Pronouns, noun phrases, or alternative expressions are occasionally employed in textual passages to refer to entities introduced earlier. This task entails the recognition of entities referred to by different segments of the text and is a fundamental research area in the field of NLP.

WiC (Pilehvar and Camacho-Collados, 2019) and CLUEWSC2020 (Xu et al, 2020b) are coreference resolution datasets in the English and Chinese domains, respec- tively, used to determine whether words in different sentences have the same referential meaning. WSC (Levesque et al, 2012) does not require comparison but rather demands the specific content to which words refer. WinoGrande (Sakaguchi et al, 2021) adjusts the WSC dataset by redesigning the task in a fill-in-the-blank format. WinoWhy (Zhang et al, 2020a) extends the WSC dataset by introducing a new task of explaining referential relationships.

6.5 Sentiment Analysis

The sentiment analysis task, commonly known as emotion classification, seeks to ana- lyze and deduce the emotional inclination of provided texts, commonly categorized as positive, negative, or neutral sentiments. This task finds practical utility in diverse domains, including social media monitoring, product review analysis, and market research.

Classic sentiment analysis datasets include IMDB (Maas et al, 2011), Sentiment140 (Go et al, 2009), SST-2 (Socher et al, 2013), and EPRSTMT (Xu et al, 2021). The tex- tual content of these datasets originates from real-life scenarios such as movie reviews, product reviews, and tweet content, hence possessing diversity and authenticity. Each sample is manually labeled as expressing either positive or negative sentiment based on the emotions conveyed in the text.

6.6 Semantic Matching

The task of semantic matching entails evaluating the semantic similarity or degree of correspondence between two sequences of text. Models must grasp the semantic information within the text to perform tasks such as assessing text similarity, match- ing sentences, and determining semantic relationships. This task is widely applied in domains such as information retrieval and dialogue systems.

MRPC (Dolan and Brockett, 2005), QQP (Wang et al, 2018), and PAWS (Zhang et al, 2019) are commonly used English semantic matching datasets, used for determin- ing semantic similarity at the sentence level. AFQMC (Xu et al, 2020b) and LCQMC (Liu et al, 2018) are commonly used large-scale Chinese datasets. Specifically, the LCQMC dataset is more inclined towards matching the intent of questions rather than semantic matching. To address the lack of other languages, PAWS-X (Yang et al, 2019) translates the PAWS dataset into 6 other languages. The most notable is the STSB dataset (Cer et al, 2017), which not only includes 10 languages but also employs continuous similarity scores as labels rather than simple binary labels.

6.7 Text Generation

The scope of text generation tasks is broad, encompassing the generation of content summaries or dialogues. In a specific context, we narrow down the definition of text generation tasks to differentiate them from tasks like text summarization and transla- tion. The narrow definition of text generation tasks is bound by provided content and specific requirements. It involves utilizing benchmark data, such as descriptive terms and triplets, to generate corresponding textual descriptions.

The first form involves generating sentences in a colloquial manner using specific words. CommonGen (Lin et al, 2020) and E2E (Novikova et al, 2017) task models with generating coherent sentences related to given vocabulary terms. The second form involves mapping structured data to text. DART (Nan et al, 2021) and WebNLG (Gardent et al, 2017) input structured data as triples to the model to obtain relevant descriptive sentences.

Tab

6.8 Text Translation

Text translation involves transforming text from one language to another. Models must adeptly grasp the meaning of the source language text and produce equivalent text that conforms to the grammar and context of the target language.

WMT65 is one of the most commonly used text translation datasets. It aggregates data from the Workshop on Statistical Machine Translation competition, with a large- scale dataset covering a wide range of languages. NLLB (Costa-juss`a et al, 2022) provides open-access to three text translation evaluation benchmarks, offering high- quality translations in over 200 languages, including many low-resource languages. IWSLT 2017 (Cettolo et al, 2017) is also representative and commonly used for training and evaluation in translation tasks.

6.9 Text Summarization

The task of text summarization pertains to the extraction or generation of a brief summary or headline from an extended text to encapsulate its primary content. Sum- maries are expected to retain the pivotal information from the original text, effectively conveying its fundamental ideas, while headlines demand brevity and inclusiveness.

News is the most common source for text summarization datasets. CNN-DM (See et al, 2017) utilizes a large number of news articles to create tens of thousands of article-summary pairs. Compared to the CNN-DM dataset, XSum (Narayan et al, 2018) has shorter text content and richer vocabulary. In addition to obtaining data samples from various news sources, SAMSum (Gliwa et al, 2019), Opinion Abstracts (Wang and Ling, 2016), LCSTS (Hu et al, 2015), MediaSum (Zhu et al, 2021), and AESLC (Zhang and Tetreault, 2019) respectively focus on real dialogues, movie reviews, social media texts, interview transcripts, and emails. This ensures that dif- ferent text summarization datasets have diverse styles of content and do not become overly homogeneous.

6.10 Text Classification

Text classification tasks aim to assign various text instances to predefined categories, comprising text data and category labels as pivotal components. Sentiment analysis and semantic matching, previously mentioned, are encompassed within the domain of text classification. Due to the unique nature of these tasks and their frequent explo- ration as standalone subtasks by researchers, this paper provides separate summaries for sentiment analysis, semantic matching, and text classification.

AGNEWS (Zhang et al, 2015) and TNEWS (Xu et al, 2020b) evaluate models’ clas- sification performance on English and Chinese news topics, respectively. They involve a relatively small number of categories, not exceeding 15. CSLDCP (Xu et al, 2021) requires models to classify Chinese literature disciplines, expanding the categories to 67. IFLYTEK (Xu et al, 2020b) categorizes descriptive text based on app functionality for model classification, with an astonishing 119 categories.

65 https://www.statmt.org/wmt22/index.html

6.11 Text Quality Evaluation

The task of text quality evaluation, also referred to as text correction, involves the identification and correction of grammatical, spelling, or language usage errors in text. This task is akin to a teacher correcting writing errors made by students.

CoLA (Warstadt et al, 2019) is used to evaluate models’ ability to judge the gram- matical correctness of English sentences, which can be seen as a binary classification task. In contrast, SIGHAN (Wu et al, 2013; Yu et al, 2014; Tseng et al, 2015) and YACLC (Wang et al, 2021b) require models to proofread and correct Chinese spelling and grammar, presenting greater difficulty. Different from these two datasets, CSCD- IME (Hu et al, 2022b) is the first Chinese spelling correction dataset caused by errors in Pinyin input method, with different sources and distributions of errors.

6.12 Text-to-Code

The Text-to-Code task involves models converting user-provided natural language descriptions into computer-executable code, thereby achieving the desired functional- ity or operation. Common subtasks include the generation of SQL query statements and generating code for different programming languages.

For example, MBPP (Austin et al, 2021) serves as a benchmark comprising Python programming problems, assessing models’ proficiency in Python programming. On the other hand, DuSQL (Wang et al, 2020a), CSpider (Min et al, 2019), and Spider (Yu et al, 2018) are applied in the Text-to-SQL task. They require models to generate corresponding SQL query statements from given databases based on questions.

6.13 Named Entity Recognition

The Named Entity Recognition (NER) task aims to discern and categorize named entities within a given text. Models are tasked with pinpointing entities, assigning them to predefined categories, and indicating their respective positions. These entities may include personal names, organizational names, geographic locations, dates, and other categories.

CoNLL2003 (Tjong Kim Sang and De Meulder, 2003) is a classic benchmark dataset in the field of NER. It categorizes entity types into 4 classes. OntoNotes 5.0 (Weischedel et al, 2012) expands into an NER task dataset based on the corpus and provides 18 entity types. Subsequently, WUNT2017 (Derczynski et al, 2017) focuses on models’ ability to recognize emerging named entities in new contexts within the NER task. Youku NER (Jie et al, 2019), Taobao NER (Jie et al, 2019), and Weibo NER (Peng and Dredze, 2015) are constructed for the entertainment, e-commerce, and social media domains, respectively, providing corresponding text-entity pairs.

6.14 Relation Extraction

The endeavor of Relation Extraction (RE) necessitates the identification of connections between entities within textual content. This process typically includes recognizing and labeling pertinent entities, followed by the determination of the specific types of rela- tionships that exist among them. As an illustration, the Forbidden City (geographic location) is positioned in (type of relationship) Beijing (geographic location).

Dialogue RE (Yu et al, 2020a) is the first entirely human-annotated dataset for dialogue RE, comprising 36 types of relationship found in real dialogues. In contrast to sentence-level datasets, DocRED (Yao et al, 2019) is constructed for RE tasks at the document level. Models are required to aggregate document information to infer relationships between entities. FewRel (Han et al, 2018) is the first to combine few- shot learning with relation extraction, and in its 2.0 version, it additionally evaluates models’ OOD capability.

6.15 Multitask

Multitask datasets hold significance as they can be concurrently utilized for different categories of NLP tasks. Creators commonly manipulate the same batch of textual data through various configurations, transformations, and annotations to produce training or evaluation data for diverse NLP tasks, exemplifying the concept of “one dataset, multiple applications.”

For example, CSL (Li et al, 2022b) contains a vast amount of information such as paper titles, abstracts, keywords, etc., which can be simultaneously applied to various NLP tasks such as title prediction, keyword generation, paper classification, and so on. QED (Lamm et al, 2021) extends the Natural Questions dataset (Kwiatkowski et al, 2019) by adding explanatory annotations and extends to different tasks such as sentence selection, equivalence recognition, etc. METS-CoV (Zhou et al, 2022) collects social media texts related to COVID-19, which are annotated by creators and used in NER and sentiment analysis tasks.

7 Challenges and Future Directions

This section primarily elaborates on the existing challenges and future directions from four aspects: pre-training corpora, fine-tuning instruction datasets, preference datasets, and evaluation datasets.

7.1 Pre-training Corpora

The construction and open sourcing of pre-training corpora have experienced sig- nificant growth recently, with increasing emphasis on their quality by researchers. However, pre-training corpora still face challenges and shortcomings that not only impact the performance of models but also involve ethical and societal issues. Below, we briefly explore the challenges existing in current pre-training corpora and discuss future development directions.

Data Selection. Research indicates that the diversity of data is crucial, and a richer variety of domains is preferable (Longpre et al, 2023c). It is worth investigating how to make the content of pre-training corpora as diverse as possible. Currently, the majority of pre-training corpora are composed of web-scraped data, and the data types are not entirely comprehensive. There is a risk of excessive focus on popular content, resulting in category imbalance. This can lead to a severe lack of knowledge in certain domains, necessitating the subsequent collection of data for incremental pre-training. Moreover, the scale of English data is much larger than that of other languages, which can result in insufficient knowledge of other languages and poor performance of models in cross-language tasks. Therefore, data selection is a nuanced art. First, larger-scale, more diverse, and more broadly sourced pre-training corpora covering multiple languages and domains with better proportional representation will be a future trend. Therefore, choices and configurations regarding data scale, data sources, domain coverage, data proportions, and language distribution need to be carefully considered. Secondly, data will be subdivided into finer categories, similar to the further categorization of books in Figure 4, to better measure the breadth of the corpora, facilitating improved data selection. Thirdly, there will be a gradual exploration of whether the addition of synthetic data is effective for the pre-training of models. Fourthly, many vertical domains lack open-source relevant data, such as in the fields of ancient texts or ethnic cultures.

Timeliness. Currently, the coverage time of most pre-training corpora is rel- atively outdated, lacking recent knowledge and making it challenging to achieve periodic updates. This results in inaccurate generation or outdated information and being unable to respond to recent content. Common Crawl, for instance, continu- ally crawls the latest webpage data, but the majority is in English. Other types of data require reacquisition and preprocessing when updates are needed. In the future, dynamic and automatic updates of pre-training corpora, as well as self-learning capabilities of LLMs regarding new knowledge, will be crucial research directions.

Quality Assessment. Longpre et al (2023c) conducts evaluations on The Pile (Gao et al, 2020) and C4 (Raffel et al, 2020), exploring potential features of the data using different data integration methods. Lee et al (2023a) designs the Task2Vec metric to measure the diversity of data. However, a systematic methodology for quality assessment has not yet been established. Most studies only assess specific aspects of the corpora. Questions about what makes a pre-training corpus of higher quality, how the quality of pre-training corpora should be compared, and what constitutes a more comprehensive quality evaluation remain largely unresolved.

Data Preprocessing. Each pre-training corpus has a unique preprocessing pipeline and methods, with some specific details yet to be disclosed. This gives rise to two issues. First, there is a lack of a unified framework and standardized processes for data preprocessing. The effectiveness of existing methods is sometimes challenging to assess. Second, Longpre et al (2023c), through experiments, demonstrated that the more harmful content is filtered out from pre-training data, the less harmful infor- mation the model generates, but its discrimination ability also weakens. Filtering out low-quality data too extensively reduces the diversity of the data. While enhancing discrimination ability, it may lead to the generation of more harmful information by the model. Whether a cleaner corpus is necessarily better and whether a small amount of harmful information and low-quality data can bring benefits are questions that need to be explored in the future. Determining the optimal extent of data cleaning is also a topic for future research.

Building the Ecosystem of Pre-training Corpora. Due to the rapid develop- ment of LLMs, a comprehensive ecosystem for pre-training corpora has not yet been established within the community. There is a lack of standards for data preprocessing, no systematic evaluation schemes for data, no established standards for the release of relevant data, and currently, there is no unified management and maintenance of data. Given these circumstances, there is still a long way to go in building the ecosystem for pre-training corpora.

7.2 Instruction Fine-tuning Datasets

During the instruction fine-tuning phase, creating high-quality datasets is crucial for driving model performance and expanding application domains. Several challenges currently pose tests to the future development of instruction fine-tuning datasets. Below, we briefly explore the challenges existing in current instruction fine-tuning datasets and look ahead to future directions.

Subdivision of Instruction Categories. In the majority of instruction fine- tuning datasets, instructions of various categories are mixed together without speci- fying the corresponding task types and associated domains for each instruction. For instance, in the classic Alpaca data dataset (Taori et al, 2023), each instruction con- sists of “instruction,” “input,” and “output” parts without category annotations. This makes it challenging to adjust the distribution of categories in the instruc- tion fine-tuning dataset to enhance the performance of specific tasks or to add and simplify instructions. Additionally, while datasets like Firefly (Yang, 2023) and BELLE train 3.5M CN (BELLEGroup, 2023) have added a field for instruction cat- egories, they suffer from issues such as incomplete or overly broad categories. Taking the “code” category as an example, instructions could be further subdivided into more granular categories like “code correction,” “code generation” and “code improvement.” Therefore, in the future, a more fine-grained category subdivision in datasets should become a standard, allowing users to better understand the overall composition and facilitating dataset optimization. Of course, this may intro- duce challenges such as difficulty in standardizing category subdivisions and increased annotation costs and time.

Domain Scarcity. The majority of datasets focus on general domains, with datasets in vertical domains mostly concentrated in common areas such as healthcare, finance, and law. This results in a scarcity of instruction datasets for low-resource and niche domains, potentially limiting the performance improvement of models in cer- tain specialized fields. For instance, in fields like traditional Chinese classics, antiques, or niche areas such as paleobiology, funeral studies, and minority languages. Con- structing corresponding datasets for these domains not only systematically integrates knowledge but also allows the application of trained LLMs in specific fields, serving as auxiliary tools with societal significance and value. Quality Evaluation. The quality evaluation of instruction fine-tuning datasets is a complex and subjective issue, and currently, there are no clear, universal standards or methods. In practice, quality evaluation may involve multiple aspects, including but not limited to: (1) Model Performance Evaluation. Assessing the perfor- mance of the fine-tuned model on evaluation datasets. The selected evaluation datasets should be diverse and reasonable to avoid evaluation contamination (Zhou et al, 2023b). (2) Annotation Consistency and Rationality. Evaluating the consistency among different annotators regarding instructions and the rationality and correct- ness of instruction input and answer output. (3) Bias Analysis. Assessing biases and harmful content in the dataset to ensure the model is not adversely affected. (4) Timeliness Detection. Regularly checking whether the content of instructions in the dataset has become outdated or inaccurate. (5) Subjective Evaluation. Man- ually conducting subjective scoring and inspection. In conclusion, future efforts may involve establishing more explicit evaluation standards and metrics, creating a unified evaluation framework to make it more scientifically objective.

Legal and Ethical Risks. Longpre et al (2023b) research on instruction fine- tuning datasets has revealed that an increasing number of datasets are treated as wholes rather than a series of sources, undergoing multiple repackagings and reautho- rizations without sufficient labeling of data sources and copyright information. This leads to issues such as data leakage and biased behavior, posing legal and ethical risks. Therefore, there is a current need to enhance the transparency of datasets, improve quality and ethical compliance, and reduce potential problems. Longpre et al (2023b) provides a dataset audit and data provenance explorer tool to address this. In the future, establishing standards for dataset usage is a focal point of concern.

7.3 Preference Datasets

The significance of preference datasets lies in providing crucial training data for the models’ output decisions. Below, we briefly discuss the challenges currently faced by preference datasets and look forward to future directions.

Limited Availability of Resources. RLHF has been widely researched and applied by leading industry companies such as OpenAI, Anthropic, Google, etc. However, due to the lack of high-quality, publicly available preference datasets, the open-source community is still lagging in the research and practice of RLHF (Cui et al, 2023). Currently, there are not many open-source preference datasets, and the major- ity are in English. Non-English and domain-specific preference datasets are extremely scarce. One reason for the scarcity of resources is the relatively cumbersome annotation process and the high cost involved. Therefore, exploring weakly supervised learning methods, using simple labels such as user clicks, support amounts, instead of man- ual annotation, or leveraging high-quality models like GPT-4 to assist in voting and scoring, could be attempted. On the other hand, there is lower attention to preference datasets in other languages and vertical domains, leading to fewer related efforts.

Preference Evaluation Method Settings. The most commonly used prefer- ence evaluation method is still the voting method, but many preference datasets lack strict and uniform evaluation standards, providing feedback information only from a single dimension. Human preferences in the real world are diverse, and to more comprehensively and high-quality reflect them, corresponding standards need to be established to reduce subjective differences and conduct fine-grained evaluations from multiple dimensions (Cui et al, 2023). Employing various evaluation methods for com- prehensive assessments is recommended. Defining these standards is a complex issue. Additionally, preference datasets often do not provide explicit reasons for why some answers are more favored by humans, introducing uncertainty into the model learning process. Therefore, it is advisable to include textual explanations in preference evalu- ations, stating the reasons for the assessment and providing suggestions for improving the responses. The construction of UltraFeedback (Cui et al, 2023) is relatively more scientifically standardized, playing a positive role in fostering future developments.

7.4 Evaluation Datasets

Evaluation datasets play a crucial role in ensuring the reliability, practicality, and safety of LLMs. They provide researchers and practitioners with insights into the strengths and weaknesses of LLMs, facilitating continuous improvements and opti- mizations. The following discussion highlights the challenges within current evaluation datasets and suggests potential directions for future development.

Establishment of Evaluation Datasets. When creating an evaluation dataset for a particular domain, several essential factors must be considered. (1) Data sources. There is a growing emphasis on evaluating the fairness and reliability of datasets (Aiyappa et al, 2023), with particular attention to the risk of data pol- lution or leakage during assessments (Zhou et al, 2023b). Zhou et al (2023b) has identified instances where LLMs unintentionally learned from evaluation data during pre-training or prompt fine-tuning, resulting in inflated evaluation scores and dimin- ished generalization ability. To mitigate this, dataset providers should disclose training data compositions and provide detailed information about data sources to prevent contamination. Consequently, beyond publicly disclosing the composition of training data to avoid inappropriate selection of evaluation datasets, providers of evaluation datasets must furnish detailed data source information and assess the risks of data con- tamination. Whenever possible, data sources should consist of artificially generated or non-public data to ensure fair evaluations. The challenge of minimizing data pollution or leakage remains an open problem. (2) Question design. Various factors, including scale, question types, and topic distribution, should be considered when developing evaluation datasets. Achieving overall enhancement requires extensive research and practical application. Initially, the scale of the evaluation dataset should be determined based on specific evaluation content, emphasizing high-quality questions, diverse ques- tion types, and an evenly distributed array of topics before gradually expanding and regularly updating the evaluation dataset. This approach resembles Chinese Gaokao, where refined questions assess the mastery of comprehensive knowledge. Additionally, setting a reasonable difficulty level is crucial. Evaluation tasks should largely surpass the current capabilities of LLMs, establishing an appropriate upper and lower limit. Without a good design of evaluation benchmarks, many models achieving scores above 95% are relatively unhelpful for advancing LLMs (Sawada et al, 2023).

Addressing Evaluation Gaps. Persistent gaps in the evaluation landscape require researchers’ attention to refine the evaluation framework. (1) Evaluating in low-resource domains. Evaluative datasets in certain domains are in nascent stages of development, such as the e-commerce domain (Li et al, 2023m), and the geoscience domain (Deng et al, 2023); while certain domains lack pertinent evaluation bench- marks temporarily, including the domain of ancient literature, cultural artifacts, tea culture, etc. (2) Evaluating in other languages. Beyond the predominantly fea- tured English and Chinese datasets, resources for evaluations in other languages are limited. (3) Multi-turn evaluations. The focus on single-turn assessments over- looks LLMs’ capabilities in multi-turn interactions and contextual understanding. (4) Dynamic evaluations. Many evaluative datasets employ static evaluation methods, introducing two drawbacks. On one hand, the evaluation data is utilized for training to enhance ranking on leaderboards; on the other hand, the initial evaluation con- tent may gradually become inadequate for meeting the capabilities of LLMs, and the evaluated knowledge may become obsolete or erroneous (Guo et al, 2023c).

Choosing and Improving Evaluation Approaches. The limitations of code evaluation, especially for open-ended questions, require addressing. Manual evalua- tions, while in-depth, can be costly and subject to human bias. Thus, model-based scoring is emerging as a promising alternative, striving for scientific reliability and the goal of fully automated evaluation processes.

Comprehensive Evaluation Framework. The complexity of selecting from numerous datasets, the lack of standardized data formats, and the diversity in evalua- tion methodologies pose significant challenges. A comprehensive evaluation framework could simplify the process by providing a central repository and an efficient, standard- ized API for model invocation. This framework should fulfill three criteria: simplicity, centralization, and efficiency. Firstly, the evaluation steps should be straightforward, requiring only the provision of an API for model invocation. Secondly, a unified repos- itory should be available for selecting datasets spanning diverse domains and tasks. Lastly, the evaluation process should be efficient, covering a broad range of dimensions to yield rapid results. Achieving this goal poses various challenges, with familiar frame- works like the HELM evaluation framework (Liang et al, 2023) and the OpenCompass evaluation platform (Contributors, 2023) evolving in this direction.

8 Conclusion

In the vast landscape of AI, Large Language Models (LLMs) stand out as rapidly grow- ing, prominent features—akin to towering trees in a dense forest. The datasets that feed their growth and development can be compared to the vital root system of these trees, providing the sustenance that is essential for their performance. Regrettably, the current landscape of LLM-related datasets is extensive, with a lack of cohesive synthesis across the various types of datasets. Understanding the current state and future trends of the LLM datasets presents a formidable challenge. Therefore, this survey offers a comprehensive analysis of LLMs datasets, categorizing and summa- rizing datasets associated with LLMs across five dimensions: pre-training corpora, fine-tuning instruction datasets, preference datasets, evaluation datasets, and tradi- tional NLP datasets. Alongside this categorization, we identify the current challenges and outline potential directions for future dataset development in four key areas: pre- training, fine-tuning instruction, reinforcement learning, and model evaluation. It is our hope that this survey will serve as a valuable point of reference for researchers both in academia and industry, as well as newcomers and proficient practitioners engaged with LLMs. Our ultimate objective is to continually refine LLMs datasets, to foster a robust and standardized dataset ecosystem, as well as to support the progressive advancement of LLMs.

post contain ""

No matching posts found containing ""

Share Your Feedback 🏝️

Survey | Datasets for LLMs

Survey | Datasets for LLMs

Datasets for Large Language Models: A Comprehensive Survey

1 Introduction

2 Pre-training Corpora

2.1 General Pre-training Corpora

2.1.1 Webpages

2.1.2 Languages Texts

2.1.3 Books

2.1.4 Academic Materials

2.1.5 Code

2.1.6 Parallel Corpus

2.1.7 Social Media

2.1.8 Encyclopedia

2.1.9 Multi-category Corpora

2.2 Domain-specific Pre-training Corpora

2.2.1 Financial Domain

2.2.2 Medical Domain

2.2.3 Other Domains

2.3 Distribution Statistics of Pre-training Corpora

2.4 Preprocessing of Pre-training Data

2.4.1 Data Collection

2.4.2 Data Filtering

2.4.3 Data Deduplication

2.4.4 Data Standardization

2.4.5 Data Review

3 Instruction Fine-tuning Datasets

3.1 Instruction Category

3.2 General Instruction Fine-tuning Datasets

3.2.1 Human Generated Datasets

3.2.2 Model Constructed Datasets

3.2.3 Collection and Improvement of Existing Datasets

3.2.4 Datasets Created with Multiple Methods

3.3 Domain-specific Instruction Fine-tuning Datasets

3.3.1 Medical Domain

3.3.2 Code Domain

3.3.3 Legal Domain

3.3.4 Mathematics Domain

3.3.5 Education Domain

3.3.6 Other Domains

3.4 Distribution Statistics of Instruction Fine-tuning Datasets

4 Preference Datasets

4.1 Preference Evaluation Methods

4.1.1 Vote

4.1.2 Sort

4.1.3 Score

4.1.4 Other

4.2 Distribution Statistics of Preference Datasets

5 Evaluation Datasets

5.1 Evaluation Domains

5.1.1 General

5.1.2 Exam

5.1.3 Subject

5.1.4 Natural Language Understanding

5.1.5 Reasoning

5.1.6 Knowledge

5.1.7 Long Text

5.1.8 Tool

5.1.9 Agent

5.1.10 Code

5.1.11 Out-of-Distribution

5.1.12 Law

5.1.13 Medical

5.1.14 Financial

5.1.15 Social Norms

5.1.16 Factuality

5.1.17 Evaluation

5.1.18 Multitask

5.1.19 Multilingual

5.1.20 Other

5.2 Evaluation Methods

5.3 Distribution Statistics of Evaluation Datasets

6 Traditional NLP Datasets

6.1 Question Answering

6.1.1 Reading Comprehension

6.1.2 Knowledge QA

6.1.3 Reasoning QA

6.2 Recognizing Textual Entailment

6.3 Math