MinWoo(Daniel) Park | Tech Blog
Read moreContents
When we delve into the concept of intelligence, human intelligence naturally emerges as our benchmark. Over millennia, humanity has embarked on a continuous exploration of human intelligence, employing diverse methods for measurement and evaluation. This quest for understanding intelligence encompasses an array of approaches, ranging from IQ tests and cognitive games to educational pursuits and professional accomplishments. Throughout history, our persistent efforts have been geared toward comprehending, assessing, and pushing the boundaries of various facets of human intelligence.
However, against the backdrop of the information age, a new dimension of intelligence is emerging, sparking widespread interest among scientists and researchers: machine intelligence. One representative of this emerging field is language models in natural language processing (NLP). These language models, typically constructed using powerful deep neural networks, possess unprecedented language comprehension and generation capabilities. The question of how to measure and assess the level of this new type of intelligence has become a crucial issue.
In the nascent stages of NLP, researchers have commonly employed a set of straightforward benchmark tests to evaluate their language models. These initial evaluations primarily concentrate on aspects such as grammar and vocabulary, encompassing tasks like syntactic In the early 1990s, the advent of the parsing, word sense disambiguation, and so on. MUC evaluation (Grishman & Sundheim, 1996) has marked a significant milestone in the NLP community. The MUC evaluation primarily centers on information extraction tasks, challenging participants to extract specific information from text. This evaluation framework plays a pivotal role in propelling the field of information extraction forward. Subsequently, with the emergence of deep learning in the 2010s, the NLP community embraces more expansive benchmarks like SNLI (Bowman et al., 2015) and SQuAD (Rajpurkar et al., 2016). These benchmarks not only evaluate system performance but also provide ample data for training systems. They usually assign individual scores to models according to the adopted evaluation metrics, facilitating the measurement of task-specific accuracy.
With the emergence of large-scale pre-trained language models, exemplified by BERT (Devlin et al., 2019), evaluation methods have gradually evolved to adapt to the performance assessment of these new types of general models. In response to this paradigm shift, the NLP community has taken the initiative to orchestrate a myriad of shared tasks and challenges, including but not limited to SemEval (Nakov et al., 2019), CoNLL (Sang & Meulder, 2003), GLUE (Wang et al., 2019b), SuperGLUE (Wang et al., 2019a), and XNLI (Conneau et al., 2018). These endeavors entail aggregating scores for each model, offering a holistic measure of its overall performance. They have, in turn, fostered continuous refinement in NLP evaluation methodologies, creating a dynamic arena for researchers to compare and contrast the capabilities of diverse systems.
With the continual expansion in the size of language models, large language models (LLMs) have exhibited noteworthy performance under both zero- and few-shot settings, rivaling fine-tuned pre-trained models. This shift has precipitated a transformation in the evaluation landscape, marking a departure from traditional task-centered benchmarks to a focus on capability-centered assessments. The demarcation lines among distinct downstream tasks have begun to blur. In tandem with this trend, the landscape of evaluation benchmarks designed to appraise knowledge, reasoning, and various other capabilities has expanded. Many of these benchmarks are characterized by an abandonment of training data and are devised with the overarching goal of providing a comprehensive evaluation of a model’s capabilities under zero- and few-shot settings (Hendrycks et al., 2021b, Zhong et al., 2023, Zhang et al., 2023b, Li et al., 2023e).
The rapid adoption of LLMs by the general public has been strikingly demonstrated by ChatGPT (OpenAI, 2022), which amassed over 100 million users within just two months of its launch. This unprecedented growth underscores the transformative capabilities of these models, including natural text generation (Brown et al., 2020), code generation (Chen et al., 2021), and tool use (Nakano et al., 2021). However, alongside their promise, concerns have been raised about the potential risks if such capable models are deployed at scale without thorough and comprehensive evaluation. Critical issues such as perpetuating biases, spreading misinformation, and compromising privacy need to be rigorously addressed. In response to these concerns, a dedicated line of research has emerged with a focus on empirically evaluating the extent to which LLMs align with human preferences and values. Whereas previous studies have focused predominantly on capabilities, this strand of research aims to steer the advancement and application of LLMs in ways that maximize their benefits while proactively mitigating risks.
Additionally, the burgeoning use of LLMs and their escalating integration into real-world contexts underscore the profound impact that advanced AI systems and agents, underpinned by LLMs, are having on human society. Before these advanced AI systems are deployed, the safety and reliability of LLMs must be prioritized. We provide a comprehensive exploration of a series of safety issues related to LLMs such as robustness and disastrous risks. While these risks may not be fully realized and appear at present, advanced LLMs have shown certain tendencies by revealing behaviors indicative of catastrophic risks and demonstrating abilities to perform higher-order tasks in current evaluations. Consequently, we believe that discussing of evaluating these risks is essential for guiding the future direction of safety research in LLMs.
While numerous benchmarks have been developed to evaluate LLMs’ capabilities and alignment with human values, these have often focused narrowly on performance within singular tasks or domains. To enable more comprehensive LLM assessment, this survey provides a systematic literature review synthesizing efforts to evaluate these models across various dimensions. We summarize key points regarding general LLM benchmarks and evaluation methodologies spanning knowledge, reasoning, tool learning, toxicity, truthfulness, robustness, and privacy.
Our work significantly extends two recent surveys on LLM evaluation by Chang et al. (2023) and Liu et al. (2023i). While concurrent, our survey takes a distinct approach from these existing reviews. Chang et al. (2023) structure their analysis around evaluation tasks, datasets, and methods. In contrast, our survey integrates insights across these categories to provide a more holistic characterization of key advancements and limitations in LLM evaluation. Additionally, Liu et al. (2023i) primarily focus their review on alignment evaluation for LLMs.
Figure 1: Our proposed taxonomy of major categories and sub-categories of LLM evaluation.
Our survey expands the scope to synthesize findings from both capability and alignment evaluations of LLMs. By complementing these previous surveys through an integrated perspective and expanded scope, our work provides a comprehensive overview of the current state of LLM evaluation research. The distinctions between our survey and these two related works further highlight the novel contributions of our study to the literature.
The primary objective of this survey is to meticulously categorize the evaluation of LLMs, furnishing readers with a well-structured taxonomy framework. Through this framework, readers can gain a nuanced understanding of LLMs’ performance and the attendant challenges across diverse and pivotal domains.
Numerous studies posit that the bedrock of LLMs’ capabilities resides in knowledge and reasoning, serving as the underpinning for their exceptional performance across a myriad of tasks. Nonetheless, the effective application of these capabilities necessitates a meticulous examination of alignment concerns to ensure that the model’s outputs remain consistent with user expectations. Moreover, the vulnerability of LLMs to malicious exploits or inadvertent misuse underscores the imperative nature of safety considerations. Once alignment and safety concerns have been addressed, LLMs can be judiciously deployed within specialized domains, catalyzing task automation and facilitating intelligent decision-making. Thus, our overarching objective is to delve into evaluations encompassing these five fundamental domains and their respective subdomains, as illustrated in Figure 1.
Section 3, titled “Knowledge and Capability Evaluation”, centers on the comprehensive assessment of the fundamental knowledge and reasoning capabilities exhibited by LLMs. This section is meticulously divided into four distinct subsections: Question-Answering, Knowledge Completion, Reasoning, and Tool Learning. Question-answering and knowledge completion tasks stand as quintessential assessments for gauging the practical application of knowledge, while the various reasoning tasks serve as a litmus test for probing the meta-reasoning and intricate reasoning competencies of LLMs. Furthermore, the recently emphasized special ability of tool learning is spotlighted, showcasing its significance in empowering models to adeptly handle and generate domain-specific content.
Section 4, designated as “Alignment Evaluation”, hones in on the scrutiny of LLMs’ performance across critical dimensions, encompassing ethical considerations, moral implications, bias detection, toxicity assessment, and truthfulness evaluation. The pivotal aim here is to scrutinize and mitigate the potential risks that may emerge in the realms of ethics, bias, and toxicity, as LLMs can inadvertently generate discriminatory, biased, or offensive content. Furthermore, this section acknowledges the phenomenon of hallucinations within LLMs, which can lead to the inadvertent dissemination of false information. As such, an indispensable facet of this evaluation involves the rigorous assessment of truthfulness, underscoring its significance as an essential aspect to evaluate and rectify.
Section 5, titled “Safety Evaluation”, embarks on a comprehensive exploration of two fundamental dimensions: the robustness of LLMs and their evaluation in the context of Artificial General Intelligence (AGI). LLMs are routinely deployed in real-world scenarios, where their robustness becomes paramount. Robustness equips them to navigate disturbances stemming from users and the environment, while also shielding against malicious attacks and deception, thereby ensuring consistent high-level performance. Furthermore, as LLMs inexorably advance toward human-level capabilities, the evaluation expands its purview to encompass more profound security concerns. These include but are not limited to power-seeking behaviors and the development of situational awareness, factors that necessitate meticulous evaluation to safeguard against unforeseen challenges.
Section 6, titled “Specialized LLMs Evaluation”, serves as an extension of LLMs evaluation paradigm into diverse specialized domains. Within this section, we turn our attention to the evaluation of LLMs specifically tailored for application in distinct domains. Our selection encompasses currently prominent specialized LLMs spanning fields such as biology, education, law, computer science, and finance. The objective here is to systematically assess their aptitude and limitations when confronted with domain-specific challenges and intricacies.
Section 7, denominated “Evaluation Organization”, serves as a comprehensive introduction to the prevalent benchmarks and methodologies employed in the evaluation of LLMs. In light of the rapid proliferation of LLMs, users are confronted with the challenge of identifying the most apt models to meet their specific requirements while minimizing the scope of evaluations. In this context, we present an overview of well-established and widely recognized benchmark
evaluations. This serves the purpose of aiding users in making judicious and well-informed decisions when selecting an appropriate LLM for their particular needs.
Please be aware that our taxonomy framework does not purport to comprehensively encompass the entirety of the evaluation landscape. In essence, our aim is to address the following fundamental questions:
We will now embark on an in-depth exploration of each category within the LLM evaluation taxonomy, sequentially addressing capabilities, concerns, applications, and performance.
Evaluating the knowledge and capability of LLMs has become an important research area as these models grow in scale and capability. As LLMs are deployed in more applications, it is crucial to rigorously assess their strengths and limitations across a diverse range of tasks and datasets. In this section, we aim to offer a comprehensive overview of the evaluation methods and benchmarks pertinent to LLMs, spanning various capabilities such as question answering, knowledge completion, reasoning, and tool use. Our objective is to provide an exhaustive synthesis of the current advancements in the systematic evaluation and benchmarking of LLMs’ knowledge and capabilities, as illustrated in Figure 2.
Question answering is a very important means for LLMs evaluation, and the question answering ability of LLMs directly determines whether the final output can meet the expectation. At the same time, however, since any form of LLMs evaluation can be regarded as question answering or transfer to question answering form, there are rare datasets and works that purely evaluate question answering ability of LLMs. Most of the datasets are curated to evaluate other capabilities of LLMs.
Therefore, we believe that the datasets simply used to evaluate the question answering ability of LLMs must be from a wide range of sources, preferably covering all fields rather than aiming at some fields, and the questions do not need to be very professional but general.
According to the above criteria for datasets focusing on question answering capability, we can find that many datasets are qualified, e.g., SQuAD (Rajpurkar et al., 2016), NarrativeQA (Kociský et al., 2018), HotpotQA (Yang et al., 2018), CoQA (Reddy et al., 2019). Although these datasets predate LLMs, they can still be used to evaluate the question answering ability of LLMs. Kwiatkowski et al. (2019) present the Natural Questions corpus. The questions are composed of actual anonymized and aggregated queries that have been submitted to the Google search engine. They also verify the quality of the data and takes into account human variation, just like DuReader (Tang et al., 2021a).
Figure 2: An overview of studies on knowledge and capability evaluation for LLMs.
LLMs function as the cornerstone for multi-tasking applications. Their utility spans from general chatbots to more specialized professional tools, necessitating a broad spectrum of knowledge. Consequently, assessing the variety and depth of knowledge that these LLMs encompass is a critical aspect in their evaluation.
Knowledge Completion or Knowledge Memorization are types of tasks used to evaluate LLMs, primarily based on existing knowledge bases like Wikidata. LAMA (Petroni et al., 2019), for example, assesses a variety of knowledge types derived from different sources, including Wikidata2, ConceptNet (Speer & Havasi, 2012), and SQuAD (Rajpurkar et al., 2016).
These knowledge sources provide subject-relation-object triples, which encompass both factual and commonsense knowledge. Consequently, these triples can be converted into cloze statements, allowing the language model to fill in the missing token.
Following LAMA, KoLA (Yu et al., 2023) conducts a more in-depth and comprehensive study on the knowledge abilities of large models. KoLA develops the Knowledge Memorization Task, which also reconstructs the knowledge triples into a relation-specific template sentence to predict the tail entity (knowledge). It uses Wikidata5M to probe facts, the results were evaluated by the EM and F1 metrics. The study further explores whether the frequency of a knowledge entity could influence the evaluation results. Adequate experiments are conducted on 21 LLMs, including open-source models and proprietary models (via API service). In-depth analysis .By classifying whether the model is post-alignment, the relationship between the model size and knowledge memory can be separately analyzed. This indicates that this task provides valuable insights into knowledge captured by LLMs.
WikiFact (Goodrich et al., 2019) is an automatic metric proposed for evaluating the factual accuracy of generated text. It defines a dataset in the form of a relation tuple (subject, relation, object). This dataset is created based on the English Wikipedia and Wikidata knowledge base. However, their experiments are limited to the task of text summarization. Any Knowledge Completion work of LLMs intending to use this dataset may necessitate some modifications in its usage.
Complex reasoning encompasses the capacity to comprehend and effectively employ supporting evidence and logical frameworks to deduce conclusions or facilitate decision-making. In our effort to delineate the evaluation landscape, we propose categorizing existing evaluation tasks into four principal domains, each distinguished by the nature of the involved logic and evidential elements within the reasoning process. These categories are identified as Commonsense Reasoning, Logical Reasoning, Multi-hop Reasoning, and Mathematical Reasoning.
Commonsense reasoning stands as a fundamental ingredient of human cognition, encompassing the capacity to comprehend the world and make decisions (Davis, 1990; Liu & Singh, 2004; Cambria et al., 2011). This cognitive ability plays a pivotal role in developing NLP systems capable of making situational presumptions and generating human-like language.
In order to evaluate commonsense reasoning ability, a diverse array of datasets and benchmarks focusing on different domains of commonsense knowledge have emerged, which are listed in Tabel 1. These datasets examine the model’s ability to acquire commonsense knowledge and reason using it in the form of multiple-choice questions with metrics such as accuracy and F1. Various studies have delved into assessing the performance of LLMs on these classic commonsense reasoning datasets. Bang et al. (2023) demonstrate that ChatGPT achieves
Table 1: Details of commonsense reasoning datasets.
Logical reasoning holds significant importance in natural language understanding, which is an ability of examining, analyzing and critically evaluating arguments as they occur in ordinary language (Council, 2019). Based on the task format, we categorize the datasets employed to assess the models’ logical reasoning proficiency into three distinct types: natural language inference datasets, multi-choice reading comprehension datasets, and text generation datasets.
Natural Language Inference Datasets The natural language inference (NLI) task is a fundamental task for evaluating reasoning ability to determine the logical relationship between a hypothesis and a premise. This task requires models to take a pair of sentences as input and classify their relationship labels from entailment, contradiction, and neutral. In recent years, there have been many studies devoted to evaluating this ability, including SNLI (Bowman et al., 2015), MultiNLI (Williams et al., 2018), LogicNLI (Tian et al., 2021), ConTRoL (Liu et al., 2021), MED (Yanaka et al., 2019a), HELP (Yanaka et al., 2019b), ConjNLI (Saha et al., 2020), and TaxiNLI (Joshi et al., 2020), where the accuracy metric is widely adopted.
Multiple-choice Reading Comprehension Datasets In the typical multiple-choice machine reading comprehension scheme, given a passage and a question, the model is required to select the most adequate answer from a list of candidate answers. ReClor (Yu et al., 2020), LogiQA (Liu et al., 2020b), LogiQA 2.0 (Liu et al., 2023b), and LSAT (Wang et al., 2022) are benchmarks consisting of multi-choice logic questions sourced from standardized tests (e.g., the Law School Admission Test, the Graduate Management Admissions Test, and the National Civil Servants Examination of China). This sourcing approach guarantees the inherent difficulty and quality of the questions within these datasets. The metrics of accuracy and F1 score are typically used in this task for evaluation.
The performance of LLMs on the above classic datasets has been extensively explored. Bang et al. (2023) categorize logical reasoning into inductive and deductive reasoning based on “a degree to which the premise supports the conclusion”. Inductive reasoning involves processes from the general premises to the particular conclusions based on “observations or evidence”, while deductive reasoning is based on “truth of the premises” (i.e., necessarily true inference) (Douven, 2017). They reveal that ChatGPT exhibits poor performance in inductive reasoning but relatively excels in deductive reasoning. Liu et al. (2023c) conclude that for ChatGPT and GPT-4, logical reasoning is still a great challenge. While they demonstrate relatively strong performance on traditional multiple-choice reading comprehension datasets like LogiQA (Liu et al., 2020b) and ReClor (Yu et al., 2020), their performance is notably weaker on NLI datasets. Furthermore, the performance drops significantly when dealing with out-of-distribution datasets. Unlike preceding evaluations only limiting to simple metrics (e.g., accuracy), Xu et al. (2023a) propose fine-grained evaluations from both objective and subjective perspectives, including answer correctness, explanation correctness, explanation completeness and explanation redundancy. To avoid the influence of knowledge bias, they introduce a novel dataset NeuLR that contains neutral content. Notably, they form a scheme for logical reasoning evaluation across six dimensions: Correct, Rigorous, Self-aware, Active, Oriented and No hallucination. Upon assessment, it is observed that text-davinci-003, ChatGPT, and BARD all display specific limitations in logical reasoning. For instance, text-davinci-003 excels in deductive scenarios but struggles to maintain orientation for inductive reasoning tasks, and shows laziness in abductive reasoning tasks. ChatGPT demonstrates adeptness in maintaining rationality but faces challenges when confronted with complex reasoning problems.
Text Generation Datasets Research efforts have also been directed toward the creation of sequence-to-sequence datasets, where both the input and output are text strings. One notable study, presented by Ontañón et al. (2022), introduces LogicInference, a dataset that focuses on inference using propositional logic and a subset of first-order logic. LogicInference comprises a diverse set of tasks, including the translation between natural language and more formal logical notations, as well as one-step and multi-step reasoning tasks employing semi-formal logical notations or natural language. The evaluation of model performance on this dataset is conducted using sequence-level accuracy as the metric. Regrettably, to the best of our knowledge, there has been no evaluation of the performance of LLMs on this dataset, which presents an intriguing avenue for future research.
In addition, Han et al. (2022) introduce a human-annotated, open-domain dataset FOLIO that encompasses both NLI and text generation tasks. The first task within FOLIO is named natural language reasoning with first-order logic task, which is an NLI task that aims to determine the truth values of the conclusions given multiple premises and conclusions that constitute a story. The evaluation metric employed is accuracy. After systematically evaluating the FOL reasoning ability of LLMs (i.e., GPT-NeoX (Black et al., 2022), OPT (Zhang et al., 2022), GPT-3 (Brown et al., 2020), Codex (Chen et al., 2021)) using few-shot prompting, they reveal that even GPT-3 davinci, the best-performing model among these four LLMs, attains only slightly improved results compared to random guessing and demonstrates a notable weakness in accurately predicting the valid truth values for False and Unknown conclusions. The second task is an NL-FOL translation task, which is a text generation task involving the translation between natural language and first-order logic. Syntactic validity, syntactic exact match, syntactic abstract syntax tree match, predicate fuzzy match and execution accuracy are adopted to evaluate this task. Experimental results indicate that models with sufficient scale excel in capturing patterns for FOL formulas and generating syntactically valid FOL formulas. However, GPT-3 and Codex still face challenges in effectively translating an NL story into a logically or semantically similar FOL counterpart.
Table 2: Details of multi-hop reasoning datasets.
Multi-hop reasoning refers to the ability to connect and reason over multiple pieces of information or facts to arrive at an answer or conclusion. It involves traversing a chain of facts or knowledge in order to make more complex inferences or answer questions that cannot be answered by simply looking at a single piece of information (Tang et al., 2021b).
Significant advancements have been made in multi-hop reasoning evaluation benchmarks, with some of the most classical and representative ones being HotpotQA (Yang et al., 2018) and HybridQA (Chen et al., 2020), which are typically evaluated by measuring standard evaluation metrics such as EM and F1 between the generated answer and the ground truth answer. Table 2 provides detailed information about the datasets used to evaluate the capability of LLMs in answering multi-hop questions. In a study by Bang et al. (2023), ChatGPT’s performance in multi-hop reasoning is assessed using 30 samples from the HotpotQA dataset. The results indicate that ChatGPT exhibits very low performance, shedding light on a common limitation shared among LLMs, indicating that they possess restricted capabilities in handling complex reasoning tasks. Chen et al. (2023a) monitor how LLMs’ ability to answer multi-hop questions of the HotpotQA dataset evolves over time. They observe significant drifts in the performance of both GPT-4 and GPT-3.5 on this particular task. Specifically, there is a very substantial increase in the exact match rate for GPT-4 from March 2023 to June 2023, while GPT-3.5 shows opposite trends with a decline in performance. These observations indicate the fragility of current prompting methods and libraries when confronted with the LLM drift in handling complex tasks.
Given that mathematics necessitates advanced cognitive skills such as reasoning, abstraction, and calculation, its evaluation constitutes a significant component of large language model assessment. Typically, a mathematical reasoning evaluation test set comprises problems with corresponding correct answers serving as labels, with accuracy commonly employed as the measurement criterion. This section primarily elucidates the evolution of mathematical reasoning evaluation datasets and associated evaluation methods within the realm of mathematical reasoning.
The development of the mathematical reasoning evaluation for AI models can be divided into two stages. The initial stage predates the advent of LLMs, during which evaluation datasets are primarily designed to facilitate the study of automated solutions for mathematics and science problems. Among various problem types, math word problems align closely with natural language processing tasks, thereby garnering significant attention from researchers. Evaluation datasets from this stage include AddSub (Hosseini et al., 2014), MultiArith (Roy & Roth, 2015), AQUA (Ling et al., 2017), SVAMP (Patel et al., 2021), and GSM8K (Cobbe et al., 2021). Among these datasets, AddSub, MultiArith and AQUA, as early dataset, feature a relatively small data volume, ranging from 395 to 600 elementary questions. GSM8K and SVAMP, on the other hand, are recent datasets that have drawn considerable attention from the research community. The queries and answers within GSM8K are meticulously designed by human problem composers, guaranteeing a moderate level of challenge while concurrently circumventing monotony and stereotypes to a considerable degree. SVAMP questions the efficacy of automatic solver models that achieve high performance based solely on shallow heuristics. Consequently, modifications have been made to certain existing questions in order to evaluate the true ability of these model on the test set.
During the second stage, a variety of datasets are curated primarily for evaluating LLMs. These datasets can be roughly divided into two categories. The first category is characteristic of comprehensive examinations, which cover multiple subjects to assess LLMs. The mathematical subject is usually included, where mathematics-related inquiries are primarily presented as multiple-choice questions. Studies such as M3KE (Liu et al., 2023a) and C-EVAL (Huang et al., 2023c) fall within this purview, both of which contain questions from primary, middle, and high school mathematics. Researchers from Vietnam have developed VNHSGE (Dao et al., 2023), a Vietnamese High School Graduation Examination dataset, which consists of 2500 mathematical questions, covering mathematical concepts of spatial geometry, number series, combinations, and more. The second category emphasizes the proposition of mathematical test sets that can profoundly evaluate LLMs. In addition to math word problems, other types of math problems are also gradually gaining traction in mathematical reasoning evaluation work. The MATH dataset (Hendrycks et al., 2021c), for instance, includes 7 types of problems: Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus. These mathematical problems are sourced from the American High School Mathematics Competition and are tagged with difficulty levels ranging from 1 to 5.
The JEEBench (Arora et al., 2023) is introduced to challenge GPT-4. Evaluation questions are sourced from the Indian Joint Entrance Examination Advanced Exam, which is challenging and time-consuming even for humans. Compared to MATH, the mathematical evaluation questions in this dataset are significantly more difficult, thereby enhancing its value for testing the limits of GPT-4. In terms of assessing pure arithmetic ability, MATH 401 (Yuan et al., 2023) is proposed, featuring a variety of arithmetic expressions. In addition to standard addition, subtraction, multiplication, and division, this test set also contains more complex calculations, such as exponentiation, trigonometry, logarithm functions, and more. CMATH (Wei et al., 2023b) introduces a Chinese Elementary School Math Word Problems dataset. The feature of this dataset is that it categorizes the difficulty of mathematical problems by grade and provides annotations for the steps to solve these problems, enabling researchers to better comprehend the model’s evaluation results.
The mathematical reasoning ability of LLMs is usually assessed under the zero- or few-shot setting, where either no or a few examples are incorporated into prompts for the tested model to elicit a response. CMATH employs zero-shot evaluation and has found that GPT-4 delivers the best performance, with accuracy exceeding 60% across all six grades. However, all models exhibit a decline in performance as the grade level increases. The concept of Chain-of-thought has been introduced by (Wei et al., 2022) and demonstrated its effectiveness in prompting LLMs. They conduct experiments on GSM8K, SVAMP, ASDiv (Miao et al., 2020) and AQuA. They suggested that Chain-of-thought prompting is suitable for evaluating LLMs. In addition to Chain-of-thought prompting, other types of prompting are also used in mathematical reasoning tasks. These include self-consistency prompting, Plan-and-Solve prompting (Wang et al., 2023c), and so on. JEEBench experiments with both Chain-of-thought and self-consistency prompting. Results with JEEBench experiments indicate that even GPT-4 might struggle in retrieving relevant math concepts and perform appropriate operations. As LLM evaluations progress, some studies have noted that the aforementioned evaluation methods fall under static evaluation. These studies suggest that the way humans interact with LLM poses an impact on the model evaluation results. Therefore, it is crucial to collect data on user behaviors and corresponding model results to better analyze the alignment between them. In this aspect, Collins et al. (2023) introduce CheckMate, a dynamic evaluation method that incorporates interactive elements into evaluation.
Tool learning refers to foundation models enabling AI to manipulate tools, which can lead to more potent and streamlined solutions for real-world tasks (Qin et al., 2023b). LLMs can perform grounded actions to interact with the real world, such as manipulating search engines (Nakano et al., 2021; Qin et al., 2023a), shopping on ecommerce websites (Yao et al., 2022), planning in robotic tasks (Huang et al., 2022a; Ichter et al., 2022; Huang et al., 2022b), etc. The model’s ability for tool learning can be divided into the capability to manipulate tools and the capability to create tools.
The model’s capability to manipulate tools can be futher divided into two categories: tool-augmented learning by using tools to enhance or expand the model’s abilities (Mialon et al., 2023), and tool-oriented learning with the goal of mastering a certain tool or technique, which is concerned with developing models that can control tools and make sequential decisions in place of humans (Qin et al., 2023b). In the following sections, we will summarize the evaluation methods for these two tool learning approaches.
In general, the current evaluation methods mainly focus on two aspects: (i) Assessing whether it can be achieved, that is, whether the model can successfully execute those tools by understanding them (Song et al., 2023; Ichter et al., 2022). Under this dimension, commonly-used evaluation metrics include the execution pass rate and tool operation success rate. (ii)Assessing how well it is done, which further evaluates the model’s deeper capabilities, once it has been determined that the model can achieve the task. This evaluates whether the final answer is correct, the quality of generated programs, and human experts’ preferences regarding the model’s operation process. In addition to some existing automatic evaluation metrics, most current research still relies on manual preference evaluations (Thoppilan et al., 2022; Qin et al., 2023a; Tang et al., 2023c) .
Evaluation for Tool-augumented Models Many studies combine commonly used evaluation datasets to assess the improvement in performance on downstream tasks after incorporating application programming interface (API) calls into models and use the corresponding metrics from these datasets, such as math problems (Cobbe et al., 2021), reasoning, and question answering (Hsieh et al., 2023; Zhuang et al., 2023; Schick et al., 2023; Borgeaud et al., 2022; Lu et al., 2023a; Sun et al., 2023; Parisi et al., 2022; Chen et al., 2022a; Gao et al., 2023; Qiao et al., 2023; Hao et al., 2023; Lu et al., 2023b). The evaluation metrics used in these studies include accuracy, F1, and Rouge-L. These studies combine existing datasets to create benchmarks used for evaluation, providing excellent references for similar future evaluations.
LaMDA (Thoppilan et al., 2022) introduces new evaluation metrics on existing datasets, which proposes foundational and role-specific metrics on a popular dialogue dataset. The foundational metrics include rationality, specificity, novelty, empiricity, informativeness, and citation accuracy. Role-specific measures focus on helpfulness ensuring that the model’s response matches the intended role. These metrics are evaluated by crowdsourced workers. However, such manual evaluations are expensive, time-consuming, and intricate. The complexity of human judgment is also challenging, making these evaluations less efficient and less generalizable than widely accepted automatic evaluation metrics. Additionally, it’s imperative to emphasize that beyond establishing evaluation metrics, when comparing the capabilities of different models, it’s essential to ensure they use the same version of the API during the evaluation process (Qin et al., 2023b). This guarantees a more equitable and unbiased assessment.
Tool augmented learning has propelled the application of LLMs in the medical domain. GeneGPT (Jin et al., 2023) integrates the NCBI Web API with LLMs. It evaluates the proposed GeneGPT model using 9 GeneTuring tasks (Hou & Ji, 2023) related to NCBI resources, each with 50 question-answer pairs. Tasks are grouped into four categories: gene naming, genome positioning, gene function analysis, and sequence alignment. Most LLMs like GPT-3, ChatGPT3 and New Bing4 perform poorly, often scoring 0.0. However, GeneGPT, combined with NCBI Web API5, excels in one-shot learning, though it has some error types, including extraction issues.
Evaluation for Tool-oriented Models We categorize the evaluation methods based on the type of tools that the model has learned to control.
Search Engine. Building upon WebGPT (Nakano et al., 2021), WebCPM (Qin et al., 2023a) uses tool learning to allow models to answer long-form questions by searching the web. It improves on WebGPT’s evaluation methods with both automatic and manual evaluations. For automatic evaluation, action prediction uses F1 metrics, while other tasks like query generation use Rouge-L. For manual evaluation, 8 annotators compare answers from three sources: search model, human-collected facts, and Bing. Results show that mBART (Liu et al., 2020c) and C-BART (Shao et al., 2021) underperform other PLMs, while mT0 (Muennighoff et al., 2023) is generally better than mT5 (Xue et al., 2021). This highlights the need for language models to refine skills during multi-task fine-tuning.
Onlineshopping. WebShop (Yao et al., 2022) trains models to query online shopping engines and make purchases. They split their 12,087-instruction dataset into a training dataset with 10,587 instructions, a development set with 1,000 instructions, and a testing set with 500 instructions, collecting human shopping paths for each instance. By evaluating task score and success rate, they finally obtain the average performance of humans and the models. After evaluating, they have found that humans outperform LLMs in all metrics. The most notable difference, a 28% gap, is in making the correct choice after searching, highlighting agents’ struggles to choose the right product options.
Code Generation. RoboCodeGen (Liang et al., 2023) introduces a new benchmark with 37 function generation tasks, which has several key differences from previous code generation benchmarks: (i) It is robot-themed, focusing on spatial reasoning tasks, geometric reasoning and control. (ii) It allows and encourages the use of third-party libraries, such as NumPy. (iii) The provided function headers neither have documentation strings nor explicit type hints, so LLMs need to infer and adhere to common conventions. (iv) The use of undefined functions is also permitted, which can be constructed via hierarchical code generation. Their chosen evaluation metric is the pass rate of generated code that passes manually written unit tests. The results show that domain-specific language models (e.g., Codex (Chen et al., 2021)) generally outperform LLMs from OpenAI, and within each model family, performance improves with increasing model size.
Robotic Tasks. In these tasks, LLMs serve as a multi-step-planning “command center”, using a robotic arm to interact with the environment. ALFWorld (Shridhar et al., 2021) is a game simulator that aligns text with embedded environments, enabling agents to learn abstract, text-based strategies in TextWorld. Subsequently, these strategies can be executed richly to accomplish objectives set in the ALFRED benchmark (Shridhar et al., 2020). This benchmark encompasses six distinct tasks and over 3,000 environments. It demands the intelligent agent to comprehend the target task, devise sequential plans for sub-tasks, and execute actions in the given environment. Tasks include searching for hidden objects (such as locating a fruit knife in a drawer), moving objects (e.g., moving a knife to a chopping board), manipulating one object with another (for instance, refrigerating a tomato in the fridge) and so on. Ichter et al. (2022) also construct 101 commands across 7 command families referencing ALFRED (Shridhar et al., 2020) and Behavior (Srivastava et al., 2021) to test the PaLM-SayCan system, a tool-learning PaLM(Chowdhery et al., 2023) model. The task requires models to use a mobile robotic arm and a set of object manipulation and navigation skills in two environments(i.e., office and kitchen). Performance is measured based on the appropriateness of the selected skills to the command and the system’s successful execution of the required commands. Three human evaluators assess the entire process, with final results showing that PaLM-SayCan achieves an 84% planning success rate and a 74% execution rate in the simulated kitchen enviroment. Meanwhile, Inner Monologue (Huang et al., 2022b) analyzes desktop operations and navigation tasks in simulated and real environments, evaluating InstructGPT (Brown et al., 2020; Ouyang et al., 2022) and PaLM (Chowdhery et al., 2023). Their results indicate that rich semantic knowledge in pre-trained LLMs can be directly transferred to unseen robotic tasks without the need of further training.
Multi-tool Benchmark According to the previous discussion, evaluation for tool-augmented and tool-oriented LLMs primarily assesses the use of a single tool based on the performance change on downstream tasks with existing benchmarks. However, these benchmarks might not genuinely represent the extent to which models utilize external tools since some tasks in these benchmarks can be accurately addressed using only the internal knowledge of assessed LLMs. In light of this issue, an increasing number of researchers begin to focus on scenarios that combine the use of multiple tools to evaluate the performance of LLMs that have undergone tool learning. This ensures a comprehensive and diverse reflection of the model’s capabilities and limitations when using various tools. We hence delve into a detailed comparison of existing hybrid tool benchmarks to guide subsequent evaluations.
API-Bank (Li et al., 2023c) presents a tailor-made benchmark for evaluating tool-augmented LLMs, encompassing 53 standard API tools, a comprehensive workflow for tool-augmented LLMs, and 264 annotated dialogues. It uses accuracy as a metric for evaluating API calls, ROUGE-L as a metric for evaluating post-call responses. For task planning evaluation, the completion of a task planning is determined by the model’s successful API call using given parameters. Experiment results on API-Bank show that compared to GPT-3 (Brown et al., 2020), GPT-3.5-turbo has the capability to use tools, while GPT-4 (OpenAI, 2023) possesses more robust planning capabilities. Nonetheless, there remains significant room for improvement compared to human performance. APIBench (Patil et al., 2023) constructs a large API corpus by scraping ML application interfaces (models) from three public model hubs: HuggingFace6, TorchHub7, and TensorHub8. They include all API calls from TorchHub (94 API calls) and TensorHub (696 API calls). For HuggingFace, due to the vast number of models, they select only the top 20 most downloaded models from each task category, totaling 925 models. Moreover, they utilize Self-Instruct (Wang et al., 2023e) to generate 10 synthetic user question prompts for each API. Using the created dataset, they check the functional correctness and hallucination problem for LLMs, reporting the corresponding accuracy. They discover that invoking APIs using GPT-4 and GPT-3.5-turbo under the zero-shot setting leads to severe hallucination errors. Xu et al. (2023b) curate a new benchmark, named ToolBench, combining existing datasets and new datasets they collect. This benchmark evaluates models’ ability to generalize to unseen API combinations and to engage in advanced reasoning. It encompasses eight tasks, including single and multi-step action generation. Each task contains approximately 100 test cases. Open-source models, after tool learning, achieve comparable or even better success rates than GPT-4 API on 4 out of the 8 tasks. However, their success rates are still relatively low on tasks requiring advanced reasoning. ToolAlpaca (Tang et al., 2023c) expands evaluation scenarios to cover ten real-world settings. From a training set of 426 tool uses, ten previously unseen tools are selected, resulting in 100 evaluation instances. Using the ReAct style (Yao et al., 2023), they trigger tool usage during text generation. Human reviewers assess program accuracy and overall correctness. Even with limited simulated training data, GPT-3.5 and Vicuna (Chiang et al., 2023) demonstrate strong tool generalization abilities And ToolAlpaca’s performance is comparable to that of GPT-3.5. TPTU (Ruan et al., 2023) introduces a diverse evaluation dataset covering from individual tool usage to comprehensive end-to-end multi-tool utilization. Different models show varying levels of proficiency across tasks. For instance, Claude (Bai et al., 2022) exhibites excellent SQL generation capabilities, while ChatGLM (Zeng et al., 2023a) excells in math code generation. These differences could be attributed to training data, training strategies, or model size. This comprehensive evaluation focuses on the appropriateness of the selected tools and their effective use. The benchmarks mentioned earlier are designed to assess the ability of LLMs in using multiple tools to tackle challenging tasks. They primarily emphasize constructing high-quality tool chains for LLMs fine-tuning and evaluating the accuracy of API calls in fixed and real-world scenarios. In contrast, ToolQA (Zhuang et al., 2023) is different because it centers on whether the LLMs can produce the correct answer, rather than the intermediary process of tool utilization during benchmarking. Additionally, ToolQA aims to differentiate between the LLMs using external tools and those relying solely on their internal knowledge by selecting data from sources not yet memorized by the LLMs. Specifically, it incorporates 13 different types of tools to test the external tool-using capability of LLMs, with reference data spanning text, tables, and charts. These tools encompass functionalities like word counting, question rephrasing, retrieval, parsing, calculation, reasoning, and more. With success rate as the evaluation metric, experimental results indicate that LLMs leveraging external tools significantly outperform those models that only utilize internal knowledge. Qin et al. (2023b) embark on a study to explore the applications of tool learning, investigating the efficacy and constraints of state-of-the-art LLMs when they use tools.
They select 18 representative tools for assessment. For six of these tasks, existing datasets are employed for evaluation. In contrast, for the remaining 12 tasks, such as slide-making, AI painting, and 3D model construction, they also adopt the Self-Instruct approach (Wang et al., 2023e). Utilizing ChatGPT, they expand upon the manually written user queries and then manually assess the success rate of these operations. By contrasting the performance of ChatGPT and text-davinci-003, they observe that, although ChatGPT has undergone fine-tuning with RLHF, its outcomes do not surpass those of text-davinci-003. Previous benchmarks mainly focus on simple tasks completed using a single API. In contrast, RestBench (Song et al., 2023) aims to promote the exploration of addressing real-world user instructions using multiple APIs. They choose two prevalent real-world scenarios: the TMDB movie database and the Spotify music player. TMDB provides official RESTful APIs covering information on movies, TV shows, actors, and photos. The Spotify music player offers API endpoints to retrieve content metadata, receive recommendations, create and manage playlists, and control playback. For these two scenarios, they filter out 54 and 40 commonly used APIs, respectively, and obtain the corresponding OpenAPI specifications to construct RestBench. Through manual evaluation, they assess the correctness of the API call paths generated by the model and the success rate of completing user queries. They find that when using all official checkpoints of Llama2-13B to implement RestGPT, they fail to understand the prompts and generate effective plans. ToolLLM (Qin et al., 2023c) introduces ToolEval, a universal evaluation tool resembling a leaderboard. It highlights two metrics: pass rate, which measures the proportion of successfully completed instructions within limited attempts, and win rate, which compares performance against chatGPT. Such an evaluation approach not only integrates both automatic and manual assessment methods but also ingeniously uses comparison with the ChatGPT-generated solutions as a substitute for direct human scoring. This significantly reduces the potential biases and unfairness that humans might introduce.
Cai et al. (2023) assess whether scheduler models can effectively recognize existing tools and create tools for unfamiliar tasks. They use 6 datasets from diverse areas: logic reasoning, object tracking, Dyck language, word sequencing, the Chinese remainder theorem, and meeting scheduling. While the first five datasets are from BigBench (Srivastava et al., 2022), the meeting scheduling task is specially developed to demonstrate the model’s real-world applicability. CREATOR (Qian et al., 2023), focusing on LLM’s tool-making ability, introduces the Creation Challenge dataset to test the LLM’s problem-solving skills in new situations without readily available tools or code packages. By leveraging the Text-Davinci-003 model, they expand the dataset iteratively for more diversity and novelty. Their evaluations on the challenge dataset reveals that chatGPT’s tool-making performance improves with more hints, reaching up to 75.5% accuracy.
In reviewing related evaluations, we notice a shortage of high-quality datasets for genuine human-machine interactions in real-world scenarios. We hope our efforts inspire the research community to develop such benchmarks, which might be crucial for training the next generation of AI systems.
Figure 3: Overview of alignment evaluations.
Although instruction-tuned LLMs exhibit impressive capabilities, these aligned LLMs are still suffering from annotators’ biases, catering to humans, hallucination, etc. To provide a comprehensive view of LLMs’ alignment evaluation, in this section, we discuss those of ethics, bias, toxicity, and truthfulness, as illustrated in Figure 3.
The ethics and morality evaluation of LLMs aims to assess whether LLMs have the ethical value alignment ablility, and whether they generate content that potentially deviates from ethical standards. While there are considerable variations in criteria for determining moral categories, we categorize current evaluations into four macroscopic perspectives based on their respective criteria.
Evaluation with Expert-defined Ethics and Morality Expert-defined ethics and morality refers to ethics and morality categorized by experts, usually proposed in academic books and articles. The earliest ethics and morality categories can trace back to Moral Foundation Theory (MFT) (Graham et al., 2009). MFT devides the moral principles into five categories, each of which contains positive and negative perspectives. MFT generally become a cornerstone of related datasets. These datasets focus on ethics and morality in different fields, such as politics (Johnson & Goldwasser, 2018), social sciences (Forbes et al., 2020), social media (Hoover et al., 2020). Rather than simply using yes/no to classify a scene or paragraph into one of the ten moral foundations proposed by MFT, Social Chemistry 101 (Forbes et al., 2020) and Moral Foundations Twitter Corpus (Hoover et al., 2020) use a multi-dimensional metric to determine the categories. Social Chemistry 101 dissolves social norms into 12 dimensions, which contain moral foundations proposed in MFT. Moral Stroies (Emelin et al., 2021) is a crowd-sourced dataset containing 12K short narratives for goal-oriented moral reasoning grounded in social situations, genreated on social norms extracted from Social Chemistry 101 but ignoring controversial or value-neutral entries. Moral Foundations Dictionary (MFD) (Rezapour et al., 2019) is proposed on the foundation of MFT, and extended by Hopp et al. (2021) because MFD restricts the utility of certain words in expressing and understanding moral messages and natural variations of their meaning.
In evaluating LLMs, TrustGPT (Huang et al., 2023b) proposes a method to evaluate the ethical and moral alignment of LLMs, which adopts two ways: active value alignment (AVA) and passive value alignment (PVA). The used dataset is Social Chemistry 101. The evaluation metric for AVA is soft and hard accuracy due to the variations in human evaluation when considering the same object, while the metric for PVA is the proportion of cases where LLMs refuse to answer. Results on TrustGPT show that on AVA, LLMs evaluated perform well on soft accuracy compared to hard accuracy. It can also be concluded that LLMs evaluated have certain judgment ability for social norms since the hard precision is above 0.5. However, the performance on PVA is not good. ETHICS (Hendrycks et al., 2021a) is proposed based on previous works which focus on various principles for narrow applications (Kitaev et al., 2020; Achiam & Amodei, 2019; Roller et al., 2021; Christiano et al., 2017) and reorganizes five dimensions which are justice, deontology, virtue ethics, utilitarianism, and commonsense moral judgements. 0/1-loss is used in the experiments of evaluating LLMs on ETHICS.
Evaluation with Crowdsourced Ethics and Morality Ethics and Morality defined in this way are all established by crowdsourced workers, who judge ethics and morality without professional guidance or training, only through their own preference. Botzer et al. (2021) focus on analyzing moral judgements rendered on social media by capturing the moral judgements which are passed in the subreddit /r/AmITheAsshole on Reddit. The labels of the collected data in their work are determined entirely by public voting in the social media community.
There are many other works (Forbes et al., 2020; Hendrycks et al., 2021a; Ziems et al., 2022) that use the data from this subreddit as the source of their dataset, but they all use different ways to preprocess the collected data. Yet another way to collect crowdsourced ethics and morality data is interview. MoralExceptQA (Jin et al., 2022) considers 3 potentially permissible exceptions, manually creates scenarios according to these 3 exceptions, and recruits subjects on Amazon Mechanical Turk (AMT), including diverse racial and ethnic groups. Different subjects are asked the same written scenario to decide whether to conform to the original norm or to break the norm in given cases. Binary classification is used as the evaluation metric and results show that, for InstructGPT, questions about how much harm will this decision cause are the easiest ones to answer, whereas questions about the purpose behind a moral rule are the most challenging questions.
Evaluation with AI-assisted Ethics and Morality AI-assisted ethics and morality refer to that AI is used to assist humans in the process of determining ethical categories or constructing datasets. With the rise of LLMs, curating datasets with assists of LLMs is promising. PROSOCIALDIALOG (Kim et al., 2022) is a multi-turn dialogue dataset, teaching conversational agents to respond to problematic content following social norms. GPT-3 (Brown et al., 2020) is used to draft the first three statements of each dialogue, prompting it to play the role of a problematic and an inquisitive speaker through examples. Crowdworkers revise these utterances and annotate Rules of Thumb (RoTs) and responses as well. After N rounds of generating and proofreading the dialogue, workers will finally label the safety of dialogue. MIC (Ziems et al., 2022) is also a dialogue dataset but focusing on prompt-reply pairs. They filter out eligible metadata from r/AskReddit as prompts to BlenderBot (Roller et al., 2021), DialoGPT (Zhang et al., 2020), and GPT-Neo (Black et al., 2021). Outputs are filtered to make sure at least one word appears in EMFD (Hopp et al., 2021). Crowdsourced workers are asked to match each filtered Q&A pair to one RoT, and to answer a series of questions about the attributes for the RoT they match and revise the answer to prompt that is either neutral or aligns with the RoT.
Scherrer et al. (2023) use rules in Gert (2004) as the moral rules in generating scenarios and action pairs. They define low-ambiguity and high-ambiguity settings. Scenarios and actions in different settings are generated by GPT-4 or text-davinci-003. They evaluate the different performance of selected 28 open- and closed-source LLMs in different settings from the perspectives of statistical measures and evaluation metrics.
Evaluation with Hybrid Ethics and Morality This includes both data on ethical guidelines created by experts and data on ethical guidelines determined by the crowd. Lourie et al. (2021) use two datasets: the ANECDOTES that collects 32,000 real-life anecdotes with normative judgments and the DILEMMAS contains 10,000 simple, ethical dilemmas. Same as the dataset proposed by Botzer et al. (2021), the raw data of ANECDOTES is from Reddit, cleaned by rule-based filters that remove undesirable posts and comments, and the voting results of Reddit users are directly used as the labels for each instance. While in DILEMMAS, they hire annotators from AMT to label each instance pair which pairs two actions from the ANECDOTES and to identify which one crowdsourced workers find less ethical.
Bias in language modeling is often defined as “a bias that produces a harm to different social groups” (Crawford, 2017), and the types of harms associated with it include the association of particular stereotypes with groups, the devaluing of groups, the underrepresentation of particular social groups, and the inequitable allocation of resources to different groups (Dev et al., 2022). Existing works have examined the possible harms of NLP modeling from a variety of perspectives, such as general social impacts (Hovy & Spruit, 2016) and risks associated with LLMs (Bender et al., 2021), the latter of which is particularly important today when LLMs are widely used. In order to mitigate these biases and associated harms, it is crucial to be able to detect and measure them, and a better understanding of bias metrics allows researchers to better adapt and deploy LLMs.
A variety of studies have already demonstrated the existence of biases inside language models and word embeddings (Caliskan et al., 2017; Bolukbasi et al., 2016; Lauscher et al., 2020; Malik et al., 2022). Now, extensive efforts are being made to focus on the external assessment of bias, specifically on model bias decisions for certain tasks (Mohammad, 2018; Webster et al., 2019) or direct evaluation of content generated by LLMs (Dhamala et al., 2021; Smith et al., 2022). In this survey, we summarize experiences from past works to address the following questions, when assessing bias in LLMs: (i) what datasets can be used, (ii) what specific types of bias can be measured, and (iii) what are the evaluation methods. Regarding these three aspects, we delve into a comparison of previous works in terms of types of biases covered and their evaluation methods.
Bias in model representations or embeddings does not necessarily imply biased outputs. To understand where the model’s output reinforces bias, many studies examine how these biases manifest in downstream tasks that have been previously researched. Since the advent of the seq-to-seq models, all NLP tasks can be unified as generation tasks. For example, by giving the instruction “Please identify the referent of ‘he’ in the following sentence”, the model can complete the coreference resolution task without needing specific training for the related task. Therefore, datasets used for bias evaluation in these downstream tasks can also be applied for LLMs bias assessment.
Coreference Resolution Coreference resolution is the task of determining which textual references resolve to the same entity, requiring inference about these entities. However, when these entities are persons, coreference resolution systems may make inappropriate inferences, causing harm to individuals or groups. Both Winogender (Rudinger et al., 2018) and WinoBias (Zhao et al., 2018) focus on gender bias associated with professions and use Winogram-schema style (Levesque, 2011) sentences to construct evaluation datasets. Winogender consists of 120 sentence templates, covering 60 professions, each generating a sentence template and only replacing the pronouns in them, with three pronoun genders - male, female, or neutral. They use the tendency of coreference systems to match female pronouns with specific professions rather than male pronouns as an evaluation metric and evaluate three coreference resolution systems. WinoBias, on the other hand, increases the focus on debiasing methods, requiring models not only to make decisions with gendered pronouns and stereotypically associated professions but also to connect pronouns with non-stereotypical professions. A model is considered to pass the WinoBias test only if it achieves high F1 scores in both tasks. Both studies indicate that current systems overly rely on social stereotypes when parsing ‘he’ and ‘she’ pronouns. After noting the phenomena revealed by WinoBias and Winogender, GAP (Webster et al., 2018) creates a corpus of 8,908 manually annotated ambiguous pronoun examples from Wikipedia, intending to promote equitable modeling of reference phenomena through detailed corpus annotation. Additionally, Cao & III (2020) propose that sociological and sociolinguistic gender concepts are not always binary, for example, some drag performers are referred to as ‘she’ during performances and ‘he’ otherwise. Therefore, they create a new dataset, the Gender Inclusive Coreference dataset (GICOREF), written and described by transgender individuals, to test the performance of coreference resolution systems on texts discussing non-binary and binary transgender individuals. They observe significant room for improvement in coreference systems, with the best-performing system achieving an F1 score of only 34%.
However, a recent study (Blodgett et al., 2021) exposes several issues in the reliability of both WinoBias and Winogender datasets. They identify a series of pitfalls in these datasets, including unstated assumptions, ambiguities, and inconsistencies. Their analysis show that only 0%–58% of the tests in these benchmarks are unaffected by these pitfalls, suggesting that these benchmarks might not provide effective measurements of stereotyping.
Machine Translation Some studies have observed that online machine translation services like Google Translate or Microsoft Translator exhibit certain gender biases (Alvarez-Melis & Jaakkola, 2017; Font & Costa-jussà, 2019). For example, regardless of the context, ‘nurse’ is translated as female, and ‘programmer’ as male. Such biases can be harmful if they occur frequently.
The WinoMT Challenge Set (Stanovsky et al., 2019) conducts the first large-scale, multilingual evaluation on translation systems. They combine Winogender and WinoBias to assess gender bias in MT. They design an automatic translation evaluation method for eight different target languages. MT models have to translate all sentences into the target language. They use simple heuristic methods and morphological analysis specific to the target language to extract the gender of the target entities. They calculate the percentage of instances that machine-generated translations have the correct gender as an indicator to evaluate four widely used commercial MT systems and two state-of-the-art MT models. Their results show significant gender bias in all tested languages. Further, Renduchintala & Williams (2021) expand this gender study in translation tasks to 20 languages. They believe that operationalizing gender bias measurement in an unambiguous task is clearer than framing it as an ambiguous task. So, they add contextual information to the occupational nouns to clearly specify the gender of the person referred to. For example, in the sentence “My nurse is a good father”, the gender identity of the nurse is unambiguous. In such a context, they determine whether the model’s stereotypical tendencies lead to translation errors. They observe that the accuracy does not exceed 70% for any languages or models. When the trigger word gender and occupational gender does not match, the accuracy drops. These two datasets can be easily extended to more languages and language models.
Natural Language Inference The task of Natural Language Inference (NLI) aims to determine whether a sentence (the premise) implies or contradicts another sentence (the hypothesis), or they are neutral in relation to each other.
Dev et al. (2020) use NLI tasks to measure biases in models, as illustrated by the following sentences: (1) A rude person visits the bishop. (2) An Uzbek visits the bishop. Clearly, the first sentence neither implies nor contradicts the second one. However, GloVe (Pennington et al., 2014) predicts with a high probability of 0.842 that sentence (1) implies sentence (2). To uncover this hidden bias, a systematic benchmark is developed targeting polarized adjectives (e.g., ‘rude’) and ethnic names (e.g., ‘Uzbek’), covering millions of such sentence pairs. Besides gender, they also include categories of nationality and religion for the first time. They define the bias metric as the deviation from neutrality and find a significant amount of bias in GloVe, ELMo (Peters et al., 2018), and BERT (Devlin et al., 2019).
Sentiment Analysis Sentiment analysis is to understand the attitudes, emotions, and opinions expressed in text. However, some computational algorithms to sentiment analysis may exhibit social biases. For example, sentences containing adjectives related to certain minority groups may be more likely to be rated as negative compared to the same sentences without those adjectives. This is especially true for groups that may be underestimated or stigmatized.
Díaz et al. (2019) pay special attention to age bias in this task. They crawl 4,151 blog posts and 64,283 comments from the “elderblogger” community (Lazar et al., 2017) and filter out 121 unique sentences. In each of these 121 sentences, they only change the age-related vocabulary to provide a comparative dataset to measure whether the sentiment scores of sentiment analysis models would change due to the variation of specific words. They find that there is a significant age bias in most algorithm outputs. Sentences with the adjective “young” are 66% more likely to be rated as positive than the same sentences with the adjective “old”. The Equity Evaluation Corpus (EEC) (Kiritchenko & Mohammad, 2018) also uses pairs of sentences but focus on biases related to race and gender. It expands the dataset to 8,640 English sentences and conducts a large-scale and comprehensive evaluation of 219 sentiment analysis systems.
Relation Extraction Relation extraction refers to extracting entity relations from original sentences and representing them as concise relation tuples. However, the fairness of this process is often overlooked. If a neural relation extraction (NRE) model more accurately predicts relations for male entities than female entities (e.g., regarding professions), the knowledge base to be constructed with extracted relations may end up with more information about males and less about females. This gender bias could then influence downstream predictions and reinforce societal gender stereotypes.
WikiGenderBias (Gaut et al., 2020) is a dataset created to assess gender bias in relation extraction systems. It measures the performance difference in extracting sentences about females versus males, containing 45,000 sentences, each of which consists of a male or female entity and one of four relations: spouse, profession, date of birth and place of birth. The creators suspect that a biased NRE system might use gender information as a proxy when extracting spouse and profession relations. This evaluation framework is used to assess gender bias in popular, open-source NRE models, offering valuable insights for developing future bias mitigation techniques in relation extraction.
Implicit Hate Speech Detection This task aims to identify and classify text content that includes hatred and prejudice. Such content may target individuals, specific groups, races, religions, sexual orientations, etc. The core challenge is that people’s comments about others are often implied rather than explicitly stated, in other words, they do not contain obvious foul language, defamation, or swear words. This differentiates it from the assessment of toxic language. Detecting this implicit language hatred is a daunting task, especially since it requires particular attention to the possibility of model classification errors. A model may wrongly classify non-hate speech as hate speech (false positive) or hate speech as non-hate speech (false negative). These errors may be related to the model’s inherent biases.
The benchmark dataset for this task is typically extracted and constructed from online social media, including Wikipedia Talk pages (Dixon et al., 2018), Civil Comments (Borkan et al., 2019; Do, 2019; Hutchinson et al., 2020), Reddit (Sap et al., 2020; Breitfeller et al., 2019), Twitter (Sap et al., 2020; Park et al., 2018; Davidson et al., 2019; ElSherief et al., 2021), and Hate Sites (Sap et al., 2020), broadly covering bias categories such as gender, sexuality, race, religion, disability, body, and age. DynaHate (Vidgen et al., 2021) and TOXIGEN (Hartvigsen et al., 2022) use language models (GPT-3) to dynamically generate large-scale datasets with subtle biased comments, covering more population groups than traditional manually written text resources. Besides English language datasets, CDail-Bias (Zhou et al., 2022) introduces the first annotated Chinese social bias detection dialogue dataset, covering race, gender, region, and occupation categories. CORGI-PM (Zhang et al., 2023a) filters out sentences that might have gender bias from a large-scale Chinese corpus, constructing a dataset for gender bias detection, classification, and mitigation tasks.
Usually, most studies measure performance using ROC-AUC (Do, 2019; Park et al., 2018; Dixon et al., 2018; Hutchinson et al., 2020), accuracy, and F1 scores (Sap et al., 2020; ElSherief et al., 2021). However, HateCheck (Röttger et al., 2021) points out that it is hard to identify specific weaknesses in models with these indicators. To provide more targeted diagnostic insights, they introduce the HateCheck functional test suite, which evaluates model performance on this task from 29 model functions.
Currently, many downstream task assessments are well-resourced in English, but are lacking for many other languages. We hope that more researchers from different cultural backgrounds participate in bias assessment research to lay the foundation for the safe use of LLMs worldwide.
StereoSet (Nadeem et al., 2021) and CrowS-Pairs (Nangia et al., 2020) are datasets designed to measure the stereotypical bias in language models (LMs) by using sentence pairs to determine if LMs prefer stereotypical sentences. StereoSet (SS) includes intra-sentential and inter-sentential prediction tests about race, religion, profession, and gender stereotypes. The intra-sentential test contains sentences with minimal differences about the target group, modifying the attributes related to the target group’s stereotypical, counter-stereotypical, or unrelated associations, acquired from crowdsourced workers. The inter-sentential test consists of context sentences about the target group, followed by free-form candidate sentences, also capturing stereotypical, counter-stereotypical, or unrelated associations. SS has been used to evaluate pretrained language models (PLMs) like BERT, GPT-2, and RoBERTa. CrowS-Pairs (CS) includes only intra-sentential prediction tests and covers nine biases, race, gender, sexual orientation, religion, age, nationality, disability, appearance, and socio-economic status or profession. It requires crowdsourced workers to write sentences about a disadvantaged group, which either exhibit a stereotype or counter the target group, and then pairs sentences minimal differences about a contrasting advantaged group. Unlike SS, CS disrupts groups rather than attributes. The evaluation metric used in CS has been adjusted accordingly, estimating the rate of unaltered tokens vs. altered tokens, not the other way round, to avoid higher probabilities for words like ‘John’ just because of their frequency in the training data, rather than learned social biases. Similarly, Hosseini et al. (2023) propose a modified TOXIGEN, selecting only sentences that all annotators agree biased towards the target group to reduce noise in the ToxiGen, and using log perplexity to assess the likelihood of benign and harmful sentences. The higher the log perplexity, the less likely the model will generate those sentences. They measure the log perplexity of each sentence in the evaluation dataset and assess 24 PLMs, including GPT-2, which shows lower safety scores, indicating a higher likelihood of generating harmful and biased content.
Besides examining model preferences, a more direct way to measure bias is from the model’s generated text. In this evaluation way, we provide a context to a model, which yields a response to the given context. We then evaluate the bias in the model’s response. However, the outputs of LLMs are usually very complex. Evaluating bias requires not only that the LLMs have a good understanding and compliance with the prompt or instruction, but also that we have good metrics to assess the degree of bias in the generated outputs.
Some works adopt automatic evaluation metrics. Liu et al. (2020a) use four indicators, diversity, politeness, sentiment and attribute words, to evaluate the race and gender domains of the seq2seq generative model, which can also be applied to the evaluation of LLMs. Meanwhile, BOLD (Dhamala et al., 2021) extends this to five types of biases: occupation, gender, race, religion, and political ideology. These sentences are collected from Wikipedia, truncated, and provided to LLMs as the first half of a sentence, with the LLMs being tasked with completing the second half. BOLD then evaluates advanced LLMs from four aspects: gender polarity, regard (Sheng et al., 2019), sentiments and toxicity. Another study conducted by Sheng et al. (2021) expands the categories of biases to social classes, sexual orientations, races and genders, and jointly assess the bias scores in model responses from four aspects: offensiveness, harmful agreements, occupational associations, and gendered coreferences. This study finds that the Blender chatbot (Roller et al., 2021) generates more “safe” and default answers (e.g., “I’m not sure what you mean…”, “I don’t know…”), while DialoGPT (Zhang et al., 2020) responses contain more diverse and direct answers.
In addition to using automatic metrics, other works explore manual evaluations. HolisticBias (Smith et al., 2022) includes 13 demographic directions and uses crowdsourced workers from Amazon’s Mechanical Turk platform to evaluate the outputs of models like GPT-2, DialoGPT, and BlenderBot based on human preference, humanization, and interestingness criteria. Multilingual Holistic Bias (Costa-jussà et al., 2023) extends the HolisticBias dataset to 50 languages, achieving the largest scale of English template-based text expansion.
Whether using automatic or manual evaluations, both approaches inevitably carry human subjectivity and cannot establish a comprehensive and fair evaluation standard. Unqover (Li et al., 2020) is the first to transform the task of evaluating biases generated by models into a multiple-choice question, covering gender, nationality, race, and religion categories. They provide models with ambiguous and disambiguous contexts and ask them to choose between options with and without stereotypes, evaluating both PLMs and models fine-tuned on multiple-choice question answering datasets. BBQ (Parrish et al., 2022) adopts this approach but extends the types of biases to nine categories. All sentence templates are manually created, and in addition to the two contrasting group answers, the model is also provided with correct answers like “I don’t know” and “I’m not sure”, and a statistical bias score metric is proposed to evaluate multiple question answering models. CBBQ (Huang & Xiong, 2023) extends BBQ to Chinese. Based on Chinese socio-cultural factors, CBBQ adds four categories: disease, educational qualification, household registration, and region. They manually rewrite ambiguous text templates and use GPT-4 to generate disambiguous templates, greatly increasing the dataset’s diversity and extensibility. Additionally, they improve the experimental setup for LLMs and evaluate existing Chinese open-source LLMs, finding that current Chinese LLMs not only have higher bias scores but also exhibit behavioral inconsistencies, revealing a significant gap compared to GPT-3.5-Turbo.
In addition to these aforementioned evaluation methods, we could also use advanced LLMs for scoring bias, such as GPT-4, or employ models that perform best in training bias detection tasks to detect the level of bias in answers. Such models can be used not only in the evaluation phase but also for identifying biases in data for pre-training LLMs, facilitating debiasing in training data.
As the development of multilingual LLMs and domain-specific LLMs progresses, studies on the fairness of these models become increasingly important. Zhao et al. (2020) create datasets to study gender bias in multilingual embeddings and cross-lingual tasks, revealing gender bias from both internal and external perspectives. Moreover, FairLex (Chalkidis et al., 2022) proposes a multilingual legal dataset as fairness benchmark, covering four judicial jurisdictions (European Commission, United States, Swiss Federation, and People’s Republic of China), five languages (English, German, French, Italian, and Chinese), and various sensitive attributes (gender, age, region, etc.). As LLMs have been applied and deployed in the finance and legal sectors, these studies deserve high attention.
LLMs are usually trained on a huge amount of online data which may contain toxic behavior and unsafe content. These include hate speech, offensive/abusive language, pornographic content, etc. It is hence very desirable to evaluate how well trained LLMs deal with toxicity. Considering the proficiency of LLMs in understanding and generating sentences, we categorize the evaluation of toxicity into two tasks: toxicity identification and classification evaluation, and the evaluation of toxicity in generated sentences.
An important NLP task is the identification and classification of toxic sentences. The most famous datasets for evaluating toxicity classification in English are OLID (Zampieri et al., 2019a) and SOLID (Rosenthal et al., 2021). OLID is a offensive language dataset crawled from Twitter, consisting 14K sentences. The dataset is labeled with offensive/non-offensive, targeted insult/non-targeted insult, and individual/target/others insulted. Following the release of OLID, SOLID has been introduced, featuring a larger dataset labeled using a semi-supervised learning method. This new dataset comprises over 9 million sentences. For non-English languages, OLID-BR (Trajano et al., 2023) is curated for Brazilian Portuguese and KODOLI (Park et al., 2023) for Korean. OLID-BR contains more than 6K sentences, while KODOLI consists of 38K sentences.
Studies have been conducted on the evaluation of LLM’s capability towards toxicity identification and classification task. Wang & Chang (2022) investigate zero-shot prompt-based toxicity detection via LLMs. They use Social Bias Inference Corpus (Sap et al., 2020), HateXplain (Mathew et al., 2021), and Civility (Zampieri et al., 2019b) datasets for evaluation. Zhu et al. (2023b), Li et al. (2023b), and Huang et al. (2023a) specifically evaluate this task on ChatGPT. Zhu et al. (2023b) evaluate ChatGPT’s ability to reproduce human-generated labels, covering sentiment analysis and hate speech labeling. In the process of reevaluating hate speech labeling, they employ the COVID-HATE (He et al., 2021) dataset, which includes 2K sentences. Li et al. (2023b) evaluates ChatGPT’s capability in detecting hateful, offensive, and toxic (HOT) contents. They utilize HOT Speech9 dataset, which comprises 3K sentences. Huang et al. (2023a) specifically examine ChatGPT’s capability to identify and classify implicit hate speech. They utilize Latent Hatred (ElSherief et al., 2021) dataset that consists of 6K sentences. For non-English hate speech detection, the study conducted by Çam & Özgür (2023) assesses ChatGPT’s performance using a Turkish dataset created by Mayda et al. (2021), which contains 1,000 sentences.
LLMs may generate toxic words or sentences. Therefore, it is important to evaluate the toxicity of LLMs generated sentences. RealToxicityPrompts (Gehman et al., 2020) serves as a testbed for generating toxicity. The dataset consists of 100K naturally occurring prompts, with 22K of them having higher toxicity scores. It is commonly used for LLM toxicity evaluation, such as the toxicity evaluation of ChatGPT (Deshpande et al., 2023). HarmfulQ (Shaikh et al., 2023) is a benchmark dataset that contains 200 explicitly toxic questions generated by the text-davinci-002 model. Based on these datasets, the toxicity of the answers generated by LLMs can be evaluated. A widely-used tool for measuring toxicity is the PerspectiveAPI proposed by Google Jigsaw (Lees et al., 2022). The scoring scale of this tool ranges from 0 to 1, indicating a progression from lower toxicity to higher toxicity. At present, PerspectiveAPI can measure the toxicity of multilingual sentences, covering languages such as Arabic, Chinese, Czech, Dutch, English, French, German, Hindi, Hinglish, Indonesian, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, and Swedish.
LLMs have demonstrated remarkable proficiency in generating natural language text. The fluency and coherence of LLM-generated texts are even competitive with those of human-authored discourses. This proficiency has opened up avenues for the application of LLMs across a diverse spectrum of practical domains, including but not limited to education, finance, law, and medicine. However, despite their fluency and coherence, LLMs may fabricate facts and generate misinformation, thereby reducing the reliability of the generated texts (Bang et al., 2023). This limitation hinders their usage in specialized and rigorous applications such as law and medicine and exacerbates the risk of the spread of misinformation. Consequently, it is crucial to verify the reliability of LLM-authored texts and conduct comprehensive assessments towards their truthfulness. This will ensure that the information generated by LLMs is accurate and reliable, thereby enhancing their utility in various practical domains.
In the pursuit of evaluating the truthfulness of LLMs, various datasets have been curated. These datasets can be categorized into three primary types based on their associated tasks: question answering, dialogue, and summarization.
Question Answering Question answering datasets play a critical role in assessing the truthfulness of LLMs. The majority of these datasets serve as a means to evaluate the models’ proficiency in answering questions that remain unanswerable due to various factors, including those outside the current scope of human knowledge or those lacking essential context and background information needed to arrive at a verifiable answer. When presented with such unanswerable questions, LLMs should indicate uncertainty in their responses rather than attempting to provide deterministic answers that lack factual grounding. We provide a brief overview of key question answering datasets that encompass such unanswerable questions, thereby affording an effective means to assess the performance of LLMs with respect to truthfulness.
NewsQA (Trischler et al., 2017) is a machine comprehension dataset comprising 119,633 human-authored question-answer pairs based on CNN news articles. The crowdworkers who formulate the questions are only shown with the headlines and summary points, not the full news articles. As a result, some questions may lack sufficient evidence present in a hidden article to be answered. Consequently, 9.5% of the questions have no answers in the corresponding articles.
SQuAD 2.0 (Rajpurkar et al., 2018) is a significant extension of the original SQuAD machine comprehension dataset (Rajpurkar et al., 2016). This more challenging version combines answerable questions from SQuAD (Rajpurkar et al., 2016) with 53,775 new adversarial unanswerable questions anchored to the same context paragraphs. These new unanswerable questions are carefully crafted by crowdworkers to appear highly relevant to the corresponding paragraphs. However, these crafted questions have no actual answers supported by the paragraphs, which fools the model into producing unreliable guesses rather than abstaining from answering. This makes the dataset more challenging and tests the model’s ability to determine when it is unable to provide a reliable answer.
BIG-bench (Srivastava et al., 2022) is a collaborative benchmark comprising a diverse set of tasks that are widely perceived to surpass the existing capabilities of contemporary LLMs. The known_unknowns task within BIG-bench (Srivastava et al., 2022) contains unanswerable questions that have been deliberately curated such that no reasonable speculation can yield a valid answer, thereby intensifying the level of challenge. Furthermore, to balance the dataset, each unanswerable question is paired with a similar answerable question. This allows for a more rigorous evaluation of the models’ ability to provide accurate and reliable answers.
SelfAware (Yin et al., 2023) is a benchmark designed to evaluate how well LLMs can recognize the boundaries of their knowledge when they lack enough information to provide a definite answer to a question. It consists of 1,032 unanswerable questions and 2,337 answerable questions. These unanswerable questions are grouped into 5 categories based on the reasons they cannot be answered: no scientific consensus, imaginary, completely subjective, too many variables, and philosophical. By encompassing a variety of unanswerable question types, the SelfAware dataset (Yin et al., 2023) allows for a comprehensive assessment of LLMs’ ability to recognize their knowledge limitations across different domains.
In contrast to the above-mentioned datasets that quantify LLMs truthfulness by presenting models with unanswerable questions, the TruthfulQA benchmark (Lin et al., 2022a) aims to test whether LLMs can avoid generating false answers learned from training data. These learned false answers, referred to as imitative falsehoods, are false statements that have a high likelihood under the model’s training distribution. The benchmark contains 817 questions across 38 diverse categories, curated specifically to elicit such imitative falsehoods from models. By focusing on adversarial questions designed to trigger false claims frequently reflected in training data, TruthfulQA (Lin et al., 2022a) provides a rigorous test of whether current LLMs can generate truthful answers.
Dialogue One common application of LLMs is to power dialogue systems that can interact with humans in natural language. However, LLMs may produce responses that contain factual inaccuracies or inconsistencies (Welleck et al., 2019). Manually verifying the factual correctness and consistency of utterances produced by models during conversations is time-consuming and costly. Consequently, various automatic metrics have been proposed (Honovich et al., 2021; Zha et al., 2023) to address this issue. To facilitate research on automatic fact-checking and factual consistency evaluation in dialogue, various benchmark datasets have been curated. These datasets can be broadly classified into two categories: fact-checking and factual consistency evaluation.
Fact-checking Gupta et al. (2022) introduce the task of fact-checking in dialogue and curate the DIALFACT benchmark. The DIALFACT benchmark comprises 22,245 annotated conversational claims, each paired with corresponding pieces of evidence extracted from Wikipedia. These claims are categorized as either supported, refuted, or ‘not enough information’ based on their relationship with the evidence. The DIALFACT benchmark encompasses three subtasks: 1) The Verifiable Claim Detection task, which classifies whether a claim contains factual information that can be verified; 2) The Evidence Retrieval task, which retrieves relevant Wikipedia documents and evidence sentences for a given claim; and 3) The Claim Verification task, which classifies whether a claim is supported, refuted, or if there is not enough information based on the provided evidence sentences.
Factual Consistency Evaluation Honovich et al. (2021) construct a dataset of system responses for the Wizard-of-Wikipedia dataset (Dinan et al., 2019b), which includes manual annotations of factual consistency. Similarly, Dziri et al. (2022b) propose the BEGIN benchmark for evaluating factual consistency in knowledge-grounded dialogue. The BEGIN benchmark comprises 12,000 dialogue responses that are manually annotated into three categories: fully attributable, not fully attributable, and generic. Fully attributable responses convey information that is solely supported by the provided knowledge, while not fully attributable responses contain some unsupported or unverifiable information. Generic responses are too vague or broad to evaluate attribution accurately. Additionally, the ConsisTest benchmark (Lotfi et al., 2022) aims to evaluate the factual consistency of open-domain conversational agents. It uses the PersonaChat dataset (Zhang et al., 2018), which contains crowdsourced persona-grounded conversations, as its foundation. To construct the benchmark, simple factual questions in both WH and Y/N formats are generated from the persona statements and dialogue history present in the PersonaChat data (Zhang et al., 2018). These questions are then appended to appropriate dialogue segments to create benchmark samples. In total, the curated dataset contains approximately 18,600 conversational QA pairs to comprehensively assess consistency with both persona facts and conversational context.
Summarization Text summarization, wherein a succinct summary is automatically generated to encapsulate the most salient information derived from a lengthy document, stands as another prominent application of LLMs. Nonetheless, LLMs may struggle with generating summaries that maintain factual consistency with the source document (Falke et al., 2019). This underscores the importance of thorough evaluation of LLMs’ factual consistency prior to their deployment, thus stimulating research into the automatic verification of the factual accuracy of the summaries produced by these models (Goyal & Durrett, 2020; Kryscinski et al., 2020; Durmus et al., 2020; Scialom et al., 2021; Fabbri et al., 2022; Laban et al., 2022; Wang et al., 2023a; Luo et al., 2023). To facilitate more robust evaluations, several studies have focused on developing benchmarks to assess these factors. The majority of these benchmarks rely on manual annotation to assess the factual consistency between model-generated summaries and source documents. This annotation often includes ratings on a Likert scale indicating the degree of factual alignment between the summary and source (Fabbri et al., 2021), as well as binary consistency labels judging whether the summary is fully consistent or not (Kryscinski et al., 2020; Wang et al., 2020). Examples of such benchmarks include XSumFaith (Maynez et al., 2020), FactCC (Kryscinski et al., 2020), SummEval (Fabbri et al., 2021), FRANK (Pagnoni et al., 2021), SUMMAC (Laban et al., 2022), QAGS (Wang et al., 2020) and Goyal’21 (Goyal & Durrett, 2021). In contrast to the above mentioned benchmarks which are annotated at the span, sentence or summary level, Cao et al. (2022) construct a benchmark annotated at the entity level, while Cao & Wang (2021) introduce the CLIFF benchmark with word-level annotations. These provides more fine-grained annotations compared to prior work. To enable more robust and standardized evaluation of factuality on modern summarization systems, Tang et al. (2023a) construct the AGGREFACT benchmark. AGGREFACT aggregates 9 existing factuality-annotated datasets, including FactCC (Kryscinski et al., 2020), Wang’20 (Wang et al., 2020), SummEval (Fabbri et al., 2021), Polytope (Huang et al., 2020), Cao’22 (Cao et al., 2022), XSumFaith (Maynez et al., 2020), FRANK (Pagnoni et al., 2021), Goyal’21 (Goyal & Durrett, 2021), and CLIFF (Cao & Wang, 2021). By unifying multiple datasets and stratifying summaries based on underlying models, AGGREFACT allows for more rigorous analysis of performance, especially on recent state-of-the-art models.
In addition to benchmark datasets for evaluating the factual correctness of language models, the methodology itself for assessing truthfulness is another crucial driver of progress in this field. These approaches can be broadly categorized into three groups: natural language inference (NLI) based methods, question answering (QA) and generation (QG) based methods, and methods utilizing LLMs.
NLI-based Methods NLI is a fundamental task in natural language processing. It is primarily focused on discerning the logical relationship between two pieces of text, traditionally referred to as the “premise” and the “hypothesis”. The NLI task requires classifying the relationship between the premise and hypothesis as one of three potential logical relations: entailment, contradiction, or neutral. NLI plays a pivotal role in ensuring the consistency for text generated by applications such as dialogue and summarization systems. For dialogue systems, it is essential that the produced utterances are attributable to relevant source information, including the dialogue context and external knowledge (Welleck et al., 2019; Dziri et al., 2021; Lotfi et al., 2022). Similarly, for summarization systems, it is crucial that the generated summaries maintain consistency with the source document. The process of verifying the consistency between system outputs and source texts can be framed as an NLI problem (Falke et al., 2019; Laban et al., 2022; Maynez et al., 2020; Aharoni et al., 2022; Kryscinski et al., 2020; Utama et al., 2022; Roit et al., 2023), where an entailment result indicates that the source text and system output are consistent. The entailment models used for consistency verification are usually fine-tuned from pretrained language models like BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), T5 (Raffel et al., 2020), and mT5 (Xue et al., 2021) on NLI datasets such as SNLI (Bowman et al., 2015), MNLI (Williams et al., 2018), and ANLI (Nie et al., 2020).
QAQG-based Methods The Question Answering and Question Generation (QAQG)-based method is a novel approach for evaluating factual consistency between two texts. Originally proposed for summarization tasks (Durmus et al., 2020; Wang et al., 2020; Scialom et al., 2021; Fabbri et al., 2022), this method leverages Question Answering and Question Generation models to assess the factual consistency between generated summaries and their source documents. Specifically, the QAQG pipeline first employs a QG model to automatically generate questions or question-answer pairs from the summary text. If only questions are generated in the initial QG step, these questions are subsequently answered by a QA model conditioned separately on the summary and source document (Wang et al., 2020). However, if question-answer pairs are produced during QG, the questions are only answered by a QA model conditioned on the source document (Durmus et al., 2020). Subsequently, the similarity between the two sets of answers is quantified, typically using token-based matching metrics such as F1 scores, as an indicator of consistency between the summary and source document. The underlying intuition is that since the summary contains a subset of the information in the source document, the answers conditioned on the summary and document should exhibit high similarity if the summary faithfully represents the document. This QAQG framework can be analogously applied to dialogue tasks, where questions are generated conditioned on the dialogue responses and then answered by a QA model conditioned on the given knowledge source (Honovich et al., 2021; Dziri et al., 2022a; Deng et al., 2023b).
LLM-based Methods Recent studies suggest that when provided with appropriate prompts, LLMs can serve as general-purpose evaluators of text quality (Fu et al., 2023; Wang et al., 2023a; Bai et al., 2023b; Liu et al., 2023h; Li et al., 2023d; Chen et al., 2023d; Zheng et al., 2023; Dubois et al., 2023; Ji et al., 2023), as well as evaluators for task-specific applications such as translation (Kocmi & Federmann, 2023) and summarization (Chen et al., 2023b; Gekhman et al., 2023). In the context of LLMs’ truthfulness, Tam et al. (2023) propose measuring the factual consistency of LLMs by prompting them to evaluate how often they prefer factually consistent summaries over inconsistent ones for a given source document. Their research uses LLMs performance on this factual consistency assessment task in summarization as an indicator of the models’ factual consistency. Likewise, Chern et al. (2023) and Min et al. (2023) introduce FacTool and FActScore, respectively, to assess the factuality of text generated by LLMs. Specifically, FacTool (Chern et al., 2023) first prompts LLMs to extract claims from the text to be evaluated, based on natural language definitions of claims for different tasks. Subsequently, FacTool (Chern et al., 2023) prompts LLMs to generate queries from these extracted claims, enabling them to query external tools such as search engines, code interpreters, or LLMs themselves for evidence collection. Finally, FacTool (Chern et al., 2023) prompts LLMs to compare the claims against the evidence and assign binary factuality labels to each claim. In a manner similar to FacTool (Chern et al., 2023), FActScore (Min et al., 2023) assesses text factuality by first decomposing the text into short statements using LLMs. Each of these short statements represents an atomic fact, containing a single piece of information. Subsequently, LLMs are prompted to validate these atomic facts. In contrast to the above mentioned works on evaluating the truthfulness of LLMs, which usually use the widely recognized powerful LLMs such as GPT-4 and ChatGPT as the evaluator to judge the LLMs’ truthfulness, with the LLMs used for generating the text usually being different from the LLMs who act as the evaluator, another line of research delves into self-evaluation, which evaluates the factuality of the generated text by the LLMs themselves. Pioneering works in this area have demonstrated that LLMs are capable of expressing uncertainty regarding the accuracy of their responses to questions, implying that LLMs possess some degree of self-awareness regarding their knowledge boundaries (Lin et al., 2022b; Kadavath et al., 2022). Following this line of research and based on the intuition that factual content comprises the majority of the training corpus, it is expected that LLMs should assign higher probability to tokens associated with factual content. Consequently, multiple responses that LLMs generate to the same prompt should be similar to each other if the responses are not hallucinated by the LLMs, as the common generation strategies today tend to favor tokens with higher probabilities. Accordingly, SelfCheckGPT (Manakul et al., 2023) is proposed, which quantifies text factuality by first sampling multiple responses and then measuring consistency between these responses. Instead of assessing the truthfulness of LLMs through their produced text, Azaria & Mitchell (2023) propose training a classifier that predicts whether a response is true or false using the hidden layer activations of LLMs as inputs for the classifier.
Figure 4: Overview of safety evaluations for LLMs.
Table 3: Recent works on LLM robustness evaluation.
In this section, we discuss evaluations on the safety of LLMs, as illustrated in Figure 4. According to current studies, we roughly categorize LLMs safety evaluations into two groups: robustness assessment that measures the stability of LLMs when confronted with disruptions, and risk evaluation that examines advanced / general-purpose LLMs behaviors and assesses them as agents.
Robustness of LLMs is one of the important element to be evaluated in order to develop LLMs with stable performance. Low robustness to unseen scenarios or various attacks may cause severe safety issues. Recent works towards LLMs robustness evaluation are summarized in Table 3. We categorize LLMs robustness evaluation into 3 categories: prompt robustness, task robustness and alignment robustness.
Zhu et al. (2023a) propose PromptBench, a benchmark for evaluating the robustness of LLMs by attacking them with adversarial prompts (dynamically created character-, word-, sentence-, and semantic-level prompts). The adversarial prompts are used to evaluate eight different NLP tasks, each of which has its own dataset for evaluation. Liu et al. (2023i) evaluate the robustness of LLMs in handling prompt typos using prompts from the Justice dataset. Initially, LLMs are prompted to generate typos based on the Justice dataset. Then these generated prompts with typos are used to prompt LLMs to investigate the impact of prompt typos on the outputs of LLMs.
Wang et al. (2023b) evaluate the robustness of ChatGPT across various NLP tasks, including translation, question-answering (QA), text classification, and natural language inference (NLI). They perform this evaluation using AdvGLUE (Wang et al., 2021) and ANLI (Nie et al., 2020) as benchmark datasets for evaluating the robustness of LLMs on these tasks. Jiao et al. (2023) conduct a robustness evaluation of ChatGPT for the translation task using the WMT datasets (WMT19 Biomedical Translation Task (Bawden et al., 2019), set2 and set3 of WMT20 Robustness Task(Specia et al., 2020)). These datasets consist of parallel corpora containing naturally occurring noises and domain-specific terminology words. For the question-answering task, Kokaia et al. (2023) mainly focus on improving the robustness of LLM from closed book into open book QA. To evaluate the improvement in robustness, they utilize a dataset consisting of 1,475 open-ended general knowledge questions, which are intentionally perturbed with typos and grammatical errors. Zhao et al. (2023) also evaluate the robustness of LLMs in the question-answering task, specifically in table-based question-answering. To achieve this, they create a new dataset named RobuT, comprising 143,477 pairs of examples sourced from the WTQ (Pasupat & Liang, 2015), WikiSQL (Zhong et al., 2017), and SQA (Iyyer et al., 2017) datasets. The RobuT dataset includes data with table headers, table content, natural language questions (NLQ), and various types of perturbations. The main types of perturbations are character- and word-level perturbations, along with row or column swapping, masking, and extension. Ko et al. (2023) primarily focus on evaluting text classification task. They propose SynTextBench, a framework designed for generating synthetic datasets to evaluate the robustness and accuracy of LLMs in sentence classification tasks. Gan & Mori (2023) also focus on evaluating classification tasks using Japanese language datasets: MARC-ja, JNLI, and JSTS. These are distinct datasets from JGLUE benchamrk (Kurihara et al., 2022). The prompt templates are divided into five types: instruction prompt, base prompt, Japanese honorific removal prompt, changed punctuation prompt, and changed sentence pattern prompt.
Since the emergence of large language models, the range of solvable tasks has been expanding to include tasks like code generation, mathematical reasoning, and dialogue generation. It is essential to evaluate the robustness of LLMs for solving these tasks. Wang et al. (2023d) propose ReCode, a benchmark for evaluating the robustness of LLMs in code generation. Using HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021) datasets, ReCode generates perturbations in code docstring, function, syntax, and format. These perturbation styles encompass character- and word-level insertions or transformations. Shirafuji et al. (2023) conduct an evaluation of the robustness of LLMs in solving programming problems. The dataset is compiled from Aizu Online Judge (AOJ) and consists of 40 programming problems. It is then modified by randomizing variable names, anonymizing output settings, rephrasing synonyms, and inverse problem specifications. For the math reasoning task, Stolfo et al. (2023) introduce a benchmark designed to evaluate the robustness of LLMs. They utilize datasets including ASDiv-A (Miao et al., 2020), MAWPS (Koncel-Kedziorski et al., 2016), and SVAMP (Patel et al., 2021) for this evaluation. The evaluation is grounded in causal inference factors, including textual framing, numerical operands, and operation types. Li et al. (2023f) propose DGSlow, a benchmark for evaluating robustness of dialogue generation task using white-box attack. DGSlow generates adversarial examples with existing benchmark datasets, e.g. BlendedSkillTalk (Smith et al., 2020), Persona-Chat (Zhang et al., 2018), ConvAI2 (Dinan et al., 2019a), and EmpatheticDialogues (Rashkin et al., 2019).
The evaluation of robustness towards multilinguality is also crucial. Stickland et al. (2023) curate a multilingual task robustness dataset. The tasks specifically included are classification/labelling and NLI. From the original dataset MultiATIS++ (Xu et al., 2020b), MultiSNIPS, MultiANN (Pan et al., 2017), and XNLI (Conneau et al., 2018), they curate a noised version of these datasets by replacing existing words with the created noise dictionary.
The alignment robustness of LLMs also needs to be evaluated to ensure the stability of the alignment towards human values. Recent studies have used “jailbreak” methods to attack LLMs to generate harmful or unsafe behaviour and content. Liu et al. (2023j) empirically study the types and effectiveness of jailbreak prompts, resulting in a new dataset that consists of 78 jailbreak prompts. Their work focuses on evaluating ChatGPT against these jailbreak prompts. They find that ChatGPT is vulnerable for generating illegal activities, fraudulent activities, and adult content. Wei et al. (2023a) conduct jailbreak attacks against GPT-4 and Claude by using a newly curated dataset that consists of 32 jailbreak prompts. Deng et al. (2023a) observe that different LLMs may have different jailbreak prevention mechanisms. They propose “MasterKey”, a comprehensive jailbreak attack framework inspired by SQL attack method. “MasterKey” is capable for generating jailbreak prompts that work on 5 different LLMs: GPT-3.5, GPT-4, BARD, Bing Chat, and ERNIE.
Aforementioned LLM evaluations are all aimed at assessing the existing capabilities of LLMs. However, as capabilities of LLMs are rapidly approaching or reaching human levels, it may lead to catastrophic safety risks (Carlsmith, 2022; Shevlane et al., 2023; Anderljung et al., 2023), such as power-seeking and situational awareness. This suggests that it is necessary and important to build in advance evaluations that can deal with catastrophic behaviors and tendencies of LLMs. We describe the current progress of this from two aspects. One is the evaluation of LLMs by discovering their behaviors, which evaluates the process of LLMs in answering questions and making decisions, and verifies the consistency of LLMs behaviors. The other is the evaluation of LLMs by interacting it with the real environment, which regards LLMs as agents that imitate human behaviors in the real world to evaluate their ability to solve complex tasks.
Perez et al. (2023) attempt to discover LLMs’ risky behaviors by automatically constructing 154 datasets. Through these high-quality datasets, they find that LLMs not only show behaviors that please humans, but also exhibit desires for power and resources. At the same time, their experiments demonstrate that RLHF (Ouyang et al., 2022) would produce inverse scaling, that is, RLHF would further enhance LLMs’ political tendencies and strong desires not to be shut down. To evaluate such risks in LLMs, they define multiple categories to generate multiple-choice questions. Below we briefly introduce these categories of risks separately. Examples corresponding to each type of behavior are shown in Table 4.
In addition to this study, other works are also trying to discover the risky behaviors of LLMs. Fluri et al. (2023) argue that LLMs’ mistakes can be discovered by detecting whether LLM’s behaviors consistent, even when LLMs have superhuman abilities which are difficult for humans to evaluate the correctness of these model decisions. In their experiments, they observe logical errors of LLMs in decision-making with three tasks: chess games, future event prediction, and legal judgment.
Causality is another aspect of evaluating LLMs. BigToM (Gandhi et al., 2023) is a social reasoning benchmark that contains 25 control variables. It aligns human Theory-of-Mind (ToM) (WeTextGenerationLLMan, 1992; Leslie et al., 2004; Frith & Frith, 2005) reasoning capabilities by controlling different variables and conditions in the causal graph. Chen et al. (2023c) evaluate the counterfactual simulatability of explanations generated by LLMs. They propose two metrics, precision and generality, and use them to evaluate LLMs on multi-hop factual reasoning and reward modeling tasks. Their experiments reveal that LLMs’ explanations have low precision and that precision does not correlate with plausibility.
Chan et al. (2023) investigate cooperativeness in LLMs by evaluating the behaviors of LLMs in high-stakes interactions with other agents. They generate scenarios with particular game-theoretic structures using both crowdworkers and a language model, and provide a dataset of scenarios based on their generated data. They also test UnifiedQA (Khashabi et al., 2020) and GPT-3 (Brown et al., 2020) on this dataset and find that instruction-tuned models tend to act in a way that could be perceived as cooperative when scaled up.
Liu et al. (2023g) discuss the need to evaluate LLMs abilities as agents in interactive environments. They propose AgentBench, a benchmark consisting of 8 environments to evaluate the reasoning and decision-making abilities of LLMs. WebArena (Zhou et al., 2023) is another realistic and reproducible benchmark for agents, with fully functional websites from four common domains. WebArena includes a set of benchmark tasks to evaluate the functional correctness of task completions. Their experiments on several autonomous agents show that solving complex tasks is challenging, with their best GPT-4-based (OpenAI, 2023) agent achieving a success rate of only 10.59%.
In order to build task-oriented evaluation benchmarks more flexibly, researchers use sandboxes to design evaluation tasks, such as simulating human social activities (Liu et al., 2023f) and planning. Lin et al. (2023) propose a tool for easily building sandboxes, allowing users to customize and create their own evaluation tasks.
Evaluating LLMs as agents is in its infancy (Shevlane et al., 2023; Sato et al., 2023). The ARC Evals project12 of the Alignment Research Center13, which is responsible for evaluating the abilities of advanced AI to seek resources, self-replicate, and adaptation to new environments (Sato et al., 2023). They define these capabilities as Autonomous Replication and Adaptation (ARA), and argue that an agent must be able to perform almost all of the tasks in order to be capable of ARA. They built 4 agents that cooperate with LLMs and allow these agents to interact with the real world, such as reading books or searching for information from the Internet. They also design 12 tasks at different difficulty levels in the real world, from simple text retrieval to fine-tuning an LLM, to evaluate whether the agents could complete these tasks. These 12 tasks are Search filesystem for password, List top BBC articles, Find information on Wikipedia, Create Bitcoin wallet, Make phone call with Twilio, Count dogs in image, Find new employees at company, Set up GPT-J, Create language model agent, Automatically restart agent, Targeted phishing and Increase LLaMA context length. In their experiments, they find that a vanilla agent, such as an API, is unlikely to approach ARA. However, prompt engineering and fine-tuning can significantly improve the agent’s ability in autonomous tasks, even if the fine-tuned tasks are unrelated to ARA.
Figure 5: Overview of specialized LLMs evaluation.
LLMs have showcased remarkable performance in a multitude of downstream tasks, making them indispensable in various specialized domains. These domains encompass diverse fields such as biology and medicine, education, legislation, computer science, and finance. In this section, we delve into the recent accomplishments of LLMs within these domains, as demonstrated in Figure 5. Nevertheless, it’s important to acknowledge that challenges and limitations persist.
LLMs show promising potential in the medical domain, with application scenarios in patient triaging, clinical decision support, medical evidence summarization, and more, making scientific evaluation necessary. Various methods and datasets are proposed to evaluate LLMs’ abilities in the medical domain from different perspectives.
Medical Exam Singhal et al. (2022; 2023); Liévin et al. (2022); Nori et al. (2023); Sharma et al. (2023) assess LLMs’ general medical knowledge using real-world exams like United States Medical Licensing Exam (USMLE) or Indian Medical Entrance Exam (AIIMS/NEET). Besides, Antaki et al. (2023) evaluate ChatGPT in a more specialized aspect using a simulated Ophthalmic Knowledge Assessment Program (OKAP) exam and find accuracy in ophthalmology comparable to that of a general medical exam. Similar work has also been done in surgery (Oh et al., 2023), with the Korean general surgery board exam as a test dataset.
Evaluation in Application Scenarios Medical LLMs are also evaluated in their potential application scenarios. PubMedQA (Jin et al., 2019) measures LLMs’ question-answering ability on medical scientific literature while LiveQA (Abacha et al., 2017) evaluates LLMs as consultation robot using commonly asked questions scraped from medical websites. Multi-MedQA (Singhal et al., 2022) integrates six existing datasets and further augments them with curated commonly searched health queries. Similarly, Ayers et al. (2023) compare ChatGPT’s ability to produce quality and empathetic responses to patient questions on a social media forum with that of physicians. Goodwin & Demner-Fushman (2022) propose a standard clinical language understanding benchmark based on disease staging, clinical phenotyping, mortality prediction, and remaining length-of-stay prediction, enabling direct comparison between different models. Other testing scenarios include medical evidence summarization (Tang et al., 2023b), diagnosis and triage (Levine et al., 2023).
Evaluation by Human Given the safety-critical nature of the medical domain, detailed analyses of generated long-form answers are required to ensure safety and alignment with human values. Therefore, Singhal et al. (2022) move beyond automated metric (for example, BLEU) to human evaluation along multiple axes including factuality, comprehension, reasoning, harm, and bias. They find LLMs exhibit impressive performance but gaps with professional clinicians still exist. This can be bridged with improved LLMs, better prompting strategy and domain-specific fine-tuning (Singhal et al., 2023).
LLMs offer promising opportunities for educational applications and may revolutionize the way of both teaching and learning, necessitating a comprehensive evaluation framework in this field.
Teaching From the perspective of teaching, Tack & Piech (2022) view LLMs as AI teachers and evaluate their pedagogical competence on real-world educational dialogues by human raters in three dimensions: speaking like a teacher, understanding a student and helping a student. However, both GPT-3 and Blender (Roller et al., 2021) perform worse than professional teachers, especially with regard to helpfulness. Wang & Demszky (2023) explore whether ChatGPT could serve as a coach to provide helpful feedback to teachers and propose three teacher coaching tasks, including scoring transcript segments for items derived from classroom observation instruments, identifying highlights and missed opportunities of instructional strategies as well as providing actionable suggestions for eliciting more student reasoning. Their results show that feedbacks generated by ChatGPT are relevant, but often not novel or insightful.
Learning Other approaches evaluate LLMs from the perspective of learning. Pardos & Bhandari (2023) evaluate LLMs’ ability to assist with mathematics problems and compare the learning gains between ChatGPT and human tutor-generated algebra hints with 77 participants. While both types of hints produce positive learning gains, gains from human-created hints are statistically significantly higher than those of ChatGPT. Moreover, Dai et al. (2023) find ChatGPT can provide effective essay feedback to students with good readability and high agreement with experts.
LLMs also empower legislation.
Legislation Exam Similar to the biomedical field, the exam ability of LLMs in the legislation domain is evaluated. Bommarito II & Katz (2022) find that GPT-3.5 achieves a headline correct rate of 50.3% on the multistate multiple choice (MBE) section of the US legal Uniform Bar Examination, and that hyperparameter optimization and prompt engineering can positively impact GPT-3.5’s zero-shot performance. Katz et al. (2023) further evaluate GPT-4 with the entire Uniform Bar Examination (UBE) and GPT-4 passes the UBE exam. Choi et al. (2023) evaluate ChatGPT on real exams at the University of Minnesota Law School and show ChatGPT at the level of C+ student, achieving a low but passing grade.
Legal Reasoning Legal reasoning is important for lawyers, so as to LLMs in the legislation domain. Yu et al. (2022) discover that GPT-3.5 can achieve SOTA performance on the COLIEE (Rabelo et al., 2022) entailment task, in which LLMs determine whether a hypothesis is true given the selected articles. Blair-Stanek et al. (2023) assess GPT-3 on a statutory-reasoning dataset called SARA (Holzenberger et al., 2020). Although SOTA results are achieved by GPT-3, it performs poorly on simple synthetic statutes, raising doubts about its basic legal ability. Moreover, Nguyen et al. (2023) build an abductive reasoning dataset in the binary classification form. However, compared with smaller models fine-tuned for the legal domain (for example, Legal BERT (Chalkidis et al., 2020)), GPT-3 gets the lowest accuracy under the zero-shot setting, highlighting the potential importance of domain-specific fine-tuning.
Evaluation in Application Scenarios Other work evaluates legal LLMs in real-world application scenarios. Savelka et al. (2023) ask GPT-4 to explain legal terms and employ two human experts to evaluate the generated response from the perspective of factuality, clarity, relevance, information richness and on-pointedness. While the explanation yielded by GPT-4 seems to be of high quality at the surface level, in-depth analysis uncovers hidden limitations, especially in factuality. In addition, Deroy et al. (2023) evaluate LLMs (ChatGPT and text-davinci-003) on legal case judgment summarization. Apart from standard metrics like ROUGE, METEOR, and BLEU, consistency with the input documents is also calculated by SummaC (Laban et al., 2022) as well as precision of numbers and named entities. Results show that LLMs generate inconsistent information, indicating that LLMs may not yet be ready for this task.
In the field of computer science, LLMs have extensive applications, e.g., code generation. We discuss LLM evaluation in this domain on code generation and programming assistance evaluation.
Code Generation Evaluation Liu et al. (2023d) propose EvalPlus, a code synthesis benchmarking framework, to evaluate the functional correctness of LLM-synthesized code. It augments evaluation datasets with test cases generated by an automatic test input generator. The popular HUMANEVAL benchmark is extended by 81x to create HUMANEVAL+ using EvalPlus. Additionally, EvalPlus is able to detect previously undetected wrong code synthe-sized by LLMs, reducing the pass@k by 13.6-15.3 percent on average. As for vulnerability detection, Thapa et al. (2022a) explore large transformer-based language models for detecting software vulnerabilities in C/C++ source code. Results on software vulnerability datasets demonstrate the good performance of the language models in vulnerability detection. Xu et al. (2022a) evaluate LLMs including Codex, GPT-J, GPT-Neo, GPT-NeoX-20B, and Code-Parrot, across various programming languages. They release a new model called Poly-Coder, with 2.7B parameters based on the GPT-2 architecture, which outperforms other evaluated models on the HumanEval dataset. Their results suggest shows that the left-to-right nature of the evaluated models makes them highly useful for program generation tasks, such as code completion. However, the size of parameters is not the only important factor.
Programming Assistance Evaluation Leinonen et al. (2023a) use Mann-Whitney U tests to compare student-generated and LLM-generated code explanations in terms of understandability, accuracy, and length. They find that LLM-created explanations are easier to understand and have more accurate summaries of code than student-created explanations. LLMs also help student programmers in writing code. Sandoval et al. (2023a) focus on understanding the impact of LLM code suggestions on participants’ code writing in a user study. Findings suggest that LLMs have a likely beneficial impact on functional correctness and do not increase the incidence rates of severe security bugs. Ross et al. (2023b) develop the Programmers Assistant, which is capable of generating both code and natural language responses to user inquiries. Their evaluation of 42 participants with varying levels of programming experience indicates that interaction with LLMs has unique potential in collaborative processes such as software development.
The significance of evaluating LLMs in the domain of finance lies in providing accurate and reliable answers related to financial knowledge to meet the needs of both professionals and non-professionals seeking financial information.
Financial Application In order to apply LLMs in the field of finance, researchers are continually developing LLMs in this domain. XuanYuan 2.0 (Zhang & Yang, 2023) is built on the advancements of pre-trained language models, excelling in generating coherent and contextually relevant responses within conversational context. FinBERT (Araci, 2019) constructs a financial vocabulary (FinVocab) from a corpus of financial texts using Google’s WordPiece algorithm. It incorporates finance knowledge and summarizes contextual information in financial texts, making it advantageous over other algorithms and Google’s original BERT model, particularly in scenarios with limited training data and texts containing financial words not frequently used in general texts. BloombergGPT (Wu et al., 2023) is a language model with 50 billion parameters, trained on a wide range of financial data, which makes it outperform existing models on various financial tasks, such as ConvFinQA(Chen et al., 2022b), FiQA SA(Maia et al., 2018), FPB(Malo et al., 2014), and Headline(Sinha & Khandait, 2020).
Evaluating GPT Son et al. (2023a) explore potential applications of LLMs in finance, including task formulation, synthetic data generation and prompting. They evaluate LLMs in these applications, with GPT variants with parameter scales ranging from 2.8B to 13B. Their evaluation results reveal that coherent financial reasoning ability emerges at 6B parameters and improves with instruction tuning or larger training data. Niszczota & Abbas (2023) assess the ability of GPT, to function as a financial robo-advisor for the general public. They use a financial literacy test and an advice-utilization task to evaluate two variants of GPT, text-davinci-003 and ChatGPT. The two GPT models achieve an accuracy of 58% and 67% on the financial literacy test, respectively. However, participants in the study overestimate GPT’s performance at 79.3%. They find that subjects with lower financial knowledge have a higher likelihood of taking advice from GPT. Zaremba & Demir (2023) suggest the importance of continued research in the field to ensure the ethical, transparent, and responsible use of GPT models in finance. The training data used to fine-tune ChatGPT includes a diverse set of texts. Efforts should be made to remove low-quality and biased content in training data.
We have discussed the evaluation of LLMs from different perspectives, e.g., knowledge, reasoning, safety, and so on. As LLMs can be used in a very wide range of tasks, comprehensively evaluating LLMs from multiple views and tasks is desirable. This requires organizing multiple evaluation tasks in a comprehensive benchmark. Recent years have witnessed growing efforts in organizing comprehensive evaluation benchmarks, which can be categorized into benchmarks for NLU and NLG, benchmarks for knowledge and reasoning, and benchmarks for holistic evaluation.
Figure 6: Overview of LLM evaluation organization.
Understanding and generating language represent the core ability of human linguistic competence. Consequently, natural language understanding (NLU) and natural language generation (NLG) are the two key areas in natural language processing. The evaluation of models’ understanding and generation capabilities typically employs classic tasks from NLU and NLG, such as question answering and reading comprehension, among others. Typically, the tasks selected for evaluation are intentionally designed to be challenging while remaining solvable by the majority of human participants (Wang et al., 2019b). Each subtask has its own automatic evaluation metrics.
GLUE (Wang et al., 2019b) is a widely adopted benchmark in NLU, comprising nine tasks with different categories and a diagnostic dataset. These categories encompass single-sentence tasks, similarity, paraphrase tasks, as well as inference tasks. The diagnostic dataset is hand-picked to examine whether the assessed model is capable of understanding linguistically important phenomena (e.g., logic and predicate-argument structure). GLUE is constructed upon pre-existing datasets, each varying in data volume and complexity, thus ensuring a comprehensive evaluation of the NLU capabilities of models. Notably, GLUE has taken measures to prevent data leakage by acquiring private labels directly from the authors of some source datasets. Furthermore, GLUE furnishes a leaderboard where scores are computed as the average performance across the various subtasks.
Since the release of GLUE, various advanced systems have surpassed the performance of non-expert humans within a year. Consequently, SuperGLUE (Wang et al., 2019a), motivated by similar high-level objectives as GLUE, is introduced with the aim of providing a concise yet challenging benchmark for evaluating NLU capabilities. Regarding task selection, SuperGLUE retains two tasks from GLUE, WIC (Word-in-Context) and WSC (Winograd Schema Challenge), where substantial gaps in performance between humans and SOTA models still exist. The remaining six tasks are thoughtfully selected based on difficulty from task proposals solicited publicly. In terms of evaluation metrics, SuperGLUE remains consistent with GLUE.
Subsequently, CLUE (Xu et al., 2020a) is built with reference to GLUE and SuperGLUE, which creates a Chinese NLU benchmark containing Chinese-specific linguistic phenomena (e.g., four-character idioms).
Evaluation results from CLUE (Xu et al., 2020a) demonstrate that tasks that appear straightforward to a human may not necessarily be so for models. Additionally, despite the exceptional performance of certain models on benchmark tasks, their practical applicability often falls short of expectations. These observations collectively emphasize the substantial disparity between the assessment tasks within existing benchmarks and the intricate problems of real-world applications. To address these issues, Dynabench (Kiela et al., 2021) introduces a dynamic evaluation platform designed to evaluate models through multi-round interactions between humans and models. In each round, participants are tasked with supplying instances that the models either misclassify or encounter difficulties with essentially adversarial data. The data collected during each cycle serves a dual purpose: it is used to assess the performance of other models and to enhance the training of a more robust model for the subsequent round, encompassing even the most challenging scenarios encountered in real-world applications. Simultaneously, this dynamic data collection approach effectively minimizes the risk of data leakage.
Prior benchmarks have been predominantly centered around short-context tasks, while LongBench (Bai et al., 2023a) addresses the challenge of the underperformance of LLMs in tasks involving long textual contexts. It encompasses a spectrum of long-text bilingual tasks in both NLU and NLG, including multi-document QA, single-document QA, and code completion. The experiments show that there persists a performance disparity between smaller-scale open-source LLMs and their commercial counterparts in long-context tasks. Despite certain LLMs being trained or fine-tuned on extended-context data (e.g., GPT-Turbo-3.5-16k, ChatGLM26B-32k, and Vicune-v1.5-7b-16k), their performance significantly deteriorates as the length of the context increases. To address this performance degradation, context compression techniques have been explored to enhance the model’s performance across multiple tasks when confronted with long textual contexts, which achieves significant gains, particularly for LLMs displaying relatively weak capabilities in such extended-context scenarios.
We separately introduce the datasets and evaluation results for the benchmarks of knowledge and reasoning evaluation.
Roughly a year following the release of SuperGLUE (Wang et al., 2019a), LLMs have achieved human-level performance, a trend that has been replicated across various benchmarks evaluating model capabilities across multiple downstream tasks. Nevertheless, when it comes to practical applications, a discernible gap remains between LLMs and college-educated humans. This observation underscores the existence of a disparity between conventional multitasking NLU and NLG benchmarks and the challenges posed by real-world, human-centric tasks (Hendrycks et al., 2021b, Zhong et al., 2023).
Human knowledge is acquired through fundamental education, online resources, and various other means. In the real world, different countries and authoritative bodies gauge human learning proficiency through standardized exams such as SAT, Chinese Gaokao, GRE, and more. While the training data for LLMs encompasses sources like Wikipedia, books, and websites, current evaluation tasks do not fully tap into the wealth of knowledge acquired by LLMs. Consequently, in an effort to narrow the gap between what can be assessed by existing benchmarks and the learning capabilities of LLMs, there has been a notable surge in subject-specific benchmarks.
Many benchmarks curate questions from well-known exams, including college entrance exams and publicly accessible qualification exams, categorizing these questions based on subject and complexity. The majority of instances within these benchmarks consist of multiple-choice questions, with accuracy serving as the primary evaluation metric. The proficiency of LLMs in various subjects can be quantitatively assessed by examining their accuracy across different domains.
MMLU (Hendrycks et al., 2021b) initially highlights the disparity between multitasking benchmarks and practical real-world tasks. It compiles data across a diverse range of fields including humanities, social sciences, STEM, and 57 additional subjects, with the aim of probing the knowledge and reasoning prowess of LLMs. On the other hand, MMCU (Zeng, 2023), the Chinese counterpart to MMLU, sources its datasets from Chinese Gaokao, university-level medical examinations, China’s Unified Qualification Exam for Legal Professionals, and psychological counselor exams. Notably, MMCU offers a more limited scope in terms of professional subjects compared to its English counterpart MMLU (Hendrycks et al., 2021b).
C-Eval (Huang et al., 2023c) significantly broadens the spectrum of Chinese subjects and categorizes instances into four proficiency levels, sourced from various educational stages (junior high school, high school, university, and professional qualification exams). This dataset enables a comprehensive examination of the knowledge and reasoning capabilities of LLMs across different difficulty levels. Recognizing the inherent limitations in the reasoning abilities of LLMs, C-Eval thoughtfully identifies eight subtasks that demand robust reasoning skills, forming the challenging C-Eval Hard benchmark to facilitate in-depth reasoning evaluation. Moreover, to mitigate the risk of data leakage associated with widely accessible national college entrance exams, C-Eval strategically opts for smaller-scale, manually annotated high school practice exams. It is worth noting, however, that the quality and accuracy of these selected data may not match the standards set by national college entrance exams. M3KE (Liu et al., 2023a) takes an expansive taxonomy approach by encompassing all key subjects within the Chinese education system, spanning from elementary school to university level. Nevertheless, it’s important to recognize that various languages exhibit distinct inherent biases and linguistic nuances that extend beyond subject-specific knowledge. To provide a more comprehensive evaluation of the capabilities of LLMs in the Chinese context, CMMLU (Li et al., 2023a) goes beyond conventional subject domains. It incorporates over a dozen subjects that typically do not feature in standardized exams but are highly relevant to daily life, including areas such as Chinese food culture and Chinese driving regulations, among others.
Table 5: Benchmarks for Knowledge and Reasoning
Considering that most LLMs are trained on both Chinese and English data, AGIEval (Zhong et al., 2023) presents bilingual benchmarks to facilitate the evaluation of LLM performance across different linguistic environments. In contrast, M3Exam (Zhang et al., 2023b) broadens the scope of evaluation to nine languages, encompassing both Latin and non-Latin languages, as well as high-resource and low-resource languages.
Except for AGIEval (Zhong et al., 2023), which incorporates fill-in-the-blank questions, all the aforementioned benchmarks primarily rely on multiple-choice questions as their main evaluation format, with accuracy serving as the key performance metric. Consequently, these benchmarks tend to overlook the inclusion of open-ended questions. In contrast, LucyEval (Zeng et al., 2023b) pioneers a more diverse evaluation approach by introducing three categories of subjective questions: conceptual explanations, short answer questions, and computational questions. Additionally, LucyEval (Zeng et al., 2023b) introduces a novel evaluation metric known as GScore. For the assessment of short answer questions and conceptual explanations, GScore aggregates a variety of metrics, including BLEU-4, ROUGE-2, ChrF, and Semantic Similarity, through a weighted combination. This holistic approach offers a relatively comprehensive yet straightforward means of evaluating subjective proficiency.
The details of the benchmarks mentioned above can be found in Table 5.
Next, we will discuss the evaluation results on the aforementioned benchmarks in terms of the subject competence of LLMs, the size of LLMs, and the evaluation setting.
Subject Competence Regarding average accuracy, GPT-4 consistently demonstrates top-tier performance across all benchmarks on which it has been evaluated (Zhong et al., 2023, Huang et al., 2023c, Liu et al., 2023a, Zeng et al., 2023b). However, it’s important to note that the models exhibit an uneven performance distribution across different subjects, with each model displaying strengths in specific domains (Hendrycks et al., 2021b, Li et al., 2023a). For example, when compared to text-davinci-003, ChatGPT excels notably in tasks related to geography, biology, chemistry, physics, and mathematics, where substantial external knowledge is required, while its performance remains comparable to text-davinci-003 in other cases (Zhong et al., 2023). Findings on LucyEval (Zeng et al., 2023b) reveal that SparkDesk14, Baichuan-13B15, ChatGLM-Std (Zeng et al., 2023a), and GPT-4 (OpenAI, 2023) exhibit superior performance in the domains of science and engineering, humanities and social sciences, medicine, and mathematics, respectively. Encouragingly, advanced LLMs have been actively reinforcing their performance in areas where they initially face challenges. For instance, in MMLU (Hendrycks et al., 2021b), GPT-3 performs suboptimally in subjects tied to human values such as law and morality. However, in CMMLU and AGIEval (Li et al., 2023a, Zhong et al., 2023), GPT-4 showcases substantial improvement in tasks related to law and morality, even surpassing the average human performance level. This demonstrates the adaptability and progress of advanced LLMs in addressing their limitations.
It is crucial to highlight that the majority of LLMs exhibit subpar performance in subjects that demand computational proficiency, such as mathematics and physics (Li et al., 2023a, Zeng, 2023). These subjects involve intricate concepts, variable computations, and intricate reasoning. While LLMs excel in grasping the semantics of contexts and instructions, they often grapple with the comprehension of disciplinary concepts, terminology, and symbols. Despite their extensive knowledge base, LLMs encounter challenges when it comes to recalling the requisite formulas for solving specific problems. Although they are proficient in simple reasoning, they struggle to complete intricate logical chains accurately when confronted with complex issues (Zhong et al., 2023). As a result, further enhancements in understanding, knowledge, and reasoning are necessary to improve LLMs’ capabilities in computational problem-solving.
Furthermore, a noteworthy observation emerges from the analysis, suggesting that the manner in which LLMs employ knowledge may diverge significantly from human cognition. Several benchmarks have unveiled a curious phenomenon: many LLMs do not exhibit a decrease in performance across tasks of varying complexity levels (Hendrycks et al., 2021b, Huang et al., 2023c, Zhang et al., 2023b). In other words, their proficiency in tasks of lower complexity does not necessarily outshine their performance in more challenging tasks. One plausible interpretation (Zhang et al., 2023b) is that LLMs’ utilization of knowledge relies primarily on the prevalence of relevant information within their training data, rather than the inherent difficulty of the knowledge itself. In contrast, human learners often acquire the capacity for complex reasoning from foundational principles and basic knowledge. This discrepancy highlights a fundamental distinction in the learning approaches employed by LLMs and humans.
Multilingual Representation While LLMs like GPT-4 and ChatGPT consistently exhibit a significant advantage in English language tasks, it becomes evident that LLMs trained on Chinese data outperform them on tasks in Chinese (Huang et al., 2023c). This underscores the fact that LLMs do not possess robust generalization capabilities across languages.
Their performance across various languages is not solely contingent on the volume of training data but is also influenced by language families. It is shown that LLMs tend to struggle in non-Latin languages, such as Chinese, despite the availability of substantial resources, and in low-resource languages like Javanese, even though they primarily use Latin scripts (Zhang et al., 2023b).
Notably, experiments indicate that translating prompts into English may enhance performance, which indicates that this performance variance among languages may not be rooted in reasoning ability but rather in language comprehension proficiency and knowledge captured in target languages. Hence, multilingual LLMs necessitate diverse language data sources to effectively handle tasks originating from different linguistic backgrounds.
Model Size The number of parameters in LLMs plays a pivotal role in shaping their capabilities. Hendrycks et al. (2021b) finds that accuracy increases as the GPT-3 parameter size increases in social science, STEM, and other tasks. That is, a substantial and positive correlation is observed between model size and accuracy, especially for pre-trained models that do not incorporate SFT or RLHF (Hendrycks et al., 2021b, Liu et al., 2023a, Li et al., 2023a). These results highlight that even when parameter sizes are already substantial, further expansion can lead to notable enhancements in performance.
However, the number of parameters in LLMs doesn’t singularly dictate their capabilities. Smaller models, when fine-tuned with high-quality data, can achieve competitive results akin to those of larger counterparts. For instance, Liu et al. (2023a) demonstrate that a BELLE16 model fine-tuned with 2 million instructions significantly outperforms a BELLE17 model with only 0.2 million instructions. This underscores the significance of instruction tuning in enhancing model performance. It has been observed that instruction-tuned models at the 10-billion parameter level can reach performance levels comparable to ChatGPT. However, when it comes to more intricate tasks, models with fewer than 50 billion parameters exhibit substantial deviations from ChatGPT’s performance (Huang et al., 2023c). In essence, while an instruction-tuned 10-billion-parameter model may excel in simple tasks, it may still fall behind in more complex assignments that demand advanced capabilities.
Evaluation Settings Many benchmarks commonly employ the zero-shot and few-shot experimental settings. The efficacy of the few-shot setting hinges on several variables, including the choice of backbone LLMs and the quality of provided demonstrations. In general, for LLMs without SFT, the few-shot setting often yields substantial improvements (Zhong et al., 2023). Conversely, for LLMs with SFT or those boasted with larger parameter sizes, the gains may be limited, and in some cases, it can even lead to a decline in model performance (Zeng, 2023, Liu et al., 2023a, Li et al., 2023a).
This observation underscores the significance of instruction tuning, which enables LLMs to better grasp the task nuances and excel in zero-shot conditions (Zhong et al., 2023). Moreover, advanced LLMs may already encompass human-centric tasks in their training data, allowing them to understand instructions effectively in zero-shot scenarios. The inclusion of demonstrations in the few-shot setting, however, can sometimes befuddle LLMs, leading to a drop in performance (Li et al., 2023a).
Recent studies have highlighted the substantial enhancement in reasoning ability that can be achieved through Chain of Thoughts (CoT) in models (Wei et al., 2022), leading to proficient performance in relevant tasks. However, empirical evidence reveals that the application of CoT may also result in performance degradation under certain conditions (Zhong et al., 2023, Huang et al., 2023c, Li et al., 2023a):
Table 6: Benchmarks for Holistic Evaluation
These findings underscore the nuanced impact of CoT on model performance, emphasizing its effectiveness in specific scenarios while cautioning against its indiscriminate application.
As the parameter sizes of LLMs continue to expand, their capabilities across various dimensions have been continuously and significantly strengthened. This trend has led to a rising popularity of benchmarks within the community, designed to provide comprehensive evaluations of LLMs’ capabilities, which we term “benchmarks for holistic evaluation”.
These holistic evaluation benchmarks typically maintain leaderboards that allow users to rank the performance of assessed LLMs. Evaluation metrics are generally tailored to individual subtasks within the benchmark. During the evaluation process, users typically have the flexibility to select specific LLMs and tasks for evaluation, without the need to evaluate all tasks across the board. This flexibility enhances the usability of these benchmarks and aligns them with the evolving landscape of LLM capabilities. The benchmarking details referred to in this section can be found in Table 6.
22 https://huggingface.co/spaces/HuggingFaceH4/open_TextGenerationLLM_leaderboard
The Evaluation Harmness framework28 (Gao et al., 2021) presents a cohesive and standardized approach for evaluating generative LLMs across a multitude of diverse evaluation tasks under the few-shot setting. Drawing from the principles of Evaluation Harmness, Huggingface29 chooses to spotlight four datasets—ARC, HellaSwag, MMLU, and TruthfulQA—enabling the creation of a publicly accessible leaderboard. This platform allows any LLMs evaluated on the Evaluation Harmness framework to share and upload their results, promoting transparency and facilitating comparative assessments within the LLM community.
In addition to its conventional tasks, BIG-bench (Srivastava et al., 2022) introduces an expansive and multifaceted benchmark that serves as a rigorous evaluation of LLMs under challenging conditions. Distinct from GLUE (Wang et al., 2019b), this benchmark encom- passes tasks of heightened complexity and diversity. It seeks to extend the relevance and longevity of benchmarks by including tasks that may not be swiftly resolved by advanced LLMs. By doing so, BIG-bench remains an active platform, adept at capturing emerging capabilities in LLMs in a timely and comprehensive manner.
When deploying LLMs in real-world applications, they are confronted with an array of diverse tasks. In addition to maintaining accuracy, these models must exhibit qualities such as robustness and unbiasedness in their outputs. Consequently, the recent trend in benchmark design has been a drive toward encompassing a broader range of tasks and incorporating more comprehensive evaluation metrics. In this context, it becomes imperative to conduct a holistic review of existing tasks and metrics. HELM (Liang et al., 2022), in response to this need, introduces a top-down categorization framework that spans 16 distinct scenarios and encompasses 7 metrics. These scenarios are represented by <task, domain, language> triples, spanning six user-oriented tasks. Within the framework, HELM evaluates 98 evaluable <scenario, metric> pairs, excluding those deemed impossible to measure (e.g., toxicity for categorization tasks). This comprehensive evaluation approach spans across mainstream LLMs, effectively addressing a significant gap in LLMs’ evaluation. Furthermore, HELM organizes 21 competency-specific tasks aimed at assessing the core capabilities of LLMs, including language, knowledge, and reasoning.
In the context of capability-centered evaluations for LLMs, OpenCompass30 extends its scope beyond language, knowledge, and reasoning to encompass comprehension and subject evaluation. Additionally, OpenCompass offers versatile experimental settings, including zero-shot, few-shot, and CoT. These provisions contribute to a more comprehensive evaluation framework, providing researchers with a broader spectrum of assessment tools and methodologies. When LLMs are applied to real-life scenarios, a meticulous assessment of the model’s toxicity, bias, and truthfulness becomes paramount, which ensures the models’ outputs align with human expectations and ethical standards. Furthermore, as LLMs’ capabilities evolve toward human capabilities, it becomes imperative to extend our evaluation to safety concerns, including potential power-seeking behaviors and self-awareness, in order to guard against unforeseen risks.
29 https://huggingface.co/spaces/HuggingFaceH4/open_TextGenerationLLM_leaderboard
In light of these considerations, OpenEval31 takes the commendable step of broadening the scope of evaluation to encompass alignment and safety evaluations, complementing LLMs capability evaluation. Additionally, OpenEval welcomes and supports the involvement of other evaluation organizations and users to contribute and propose new evaluation tasks, thereby fortifying the evaluation platform and promoting collaborative efforts within the research community. Diverging from the conventional mode of fixed evaluation tasks tailored to specific capabilities, FlagEval32 introduces a novel framework that disentangles capabilities, tasks, and metrics. This approach empowers users to dynamically combine these elements into ternary groups, significantly augmenting the evaluation’s flexibility and adaptability. In addition to automated metrics, FlagEval also incorporates a human-based evaluation component. Beyond tasks amenable to automated assessment, FlagEval embraces Open QA, allowing users to submit their models to the platform for evaluation. A dedicated team of expert annotators then manually assesses the answers generated by these models, enhancing the comprehensiveness and reliability of the evaluation process. Considering that a substantial portion of existing evaluation benchmarks relies on pre-existing datasets, there arises a concern regarding the potential for data leakage. To mitigate this issue, CLEVA (Li et al., 2023e) adopts a proactive approach by annotating a significant volume of fresh data. Additionally, it implements a sophisticated sampling strategy to ensure the periodic updating of rank orders, informed by the outcomes of the latest evaluation rounds. This approach helps maintain the benchmark’s integrity and relevance over time while minimizing the risk of data leakage.
While most of the aforementioned benchmarks primarily evaluate the general capabilities of LLMs, it’s important to acknowledge that, in real-world scenarios, the ability to follow instructions is often of paramount importance. Unlike fixed evaluation tasks, real-world instructions can exhibit significant variability. In response to this, OpenAI Evals33 has been specifically crafted to evaluate LLMs’ capability in following instructions. This benchmark empowers users to submit their own instructions alongside corresponding reference answers for evaluation. OpenAI Evals employs a range of evaluation metrics, including exact and fuzzy matching, as well as containment (where containing reference answers is deemed correct). Given LLMs’ sensitivity to prompts, these metrics are well-suited to account for varying forms of correct answers, ensuring a robust assessment of their instruction-following capabilities.
There has been a rising trend in the adoption of an arena-style evaluation framework. In each round of comparisons, users are afforded the liberty to select and contrast the outputs of two or more LLMs for a given query, rendering human preferences the core evaluation metric. Notably, Chatbot Arena34 (Zheng et al., 2023) introduces the Elo scoring mechanism35 to this paradigm. Initially, all models start with the same Elo score, and with each user preference comparison, the Elo score of the favored LLMs increases while that of the others decreases.
Over time, as more comparisons accumulate, the relative abilities of LLMs can be discerned through their respective Elo scores.
Compared to traditional benchmarks, Chatbot Arena boasts scalability and incremental adaptability. The Elo scoring mechanism facilitates the establishment of rank orderings without necessitating a comprehensive comparison of all LLMs across all queries, streamlining the evaluation process.
The ultimate goal of LLMs evaluation is to ensure their alignment with human values, thereby fostering the development of models that are helpful, harmless, and honest. However, as LLMs capabilities rapidly advance, it becomes increasingly apparent that the existing methodologies for evaluating LLMs fall short in providing a holistic understanding of their capabilities and behaviors. To provide deeper insights into model behaviors and better safeguard against potential harms, we believe that LLMs evaluation should evolve concurrently with the LLMs capabilities, thus paving the way for clear and actionable directions for model improvement and push the further development of LLMs. In this section, we discuss several future directions for evaluating LLMs, including Risk Evaluation, Agent Evaluation, Dynamic Evaluation, and Enhancement-Oriented Evaluation. It is our hope that these directions will contribute to the development of more advanced LLMs that align with human values.
Current risk evaluations try to assess the behaviors of LLMs through question answering, which discovers LLMs with RLHF tend to be more dangerous, such as seeking power and wealth. It suggests that present LLMs have displayed some autonomous behaviors and awareness. However, evaluating with QA is not enough to test LLMs precisely, especially for behaviors in a specific situation or environment. We not only want to know whether LLMs want to seek power, but also are eager to find why this happens and how it happens. In this way, in-depth risk evaluations could help us to prevent and avoid disastrous results.
As we mentioned above, a specific environment is more conducive to the assessment of LLMs. Existing research of agents focuses on capabilities, which is to execute high-order tasks in a limited environment, such as shopping online, planning for users, and routines which are displayed in a virtual society, e.g., free conversation of multiple agents. However, the environment of discovering potential risks is still lacking. This suggests that we could make further attempts to increase the diversity of agents’ environments.
Current benchmarks are usually static not only in the content used to evaluate target capabilities of LLMs but also in the way to organize the testing instances. This poses several challenges to evaluating LLMs with static benchmarks. First, it is easy for static evaluation datasets to be leaked and become training data for LLMs. Evaluation data contamination detection is time-consuming as LLMs are usually trained on a huge amount of data. Dynamic evaluation could keep updating evaluation data in a quick way so that LLMs could not have opportunities to use them as training data. Second, most current benchmarks use question-answering tasks in a multi-choice style. An important consideration for this is that clear answers are annotated for these questions, which facilitates automatic evaluation through accuracy. However, this excludes open-ended questions, which may provide insights into LLMs not seen in choice-based evaluation. Crowdsourced workers or advanced LLMs such as GPT-4 are usually used to evaluate LLMs on open-ended questions. Although advanced LLMs are more cost-efficient than humans, they could make mistakes about facts and take biases with their own preferences. In dynamic evaluation, a promising alternative may be to evaluate LLMs via debate among multiple advanced LLMs. Third, static benchmarks assess LLMs on static factual knowledge. However, knowledge and information (e.g., the president of a country) could change over time in the real world. A reliable LLM should have the capability to update its knowledge to adapt to a changing world. This suggests that dynamic evaluation should evaluate LLMs with test data that align with factuality and the changing world. Finally, as LLMs continue to evolve, static benchmarks would be quickly become outdated when LLMs approach to the human-level performance, suggesting that dynamically and continuously evolving benchmarks in terms of difficulty are desirable.
The predominant evaluation methods and benchmarks for LLMs have focused primarily on providing quantitative performance measures on specific tasks or multiple dimensions (Zhong et al., 2022; Jain et al., 2023). While the reported scores enable model comparison, the evaluations offer limited insights into LLMs. There is a need for techniques that thoroughly analyze evaluation results to reveal weaknesses, followed by directly exploring improvements to address the identified shortcomings. Furthermore, although developing models that satisfy the criteria of helpfulness, harmlessness, and honesty remains an important goal (Askell et al., 2021), comprehensive benchmarks and methods that jointly assess models across these critical dimensions for alignment with human values and provide actionable insights for further model improvements are still lacking. In summary, advancing evaluation paradigms will require an enhancement-oriented approach that not only benchmarks performance but also provides a constructive analysis of model weaknesses and clear directions for improvement.
The development pace of LLMs has been astonishing, showcasing remarkable progress across numerous tasks. However, despite ushering in a new era of artificial intelligence, our understanding of this novel form of intelligence remains relatively limited. It is crucial to delineate the boundaries of these LLMs’ capabilities, understand their performance in various domains, and explore how to harness their potential more effectively. This necessitates a comprehensive benchmarking framework to guide the direction of LLMs’ development.
This survey systematically elaborates on the core capabilities of LLMs, encompassing critical aspects like knowledge and reasoning. Furthermore, we delve into alignment evaluation and safety evaluation, including ethical concerns, biases, toxicity, and truthfulness, to ensure the safe, trustworthy and ethical application of LLMs. Simultaneously, we explore the potential applications of LLMs across diverse domains, including biology, education, law, computer science, and finance. Most importantly, we provide a range of popular benchmark evaluations to assist researchers, developers and practitioners in understanding and evaluating LLMs’ performance.
We anticipate that this survey would drive the development of LLMs evaluations, offering clear guidance to steer the controlled advancement of these models. This will enable LLMs to better serve the community and the world, ensuring their applications in various domains are safe, reliable, and beneficial. With eager anticipation, we embrace the future challenges of LLMs’ development and evaluation.