00:00:00

Share Your Feedback 🏝️

Survey, Hallucination | Survey of Hallucination

Survey, Hallucination | Survey of Hallucination

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Survey | Evaluating Large Language Models A Comprehensive Survey Next: Model | Zephyr

Survey, Hallucination | Survey of Hallucination

  • Related Project: Private
  • Category: Paper Review
  • Date: 2023-10-31

Survey of Hallucination in Natural Language Generation

  • url: https://arxiv.org/abs/2202.03629
  • pdf: https://arxiv.org/pdf/2202.03629
  • abstract: Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; and (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, machine translation, and visual-language generation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.

Contents


1 INTRODUCTION

Natural Language Generation (NLG) is one of the crucial yet challenging sub-fields of Natural Language Processing (NLP). NLG techniques are used in many downstream tasks such as summarization, dialogue generation, generative question answering (GQA), data-to-text generation, and machine translation. Recently, the rapid development of NLG has captured the imagination of many thanks to the advances in deep learning technologies, especially Transformer [189]-based models like BERT [29], BART [100], GPT-2 [149], and GPT-3 [16]. The conspicuous development of NLG tasks attracted the attention of many researchers, leading to an increased effort in the field. Alongside the advancement of NLG models, attention towards their limitations and potential risks has also increased. Some early works [71, 201] focus on the potential pitfalls of utilizing the standard likelihood maximization-based objective in training and decoding of NLG models. They discovered that such likelihood maximization approaches could result in degeneration, which refers generated output that is bland, incoherent, or gets stuck in repetitive loops. Concurrently, it is discovered that NLG models often generate text that is nonsensical, or unfaithful to the provided source input [85, 153, 159, 190]. Researchers started referring to such undesirable generation as hallucination [125] 1.

Hallucination in NLG is concerning because it hinders performance and raises safety concerns for real-world applications. For instance, in medical applications, a hallucinatory summary generated from a patient information form could pose a risk to the patient. It may provoke a life-threatening incident for a patient if the instructions of a medicine generated by machine translation are hallucinatory. Hallucination can also lead to potential privacy violations. Carlini et al. [20] demonstrate that language models can be prompted to recover and generate sensitive personal information from the training corpus (e.g., email address, phone/fax number, and physical address). Such memorization and recovery of the training corpus is considered a form of hallucination because the model is generating text that is not “faithful” to the source input content (i.e., such private information does not exist in the source input).

Currently there are many active efforts to address hallucination for various NLG tasks. Analyzing hallucinatory content in different NLG tasks and investigating their relationship would strengthen our understanding of this phenomenon and encourage the unification of efforts from different NLG fields. However, to date, little has been done to understand hallucinations from a broader perspective that encompasses all major NLG tasks. To the best of our knowledge, existing surveys have only focused specific tasks like abstractive summarization [76, 125] and translation [95]. Thus, in this paper, we present a survey of the research progress and challenges in the hallucination problem in NLG. And offer a comprehensive analysis of existing research on the phenomenon of hallucination in different NLG tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, machine translation. We mainly discussed hallucination of the unimodal NLG tasks that have textual input sources upon which the generated text can be assessed. We also briefly summarize hallucinations in multi-modal settings such as visual-language tasks [1, 13]. This survey can provide researchers a high-level insight derived from the similarities and differences of different approaches. Furthermore, given the various stages of development in studying hallucination from different tasks, the survey can assist researchers in drawing inspiration on concepts, metrics, and mitigation methods.

1The term “hallucination” first appeared in Computer Vision (CV) in Baker and Kanade [5] and carried more positive meanings, such as superresolution [5, 112], image inpainting [48], and image synthesizing [226]. Such hallucination is something we take advantage of rather than avoid in CV. Nevertheless, recent works have started to refer to a specific type of error as “hallucination” in image captioning [13, 159] and object detection [4, 83], which denotes non-existing objects detected or localized incorrectly at their expected position. The latter conception is similar to “hallucination” in NLG.

Organization of this Survey. The remainder of this survey is organized as follows. Section 2 ∼ Section 6 provide an overview of the hallucination problem in NLG by discussing the definition and categorization, contributors, metrics, and mitigation methods of hallucinations, respectively. The second part of our survey discusses the hallucination problem associated with specific NLG tasks: abstractive summarization in Section 7, dialogue generation in Section 8, GQA in Section 9, data-to-text generation in Section 10, machine translation in Section 11, and VL generation in Section 12. Finally, we conclude the whole survey in Section 13.

2 DEFINITIONS

In the general context outside of NLP, hallucination is a psychological term referring to a particular type of perception [51, 118]. Blom [14] define hallucination as “a percept, experienced by a waking individual, in the absence of an appropriate stimulus from the extracorporeal world”. Simply put, a hallucination is an unreal perception that feels real. The undesired phenomenon of “NLG models generating unfaithful or nonsensical text” shares similar characteristics with such psychological hallucinations – explaining the choice of terminology. Hallucinated text gives the impression of being fluent and natural despite being unfaithful and nonsensical. It appears to be grounded in the real context provided, although it is actually hard to specify or verify the existence of such contexts. Similar to psychological hallucination, which is hard to tell apart from other “real” perceptions, hallucinated text is also hard to capture at first glance.

Within the context of NLP, the above definition of hallucination, the generated content that is nonsensical or unfaithful to the provided source content [50, 125, 140, 237], is the most inclusive and standard. However, there do exist variations in definition across NLG tasks, which will be further described in the later task-specific sections.

2.1 Categorization

Following the categorization from previous works [41, 76, 125], there are two main types of hallucinations, namely intrinsic hallucination and extrinsic hallucination. To explain the definition and categorization more intuitively, we give examples of each category of hallucinations for each NLG downstream task in Table 1.

(1) Intrinsic Hallucinations: The generated output that contradicts the source content. For instance, in the abstractive summarization task from Table 1, the generated summary “The first Ebola vaccine was approved in 2021” contradicts the source content “The first vaccine for Ebola was approved by the FDA in 2019.”.

(2) Extrinsic Hallucinations: The generated output that cannot be verified from the source content (i.e., output that can neither be supported nor contradicted by the source). For example, in the abstractive summarization task from Table 1, the information “China has already started clinical trials of the COVID-19 vaccine.” is not mentioned in source. We can neither find evidence for the generated output from the source nor assert that it is wrong. Notably, the extrinsic hallucination is not always erroneous because it could be from factually correct external information [125, 181]. Such factual hallucination can be helpful because it recalls additional background knowledge to improve the informativeness of the generated text. However, in most of the literature, extrinsic hallucination is still treated with caution because its unverifiable aspect of this additional information increases the risk from a factual safety perspective.

2.2 Task Comparison

The previous subsection is about the definition and categorization of hallucination commonly shared by many NLG tasks. Yet, there are some task-specific differences.

For the abstractive summarization, data-to-text, and dialogue tasks, the main difference is in what serves as the “source” and the level of tolerance towards hallucinations. The source in abstractive summarization is the input source text that is being summarized [165], while the source in data-totext is non-linguistic data [56, 157], and the source(s) in the dialogue system is dialogue history and/or the external knowledge sentences. Tolerance towards hallucinations is very low in both the summarization [139] and data-to-text tasks [140, 195, 199] because it is essential to provide faithful generation. In contrast, the tolerance is relatively higher in dialogue systems because the desired characteristics are not only faithfulness but also user engagement, especially in open-domain dialogue systems [75, 78].

For the generative question answering (GQA) task, the exploration of hallucination is at its early stage, so there is no standard definition or categorization of hallucination yet. However, we can see that the GQA literature mainly focuses on “intrinsic hallucination” where the source is the world knowledge [102]. Lastly, unlike the aforementioned tasks, the categorizations of hallucinations in machine translation vary within the task. Most relevant literature agrees that translated text is considered a hallucination when the source text is completely disconnected from the translated target [95, 132, 153]. For further details, please refer to Section 11.

2.3 Terminology Clarification

Multiple terminologies are associated with the concept of hallucination. We provide clarification of the commonly used terminologies hallucination, faithfulness and factuality to resolve any confusion. Faithfulness is defined as staying consistent and truthful to the provided source – an antonym to “hallucination.” Any work that tries to maximize faithfulness thus focuses on minimizing hallucination. For this reason, our survey includes all those works that address the faithfulness of machine generated outputs. Factuality refers to the quality of being actual or based on fact. Depending on what serves as the “fact”, “factuality” and “faithfulness” may or may not be the same. Maynez et al. [125] differentiate “factuality” from “faithfulness” by defining the “fact” to be the world knowledge. In contrast, Dong et al. [34] use the source input as the “fact” to determine the factual correctness, making “factuality” indistinguishable from “faithfulness”. In this paper, we adopt the definition from Maynez et al. [125] because we believe having such distinction between source knowledge and world knowledge provides a more clear understanding.

Note that the judging criteria for what is considered faithful or hallucinated (i.e., the definition of hallucination) can differ across tasks. For more details of these variation definitions, you can find in the later task-specific sections.

3 CONTRIBUTORS TO HALLUCINATION IN NLG

3.1 Hallucination from Data

The main cause of hallucination from data is source-reference divergence.

This divergence happens 1) as an artifact of heuristic data collection or 2) due to the nature of some NLG tasks that inevitably contain such divergence. When a model is trained on data with source-reference(target) divergence, the model can be encouraged to generate text that is not necessarily grounded and not faithful to the provided source.

Heuristic data collection. When collecting large-scale datasets, some works heuristically select and pair real sentences or tables as the source and target [94, 207]. As a result, the target reference may contain information that cannot be supported by the source [140, 194]. For instance, when constructing WIKIBIO [94], a dataset for generating biographical notes based on the infoboxes of Wikipedia, the authors took the Wikipedia infobox as the source and the first sentence of the Wikipedia page as the target ground-truth reference. However, the first sentence of the Wikipedia article is not necessarily equivalent to the infobox in terms of the information they contain. Indeed, Dhingra et al. [30] points out that 62% of the first sentences in WIKIBIO have additional information not stated in the corresponding infobox. Such mismatch between source and target in datasets can lead to hallucination.

Table 2. Evaluation metrics and mitigation methods for each task. *The hallucination metrics are not specifically proposed for generative question answering (GQA), but they can be adapted for that task.

Another problematic scenario is when duplicates from the dataset are not properly filtered out. It is almost impossible to check hundreds of gigabytes of text corpora manually. Lee et al. [96] show that duplicated examples from the pretraining corpus bias the model to favor generating repeats of the memorized phrases from the duplicated examples.

Innate divergence. Some NLG tasks by nature do not always have factual knowledge alignment between the source input text and the target reference, especially those that value diversity in generated output. For instance, it is acceptable for open-domain dialogue systems to respond in chit-chat style, subjective style [152], or with a relevant fact that is not necessarily present in the user input, history or provided knowledge source – this improves the engagingness and diversity of the dialogue generation. However, researchers have discovered that such dataset characteristic leads to inevitable extrinsic hallucinations.

3.2 Hallucination from Training and Inference

As discussed in the previous subsection, source-reference divergence existing in dataset is one of the contributors of hallucination. However, Parikh et al. [140] show that hallucination problem still occurs even when there is very little divergence in dataset. This is because there is another contributor of hallucinations – training and modeling choices of neural models [85, 153, 159, 190].

Imperfect representation learning. The encoder has the role of comprehending and encoding input text into meaningful representations. An encoder with a defective comprehension ability could influence the degree of hallucination [140]. When encoders learn wrong correlations between different parts of the training data, it could result in erroneous generation that diverges from the input [3, 49, 103, 184].

Erroneous decoding. The decoder takes the encoded input from the encoder and generates the final target sequence. Two aspects of decoding contribute to hallucinations. First, decoders can attend to the wrong part of the encoded input source, leading to erroneous generation [184]. Such wrong association results in generation with facts mixed up between two similar entities [41, 168]. Second, the design of the decoding strategy itself can contribute to hallucinations. Dziri et al. [41] illustrate that a decoding strategy that improves the generation diversity, such as top-k sampling, is positively correlated with increased hallucination. We conjecture that deliberately added “randomness” by sampling from the top-k samples instead of choosing the most probable token increase the unexpected nature of the generation, leading to a higher chance of containing hallucinated content.

Exposure Bias. Regardless of decoding strategy choices, the exposure bias problem [9, 151], defined as the discrepancy in decoding between training and inference time, can be another contributor to hallucination. It is common practice to train the decoder with teacher-forced maximum likelihood estimation (MLE) training, where the decoder is encouraged to predict the next token conditioned on the ground-truth prefix sequences. However, during the inference generation, the model generates the next token conditioned on the historical sequences previously generated by itself [69]. Such a discrepancy can lead to increasingly erroneous generation, especially when the target sequence gets longer.

Parametric knowledge bias. Pre-training of models on a large corpus is known to result in the model memorizing knowledge in its parameters [121, 142, 158]. This so-called parametric knowledge helps improve the performance of downstream tasks, but also serves as another contributor to hallucinatory generation. Large pre-trained models used for downstream NLG tasks are powerful in providing generalizability and coverage, but Longpre et al. [115] have discovered that such models prioritize parametric knowledge over the provided input. In other words, models that favor generating output with their parametric knowledge instead of the information from the input source can result in the hallucination of excess information in the output.

4 METRICS MEASURING HALLUCINATION

Recently, various studies have illustrated that most conventional metrics used to measure the quality of writing are not adequate for quantifying the level of hallucination [156]. It has been shown that state-of-the-art abstractive summarization systems, evaluated with metrics such as ROUGE, BLEU, and METEOR, have hallucinated content in 25% of their generated summaries [45]. A similar phenomenon has been shown in other NLG tasks, where it has been discovered that traditional metrics have a poor correlation with human judgment in terms of the hallucination problem [30, 36, 73, 88]. Therefore, there are active research efforts to define effective metrics for quantifying hallucination. FRANK [139] surveys the faithfulness metrics for summarization and compares these metrics’ correlations with human judgments. To assess the example-level accuracy of metrics in diverse tasks, TRUE [72] reports their Area Under the ROC Curve (ROC AUC) in regard to hallucinated example detection.

4.1 Statistical Metric

One of the simplest approaches is to leverage lexical features (n-grams) to calculate the information overlap and contradictions between the generated and the reference texts – the higher the mismatch counts, the lower the faithfulness and thus the higher the hallucination score.

Given that many traditional metrics leverage the target text as the ground-truth reference (e.g., ROUGE, BLEU, etc.), Dhingra et al. [30] build upon this idea and propose PARENT (Precision And Recall of Entailed n-grams from the Table) 2, a metric which can also measure hallucinations using both the source and target text as references. Particularly, PARENT n-gram lexical entailment matches generated text with both the source table and target text. The F1-score that combines the precision and recall of the entailment reflects the accuracy in the table-to-text task. The source text is additionally used because it is not guaranteed that the output target text contains the complete set of information available in the input source text.

It is common for NLG tasks to have multiple plausible outputs from the same input, which is known as one-to-many mapping [64, 173]. In practice, however, covering all the possible outputs is too expensive and almost impossible. Thus, many works simplify the hallucination evaluation setup by relying on the source text as the sole reference. Their metrics just focus on the information referred by input sources to measure hallucinations, especially intrinsic hallucinations. For instance, Wang et al. [199] propose PARENT-T, which simplifies PARENT by only using table content as the reference. Similarly, Knowledge F1 [168] – a variant of unigram F1 – has been proposed for knowledge-grounded dialogue tasks to measure the overlap between the model’s generation and the knowledge used to ground the dialogue during dataset collection.

Furthermore, Martindale et al. [124] proposed a bag-of-vectors sentence similarity (BVSS) metric for measuring sentence adequacy in machine translation, that only refers to the target text. This statistical metric helps to determine whether the MT output has a different amount of information than the translation reference.

Although simple and effective, one potential limitation of the lexical matching is that it can only

handle the lexical information. Thus, it fails to deal with syntactic or semantic variations [166].

Note that PARENT is a general metric like ROUGE and BLEU, not only constrained to hallucination

4.2 Model-based Metric

Model-based metrics leverage neural models to measure the hallucination degree in the generated text. They are proposed to handle more complex syntactic and even semantic variations. The model-based metrics comprehend the source and generated texts and detect the knowledge/content mismatches. However, the neural models can be subject to errors that can propagate and adversely affect the accurate quantification of hallucination.

Information Extraction (IE)-based. It is not always easy to determine which part of the 4.2.1 generated text contains the knowledge that requires verification. IE-based metrics use IE models to represent the knowledge in a simpler relational tuple format (e.g., subject, relation, object), then verify against relation tuples extracted from the source/reference. Here, the IE model is identifying and extracting the “facts” that require verification. In this way, words containing no verifiable information (e.g., stopwords, conjunctions, etc) are not included in the verification step.

For example, ground-truth reference text “Brad Pitt was born in 1963” and generated text “Brad Pitt was born in 1961” will be mapped to the relation triples (Brad Pitt, born-in, 1963) and (Brad Pitt, born-in, 1961) respectively 3. The mismatch between the dates (1963≠1961) indicates that there is hallucination. One limitation associated with this approach is the potential error propagation from the IE model.

4.2.2 QA-based.

This approach implicitly measures the knowledge overlap or consistency between the generation and the source reference. This is based on the intuition that similar answers will be generated from a same question if the generation is factually consistent with the source reference. It is already put in use to evaluate hallucinations in many tasks, such as summarization [36, 164, 191], dialogue [73], and data2text generation [155].

QA-based metric that measures the faithfulness of the generated text is consisted of three parts: First, given a generated text, a question generation (QG) model generates a set of question-answer pairs. Second, a question answering (QA) model answers the generated questions given a groundtruth source text as the reference (containing knowledge). Lastly, the hallucination score is computed based on the similarity of the corresponding answers.

Similar to the IE-based metrics, the limitation of this approach is the potential error that might arise and propagated from either the QG model or the QA model.

4.2.3 Natural Language Inference (NLI) Metrics.

There are not many labelled datasets for hallucination detection tasks, especially at the early stage when the hallucination problem starts to gain attention. As an alternative, many works leverage the NLI dataset to tackle hallucinations. Note that NLI is a task that determines whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise”. These metrics are based on the idea that only the source knowledge reference should entail the entirety of the information in faithful and hallucination-free generation [39, 42, 45, 73, 76, 89, 93, 131, 206]. More specifically, NLI-based metrics define the hallucination/faithfulness score to be the entailment probability between the source and its generated text, also known as the percentage of times generated text entails, neutral to, and contradicts the source.

According to Honovich et al. [73], NLI-based approaches are more robust to lexical variability than token matching approaches such as IE-based and QA-based metrics. Nevertheless, as illustrated by Falke et al. [45], off-the-shelf NLI models tend to transfer poorly to the abstractive summarization task. Thus, there is a line of research in improving and extending the NLI paradigm specifically for hallucination evaluation purposes [42, 45]. Apart from generalizability, Goyal and Durrett [63] point out the potential limitation of using sentence-level entailment models, namely their incapability to pinpoint and locate which parts of the generation are erroneous. In response, the authors propose a new dependency-level entailment and attempt to identify factual inconsistencies in a more fine-grained manner.

This is an example from [61]

Faithfulness Classification Metrics. To improve upon NLI-based metrics, task-specific datasets 4.2.4 are constructed to improve from the NLI-based metrics. Liu et al. [113], Zhou et al. [237] constructed syntactic data by automatically inserting hallucinations into training instances. Santhanam et al. [163] and Honovich et al. [73] construct new corpora for faithfulness classification in dialogue responses. They manually annotate the Wizard-of-Wikipedia dataset [32], a knowledge grounded dialog dataset, by judging whether each response is hallucinated.

Faithfulness specific datasets can be better than NLI datasets because entailment or neutral labels of NLI datasets and faithfulness are not equivalent. For example, the hypothesis “Putin is U.S. president” can be considered to be either neutral to or entailed from the premise “Putin is president”. However, from the faithfulness perspective, the hypothesis contains unsupported information “U.S.”, which is deemed to be hallucination.

4.2.5 LM-based Metrics.

These metrics leverage two language models (LMs) to determine if each token is supported or not: An unconditional LM is only trained on the targets (ground-truth references) in the dataset, while a conditional language model 𝐿𝑀𝑥 is trained on both source and target data. It is assumed that the next token is inconsistent with the input if unconditional LM gets a smaller loss than conditional 𝐿𝑀𝑥 during forced-path decoding [50, 184]. We classify the generated token as hallucinatory if the loss from LM is lower. The ratio of hallucinated tokens to the total number of target tokens 𝑦 can reflect the hallucination degree.

4.3 Human Evaluation

Due to the challenging and imperfect nature of the current automatic evaluation of hallucinations in NLG, human evaluation [163, 168] is still one of the most commonly used approaches. There are two main forms of human evaluation: (1) scoring, where human annotators rate the hallucination level in a range; and (2) comparing, where human annotators compare the output texts with baselines or ground-truth references [176].

Multiple terminologies, such as faithfulness [19, 21, 50, 125, 140, 152, 152, 174, 184, 211, 237], factual consistency [17, 18, 23, 163, 167, 210], fidelity [22], factualness4 [154], factuality4 [34], or on the other hand, hallucination [41, 74, 114, 163, 168], fact contradicting [136] are used in the human evaluation of hallucination to rate whether the generated text is in accord with the source input. Chen et al. [21], Nie et al. [137] use finer-grained metrics for intrinsic hallucination and extrinsic hallucination separately. Moreover, there are some broad metrics, such as Correctness [6, 11, 103, 195], Accuracy [102, 220], and Informativeness [108] considering both missing and additional contents (extrinsic hallucinations) compared to the input source.

5 HALLUCINATION MITIGATION METHODS

Common mitigation methods can be divided into two categories, in accordance with two main contributors of hallucinations: Data-Related Methods, and Modeling and Inference Methods.

5.1.1 Building a Faithful Dataset.

Considering that noisy data encourage hallucinations, constructing faithful datasets manually is an intuitive method, and there are various ways to build such uses the source input as the “fact”.

datasets: One way is employing annotators to write clean and faithful targets from scratch given the source [54, 204], which may lack diversity [67, 140, 143]. Another way is employing annotators to rewrite real sentences on the web [140], or targets in the existing dataset [194]. Basically, the revision strategy consists of three stages: (1) phrase trimming: removing phrases unsupported by the source in the exemplar sentence; (2) decontextualization: resolving co-references and deleting phrases dependent on context; (3) syntax modification: making the purified sentences flow smoothly. Meanwhile, other works [52, 73] leverage the model to generate data and instruct annotators to label whether these outputs contain hallucinations or not. While this approach is typically used to build diagnostic evaluation datasets, it has the potential to build faithful datasets.

5.1.2 Cleaning Data Automatically.

In order to alleviate semantic noise issues, another approach is to find information that is irrelevant or contradictory to the input from the existing parallel corpus and then filter or correct the data. This approach is suitable for the case where there is a low or moderate level of noise in the original data [50, 137].

Some works [114, 153, 167] have dealt with the hallucination issue at the instance level by using a score for each source-reference pair and filtering out hallucinated ones. This corpus filtering method consists of several steps: (1) measuring the quality of the training samples in terms of hallucination utilizing the metrics described above; (2) ranking these hallucination scores in descending order; (3) selecting and filtering out the untrustworthy samples at the bottom. Instance-level scores can lead to a signal loss because divergences occur at the word level; i.e., parts of the target sentence are loyal to the source input, while others diverge [154].

Considering this issue, other works [37, 137] correct paired training samples, specifically the input data, according to the references. This method is mainly applied in the data-to-text task because structured data are easier to correcte than utterances. This method consists of two steps: (1) utilizing a model to parse the meaning representation (MR), such as attribute-value pairs, from original human textual references; (2) using the MR extracted from the reference to correct the input MR through slot matching. This method will enhance the semantic consistency between input and output without abandoning a part of the dataset.

Information Augmentation. It is intuitive that augmenting the inputs with external infor 5.1.3 mation will obtain a better representation of the source. Because the external knowledge, explicit alignment, extra training data, etc., can improve the correlation between the source and target and help the model learn better task-related features. Consequently, a better semantic understanding helps alleviate the divergence from the source issue. Examples of the augmented information include entity information [114], extracted relation triples from source document [19, 74] obtained by Fact Description Extraction, pre-executed operation results [136], synthetic data generated through replacement or perturbation [21, 95], retrieved external knowledge [11, 46, 65, 168, 240], and retrieved similar training samples [12].

These methods enforce a stronger alignment between inputs and outputs. However, they will bring challenges due to the gap between the original source and augmented information, such as the semantic gap between an ambiguous utterance and a distinct MR of structured data, and the format discrepancy between the structured knowledge graph and natural language.

5.2 Modeling and Inference Methods

5.2.1 Architecture.

Encoder. The encoder learns to encode a variable-length sequence from input text into a fixed-length vector representation. As we mentioned above in Section 5.1.3, hallucination appears when the models lack semantic interpretation over the input. Some works have modified the encoder architecture in order to make it more compatible with input and learn a better representation. For example, Huang et al. [74] and Cao et al. [19] propose a dual encoder, consisting of a sequential document encoder and a structured graph encoder to deal with the additional knowledge.

Fig. 1. The frameworks of training methods.

Attention. The attention mechanism is an integral component in neural networks that selectively concentrates on some parts of sequences while ignoring others based on dependencies [189]. In order to encourage the generator to pay more attention to the source, Aralikatte et al. [3] introduce a short circuit from the input document to the vocabulary distribution via source-conditioned bias. Krishna et al. [88] employ sparse attention to improve the model‘s long-range dependencies in the hope of modeling more retrieved documents so as to mitigate the hallucination in the answer. Wu et al. [210] adopt inductive attention, which removes potentially uninformative attention links by injecting pre-established structural information to avoid hallucinations.

Decoder. The decoder is responsible for generating the final output in natural language given input representations [189]. Several work modified the decoder structures to mitigate hallucination, such as the multi-branch decoder [154], uncertainty-aware decoder [211], dual decoder, consisting of a sequential decoder and a tree-based decoder [170], and constrained decoder with lexical or structural limitations [6]. Based on the observation that the “randomness” from sampling-based decoding, especially near the end of sentences, can lead to hallucination, [98] propose to iteratively reduce the “randomness” through time. These decoders improve the possibility of faithful tokens while reducing the possibility of hallucinatory ones during inference by figuring out the implicit discrepancy and dependency between tokens or restricted by explicit constraints. Since such decoders may have more difficulty generating fluent or diverse text, there is a balance to be struck between them.

5.2.2 Training.

Planning/Sketching. Planning is a common method to control and restrict what the model generates by informing the content and its order [147]. Planning can be a separate step in a two-step generator [21, 114, 148, 174, 195], which is prone to progressive amplification of the hallucination problem. Or be injected into the end-to-end model during generation [216]. Sketching has a similar function to planning, and can also be adopted for handling hallucinations [195]. The difference is that the skeleton is treated as a part of the final generated text. While providing more controllability, such methods also need to strike a balance between faithfulness and diversity.

Reinforcement Learning (RL). As pointed out by Ranzato et al. [151], word-level maximum likeli-hood training leads to the problem of exposure bias. Some works [74, 87, 108, 128, 174] adopt RL to solve the hallucination problem, which utilizes different rewards to optimize the model. The purpose of RL is for the agent to learn an optimal policy that maximizes the reward that accumulates from the environment [188].The reward function is critical to RL and, if properly designed, it can provide training signals that help the model accomplish its goal of hallucination reduction. For example, Li et al. [108] propose a slot consistency reward which is the cardinality of the difference between generated template and the slot-value pairs extracted from input dialogue act. Improving the slot consistency can help reduce the hallucination phenomenon of missing or misplacing slot values in generated templates. Mesgar et al. [128] attain persona consistency sub-reward via an NLI model to reduce the hallucinations in personal facts. Huang et al. [74] use a combination of ROUGE and the multiple-choice cloze score as the reward function to improve the faithfulness of summarization outputs. The cloze score is similar to the QA-based metric, measuring how well a QA model can address the questions by reading the generated summary (as context), where the questions are automatically constructed from the reference summary. As the above examples show, some RL reward functions for mitigating hallucination are inspired by existing automatic evaluation metrics. Although RL is challenging to learn and converge due to the extremely large search space, this method has the potential to obtain the best policy for the task without an oracle. Multi-task Learning. Multi-task learning is also utilized for handling hallucinations in different NLG tasks. In this training paradigm, a shared model is trained on multiple tasks simultaneously to learn the commonalities of the tasks. The hallucination problem may be derived from the reliance of the training process on a single dataset, leading to the fact that the model fails to learn the actual task features. By adding proper additional tasks along with the target task during training, the model can suffer less from the hallucination problem. For example, Weng et al. [205] and Garg et al. [55] incorporate a word alignment task into the translation model to improve the alignment accuracy between the input and output, and thus faithfulness. Li et al. [103] combine an entailment task with abstractive summarization to encourage models to generate summaries entailed by and faithful to the source. Li et al. [102] incorporate rationale extraction and the answer generation, which allows more confident and correct answers and reduces the hallucination problem. The Multi-task approach has several advantages, such as data efficiency improvement, overfitting reduction, and fast learning. It is crucial to choose which tasks should be learned jointly, and learning multiple tasks simultaneously presents new challenges of design and optimization [25].

Controllable Generation. Current works treat the hallucination level as a controllable attribute in order to remain the hallucination in outputs at a low level. Controllable generation techniques such as controlled re-sampling [152], control codes that can be provided manually [50, 152, 210], or predicted automatically [210] are leveraged to improve faithfulness. This method may require some annotated datasets for training. Considering that hallucination is not necessarily harmful and may bring some benefits, controllable methods can be further adapted to change the degree of hallucination to meet the demands of different real-world applications.

Other general training methods such as regularization [82, 95, 132] and loss reconstruction [107, 193, 199] have also been proposed to tackle the hallucination problem.

5.2.3 Post-Processing.

Post-processing methods can correct hallucinations in the output, and this standalone task requires less training data. Especially for noisy datasets where a large proportion of the ground truth references suffer from hallucinations, modeling correction is a competitive choice to handle the hallucination problem [21]. Cao et al. [17], Chen et al. [21], Dong et al. [34], and Dziri et al. [41] follow a generate-then-refine strategy. While the post-processing correction step tends to result in ungrammatical texts, this method allows researchers to utilise SOTA models which perform best in respect of other attributes, such as fluency, and then correct the results specifically for faithfulness by using small amounts of automatically generated training data.

6 FUTURE DIRECTIONS

Many studies have been conducted to tackle the hallucination problem in NLG and its downstream tasks. As mentioned above, we have discussed common metrics and mitigation methods to advance research in these fields. From a broader perspective, we wish to point out open challenges and potential directions in regard to metric and mitigation method.

6.1 Future Directions in Metrics Design

Fine-grained Metrics. Most of the existing hallucination metrics measure intrinsic and extrinsic hallucinations together as a unified metric. However, it is common for a single generation to have both types and a number of hallucinatory sub-strings. Fine-grained metrics that can distinguish between the two types of hallucinations will provide richer insight to researchers.

In order to implement a fine-graded metric, the first step would be to identify the exact location of the hallucinatory sub-strings correctly. However, some metrics such as those that are QA-based cannot identify the individual hallucinatory sub-strings. Improvements in this aspect would help improve the quality and explainability of the metrics. The next step would be to categorize the detected hallucinatory sub-strings. The hallucinatory sub-string will be intrinsic if it is wrong or nonsensical, and extrinsic if it is non-existing in the source context. Future work that explores an automatic method of categorization would be beneficial.

Fact-Checking. The factual verification of extrinsic hallucinations requires fact-checking against world knowledge, which can be time consuming and laborious. Leveraging an automatic fact-checking system for extrinsic hallucination verification is, thus, other future work that requires attention. Fact-checking consists of the knowledge evidence selection and claim verification sub-tasks, and the following are the remaining challenges associated with each sub-task.

The main research problem associated with the evidence selection sub-task is how to retrieve evidence from the world knowledge. Most of the literature leverages Wikipedia as the knowledge source [97, 182, 221], which is only a small part of world knowledge. Other literature attempts to use the whole web as the knowledge source [44, 123]. However, this method leads to another research problem – “how to ensure the trustworthiness of the information we use from the web” [58]. Source-level methods that leverages the meta-information of the web source (e.g., web traffic, PageRank or URL structure) have been proposed to deal with this trustworthiness issue [7, 144, 145]. Addressing the aforementioned issues to allow evidence selection against world knowledge will be an important future research direction.

For the verification subtask, verification models perform relatively well if given correct evidence [99]. However, it has been shown that verification models are prone to adversarial attacks and are not robust to negation, numerical or comparative words [183]. Improving this weakness of verification models would also be crucial because the factuality of a sentence can easily be changed by small word changes (i.e., changes in negations, numbers, and entities).

Generalization. Although we can see that the source and output text of different tasks are in various forms, investigating their relationship and common ground and proposing general metrics to evaluate hallucinations are worth exploring. Task-agnostic metrics with cross-domain robustness could help the research community to build a unified benchmark. It is also important and meaningful to build open-source platforms to collaborate and standardize the evaluation metrics for NLG tasks.

Incorporation of Human Cognitive Perspective. A good automatic metric should correlate with human evaluation. Humans are sensitive to different types of information. For instance, proper nouns are usually more important than pronouns in the generated text. Mistakes concerning named entities are striking to human users, but automatic metrics treat them equally if not properly designed. In order to address this issue, new metrics should be designed from the human cognitive perspective. The human ability to recognize salient information and filter the rest is evident in scenarios where the most important facts need to be determined and assessed. For instance, when signing an agreement, a prospective employee naturally skims the document to look at the entries with numbers first. In this way, humans classify what they believe is crucial.

Automatic check-worthy detection has the potential to be applied to improve the correlation with human judgement. Implementing the automatic human-like judgment mentioned above can further mitigate hallucination and improve NLG systems.

6.2 Future Directions in Mitigation Methods

General and robust data pre-processing approaches. Since the data format varies between down-stream tasks, there is still a gap for data processing methods between tasks, and currently, no universal method is effective for all NLG tasks [101]. Data pre-processing might result in gram-matical errors or semantic transformation between the original and processed data, which can negatively affect the performance of generation. Therefore, we believe that general and robust data pre-processing methods can help mitigate the hallucinations in NLG.

Hallucinations in numerals. Most existing mitigation methods do not focus on the hallucination of numerals. However, the correctness of numerals in generated text, such as date, quantities and scalars are important for readers [180, 230, 234]. For example, given the source document “The optimal oxygen saturation (𝑆𝑝𝑂2) in adults with COVID-19 who are receiving supplemental oxygen is unknown. However, a target 𝑆𝑝𝑂2 of 92% to 96% seems logical, considering that indirect evidence from patients without COVID-19 suggests that an 𝑆𝑝𝑂2 of <92% or >96% may be harmful. 5”, the summary “The target oxygen saturation range for patients with COVID-19 is 82–86%.” includes wrong numbers, which could be fatal. Currently, some works [137, 180, 230] point out that using commonsense knowledge can help to gain better numeral representation. And Zhao et al. [234] alleviate numeral hallucinations by re-ranking candidate-generated summaries based on the verification score of quantity entities. Therefore, we believe that explicitly modeling numerals to mitigate hallucinations is a potential direction.

Extrinsic Hallucination Mitigation. Though many works on mitigating hallucinations have been published, most do not distinguish between intrinsic and extrinsic hallucination. Moreover, the main research focus has been on dealing with intrinsic hallucination, while extrinsic hallucination has been somewhat overlooked as it is more challenging to reduce [76]. Therefore, we believe it is worth exploring different mitigation methods for intrinsic and extrinsic hallucinations, and relevant methods in fact-checking can be potentially used for this purpose.

Hallucination in long text. Many tasks in NLG require the model to process long input texts, such as multi-document summarization and generative question answering. We think adopting existing approaches to a Longformer [8]-based model could help encode long inputs. Meanwhile, part of dialogue systems need to generate long output text, in which the latter part may contradict history generation. Therefore, reducing self-contradiction is also an important future direction.

Reasoning. Misunderstanding facts in the source context will lead to intrinsic hallucination and errors. To help models understand the facts correctly requires reasoning over the input table or text. Moreover, if the generated text can be reasoned backwards to the source, we can assume it is faithful. There are some reasoning works in the area of dialogue [26, 57, 198], but few in reducing hallucinations. Moreover, tasks with quantities, such as logical table-to-text generation, require numerical reasoning. Therefore, adding reasoning ability to the hallucination mitigation methods is also an interesting future direction.

https://www.covid19treatmentguidelines.nih.gov/management/critical-care/oxygenation-and-ventilation/

Controllability. Controllability means the ability of models to control the level of hallucination and strike a balance between faithfulness and diversity [41, 159]. As mentioned in Section 3, it is acceptable for chit-chat models to generate a certain level of hallucinatory content as long as it is factual. Meanwhile, for the abstractive summarization task, there is no agreement in the research community about whether factual hallucinations are desirable or not [125]. Therefore, we believe controllability merits attention when exploring hallucination mitigation methods.

7 HALLUCINATION IN ABSTRACTIVE SUMMARIZATION

Abstractive summarization aims to extract essential information from source documents and to generate short, concise, and readable summaries [222]. Neural networks have achieved remarkable results on abstractive summarization. However, Maynez et al. [125] observe that neural abstractive summarization models are likely to generate hallucinatory content that is unfaithful to the source document. Falke et al. [45] analyze three recent abstractive summarization systems and show that 25% of the summaries generated from state-of-the-art models have hallucinated content. In addition, Zhou et al. [237] mention that even if a summary contains a large amount of hallucinatory content, it can achieve a high ROUGE [109] score. This has encouraged researchers to actively devise ways to improve the evaluation of abstractive summarization, especially from the hallucination perspective. In this section, we review the current progress in automatic evaluation and the mitigation of hallucination, and list the remaining challenges for future work. In addition, it is worth mentioning that researchers have used various terms to describe the hallucination phenomenon, such as faithfulness, factual errors, and factual consistency, and we will use the original terms from their papers in the remainder of this section.

7.1 Hallucination

Definition in Abstractive Summarization The definition of hallucination in abstractive summarization follows that in Section 2. Specifically, we adopt the definition from [125]: given a document and its abstractive summary, a summary is hallucinated if it has any spans not supported by the input document. Once again, intrinsic hallucination refers to output content that contradicts the source, while extrinsic hallucination refers to output content that the source cannot verify. For instance, in Table 3, given the input article shown in the caption, an example of intrinsic hallucination is “The Ebola vaccine was rejected by the FDA in 2019,” because this statement contradicts the given content “The first vaccine for Ebola was approved by the FDA in 2019 in the US”. And an example of extrinsic hallucination is “China has already started clinical trials of the COVID-19 vaccine,” because this statement is not mentioned in the given content. We can neither find evidence of it from the input article nor assert that it is wrong.

Pagnoni et al. [139] define fine-grained types of factual errors in summaries. As mentioned in 2.3, since the “fact” here refers to source knowledge, “factual error” can be treated as hallucination, and we can adopt this classification as a sub-type of hallucination. They establish three categories as semantic frame error, discourse error, and content verifiability error.

7.2 Hallucination Metrics in Abstractive Summarization

Existing metrics for hallucination in abstractive summarization are mainly model-based. Following [76], we divide the hallucination metrics into two categories: (1) unsupervised metrics and (2) semi-supervised metrics. Note that existing hallucination metrics evaluate both intrinsic and extrinsic hallucinations together in one metric because it is difficult to automatically distinguish between them.

7.2.1 Unsupervised Metrics.

Given that hallucination is a newly emerging problem, there are only a few hallucination-related datasets. Therefore, researchers have proposed to adopt other datasets to build unsupervised hallucination metrics. There are three types of such unsupervised metrics: (1) information extraction (IE)-based metrics, (2) natural language inferencing (NLI)-based metrics, (3) question answering (QA)-based metrics.

IE-based Metrics. As mentioned in Section 4, IE-based metrics leverage IE models to extract knowledge as relation tuples (subject, relation, object) from both the generation and knowledge source to analyze the factual accuracy of the generation [61]. However, IE models are not 100% reliable yet (making errors in the identification of the relation tuples). Therefore, Nan et al. [134] propose an entity-based metric relying on the Named-Entity Recognition model, which is relatively more robust. Their metric builds on the assumption that there will be a different set of named entities in the gold and generated summary if there exists hallucination.

NLI-based Metrics. As mentioned in Section 4, the NLI-model (textual entailment model) can be utilized to measure hallucination based on the assumption that a faithful summary will be entailed by the gold source. However, Falke et al. [45] discover that models trained on NLI datasets can not transfer well to abstractive summarization tasks, degrading the reliability of NLI-based hallucination metrics. To improve NLI models for hallucination evaluation, they release collected annotations as additional test data. Other efforts have also been made to further improve NLI models. Mishra et al. [131] find that the low performance of NLI-based metrics is mainly caused by the length of the premises in NLI datasets being shorter than the source documents in abstractive summarization. Thus, the authors propose to convert multiple-choice reading comprehension datasets into long premise NLI datasets automatically. The results indicate that long-premise NLI datasets help the model achieve a higher performance than the original NLI datasets. In addition, Laban et al. [93] introduce a simple but efficient method called SUMMAC𝐶𝑜𝑛𝑣 by applying NLI models to sentence units that are segmented from documents. The performance of their model is better than applying NLI models to the whole document.

QA-based Metrics. QA-based metrics measure the knowledge overlap or consistency between summaries and the source documents based on the intuition that QA models will achieve similar answers if the summaries are factually consistent with the source documents. QA-based metrics such as FEQA [36], QAGS [191], and QuestEval [164] follow three steps to obtain a final score: (1) a QG model generates questions from the summaries, (2) a QA model obtains answers from the source documents, and (3) calculate the score by comparing the set of answers from source documents and the set of answers from summaries. The results show that these reference-free metrics have substantially higher correlations with human judgments of faithfulness than the baseline metrics. Gabriel et al. [52] further analyze the FEQA and find that the effectiveness of QA-based metrics depends on the question. They also provide a meta-evaluation framework that includes QA metrics.

Semi-Supervised Metrics. Semi-supervised metrics are trained on the synthetic data generated 7.2.2 from summarization datasets. Trained on these task-specific corpora, models can judge whether the generated summaries are hallucinatory. Kryscinski et al. [89] propose a weakly supervised model named FactCC for evaluating factual consistency. The model is trained jointly for three tasks: (1) checking whether the synthetic sentences remain factually consistent, (2) extracting supporting spans in the source documents, and (3) extracting inconsistent spans in the summaries, if any exist. They transfer this model to check whether the summaries generated from summarization models are factually consistent. Results show that the performance of their FactCC model surpasses the classifiers trained on the MNLI or FEVER datasets. Zhou et al. [237] introduce a method to fine-tune a pre-trained language model on synthetic data with automatically inserted hallucinations in order to detect the hallucinatory content in summaries. The model can classify whether spans in the machine-generated summaries are faithful to the article. This method shows higher correlations with human factual consistency evaluation than the baselines.

7.3 Hallucination

Mitigation in Abstractive Summarization Recently, many approaches have been proposed to reduce the hallucination phenomenon in abstractive summarization.

7.3.1 Architecture Method.

Seq-to-seq [178] models are widely used and achieve state-of-the-art performance in abstractive summarization. Researchers have made modifications to the architecture design of the seq-to-seq models to reduce hallucinated content in the summaries. We describe various efforts made to improve the encoder, decoder, or both the encoder and decoder of the seq-to-seq models.

Encoder. Zhu et al. [240] propose to use an explicit graph neural network (GNN) to encode the fact tuples extracted from source documents. In addition to an explicit graph encoder, Huang et al. [74] further design a multiple-choice cloze test reward to encourage the model to better understand entity interactions. Moreover, Gunel et al. [65] use external knowledge from Wikipedia to make knowledge embeddings, which the results show improve factual consistency.

Decoder. Song et al. [170] present the incorporation of a sequential decoder with a tree-based decoder to generate a summary sentence and its syntactic parse. This joint generation is performed improve faithfulness. Aralikatte et al. [3] introduce the Focus Attention Mechanism, which encourages decoders to generate tokens similar or topical to the source documents. The results on the BBC extreme summarization task show that models augmented with the Focus Attention Mechanism generate more faithful summaries.

Encoder-decoder. Cao et al. [19] extract fact descriptions from the source text and apply a dual-attention seq-to-seq framework to force the summaries to be conditioned on both source documents and the extracted fact descriptions. Li et al. [103] propose an entailment-aware encoder and decoder with multi-task learning which incorporates the entailment knowledge into abstractive summarization models.

7.3.2 Training Method.

Aside from architecture modification, some works improved the training approach to reduce hallucination. Cao and Wang [18] introduce a contrastive learning method to train summarization models. The positive training data are reference summaries, while the negative training data are automatically generated hallucinatory summaries, and the contrastive learning system is trained to distinguish between them. In the dialogue summarization field, Tang et al. [179] propose another contrastive fine-tuning strategy, named CONFIT, that can improve the factual consistency and overall quality of summaries.

7.3.3 Post-Processing Method.

Some works carry out post-editing to reduce the hallucination of the model-generated summaries, which are viewed as draft summaries. Dong et al. [34] propose SpanFact, a pair of factual correction models that use knowledge learned from QA models to correct the spans in the generated summaries. Similar to SpanFact, Cao et al. [17] introduce a post-editing corrector module to identify and correct hallucinatory content in generated summaries. The corrector module is trained on synthetic data which are created by adding a series of heuristic transformations to reference summaries. Zhao et al. [234] present HERMAN, a system that learns to recognize quantities (dates, amounts of money, etc.) in the generated summary and verify their factual consistency with the source text. According to the quantity hallucination score, the system chooses the most faithful summary where the source text supports its quantity terms from the candidate-generated summaries. Chen et al. [21] introduce a contrast candidate generation and selection system to do post-processing. The contrast candidate generation model replaces the named entities in the generated summaries with ones present in the source documents, and the contrast candidate selection model will select the best candidate as the final output summary.

7.4 Future Directions in Abstractive Summarization

Factual Hallucination Evaluation. Factual hallucinations contain information not found in source content, though it is factually correct. In the summarization task, this kind of hallucination could lead to better summaries. However, there is little work focused on evaluating factual hallucination explicitly. Fact-checking approaches could be potentially used in this regard.

Extrinsic Hallucination Mitigation. There has been little research on extrinsic hallucinations as it is more challenging to detect and mitigate content based on world knowledge. We believe it is worth exploring extrinsic hallucination in terms of evaluation metrics and mitigation methods.

Hallucination in Dialogue Summarization. In conversational data, the discourse relations between utterances and co-references between speakers are more complicated than from, say, news articles. For example, Zhong et al. [235] show that 74% of samples in the QMSum dataset consist of inconsistent facts. We believe exploring the hallucination issue in dialogue summarization is an important and special component of research into hallucination in abstractive summarization.

8 HALLUCINATION IN DIALOGUE GENERATION

Dialogue generation is an NLG task that automatically generates responses according to user utterances. The generated responses are required to be fluent, coherent, and consistent with the dialogue history. The dialogue generation task can be divided into two sub-tasks: (1) task-oriented dialogue generation; (2) open-domain dialogue generation. A task-oriented dialogue system aims to complete a certain task according to a user query in a specific domain, such as restaurant booking, hotel recommendation, and calendar checking. Meanwhile, an open-domain dialogue system aims to establish a multi-turn, long-term conversation with users while providing the users with an engaging experience.

8.1 Hallucination

Definition in Dialogue Generation The hallucination problem also exists in the dialogue generation task. It is important to note that a dialogue system is expected either to provide the user with the required information or to provide an engaging response without repeating utterances from the dialogue history. Thus, the tolerance for producing proper “hallucination” from the dialogue history is relatively higher.

The definition of hallucination in this task can be adopted from the general definition as follows: (1) Intrinsic hallucination: the generated response is contradictory to the dialogue history or the external knowledge sentences. In the examples of intrinsic hallucination shown in Table 1, we can verify that the output contradicts the inputs: In one example, the input is a “moderate” price range, but the model mistakenly generates a sentence with a “high” price range. In another case, the confusion of the names “Roger Federer” and “Rafael Nadal” causes the output generation of “Roger Nadal”. (2) Extrinsic hallucination: the generated response is hard to verify with the dialogue history or the external knowledge sentences. Responses with extrinsic hallucination are impossible to verify with the given inputs. “Pickwick hotel” might be “in san diego”, and Djokovic may have been “in the top ten singles players of the world”. However, we do not have enough information to check the truth of these statements.

In the following sections, the hallucination problem in open-domain and task-oriented dialogue generation tasks will be separately discussed according to the their natures.

8.2 Open-domain Dialogue Generation

While the term “hallucination” seems to have newly emerged in the NLP field, a related behavior, “inconsistency”, of neural models has been widely discussed. This behavior has been pointed out as a shortcoming of generation-based approaches for open-domain chatbots [75, 117, 160]. Two possible types of inconsistency occur in open-domain dialogue generation: (1) inconsistency among the system utterances, such as when the system contradicts its previous utterance; (2) inconsistency with an external source, such as factually incorrect utterances. Whereas the first type is described using the term “inconsistency” [106, 202, 225] or “incoherence” [10, 40], some have recently started to call the second type “hallucination” [130, 161]. Self-inconsistency can be considered as an intrinsic hallucination problem, while the external inconsistency involves both intrinsic and extrinsic hallucinations, depending on the reference source.

As mentioned earlier, a certain level of hallucination may be acceptable in open-domain chit-chat as long as it does not involve severe factual issues. Moreoever, it is almost impossible to verify factual correctness since the system usually lacks a connection to external resources. With the introduction of knowledge-grounded dialogue tasks [32, 238], which provide an external reference, however, there has been more active discussion of hallucination in open-domain dialogue generation.

Self-Consistency. In end-to-end generative open-domain dialogue systems, the inconsistency 8.2.1 among system utterances has been pointed out as the bottleneck to human-level performance [190]. We often observe an inconsistency in the answers to semantically similar yet not identical questions. For example, a system may answer the questions of “What is your name?” and “May I ask your name?” with different responses. Persona consistency has been the center of attention [104, 228] and it is one of the most obvious cases of self-contradiction regarding the character of the dialogue system. “Persona” is defined as the character that a dialogue system plays during a conversation, and can be composed of identity, language behavior, and an interaction style [104]. While some works has set their objective as teaching models to utilize speaker-level embeddings [104, 120], others condition generation with a set of descriptions about a persona, which we will discuss in detail in the next section.

8.2.2 External Consistency.

Besides self-consistency, an open-domain dialogue system should also generate persona-consistent and informative responses corresponding so as to user utterances to further engage with the user during conversation. In this process, an external resource containing explicit persona information or world knowledge is introduced into the system to assist the model generation process.

The PersonaChat datasets [31, 228] have accelerated research into persona consistency [68, 91, 126, 208, 219, 224, 232]. In PersonaChat datasets, each conversation has persona descriptions such as “I like to ski” or “I am a high school teacher” attached. By conditioning the response generation on the persona description, a chit-chat model is expected to acquire an ability to generate a more persona-consistent response. Lately, the application of NLI methods [106, 169] or reinforcement learning frameworks [128] have been investigated. Although these methods conditioned on the PersonaChat datasets have been successful, further investigation of approaches that do not rely on a given set of persona descriptions is necessary because such descriptions are not always available, and covering every aspect of a persona with them is impossible.

In addition to PersonaChat-related research, the knowledge-grounded dialogue (KGD) task in the open-domain requires the model to generate informative responses with the help of an external knowledge graph (KG) or knowledge corpus [32, 238]. Hallucination in conversations, which is also considered as a factual consistency problem, has raised much research interest recently [41, 152, 163, 168]. Here, we continue to split the hallucination problem in the KGD task into intrinsic hallucination and extrinsic hallucination. Most of the KGD works tackle the hallucination problem when responses contain information that contradicts (intrinsic) or cannot be found in the provided knowledge input (extrinsic). Since world knowledge is enormous and ever-changing, the extrinsic hallucination may be factual but hard to verify. Dziri et al. [41] further adopt the same definition of hallucination as mentioned above to the knowledge graph-grounded dialogue task, where intrinsic hallucination indicates the case of misusing either the subject or object of the knowledge triple; and extrinsic hallucination indicates that there is no corresponding valid knowledge triple in the gold reference knowledge. Recently, there have been some attempts to generate informative responses without explicit knowledge inputs, but with the help of the implicit knowledge inside large pre-trained language models instead [217, 239] during the inference time. Under this setting, the study of extrinsic hallucination is of great value but still poorly investigated.

8.2.3 Hallucination Metrics.

For generation-based dialogue systems, especially open-domain chatbots, the hallucination evaluation method remains an open problem [160]. As of now, there is no standard metric. Therefore, chatbots are usually evaluated by humans on factual consistency or factual correctness [163, 210]. We also introduce some automatic statistical and model-based metrics as a reference, which will be described in more detail below.

Variants of F1 Metrics. Knowledge F1 (KF1) measures the overlap between the generated responses and the gold knowledge sentences to which the human referred for conversation during dataset collection [168]. KF1 attempts to capture whether a model can generate knowledgable responses by correctly utilizing the relevant knowledge. KF1 is only available for datasets with labeled ground-truth knowledge. Shuster et al. [168] further propose Rare F1 (RF1), which only considers the infrequent words in the dataset when calculating F1 to avoid influence from the common uni-grams. The authors define an infrequent word as being in the lower half of the cumulative frequency distribution of the reference corpus.

Model-based Metric. Natural language has its natural on the flexibility of the surface forms with the same semantics, so overlap-based metrics cannot provide the comprehensive evaluation. Recently, several works have proposed evaluation metrics for measuring consistency, such as using natural language inference (NLI) [40, 202], training learnable evaluation metrics [225], or releasing an additional test set for coherence [10]. These methods are more flexible and supports the generated responses with different surface forms. For the KGD task, Dziri et al. [42] propose the BEGIN benchmark, which consists of samples taken from Dinan et al. [32] with additional human annotation and a new classification task extending the NLI paradigm. Honovich et al. [73] present a trainable metric for the KGD task, which also applies NLI. It is also noteworthy that Gupta et al. [66] propose datasets that can benefit fact-checking systems specialized for dialogue systems. The Conv-FEVER corpus [163] is a factual consistency detection dataset, which was created by adapting the Wizard-of-Wikipedia dataset [32]. It consists of both factually consistent and inconsistent responses and can be used to train a classifier to detect factually inconsistent responses with respect to the knowledge provided.

8.2.4 Mitigation Methods.

The hallucination issue can be mitigated by data pre-processing, which includes introducing extra information into the data. Shen et al. [167] propose a measurement based on seven attributes of the dialogue quality, including self-consistency. Based on this measurement, the untrustworthy samples which get lower scores are filtered out from the training set to improve the model performance in terms of self-consistency (i.e., intrinsic hallucination). Shuster et al. [168] conduct a comprehensive investigation on a retrieval-augmented KGD task where a retriever is introduced to the system for knowledge selection. The authors study several key problems, such as whether retrieval helps reduce hallucinations and how the generation should be augmented with the retrieved knowledge. The experimental results show that retrieval helps substantially in improving performance on KGD tasks and in reducing the hallucination in conversations without sacrificing conversational ability.

Rashkin et al. [152] introduce a set of control codes and concatenate them with dialogue inputs to reduce the hallucination by forcing the model to be more aware of how the response relies on the knowledge evidence in the response generation. Some researchers have also tried to reduce hallucinated responses during generation by improving dialogue modeling. Wu et al. [210] apply inductive attention into transformer-based dialogue models, and potentially uninformative attention links are removed with respect to a piece of pre-established structural information between the dialogue context and the provided knowledge. Instead of improving the dialogue response generation model itself, Dziri et al. [41] present a response refinement strategy with a token-level hallucination critic and entity-mention retriever, so that the original dialogue model is left without retraining. The former module is designed to label the hallucinated entity mentioned in the generated responses, while the retriever is trained to retrieve more faithful entities from the provided knowledge graph.

8.3 Task-oriented Dialogue Generation

A task-oriented dialogue system is often composed of several modules: a natural language understanding (NLU) module, a dialogue manager (DM), and a natural language generation (NLG) module [53, 80]. Intrinsic hallucination can occur between the DM and NLG, where a dialogue act such as recommend(NAME=peninsula hotel, AREA=tsim sha tsui) is transformed into a natural language representation “the hotel named peninsula hotel is located in tsim sha tsui area.” [6, 108].

8.3.1 Hallucination Metrics.

To evaluate hallucination, Li et al. [108] and Balakrishnan et al. [6] combine traditional metrics such as the BLEU score and human evaluation as well as hallucinationspecific automatic metrics. Following previous works such as [38, 203], and [185], Li et al. [108] use the slot error rate, which is computed by (𝑝 + 𝑞)/𝑁 , where 𝑁 represents the total number of slots extracted by another model in the dialogue act. Here, 𝑝 stands for the missing slots in the generated template, and 𝑞 is the number of redundant slots. On the other hand, Balakrishnan et al. [6] introduce a novel metric called the tree accuracy, which determines whether the prediction’s tree structure is identical to that of the input meaning representations.

8.3.2 Mitigation Methods.

While Balakrishnan et al. [6] propose to adopt tree-structured semantic representations and add constraints on decoding, Li et al. [108] frame a reinforcement learning problem to which they apply a bootstrapping algorithm to sample training instances and then leverage a reward related to slot consistency. Recently, there has emerged another line of research in task-oriented dialogue, which is to build a single end-to-end system rather than connecting several modules (e.g., Eric and Manning [43], Madotto et al. [119, 122], Wu et al. [209]). As discussed in previous sections of this paper, there is potential for such end-to-end systems to produce extrinsic hallucinations, yet this remains less explored. For example, a model might generate a response with an entity that appears out of nowhere. In the example of hotel recommendation in Hong Kong given above, a model could generate a response such as “the hotel named raffles hotel is located in central area,6” which cannot be verified from the knowledge base of the system.

Raffles Hotel is a hotel located in Downtown Core, Singapore.

8.4 Future Directions in Dialogue Generation

Self-Contradiction in Dialogue Systems. One of the possible reasons for self-contradiction is that current dialogue systems tend to have a short memory of dialogue history [160]. Firstly, common dialogue datasets provide several turns of conversation, yet these are not long enough to assess a model’s ability to deal with a long context. To overcome this, Xu et al. [212] introduce a new dataset that consists of, on average, over 40 utterances per episode. Secondly, we often truncate dialogue history into fewer turns to fit into models such as Transformer-based architectures, which makes it difficult for a model to memorize the past. In addition to the works on dialogue summarization, e.g., Gliwa et al. [59], it would be beneficial to apply other works which are aiming to grasp the longer context but do not focus on dialogue generation [8, 223, 233].

Fact-checking in dialogue systems. In addition to the factual consistency in responses from knowledge grounded dialogue systems, fact-checking is a future direction in dealing with the hallucination problem in dialogue systems [66]. Dialogue fact-checking involves verifiable claim detection, which is an important line in distinguishing hallucination-prone dialogue, and evidence retrieval from an external source. This fact-checking in the dialogue system could be utilized not only as an evaluation metric for facilitating factual consistency but also to model such a system.

9 HALLUCINATION IN GENERATIVE QUESTION ANSWERING

Generative question answering (GQA) aims to generate an abstractive answer rather than extract an answer to a given question from provided passages [47, 102]. It is an important task since many of the everyday questions that humans deal with and pose to search engines require in-depth explanations [84] (e.g., why/how..?), and the answers are normally long and cannot be directly extracted from existing phrase spans. A GQA system can be integrated with a search engine [129] to empower more intelligent search or combined with a virtual conversation agent to enhance user experience.

Normally, a GQA system involves searching an external knowledge source for information relevant to the question. Then it generates the answer based on the retrieved information [88]. In most cases, no single source (document) contains the answer, and multiple retrieved documents will be considered for answer generation. Those documents may contain redundant, complementary, or contradictory information. Thus, hallucination is common in the generated answers.

The hallucination problem is one of the most important challenges in GQA. Since an essential goal of a GQA system is to provide factualy-correct answers given the question, hallucination in the answer will mislead the user and damage the system performance dramatically.

9.1 Hallucination Definition in GQA

As a challenging yet under-explored task, there is no standard definition of hallucination in GQA. However, almost all the works on GQA [47, 88, 133, 172] involve a human evaluation process, in which the factual correctness measuring the faithfulness of the generated answer can be seen as a measurement of the hallucination; i.e., the more faithful the answer is, the less hallucinated content it contains. The most recent such work [102] uses the term semantic drift, which indicates how the answer drifts away from a correct one during generation, and this can also be seen as a specific definition of hallucination in GQA.

In line with the general categorization of hallucination in Section 2.1, we give two concrete hallucination examples in GQA in Table 1. The sources of both questions are Wikipedia web pages. For the first question, “dow jones industrial average please?”, the generated answer “index of 30 major U.S. stock indexes” conflicts with the statement “of 30 prominent companies listed on stock exchanges in the United States” from Wikipedia. So we categorize it as an intrinsic hallucination. For the second example, the sentences “The definition of a Sadducee is a person who acts in a deceitful or duplicitous manner. An example of a Sadduceee is a politician who acts deceitfully in order to gain political power” in the generated answer can not be verified from the source documents; thus, we categorize it as an extrinsic hallucination.

Currently, there is no automatic metric to evaluate hallucination in QGA specifically. While most works on GQA use automatic evaluation metrics such as ROUGE score and F1 to measure the quality of the answer, these N-gram overlap-based metrics are not a meaningful way to evaluate hallucination due to their poor correlation with human judgments, as indicated by Krishna et al. [88]. On the other hand, almost all the GQA-related work involves a human evaluation process as a complement to the automatic evaluation. Normally, human annotators will be asked to assign a score indicating the faithfulness of the answer, which can also be viewed as a measurement of the answer hallucination. However, the metrics obtained via human evaluation come normally from a small sample of the data.

Metrics such as semantic overlap [166], a learned evaluation metric based on BERT that models human judgments, could be considered a better measurement of hallucination for GQA. Other metrics such as the factual correctness can also be considered as a way to measure hallucination in GQA. Zhang et al. [231] propose to explicitly measure the factual correctness of a generated text against the reference by first extracting facts via an information extraction (IE) module. Then they define and measure the factual accuracy score to be the ratio of facts in the generation text equal to the corresponding facts in the reference.

Factual consistency, which measures the faithfulness of the generated answer given its source documents, can be employed as another way to measure hallucination in GQA. Durmus et al. [36], Wang et al. [191] propose an automatic QA-based metric to measure faithfulness in summary, leveraging the recent advances in machine reading comprehension. They first use a question generation model to construct question-answer pairs from the summary, and then a QA model is applied to extract short answer spans from the given source document for the question. The extracted answers that do not match the provided answers indicate unfaithful information in the summary. While these metrics were first proposed in summarization works, they can be easily adopted in generative QA to measure hallucinations in the generated long-form answer.

The most recent work on GQA by Su et al. [172] proposed to estimate the faithfulness of the generated long-form answer via zero-shot short answer recall on extractive QA datasets. They first generate long-form answers for questions from two extractive QA datasets Natural Questions(NQ) [92] and HotpotQA [218], both of which contains large-scale question-answer pairs, then they measure the ratio of golden short answer span contained in the generated long answer as an estimation of faithfulness of the generated long-answer. While the idea is similar to the factual consistency metric in summarization work [36], and also matches with our intuition to some extent, its correlation with human evaluation on faithfulness has not been verified.

9.3 Hallucination

Mitigation in GQA Unlike conditional text generation tasks such as summarization, or data-to-text generation, in which the source documents are provided and normally related to the target generation, the hallucination problem in GQA is more complicated. Generally speaking, it might come from two sources: 1) the incompetency of the retriever, which retrieves documents irrelevant to the answer, and 2) the intrinsic and extrinsic hallucination in the conditional generation model itself. Normally these two parts are interconnected and cause hallucinations in the answer.

Early works on GQA mostly tried to improve the faithfulness of the answer by investigating reliable external knowledge sources or incorporating multiple information sources. Yin et al. [220] propose Neural Generative Question Answering (GENQA), an end-to-end model that generates answers to simple factoid questions based on the knowledge base, while Bi et al. [11] propose the Knowledge-Enriched Answer Generator (KEAG) to generate a natural answer by integrating facts from four different information sources, namely, questions, passages, vocabulary, and knowledge. Nevertheless, these methods rely on the existence of high-quality, relevant resources which are not easily available.

Recent works focus more on the conditional generation model. Fan et al. [46] construct a local knowledge graph for each question to compress the information and reduce redundancy from the retrieved documents, which can be viewed as an early trial to mitigate hallucination. Li et al. [102] propose Rationale-Enriched Answer Generator (REAG), in which they add an extraction task to obtain the rationale for an answer at the encoding stage, and the decoder is expected to generate the answer based on both the extracted rationale and original input. The recent work [88] employs a Routing Transformer (RT), a sparse attention-based Transformer-based model that employs local attention and mini-batch k-means clustering for long-range dependence, as the answer generator in the hope of modeling more retrieved documents to mitigate the hallucination in the answer. Su et al. [172] propose a framework named RBG (read before generate), to jointly models answer generation with machine reading. They augment the generation model with fine-grained, answer-related salient information predicted by the MRC module, to enhance answer faithfulness. Such methods can exploit and utilize the information in the original input better, while they require the extra effort of building models to extract that information.

Most recently, Lin et al. [110] propose a benchmark, which comprises 817 questions that span 38 categories, to measure the truthfulness of a language model in the QA task. This work investigates the performances of GPT-3 [16], GPT-Neo/J [192], GPT-2 [149] and a T5-based model [150]. The results suggest that simply scaling up the model is less promising than fine-tuning it in terms of improving truthfulness since larger models are better at learning the training distribution from web data and thus tend to produce more imitative falsehoods. In another recent work, Nakano et al. [133] fine-tune GPT-3 to answer long-form questions with a web-browsing environment, which allows the model to navigate the web as well as use human feedback to optimize answer quality directly using imitation learning [77]. While this method seems promising, it also hinges on how that feedback is processed.

9.4 Future Directions in GQA

While GQA is challenging yet under-explored, many possible directions could be explored to improve the answer quality and mitigate hallucination. First, better automatic evaluation metrics are needed to measure hallucination. The previously mentioned metrics, such as the semantic overlap between the generated answer and the ground-truth answer, the faithfulness of the generated answer, and factual consistency between the answer and the source documents, only consider one aspect of hallucination. Metrics that can consider all the factors related to hallucination (such as semantic overlap, faithfulness, or factual consistency) could be designed. Second, datasets with hallucination annotations should be proposed since none of the current GQA datasets have that information. Another possible direction to mitigate hallucination in the answer is improving the performance of the models. We need better retrieval models that retrieve relevant information according to queries and generation models that can synthesize more accurate answers from multi-source documents.

10 HALLUCINATION IN DATA-TO-TEXT GENERATION

Data-to-Text Generation is the task of generating natural language descriptions conditioned on structured data [90, 127], such as tables [140, 207], database records [24], and knowledge graphs [54]. Although this field has been recently boosted by neural text generation models, it is well known that these models are prone to hallucinations [207] because of the gap between structured data and text, which may cause semantic misunderstanding and erroneous correlation. Moreover, the tolerance of hallucination is very low when this task is applied to the real world, such as in the case of patient information table description [181], and analysis of experimental results tables in a scientific report. Recent years have seen a growth of interest in hallucinations in Data-to-Text Generation, and researchers have proposed works from the aspect of evaluation and mitigation.

10.1 Hallucination Definition in Data-to-Text Generation

The definition and categories of hallucination in Data-to-Text Generation follow the descriptions in Section 2. We follow the general hallucination definition in this task: (1) Intrinsic Hallucinations: the generated text contains information that is contradicted by the input data [137]. For example, in Table 1, “The Houston Rockets (18-4)” uses the information “[TEAM: Rockets, CITY:Houston, WIN:18, LOSS: 5]” in the source table. However, “(18-4)” is contradicted by “[LOSS: 5]” and it should be “(18-5)”. (2) Extrinsic Hallucinations: the generated text contains extra information irrelevant to the input [30, 137]. For example, in Table 1, “Houston has won two straight games and six of their last seven.” is not mentioned in the source table [194].

10.2 Hallucination Metrics in Data-to-Text Generation

Statistical. PARENT [30] measures the accuracy of table-to-text generation by aligning n-grams from the reference description 𝑅 and generated texts 𝐺 to the table 𝑇 . And it is the average F-score by combining the entailment precision and recall. Wang et al. [199] modify PARENT and denote this table-focused version as PARENT-T. Different from PARENT, which evaluates i-th instance (𝑇𝑖, 𝑅𝑖, 𝐺𝑖 ), PARENT-T ignores the reference description R and evaluates each instance (𝑇𝑖, 𝐺𝑖 ).

IE-based. Liu et al. [114] estimate hallucination with two entity-centric metrics: table record coverage (the ratio of covered records in a table) and hallucinated ratio (the ratio of hallucinated entities in text). This metric firstly uses entity recognition to extract the entities of input and generated output, then aligns these entities by heuristic matching strategies, and finally calculates the ratios of faithful and hallucinated entities separately. Moreover, there are some general post-hoc IE-based metrics that could be applied to hallucination evaluation, such as Slot Error Rate (SER) [216], Content Selection (CS), Relation Generation (RG), and Content Ordering (CO) [194, 207].

QA-based. Data-QuestEval [155] adapt QuestEval [164] from summarization into data-to-text generation. First, a textual QG model is trained on a textual QA dataset. For each sample (structured data, textual descriptions), the textual QG model generates synthetic problems based on the descriptions. The structured data, textual descriptions (answers), and synthetic questions make up a synthetic QG/QA dataset to train synthetic QA/QG models. Then, the synthetic QG model generates questions based on the textual description to be evaluated. The synthetic QA model then generates answers based on a synthetic question and the structured input data. Finally, BERTScore [229] measures the similarity between the generated answer and description, indicating faithfulness.

NLI-based. Dušek and Kasner [39] recognize the textual entailment between the input data and the output text for both omissions and hallucinations with an NLI model. This work measures the semantic accuracy in two directions: check omissions by inferring whether the input fact is entailed by the generated text and check hallucinations by inferring the generated text from the input.

LM-based. Filippova [50], Tian et al. [184] are based on the intuition that when an unconditional LM, only trained on the targets, gets a smaller loss than a conditional 𝐿𝑀𝑥 , trained on both sources and targets, the token is predicted unfaithfully. Thus, they calculate the ratio of hallucinated tokens to the total target length to measure the hallucination level.

10.3 Hallucination Mitigation in Data-to-Text Generation

Data-Related Methods. Several clean and faithful corpora are collected to tackle the challenges from data infidelity. TOTTO [140] is an open-domain faithful table-to-text dataset, where each sample includes a Wikipedia table with several highlighted cells and a description. To ensure that targets exclude hallucinations, the annotators revise existing Wikipedia candidate sentences and clear the parts unsupported by the table. Moreover, RotoWire-FG (Fact-Grounding) [194] is a purified and enlarged and enriched version of RotoWire [207] generating NBA game summaries from score tables. Annotators trim the hallucination part in target texts and extract the mapped table records as content plans to better align input tables and output summaries.

For data processing, OpAtt [136] designs a gating mechanism and a quantization module for the symbolic operation to augment the record table with pre-calculated results. Nie et al. [137] utilize a language understanding module to improve the equivalence between the input MR and the reference utterance in the dataset. They train an NLU model with an iterative relabeling procedure: First, they train the model on original data; parse the MR by model inference; train the model on new paired data with high confidence; and then repeat the above processes. Liu et al. [114] select training instances based on faithfulness ranking. Finer-grained than the above instance-level method, Rebuffel et al. [154] label tokens according to co-occurrence analysis and sentence structure through dependency parsing in the pre-processing step to explicate the correspondence between the input table and the text. Generally, the data-related methods are appropriate when the training dataset is noisy.

Modeling and Inference Methods. Planning and skeleton generation are common methods to improve the faithfulness to the input in data-to-text tasks. Liu et al. [114] propose a two-step generator with a separate text planner augmented by auxiliary entity information. The planner predicts the plausible content plan based on the input data. Then, given the above input data and the content plan, the sequence generator generates the text. Similarly, Plan-then-Generate [174] also consists of a content planner and a sequence generator. In addition, this work adopts a structure-aware RL training to generate output text following the generated content plan faithfully. Puduppully and Lapata [148] first induce a macro plan consisting of multiple sequences of entities and events from the input table and its corresponding multi-paragraph long document. The predicted macro plan then serves as the input to an encoder-decoder model for surface realization. SANA [195] is a skeleton-based two-stage model that includes skeleton generation to select key tokens from the source table and edit-based generation to produce texts via iterative insertion and deletion operations. In contrast to the above two-step model using planning or skeleton, AGGGEN [216] is an end-to-end model that jointly learns to plan and generate at the same time. This architecture with a Hidden Markov Model and Transformer encoder-decoder reintroduces explicit sentence planning stages into neural systems by aligning facts in the target text to input representations.

Other modeling methods have also been proposed to mitigate the hallucination problem. Conjecturing that hallucinations can be caused by inattention to the source, Tian et al. [184] propose a confidence score and a variational Bayes training framework to learn the score from data. Wang et al. [199] introduce a table-text optimal-transport matching loss and an embedding similarity loss to encourage faithfulness. The hallucination degree can also be treated as a controllable factor in generating texts. In Filippova [50], the hallucination degree of each training sample is estimated and converted into a categorical value which is a part of the inputs as a controlled setting. This approach does not require the dismissal of any input or modification of the model structure.

To mitigate hallucinations at the inference step, Rebuffel et al. [154] propose a Multi-Branch Decoder that leverages word-level alignment labels between the input table and paired text to learn the relevant parts of the training instance. These word-level labels are gained through dependency parsing during the pre-processing step. The branches separately integrate three co-dependent control factors: content, hallucination, and fluency. Uncertainty-aware beam search (UABS) [211] is an extension to beam search to reduce hallucination. Considering that the hallucination probability is positively correlated with predictive uncertainty, this work adds a weighted penalty term in the beam search which is able to balance the predictive probability and uncertainty. This approach is task-agnostic and can also be applied to other tasks, such as image captioning.

These various types of methods do not necessarily conflict and can collaborate to solve the hallucination problem in data-to-text generation.

10.4 Future Directions in Data-to-Text Generation

Given the challenges brought by the discrepancy between structure data and natural text, and the low fault tolerance in the Data-to-Text Generation task, there are several potential directions worth exploring in terms of hallucination.

Firstly, numbers contain information about scales and are common and crucial in the Data-to-Text task [175, 230]. It is frequent to have errors in numbers, which results in hallucinations and infidelity. This is a serious problem for Data-to-Text generation, yet models rarely give special consideration to the numbers found in the table or text [180]. The current automatic metrics of hallucinations also do not specifically treat numbers. This indiscriminate treatment contradicts findings in cognitive neuroscience, where numbers are known to be represented differently from lexical words in a different part of the brain [60]. Thus, considering or highlighting numbers when mitigating and assessing hallucinations is worth exploring. This requires the generative model to learn a better numerical presentation and capture scales, which will reduce the hallucinations caused by the misunderstanding of numbers.

Moreover, for the logical data-to-text generation task, rather than surface-level generation, logical inference, calculation, and comparison are required, which is challenging and causes hallucinations more easily. Thus, reasoning (including numerical reasoning), which is usually combined with graph structures [23] is another direction to improve the accuracy of entity relationships and alleviate hallucinations.

11 HALLUCINATIONS IN NEURAL MACHINE TRANSLATION

Neural Machine Translation (NMT) is the task of generating translation of the source language into the target language via inference, given parallel data samples for training. Compared to statistical machine translation (SMT) the output of NMT is usually quite fluent and of human-level quality, which creates the danger of misinforming users when there are hallucinations [124].

11.1 Hallucinations Definition and Categories in NMT

The problem of hallucination was identified with the deployment of the first NMT models. Early work comparing SMT and NMT systems [86], without explicitly using the term “hallucination”, mentioned that NMT models tend to “sacrifice adequacy for the sake of fluency” especially when evaluated with out-of-domain test sets. Following further development of NMT, most of the relevant research papers agree that translated text is considered a hallucination when it is completely disconnected from the source [95, 132]. The categorization of hallucination in NMT is unlike that in any other NLG tasks, and uses various terms that are often overlapping. In order to maintain consistency with other NLG tasks, in this section we use the intrinsic and extrinsic hallucination categories applied to the NMT task by [237]. After a formal definition, we will describe other identified types of hallucinations and hallucination categories mentioned in the relevant literature.

Table 3. Categories and examples of hallucinations in MT by Zhou et al. [237] and Raunak et al. [153]

Intrinsic and Extrinsic Hallucinations. Following the idea that hallucinations are outputs that are disconnected from the source, [237] suggest categorizing the hallucinatory content based on the way the output is disconnected:

  • Intrinsic Hallucination: translations contain incorrect information compared to information present in the source. In Table 3, the example of such hallucination is “Jerry doesn’t go”, since the original name in the source is “Mike” and the verb “to go” is not negated.

  • Extrinsic Hallucination: translations produce additional content without any regard to the source. In Table 3, “happily” and “with his friend” are the two examples of the hallucinatory content since they are added without any apparent connection to the input.

Other Categories and Types of Hallucinations. Raunak et al. [153] propose an alternative categorization of hallucinations. They divide hallucinations into hallucinations under perturbations and natural hallucinations. Hallucinations under perturbation are those that can be observed if a model tested on the perturbed and unperturbed test set returns drastically different content. Their work on hallucinations under perturbation strictly follows the algorithm proposed by Lee et al. [95]; see Section 11.2.2 on the entropy measure. The second category, natural hallucinations, are created with a connection to the noise in the dataset and can be further divided into detached and oscillatory, where detached hallucinations mean that a target translation is semantically disconnected from a source input, and oscillatory hallucinations mean those that are decoupled from the source by manifesting a repeating n-gram. Tu et al. [187] and Kong et al. [87] analyze this phenomenon under the name “over-translation”, that is, a repetitive appearance of words that were not in the source text. Conversely, under-translation is skipping the words that need to be translated [187]. Finally, abrupt jumps to the end of the sequence and outputs that remain mostly in the source language are also examples of hallucinatory content [95].

11.2 Hallucination Metrics in NMT

The definition of hallucinations in machine translation (MT) tends to be qualitative and subjective, and thus researchers often identify hallucinated content manually. Most detrimentally, the appearance of hallucinations is found not to affect the BLEU score of the translated text [184, 237]. There are, nevertheless, several notable efforts to automatize and quantify the search for hallucinations using statistical methods.

Statistical Metrics. Martindale et al. [124] propose identifying sentence adequacy using the 11.2.1 bag-of-vectors sentence similarity (BVSS) metric. This metric indicates that the information is lost because the reference contains more information than the MT output, or the MT output contains more information than the reference.

11.2.2 Model-Based Metrics.

Auxiliary Decoder. “Faithfulness” refers to the amount of source meaning that is faithfully expressed in the translation, and it is used interchangeably with the term “adequacy” [49, 186]. Feng et al. [49] propose adding another “evaluation decoder” apart from the standard translation decoder. In their work, faithfulness is based on word-by-word translation probabilities, and is calculated in the evaluation module along with translation fluency. The loss returned by the evaluation module helps to adjust the probability returned by the translation module.

Entropy Measure. In scenarios where the ground truth of a translation is not available, an entropy measure of the average attention distribution can be used to detect hallucinations. Tu et al. [187] and Garg et al. [55] show that hallucinations are visible in attention matrices. When the model outputs correct translation, the attention mechanism attends to the entire input sequence throughout decoding. However, it tends to concentrate on one point when the model outputs hallucinatory content. The entropy is calculated on the average attention weights when the model does or does not produce hallucinations during testing. For comparison, a clean test set is used along with the purposefully perturbed one, which is created to incite hallucinations (test sets featuring multiple repetitions). The mean entropy returned by hallucinatory models diverges from the mean of the models that do not produce hallucinations spontaneously [95].

Token Level Hallucination Detection. Zhou et al. [237] propose a method for detecting hallucinated tokens within a sentence, making the search more fine-grained. They use a synthetic dataset that is created by adding noise to the source data, more specifically it is generated by a language model with certain tokens of correct translations masked. Tokens in synthetic data are labeled as hallucinated (1) or not (0). Then the authors compute the hallucination prediction loss between binary labels and the tokens from the hallucinated sentence. This work further employs a word alignment-based method and overlap-based method as baselines for hallucination.

Similarity-based Mathods. Zhou et al. [237] use an unsupervised model that extracts alignments from similarity matrices of word embeddings [162], and then predicts the target token as hallucinated if it is not aligned to the source. Parthasarathi et al. [141] propose calculating faithfulness by computing similarity scores between perturbed source sentence and target sentence after applying the same perturbation.

Overlap-based Mathods. Zhou et al. [237] predict that the target token is hallucinated if it does not appear in the source. Since the target and source are two different languages, the authors use the density matching method for bilingual synonyms from Zhou et al. [236]. Kong et al. [87] suggest the Coverage Difference Ratio (CDR) as the metric to evaluate adequacy, which is especially successful in finding cases of under-translation. It is estimated by comparing source words covered by generated translation with human translations.

The overlap-based methods for detecting hallucinations are heuristics based on the assumption that all translated words should appear in the source. However, this is not always the case, e.g., when paraphrasing or using synonyms. Using word embeddings as similarity-based methods helps avoid such simplifications and allows more diverse, synonymous translations.

Approximate Natural Hallucination Detection. Raunak et al. [153] propose Approximate Natural Hallucination (ANH) detection based on the fact that hallucinations often occur as oscillations (repeating n-grams) and the lower unique bigram count indicates a higher appearance of oscillatory hallucinations. Furthermore, the ANH detection method searches for repeated targets in the translation output. Their method finds translation above a certain n-gram threshold and searches for repeated targets in the output translation, following the assumption that if hallucinations are often incited by aligning unique sources to the same target, then repeating targets will also appear during the inference [187].

11.3 Hallucination Mitigation Methods in NMT

Hallucinations in MT are hard to discover for a person who is not fluent in the target language, and thus they can lead to many possible errors, or even dangers. Out of all the natural language generation tasks, NMT engines such as Google in the English-speaking internet and Baidu in the Sinosphere are probably the most widely accessible to netizens. Consequently, there is a big interest in improving NMT´s performance, also by mitigating hallucinations. This subsection compiles methods of mitigating hallucinations in NMT.

Data augmentation appears to be one of the most common methods for removing hallucination. Lee et al. [95] and Raunak et al. [153] suggest addition of perturbed sentences. Furthermore, perturbation, where the insertions of most common tokens are placed at the beginning of the sentence, seems to be the most successful in hallucination mitigation. A disadvantage of this method is the need to understand different types of hallucinations produced by the model in order to apply a correct augmentation method. Corpus filtering is a method of mitigating hallucinations caused by the noise in the dataset by removing the repetitive and mismatching source and target sequences [153]. Junczys-Dowmunt [79] implements a cross-entropy data filtering method for bilingual data, which uses cross-entropy scores calculated for noisy pairs according to two translation models trained on the clean data. The scores that suggest dissagreament between sentence pairs from two models are subsequently penalized.

While [95, 153] and [79] define noise as mismatched source and target sentences, [15] analyzes the influence of fine-grained semantic divergences on NMT outputs. The authors consequently propose a mitigation method for fine-grained divergences based on semantic factors. The tags are applied to each source and target sentence to inform about the position of divergent tokens. Factorizing divergence not only helps to mitigate hallucinations, but improves the overall performance of the NMT. This shows that tagging small semantic divergences can provide useful information for the network during training.

11.3.2 Modeling and Inference.

Overexposure bias is a common problem in NMT, amplified by the teacher-forcing technique used in sequence-to-sequence models. The models are trained on the ground truth, but during inference, they attend to the past predictions, which can be incorrect [87, 151]. To mitigate this problem, Wang and Sennrich [193] propose substituting MLE as a training objective with minimum risk training (MRT) [138]. Scheduled sampling is a classic method of mitigating overexposure bias first proposed by [9]. Based on that method, [62] create a differentiable approximation to greedy decoding that shows a good performance in the NMT task. [215] propose further improvement of the scheduled sampling algorithm for NMT by optimizing the probability of source and target word alignments. This improvement helps to address the issue flexibility in word order between a source and target language when performing scheduled sampling.

Zhou et al. [237] propose a method of improving self-training of NMT based on hallucination detection. They create hallucination labels (see Section 11.2.2), and then discard losses of tokens predicted as hallucinations, which is known as token loss truncation. This is similar to the method proposed by Kang and Hashimoto [81], the latter for full sentences in the summarization task. Furthermore, instead of adjusting losses, Zhou et al. [237] mask the hidden states of the discarded losses in the decoder in a procedure called decoder HS masking. Experimental results show both a translation quality improvement in terms of BLEU and also a large reduction in hallucination. The token loss truncation method shows good results in the low-resource languages scenario.

Another method to mitigate the impact of noisy datasets is tilted empirical risk minimization (TERM), a training objective proposed by Li et al. [107]. [95] mentions that techniques such as dropout, L2E regularization, and clipping tend to decrease the number of hallucinations. Lastly, several authors propose methods of improving phrase alignment that are helpful both in increasing translation accuracy and identifying content that did not appear in the source translation [55, 205, 227].

11.4 Future Directions in NMT

The future work on hallucinations in NMT is to define hallucinations in a quantifiable manner; i.e., to specify a cut-off value between translation error and hallucinated content using a particular metric. Martindale et al. [124] propose a threshold between fluency and adequacy which is the closest to this ideal. They, however, do not concentrate on hallucinated content as such, and thus fluent but inadequate sentences may not always indicate hallucinations but also other types of translation errors. Balakrishnan et al. [6] mention constrained decoding as a method to mitigate hallucinations in dialogue systems, but it could also be applied in NMT. [33, 70, 146, 171, 177, 214] and [213] use constrained decoding to incorporate specific terminology into MT, but the above methods can be repurposed to mitigate hallucinations.

Another direction for future work on hallucinations is improving existing methods of searching for hallucinatory content, such as the algorithms proposed by Feng et al. [49], Lee et al. [95] and Raunak et al. [153], that are computationally expensive [153] or require the creation of an additional perturbed test-set [95]. Similarly, for mitigation of lack of faithfulness and fluency, the method proposed by Feng et al. [49] requires the creation of a one-to-many architecture (one encoder and two decoders), which is also computationally expensive. Future directions would therefore include simplification of existing hallucination evaluation methods, applying them to different architectures like CNNs and transformers, and possibly conducting research on finding simpler hallucination search methods.

12 HALLUCINATION IN VISION-LANGUAGE GENERATION

With the vast advancement of the Transformer architecture [35, 189] in both CV and NLP, there is a trend to pre-train large-scale unified vision-language (VL) models [1, 27, 105, 196, 197, 200] to perform vision grounded text generation tasks, such as image captioning and visual question answering (VQA). Generally, there are two common schemas for vision-language pre-training: 1) pre-train from scratch with a massive amount of image-text pairs as well as optionally a large text-only corpus; or 2) initialize model parameters from a large pre-trained LM and then adapt it to the VL domain with adequate image-text pairs. Either way, the learned vision and language representations are aligned in the same multimodal space and the resulting model can be seen as a LM that understands visual information. Therefore, the hallucination problem is also observed in VL models due to similar reasons as found in NLG.

In the VL domain, the research on hallucination is still in its very early stage and how to measure and mitigate hallucination is an open question. In this section, we first review the hallucination in image captioning as it is the only VL task that has corresponding previous research works. Then, we introduce hallucination phenomena found in other VL tasks. Finally, we discuss potential future research directions on this problem.

12.1 Object Hallucination in Image Captioning

Definition. Object hallucination is defined as models generating captions that contain non-existent or inaccurate objects from the input image. Following tasks in NLG, we also categorize object hallucination into intrinsic and extrinsic ones:

  • Intrinsic Object Hallucination: captions contain incorrect or definitely non-existent objects given the input image. For example in Figure 2, there is no “mirror” or “football” on top of the chest in the given image.
  • Extrinsic Object Hallucination: captions contain objects cannot be verified their existence from the input image. For example in Figure 2, we cannot verify whether there are “letters” in the drawer or a “fan” on the roof.

Fig. 2. Examples of intrinsic and extrinsic object hallucination in image captioning.

Metrics. To automatically measure object hallucination, Rohrbach et al. [159] propose the CHAIR (Caption Hallucination Assessment with Image Relevance) metric, which calculates what proportion of object words generated are actually in the image according to the ground truth captions. Specifically, there are two variants of it, which are CHAIR𝑖 and CHAIR𝑠 defined as follows,

CHAIR𝑖 measures per-instance object hallucination, i.e. what fraction of object instances in each generated caption are hallucinated. CHAIR𝑠 measures per-sentence object hallucination, i.e. what fraction of generated captions include at least one hallucinated object. For example, to calculate CHAIR scores for the MSCOCO dataset [111], Rohrbach et al. [159] apply the 80-object list used in the MSCOCO segmentation challenge Lu et al. [116] and find exact matches of object words or phrases in captions.

Mitigation. As a research problem that is in its early stage, there are currently a limited number of approaches proposed to mitigate object hallucination in image captioning. Biten et al. [13] hypothesize that the main cause of object hallucination is the systematic co-occurence of particular object categories in input images. They propose three simple yet effective ways of data augmentation to make the co-occurence statistics matrix more uniform to mitigate object hallucination. Results show that their introduced method can reduce object hallucination without changing model architectures. From another perspective, Xiao and Wang [211] propose an uncertainty-aware beam search method for decoding and exhibit that reducing uncertainty can lead to less hallucination. Specifically, a weighted penalty term is added to the beam search objective to balance between log probability and predictive uncertainty of the selected word candidates. More recently, Dai et al.

Intrinsic Object Hallucination A chest of drawers with a mirror on top of it. A chest of drawers and a football on top of it. Extrinsic Object Hallucination A chest of drawers and letters inside drawers. A chest of drawers and a fan on the roof.

[28] analyze object hallucination in VL pre-training and propose a novel pre-training objective named object masked language modeling to alleviate this problem.

12.2 Hallucination in Other VL Tasks

In addition to image captioning, hallucination has also been observed in other VL tasks and raised as an open research question. For example, in open-ended visual question answering, Figure 3 (left and right) shows that the model could generate seem likely answers when we only see the text, however wrong when given the image. Moreover, Figure 3 (middle) indicates that hallucination can also be triggered by adversarially prompting an unanswerable question. The model will imagine an unsupported answer that commonly matches the given visual scene.

Fig. 3. Examples of hallucination in visual question answering (taken from [1]). The bold text is the output generated by the model and the part before it is the input prompt.

12.3 Future Directions in VL

For future research on the hallucination problem in VL, we summarize three promising directions. Firstly, hallucination in VL is still in the early stage. There is a lack of empirical and theoretical analyses in many tasks, such as visual storytelling, visual commonsense reasoning, video captioning, etc. Secondly, more effective evaluation metrics are needed. For example, although CHAIR can automatically evaluate the degree of object hallucination in image captioning, it requires a pre-defined list of object categories, which does not generalize well. Furthermore, for the hallucination types discussed in Section 12.2, currently there is no automatic metric. Therefore, we cannot perform quantitative evaluations for them. Thirdly, we believe how to perform controlled generation [28, 154] with visual grounding is a promising direction to mitigate hallucination in VL.

13 CONCLUSION

In this survey, we provide the first comprehensive overview of the hallucination problem in NLG, summarizing existing evaluation metrics, mitigation methods, and the remaining challenges for future research. Hallucination is an artifact of neural-based NLG and is of concern because they appear fluent and can therefore be misleading to users. In some scenarios and tasks, hallucination can cause harm. We survey various contributors to hallucination, ranging from noisy data, erroneous parametric knowledge, incorrect attention mechanism, inappropriate training strategy, to inference exposure bias, etc. We show that there are two categories of hallucinations, namely intrinsic hallucination and extrinsic hallucination, and they need to be treated differently with diverse

  • Question: What is on thephone screen?
  • Answer: A textmessage from a friend.

  • Question: What can you seeout the window?
  • Answer: Aparking lot.

  • Question: Whom is the persontexting?
  • Answer: The driver.

mitigation strategies. Hallucination is relatively easy to detect in abstractive summarization and in NMT against the evidence in the source. For dialogue systems, it is important to balance diversity vs consistency in dialogue responses. Hallucination in GQA and VL tasks is detrimental to the performance, but research on mitigation methods is still very preliminary in these areas. For data- to-text generation, hallucination arises from the discrepancy between the input and output format. Most methods to mitigate hallucinations in NMT either aim to reduce dataset noise or alleviate exposure bias. In the VL domain, models also generate unfaithful output given the visual scene, and recent works have mainly focused on the object hallucination problem. There remain many challenges ahead in identifying and mitigating hallucinations in NLG, and we hope research in this area can benefit from this survey.

Previous: Survey | Evaluating Large Language Models A Comprehensive Survey Next: Model | Zephyr

post contain ""

    No matching posts found containing ""