00:00:00

Chain of Verification (CoV)

https://dsdanielpark.github.io https://github.com/dsdanielpark

Chain of Verification (CoV)

MinWoo(Daniel) Park | Tech Blog

Created: 2023-12-07 05:23:21 +0000

Last modified: 2024-09-05 20:56:50 +0900

Chain of Verification (CoV)

Related Project: Private
Category: Paper Review
Date: 2023-10-03

Chain-of-Verification Reduces Hallucination in Large Language Models

url: https://arxiv.org/abs/2309.11495
pdf: https://arxiv.org/pdf/2309.11495
abstract: Generation of plausible yet incorrect factual information, termed hallucination, is an unsolved issue in large language models. We study the ability of language models to deliberate on the responses they give in order to correct their mistakes. We develop the Chain-of-Verification (CoVe) method whereby the model first (i) drafts an initial response; then (ii) plans verification questions to fact-check its draft; (iii) answers those questions independently so the answers are not biased by other responses; and (iv) generates its final verified response. In experiments, we show CoVe decreases hallucinations across a variety of tasks, from list-based questions from Wikidata, closed book MultiSpanQA and longform text generation.

Contents

Chain-of-Verification Reduces Hallucination in Large Language Models

TL;DR

언어 모델의 정확성 개선을 위한 검증 체인 개발
다양한 벤치마크에서 CoVe의 효과적 적용 및 성능 향상 확인
체계적인 검증 과정을 통해 발생 가능한 오류 최소화

1. 서론

언어 모델은 방대한 텍스트 데이터에 기반하여 훈련되며, 모델의 파라미터 수가 증가함에 따라 성능이 향상됩니다. 특히, 더 큰 모델은 더 정확한 사실적 진술을 생성할 수 있지만, 잘 알려지지 않은 사실에 대해서는 여전히 오류를 범할 수 있습니다. 이런 오류를 ‘환각’이라고 하며, 본 논문에서는 이를 줄이기 위해 언어 모델 기반의 인퍼런스 사용을 연구합니다. ‘검증 체인(Chain-of-Verification, CoVe)’ 방법을 개발하여 초기 응답 후 검증 질문을 계획하고 체계적으로 답변함으로써 최종 응답의 정확성을 높이는 방법을 제안합니다.

2. 선행 연구

환각 문제는 크게 훈련 시간의 수정, 생성 시간의 수정, 도구 사용을 통한 수정으로 나뉩니다. 훈련 시간에는 강화 학습이나 대조 학습을 통해 모델을 조정하고, 생성 시에는 생성된 토큰의 확률을 고려하여 더 신뢰할 수 있는 결정을 내리는 방법이 사용됩니다. 외부 도구를 활용한 접근 방식은 사실적 문서를 기반으로 생성을 보강하여 환각을 줄이는 방법을 포함합니다. 또한, 인퍼런스 과정에서의 연장된 단계가 성능을 향상시킨다는 결과도 있습니다.

3. 검증 체인

3.1 기본 응답 생성

쿼리에 기반한 기본 응답을 언어 모델로부터 생성합니다. 이 단계는 CoVe 파이프라인에서 기준점으로 사용됩니다.

3.2 검증 계획

기본 응답과 쿼리에 조건을 달아 검증 질문 목록을 생성합니다. 이 질문들은 기본 응답에 포함된 사실적 주장을 테스트합니다.

3.3 검증 실행

계획된 검증 질문에 답변하여 기본 응답과의 일관성 여부를 확인합니다. 여기서는 독립적인 프롬프트를 통해 각 질문에 답함으로써 더 나은 결과를 도출할 수 있는 ‘분할 + 수정’ 접근 방식을 사용합니다.

3.4 최종 검증된 응답 생성

검증 결과를 통합하여 개선된 최종 응답을 생성합니다. 이 과정에서 검증된 응답과 질문-답변 쌍을 고려하여 수정이 이루어집니다.

4. 실험

4.1 벤치마크

Wikidata: 자동 생성된 질문에 대한 답변 정확도를 측정합니다.
Wiki-Category List: 다양한 범주에 대한 집합 생성 태스크를 수행합니다.
MultiSpanQA: 독서 이해력 평가를 통해 다중 정답을 요구하는 질문에 답합니다.
Long-form Biography Generation: 인물의 전기를 생성하고 FACTSCORE 메트릭을 사용하여 팩트 체크합니다.

4.2 Baseline Model

Llama65B를 Baseline Model로 사용하며, 다양한 CoVe 변형을 통해 성능 향상을 검증합니다.

4.3 결과

CoVe는 리스트 기반 답변 작업에서 정확도를 크게 향상시키며, 닫힌 책 형태의 QA와 장문 생성에서도 성능이 개선됩니다. 확장된 인퍼런스 단계가 오류를 줄이는 데 기여하며, 검증 질문 생성에 언어 모델을 사용하는 것이 휴리스틱 접근 방식보다 우수한 결과를 보입니다.

이 논문에서는 언어 모델의 환각 문제를 줄이기 위한 체계적인 접근 방식을 제안하고, 이를 다양한 벤치마크에서 검증하여 그 효과를 입증합니다.

1 INTRODUCTION

Large Language Models (LLMs) are trained on huge corpora of text documents with billions of tokens of text. It has been shown that as the number of model parameters is increased, performance at tasks such as closed-book QA improves in accuracy, and larger models can generate more correct factual statements (Radford et al., 2019; Petroni et al., 2019). However, even the largest models can still fail, particularly on lesser-known torso and tail distribution facts (Sun et al., 2023a), i.e., those that occur relatively rarely in the training corpora. In those cases where the model is incorrect, they instead generate an alternative response which is typically plausible-looking (e.g., a similar entity, but an incorrect one). These factually incorrect generations are referred to as hallucinations (Maynez et al., 2020). Further, in long-form tasks consisting of generating multiple sentences or paragraphs, the hallucination problem can be exacerbated due to the issue of exposure bias (Wang & Sennrich, 2020). The current wave of language modeling research goes beyond next-word prediction and has focused on their ability to reason. Improved performance in reasoning tasks can be gained by encouraging language models to first generate internal thoughts or reasoning chains before responding (Wei et al., 2022; Adolphs et al., 2021; Wang et al., 2022; Lanchantin et al., 2023), as well as updating their initial response through self-critique (Press et al., 2022; Madaan et al., 2023). In this work, we follow this line of research to study how and when language-model-based reasoning can be used to reduce hallucinations. We develop an approach, called Chain-of-Verification (CoVe), which, given an initial draft response, first plans verification questions to check its work and then systematically answers those questions in order to finally produce an improved revised response. We find that independent verification questions tend to provide more accurate facts than those in the original long-form answer, and hence improve the correctness of the overall response. We study variations on this recipe across a range of tasks: from list-based questions, closed-booked QA, and long-form text generation. We first propose a joint approach for generating the entire verification chain left-to-right, which improves performance and decreases hallucinations compared to the baseline language model. However, models that attend to existing hallucinations in the context from their own generations tend to repeat the hallucinations. Hence we also introduce further improvements with factored variants which separate out the verification chain steps, in terms of which context is attended to. We show how these factored variants give further performance gains across all three tasks considered.

Hallucination is a general problem in language model generations that appears across many tasks, from summarization (Maynez et al., 2020) to open-domain dialogue (Roller et al., 2020), and has not been resolved by simply scaling up training data or model size (Zhang et al., 2023). For a survey of the hallucination issue, see Ji et al. (2023). A majority of the methods for reducing hallucination can be divided into roughly three categories: training-time correction, generation-time correction, and via augmentation (tool-use). In training-time correction methods, an attempt is made to improve the raw left-to-right generations of an encoder-decoder or decoder-only language model by either training or otherwise adjusting the model weights to decrease the probability of hallucinated generations. This includes using reinforcement learning (Roit et al., 2023; Wu et al., 2023), contrastive learning (Chern et al., 2023b; Sun et al., 2023b), and other methods (Li et al., 2023). In generation-time correction, a common theme is to make reasoning decisions “on top of” the base LLM in order to make them more reliable. For example, by considering the probabilities of the generated tokens (Mielke et al., 2022; Kadavath et al., 2022). In Manakul et al. (2023), multiple samples are drawn from the model to detect hallucinations. In Varshney et al. (2023), hallucinations are identified using low confidence scores, and their correctness is checked through a validation procedure, mitigated, and then the generation is continued. An alternative to using the confidence scores is to leverage inconsistencies in the LLM’s output to detect hallucination. Agrawal et al. (2023) use both multiple samples and consistency detection by asking direct and indirect queries to check for hallucinated references. Cohen et al. (2023) introduce a method called LM vs. LM which simulates an interactive setup between two LLMs where one LLM acts as an examiner and tests if the output is consistent via repeated cross-examination. Cohen et al. (2023) shows that using inconsistencies for QA tasks can outperform using confidence scores for hallucination detection. COVE also uses a related self-consistency approach but without the multi-agent (multi-LLM) debate concept. A third approach is to use external tools to help mitigate hallucinations, rather than relying solely on the abilities of the language model itself. For example, retrieval-augmented generation can decrease hallucinations by using factual documents for grounding (Shuster et al., 2021; Jiang et al., 2023b; Yu et al., 2023) or chain-of-thought verification (Zhao et al., 2023). Other approaches include using tools for fact-checking (Chern et al., 2023a; Galitsky, 2023; Peng et al., 2023), or linking to external documents with attribution (Mennicke et al., 2022; Rashkin et al., 2023; Gao et al., 2023). There are also a number of related works in improving reasoning for logical and mathematical tasks, even if they do not address reducing hallucination explicitly. Several approaches have been shown to improve results with extended reasoning steps by the system, such as chain-of-thought (Wei et al., 2022), deductive verification (Ling et al., 2023), and self-verification (Miao et al., 2023; Jiang et al., 2023a; Wen et al., 2022). The latter tries to predict the (masked) question given the answer for math problems and use that as evidence that this is the correct solution.

3 CHAIN-OF-VERIFICATION

Our approach assumes access to a base LLM that – despite potentially being prone to hallucination – is capable of being prompted with general instructions in either a few-shot or zero-shot fashion. A key assumption of our method is that this language model, when suitably prompted, can both generate and execute a plan of how to verify itself in order to check its own work, and finally incorporate this analysis into an improved response.

Our overall process, which we call Chain-of-Verification (CoVe), thus performs four core steps:

Generate Baseline Response: Given a query, generate the response using the LLM.
Plan Verifications: Given both the query and the baseline response, generate a list of verification questions that could help to self-analyze if there are any mistakes in the original response.
Execute Verifications: Answer each verification question in turn and hence check the answer against the original response to check for inconsistencies or mistakes.
Generate Final Verified Response: Given the discovered inconsistencies (if any), generate a revised response incorporating the verification results.

Each of these steps is performed by prompting the same LLM in different ways to obtain the desired response. While steps (1), (2), and (4) all can be invoked with a single prompt, we investigate variations of step (3), including joint, 2-step, and factored versions. These variants either involve a single prompt, two prompts, or else independent prompts per question, where more sophisticated decomposition can yield improved results.

We describe these steps in more detail below. An overview of the approach is illustrated in Figure 1, and in the Appendix in Figure 3.

3.1 BASELINE RESPONSE

Given a query, we generate left-to-right as usual using the LLM, with no special tricks. While this is the first step in the CoVe pipeline, it also serves as the baseline we wish to improve in our experiments (i.e., we will directly compare this baseline response with the final verified response from our overall method). Given such baseline generations are typically prone to hallucination, CoVe attempts to identify these hallucinations and correct them in the following steps.

3.2 PLAN VERIFICATIONS

Conditioned on the original query and the baseline response, the model is prompted to generate a series of verification questions that test the factual claims in the original baseline response. For example, if part of a long-form model response contains the statement “The Mexican–American War was an armed conflict between the United States and Mexico from 1846 to 1848”, then one possible verification question to check those dates could be “When did the Mexican-American war start and end?”. We note that verification questions are not templated, and the language model is free to phrase these in any form it wants, and they also do not have to closely match the phrasing of the original text. In our experiments, we perform such verification planning by providing a few-shot prompt of (response, verification) demonstrations to our LLM. See section 8 for the few-shot prompts we will use in our experiments. We note it is also possible with a sufficiently performant instruction-following LLM that this could be performed zero-shot.

3.3 EXECUTE VERIFICATIONS

Given the planned verification questions, the next step is to answer them in order to assess if any hallucinations exist. While techniques such as retrieval-augmentation could be used in this process, such as verification via search engine, in this work we do not explore tool-use. Instead, we consider only using the LLM itself in all steps of CoVe, hence the model is used to check its own work. We investigate several variants of verification execution, called joint, 2-Step, factored, and factor+revise.

Joint In the joint method, the planning and execution (steps 2 and 3) are accomplished by using a single LLM prompt, whereby the few-shot demonstrations include both verification questions and their answers immediately after the questions. In this approach, separate prompts are not needed.
2-Step A potential disadvantage of the joint method is that because the verification questions must condition on the baseline response in the LLM context, and the method is joint, the verification answers have to condition on the initial response as well. This may increase the likelihood of repetition, another known issue of modern LLMs (Holtzman et al., 2019). This means the verification questions might hallucinate similarly to the original baseline response, which defeats the purpose. We hence instead separate the planning and execution into separate steps, both with their own LLM prompt. The planning prompt conditions on the baseline response in the first step. The verification questions generated from planning are answered in the second step, where crucially the context given to the LLM prompt only contains the questions, and not the original baseline response and hence cannot repeat those answers directly.
Factored + Revise Another, more sophisticated approach, is to answer all questions independently as separate prompts. Again, crucially, those prompts do not contain the original baseline response and are hence not prone to simply copying or repeating it. The factored approach has the further advantage of removing any potential interference not only from the baseline response but also between answer contexts and is somewhat related to the recent (concurrent) work of Radhakrishnan et al. (2023) for subquestion answering by factored decomposition; hence we adopt their naming. It can also potentially handle more verification questions by virtue of them not all having to fit within the same single context. While this is potentially more computationally expensive, requiring the execution of many more LLM prompts, they can be run in parallel and hence be batched. In order to do this, we first have to take the set of generated questions from subsection 3.2 and parse them into separate questions, which is a relatively easy task as the few-shot demonstrations we provide indicate they should be generated as a comma-separated list. We can then split them out into separate LLM prompts.
Factor + Revise After answering the verification questions, the overall CoVe pipeline then has to either implicitly or explicitly cross-check whether those answers indicate an inconsistency with the original responses. In the Factor+Revise approach, we execute this as a deliberate step via an extra LLM prompt, which may make it easier for the final system to reason about this step explicitly. Differently to answering the verification questions, the cross-checking phase needs to condition on both the baseline response and the verification question and answer. We thus execute this as separate LLM prompts, one “cross-check” prompt for each question, with again a set of few-shot demonstrations showing the desired output. For example, if the original baseline response contained the phrase “It followed in the wake of the 1845 U.S. annexation of Texas…” and CoVe generated a verification question “When did Texas secede from Mexico?” which was answered with 1836, then an inconsistency should be detected by this step.

3.4 FINAL VERIFIED RESPONSE

Finally, the improved response that takes verification into account is generated. This is executed by a final few-shot prompt where the context takes into account all of the previous reasoning steps, the baseline response, and verification question-answer pairs so that the corrections can take place. If the Factor+Revise approach is used from subsection 3.3, then the output of the cross-check inconsistency detection is provided as well.

4 EXPERIMENTS

We use various experimental benchmarks to measure the efficacy of CoVe in reducing hallucination, comparing against a number of baselines.

4.1 TASKS

The benchmarks we use range from list-based questions where the required answer is a set of entities to where the answer is a long-form generation of multiple freeform sentences.

4.1.1 WIKIDATA

We start by testing CoVe on a set of automatically generated questions using the Wikidata API. We create list questions of the form: “Who are some [Profession]s who were born in [City]?”. For example, “Who are some politicians who were born in Boston?”. The answer to these questions is a set of entities, where the gold list is obtained from the Wikidata knowledge base. This results in a dataset of 56 test questions, each typically containing ∼600 known gold entities, but typically an LLM will produce a much shorter list. We then use the precision metric (micro-averaged) to measure performance, in addition to reporting the averaged number of positive and negative entities produced.

We then proceed to a harder set-generation task. We use the QUEST dataset that was created using Wikipedia Category lists. We convert these category names to questions by simply prepending a “Name some”. Owing to the varied questions such as “Name some Mexican animated horror films” or “Name some Endemic orchids of Vietnam,” we believe this task can pose a greater challenge. We collate all examples in the dataset that do not require logical operations to create a set of 55 test questions, each having ˜8 answers. Similar to the Wikidata task, we measure precision (micro-averaged) to measure performance, in addition to reporting the averaged number of positive and negative entities produced.

4.1.3 MULTISPANQA

We next test our approach on a reading comprehension benchmark, MultiSpanQA. MultiSpanQA comprises of questions that have multiple independent answers (derived from a series of multiple discontiguous spans in the text, with questions originally from the Natural Questions dataset). We consider a closed-book setting, where we do not provide supporting documents, and hence consider a subset of questions which are factoid-based, so that our base LLM is more likely to be able to answer them. We thus use a test set of 418 questions with shorter answers per span (up to 3 tokens per item). For example, Q: Who invented the first printing press and in what year?, A: Johannes Gutenberg, 1450.

4.1.4 LONG-FORM GENERATION OF BIOGRAPHIES

We next validate the performance of CoVe on long-form text generation. In this setting, we evaluate our method on generating biographies, adopting the benchmark proposed in by Min et al. (2023). Here the model is simply prompted to generate a biography of a selected entity using the prompt: “Tell me a bio of ". We evaluate the efficacy of our approach using the FACTSCORE metric (Min et al., 2023) developed in that work, which uses a retrieval-augmented language model to fact-check the response (Instruct-Llama, "Llama+Retrieval+NP"), which they showed correlates well with human judgments.

4.2 BASELINES

We use Llama65B, a strong open model as our base LLM, and use greedy decoding for all models. As Llama65B is not instruction fine-tuned, we employ few-shot examples particular to each task for measuring performance on each of our benchmarks. This serves as our main baseline which CoVe tries to improve upon. CoVe uses the same Llama65B base but includes, for the same few-shot examples, demonstrations of verification questions and final verified responses, following Figure 1 and section 3. Thus, we measure the ability to improve over the original baseline response for the same LLM. For CoVe, we compare different variants, particularly the joint and factored versions on all tasks.

We also compare to Llama instruction fine-tuned models, for which we use Llama2 (Touvron et al., 2023b). We measure both zero-shot performance on the task, or zero-shot with chain-of-thought by adding “Let’s think step by step” to the zero-shot prompt. We find that the instruction fine-tuned models tend to generate extraneous content when queried. This can especially be a problem for the list-based tasks. To deal with this, we add an extra line to our prompt: “List only the answers separated by a comma.” We also add another layer of post-processing to extract the answers by using an off-the-shelf NER model to further avoid this issue as this helped. However, we still expect few-shot to improve over this, especially for tasks like Multi-Span-QA where the answers are not all named entities, and the few-shot examples effectively show the domain of the task. For the long-form generation of biographies, we also compare to several existing model results reported in Min et al. (2023), in particular InstructGPT (Ouyang et al., 2022), ChatGPT2, and PerplexityAI3.

4.3 RESULTS

We are interested in empirically answering the following research questions: RQ1 Can COVE effectively reduce the rate of hallucinatory content produced by the LLM? RQ2 Can COVE be used to fix or remove incorrect generations without decreasing the amount of our main results across the four benchmark tasks are given in Table 1, Table 2, and Table 3, and our main findings are as follows.

CoVe improves precision on list-based answer tasks

We find that CoVe provides large gains in precision on the list-based tasks, e.g., more than double the precision from the Llama65B few-shot baseline for the Wikidata task (from 0.17 to 0.36). We find from the positive and negative breakdown that there is a large reduction in the number of hallucinated answers (negatives: 2.95 → 0.68) while only a relatively small reduction in the number of non-hallucinations (positives: 0.59 → 0.38).

CoVe improves performance on closed-book QA

We also find that CoVe brings improvements in general QA problems, as measured on MultiSpanQA. We observe a 23% improvement in F1 over the few-shot baseline (0.39 → 0.48), where the improvements come from gains in both precision and recall.

CoVe improves precision on longform generation

These results also extend to longform generation, where we actually see larger gains than in the QA setting. FACTSCORE increases 28% (55.9 → 71.4) from the few-shot baseline, with again only a relatively small reduction in the average number of facts provided (16.6 → 12.3). We also show the breakdown of improvements across facts in Figure 2, where one can see CoVe improves results for both rare and more frequent facts.

Instruction-tuning and CoT do not reduce hallucinations

We find that the few-shot baseline that employs a pre-trained Llama model outperforms Llama2-Chat, an instruction-tuned model, across all the tasks. The few-shot examples lead the model to give outputs in line with those expected for the task, whereas general instruction tuning produces more hallucinations or incorrect outputs. Standard chain-of-thought (CoT) prompting also fails to improve the results for these tasks.

Factored and 2-step CoVe improve performance

We observe a consistent performance improvement across all tasks from applying the factored CoVe approach compared to joint CoVe. For example, improvement from 60.8 → 63.7 in FACTSCORE in longform generation. Similarly, the 2-step approach also outperforms the joint approach, as tested on the Wikidata and Wiki-Category list tasks, with 2-step giving the best results for Wikidata and factored the best for Wiki-Category. All these results support our hypothesis that verifying questions should not attend to the original baseline response as they may be prone to repeating it (as the joint method can do).

Further explicit reasoning helps remove hallucinations

In the longform generation task, we also explore more sophisticated reasoning steps in the CoVe “factor+revise” method, which explicitly cross-checks whether verification answers indicate an inconsistency. We see large gains in the FACTSCORE metric from this further explicit reasoning from 63.7 (factored) → 71.4 (factor+revise). This gives further indication that appropriate and explicit reasoning in LLMs can bring improvements in mitigating hallucinations.

CoVe-based Llama outperforms InstructGPT, ChatGPT, and PerplexityAI

On the longform generation task, our baseline few-shot Llama65B is outperformed by the ChatGPT and PerplexityAI models in terms of the FACTSCORE metric. However, applying CoVe to the baseline Llama65B lifts its performance above both ChatGPT and PerplexityAI, as well as outperforming InstructGPT. This is particularly impressive compared to PerplexityAI, considering that is a model that can support its facts with retrieval augmentation, whereas CoVe uses only the base language model itself with improved reasoning via deliberation (verification).

However, we can see in Figure 2 PerplexityAI still outperforms CoVe for very rare facts where retrieval is essential, but CoVe outperforms PerplexityAI for more frequent facts. We note that some models produce fewer overall facts than others; however, the FACTSCORE metric is normalized and hence comparable across models. We verified this experimentally by clipping Llama270Bchat’s output to present fewer facts (as it contains the largest number in its output out of all models), but this did not change its FACTSCORE substantially, e.g., clipping to 10 sentences increased its score from 41.3 → 42.7. We note the length of the generations of the few-shot-based models are essentially governed by the few-shot examples, which in turn are constrained by the context length.

Shortform verification questions are more accurately answered than longform queries In a longform response, LLMs are prone to generate a number of hallucinations. However, it can often be the case that the LLM itself would know these hallucinations are wrong if queried specifically for that individual fact, independent of the rest of the longform generation, see Figure 1, Figure 3, and section 9. This can be seen quantitatively on the Wikidata task, where only ∼17% of the Llama few-shot baseline answer entities are correct in list-based questions. However, when querying each individual entity via a verification question, we find ∼70% are correctly answered.

LLM-based verification questions outperform heuristics

In our method, CoVe, the verification questions are generated by the LLM dependent on the task. We compare the quality of these questions to heuristically constructed ones in order to measure their quality, by replacing the LLM questions with templated yes/no questions of the form “Does X answer the question” for list-based questions with elements X in the answer. Results on the Wiki-Category task, given in Table 4, show a reduced precision with rule-based verification questions. We believe this difference would be larger for longform generation where the types of required verification questions can be more diverse, and LLM-based verification becomes even more necessary.

Open verification questions outperform yes/no-based questions

In our main experiments, we use verification questions where the expected answers are true facts. An alternative setup is to include the fact as part of the verification question and ask it in a yes/no answer format. We evaluate this difference in Table 4 and find that yes/no type questions perform worse for the factored version of CoVe. Some anecdotal examples are included in Appendix section 9 for ChatGPT where we find the model tends to agree with facts in a yes/no question format whether they are right or wrong.

5 CONCLUSION

We introduced Chain-of-Verification (CoVe), an approach to reduce hallucinations in a large language model by deliberating on its own responses and self-correcting them. In particular, we showed that models are able to answer verification questions with higher accuracy than when answering the original query by breaking down the verification into a set of simpler questions. Secondly, when answering these verification questions, we showed that controlling the attention of the model so that it cannot attend to its previous answers

post contain ""

No matching posts found containing ""

Chain of Verification (CoV)

Chain of Verification (CoV)

Chain of Verification (CoV)

Chain-of-Verification Reduces Hallucination in Large Language Models

TL;DR

1 INTRODUCTION

3 CHAIN-OF-VERIFICATION

3.1 BASELINE RESPONSE

3.2 PLAN VERIFICATIONS

3.3 EXECUTE VERIFICATIONS

3.4 FINAL VERIFIED RESPONSE

4 EXPERIMENTS

4.1 TASKS

4.1.1 WIKIDATA

4.1.3 MULTISPANQA

4.1.4 LONG-FORM GENERATION OF BIOGRAPHIES

4.2 BASELINES

4.3 RESULTS

CoVe improves precision on list-based answer tasks

CoVe improves performance on closed-book QA

CoVe improves precision on longform generation

Instruction-tuning and CoT do not reduce hallucinations

Factored and 2-step CoVe improve performance

Further explicit reasoning helps remove hallucinations

CoVe-based Llama outperforms InstructGPT, ChatGPT, and PerplexityAI

LLM-based verification questions outperform heuristics

Open verification questions outperform yes/no-based questions

5 CONCLUSION

post contain ""

No matching posts found containing ""

Recent Posts

Most Likes

Most Views

Share Your Feedback 🏝️

Chain of Verification (CoV)

Chain of Verification (CoV)

Chain-of-Verification Reduces Hallucination in Large Language Models

TL;DR

1 INTRODUCTION

2 RELATED WORK

3 CHAIN-OF-VERIFICATION

3.1 BASELINE RESPONSE

3.2 PLAN VERIFICATIONS

3.3 EXECUTE VERIFICATIONS

3.4 FINAL VERIFIED RESPONSE

4 EXPERIMENTS

4.1 TASKS

4.1.1 WIKIDATA

4.1.2 WIKI-CATEGORY LIST

4.1.3 MULTISPANQA

4.1.4 LONG-FORM GENERATION OF BIOGRAPHIES

4.2 BASELINES

4.3 RESULTS

CoVe improves precision on list-based answer tasks

CoVe improves performance on closed-book QA

CoVe improves precision on longform generation

Instruction-tuning and CoT do not reduce hallucinations

Factored and 2-step CoVe improve performance

Further explicit reasoning helps remove hallucinations

CoVe-based Llama outperforms InstructGPT, ChatGPT, and PerplexityAI

LLM-based verification questions outperform heuristics

Open verification questions outperform yes/no-based questions

5 CONCLUSION

post contain ""

No matching posts found containing ""

Recent Posts

Most Likes

Most Views