00:00:00

Share Your Feedback 🏝️

Chain of Note

Chain of Note

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: NEFTune Next: Self-Improving for NER

Chain of Note

  • Related Project: Private
  • Category: Paper Review
  • Date: 2023-11-15

Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models

  • url: https://arxiv.org/abs/2311.09210
  • pdf: https://arxiv.org/pdf/2311.09210
  • abstract: Retrieval-augmented language models (RALMs) represent a substantial advancement in the capabilities of large language models, notably in reducing factual hallucination by leveraging external knowledge sources. However, the reliability of the retrieved information is not always guaranteed. The retrieval of irrelevant data can lead to misguided responses, and potentially causing the model to overlook its inherent knowledge, even when it possesses adequate information to address the query. Moreover, standard RALMs often struggle to assess whether they possess adequate knowledge, both intrinsic and retrieved, to provide an accurate answer. In situations where knowledge is lacking, these systems should ideally respond with “unknown” when the answer is unattainable. In response to these challenges, we introduces Chain-of-Noting (CoN), a novel approach aimed at improving the robustness of RALMs in facing noisy, irrelevant documents and in handling unknown scenarios. The core idea of CoN is to generate sequential reading notes for retrieved documents, enabling a thorough evaluation of their relevance to the given question and integrating this information to formulate the final answer. We employed ChatGPT to create training data for CoN, which was subsequently trained on an LLaMA-2 7B model. Our experiments across four open-domain QA benchmarks show that RALMs equipped with CoN significantly outperform standard RALMs. Notably, CoN achieves an average improvement of +7.9 in EM score given entirely noisy retrieved documents and +10.5 in rejection rates for real-time questions that fall outside the pre-training knowledge scope.

Contents


  • 향상된 검색 보강 언어 모델: Chain of Note
  • 시스템의 노이즈 및 알 수 없는 상황에 대한 강건성 향상
  • 데이터셋 및 벤치마크를 활용한 종합적인 평가

1 서론

검색 보강 언어 모델(RALMs)은 대규모 언어모델의 한계를 극복하기 위해 외부 지식 소스를 통합하여 정보를 검색하고 이를 활용하며, 특히 모델이 주제에 대해 직접적인 지식이 없을 때 유용합니다. 그러나 현 RALM 프레임워크에는 여러가지 문제점이 있습니다. 예를 들어, 정보 검색 시스템이 항상 관련성 높고 신뢰할 수 있는 정보를 제공하지 않을 수 있으며, 불필요한 정보의 검색은 잘못된 응답을 유발할 수 있습니다.

이를 해결하기 위해, 본 논문에서는 Chain of Note (CON) 프레임워크를 도입하여 RALM의 강건성을 높이고자 합니다. CON은 검색된 문서에 대해 순차적으로 읽기 노트를 생성함으로써 문서의 관련성을 체계적으로 평가합니다. 이런 접근 방식은 문서의 신뢰성을 높이고, 관련 없거나 믿을 수 없는 내용을 걸러내어 보다 정확하고 맥락에 맞는 응답을 생성하게 합니다.


2 관련 연구

2.1 검색 보강 언어 모델(RALMs)

RALMs는 대규모 언어모델(LLMs)과 외부 지식 소스의 특정성과 세부 사항을 결합합니다. 이 모델은 먼저 검색기를 사용하여 사용자의 쿼리에 관련된 문서 집합을 식별하고, 이후 읽기 구성요소를 사용하여 이 문서들을 면밀히 분석하고 응답을 생성합니다. 최근의 연구는 검색기나 읽기 구성요소를 개선하거나 시스템을 종단 간으로 훈련하는 데 중점을 두고 있습니다.

2.2 X의 연쇄 접근 방식

최근 연구에서는 LLMs가 복잡한 문제를 일련의 중간 단계로 분해할 수 있음을 보여줍니다. 이런 접근 방식은 사람의 문제 해결 방법을 모방하며, 이를 통해 LLMs는 문제의 각 부분을 집중적으로 다루어 오류의 가능성을 줄일 수 있습니다.


3 제안하는 방법

3.1 개요

CON 프레임워크는 검색된 문서들에 대해 순차적으로 읽기 노트를 생성합니다. 이 프레임워크는 각 문서의 관련성을 평가하고 가장 중요하고 신뢰할 수 있는 정보를 식별합니다. 이 과정은 관련 없거나 덜 신뢰할 수 있는 내용을 걸러내어 더 정확하고 맥락에 맞는 응답을 생성합니다.

3.2 기존 RALMs의 배경

기존 RALMs는 외부 문서를 고려하여 보다 정확한 응답을 생성하도록 설계되었습니다. 이 모델들은 다음과 같은 수식을 사용합니다. \(p(y\\|x) = \sum_i p(y\\|d_i, x)p(d_i\\|x)\) $x$는 입력 쿼리를, $y$는 모델이 생성한 응답을 나타냅니다. 하지만 이 모델들은 여러가지 한계를 갖고 있습니다.

  • 표면 수준 처리의 위험: 직접적인 답변 생성 시, 언어 모델은 표면적 정보에 의존할 수 있습니다.
  • 모순 정보 처리의 어려움: 서로 모순되는 정보가 문서에 있을 경우, 답변을 생성하는 것이 어렵습니다.
  • 투명성 및 해석 가능성 감소: 직접적인 답변 생성은 모델이 결론에 도달한 근거를 이해하는데 도움이 되지 않습니다.

3.3 Chain of Note 프레임워크

CON은 검색 보강 언어 모델이 직면한 챌린지를 해결합니다. 이 프레임워크는 검색된 문서를 체계적으로 평가하고, 각 문서에 대한 간결하고 맥락적으로 관련된 요약 또는 노트를 생성합니다. 이 방식은 문서의 관련성과 정확성을 체계적으로 평가할 수 있게 하며, 충돌하는 정보를 해결합니다.

3.4 데이터 수집 및 모델 훈련

ChatGPT를 사용하여 NQ 데이터셋에서 10k 질문을 무작위 샘플링하고, 이를 통해 읽기 노트 데이터를 생성했습니다. 이 데이터를 사용하여 LLaMA-2 7B 모델을 훈련시켰습니다. 또한 가중 손실 전략을 사용하여 읽기 노트와 답변 사이의 균형을 맞추었습니다.


4 실험

4.1 실험 설정 및 평가

NQ, TriviaQA, WebQ 및 RealTimeQA 데이터셋을 사용하여 모델을 평가했습니다. 이 데이터셋들을 사용하여 모델의 전반적인 QA 성능, 노이즈에 대한 강건성 및 알려지지 않은 시나리오에 대한 강건성을 평가했습니다.

4.2 전반적인 QA 성능 평가

CON은 기존 RALM 시스템과 비교하여 더 나은 성능을 보였습니다. 특히 노이즈가 많은 문서에서도 높은 강건성을 보여주었습니다.

4.3 노이즈 강건성 평가

CON은 완전히 노이즈가 있는 문서에서도 표준 RALM보다 일관되게 더 나은 성능을 보여주었습니다. 이는 CON이 불필요한 정보를 무시하는 능력을 강화했음을 보여줍니다.

4.4 알려지지 않은 시나리오에 대한 강건성 평가

CON은 RealTimeQA 벤치마크에서 알려지지 않은 시나리오를 처리하는 데 있어 향상된 강건성을 보여주었습니다. 이는 모델이 초기 훈련 단계에서 배우지 않은 정보를 식별하고 무시할 수 있는 능력을 강화했음을 의미할 수 있습니다.

4.5 사례 연구

사례 연구를 통해, CON이 어떻게 문서에서 정보를 추출하고 질문에 관련된 정보를 어떻게 더 정확하게 처리하는지를 보여줍니다. 이는 CON이 표준 RALM에 비해 더 깊은 이해를 제공하며, 더 정확한 응답을 생성할 수 있음을 입증합니다.


1 INTRODUCTION

Retrieval-augmented language models (RALMs) represent a novel framework that significantly advances large language models (Touvron et al., 2023; OpenAI, 2023) by addressing key limitations such as reducing factual hallucinations (Ji et al., 2023; Zhang et al., 2023a), injecting up-to-date knowledge in a plug-and-play manner (Dhingra et al., 2022; Vu et al., 2023), and enhancing domain-specific expertise (Li et al., 2023; Qin et al., 2023). These enhancements primarily stem from integrating large language models (LLMs) with external knowledge sources (Guu et al., 2020; Lewis et al., 2020; Borgeaud et al., 2022; Shi et al., 2023c). In a typical RALM setup, a query is first processed by a retriever that searches a vast evidence corpus for pertinent documents. A reader then examines these documents, extracting useful information and formulating the final output answer. The potential benefit of the RALM framework is its ability to integrate relevant external knowledge, thereby enriching the LLMs’ understanding of input text and generating answers based on this information. This is particularly beneficial when LLMs lack direct knowledge of a subject, allowing them to acquire and utilize relevant information in a plug-and-play manner (Yu et al., 2022b).

However, there exist several issues with the current RALM framework. First, there is no guarantee that the information retrieval (IR) system will always yield the most pertinent or trustworthy information. The retrieval of irrelevant data can lead to misguided responses (Shi et al., 2023a; Yoran et al., 2023), and potentially causing the model to overlook its inherent knowledge, even when it possesses adequate information to address the query (Mallen et al., 2023). Secondly, state-of-the-art LLMs often hallucinate when addressing fact-oriented questions, a deficiency that can be risky and may discourage users (Ji et al., 2023; Zhang et al., 2023a). Ideally, an intelligent system should be capable of determining whether it has enough knowledge, both intrinsic and retrieved, to provide an accurate answer.

Figure 1: Compared with the current RALMs, the core idea behind Chain of Note (CON) is to generate sequential reading notes for the retrieved documents, ensuring a systematic assessment of their relevance to the input question before formulating a final response.

In cases where knowledge is insufficient, the system should respond with “unknown” when the answer cannot be determined. Based on the shortcomings of the standard RALM system, in this paper, we aims to improve the robustness of RALMs, mainly focusing on two pivotal aspects:

  • (1) Noise Robustness: The ability of a RALM to discern and disregard noisy information present in irrelevant retrieved documents, while appropriately leveraging its intrinsic knowledge.
  • (2) Unknown Robustness: The capacity of a RALM to acknowledge its limitations by responding with “unknown” when given a query it does not have the corresponding knowledge to answer, and the relevant information is not found within the retrieved documents.

In this work, we introduce a novel framework named CHAIN-OF-NOTING (CON), designed to enhance the robustness of RALMs. The cornerstone of CON is to generate a series of reading notes for retrieved documents, enabling a comprehensive assessment of their relevance to the input query. This approach not only evaluates each document’s pertinence but also pinpoints the most critical and reliable information therein. This process effectively filters out irrelevant or less credible content, leading to responses that are more precise and contextually relevant, as exemplified in Figure 1. Besides, CON enhances the capability of RALM to handle queries fall outside the scope of their training data. In cases where the retrieved documents do not provide any relevant information, CON can guide the model to acknowledge its limitations and respond with an “unknown” or provide the best possible explanation based on available data, enhancing the model’s reliability.

To validate the effectiveness of the CON idea, we first prompt ChatGPT (OpenAI, 2023) to generate a 10K training data based on questions collected from Natural Questions (NQ) (Kwiatkowski et al., 2019). Subsequently, we trained a LLaMA-2 7B model to incorporate the note-taking ability integral to CON. Our evaluation of the RALM, integrated with CON and compared to the standard RALM system, focused on three major aspects: (1) overall QA performance using DPR-retrieved documents, (2) noise robustness, assessed by introducing noisy information to the system, and (3) unknown robustness, evaluated through queries not covered in the LLaMA-2 pre-training data, i.e., real-time questions. The evaluations were conducted on the NQ and three additional out-of-domain open-domain QA datasets, namely TriviaQA (Joshi et al., 2017), WebQ (Berant et al., 2013), and RealTimeQA (Kasai et al., 2023). Our experiments demonstrate that Chain of Note (CON) not only improves overall QA performance when employed with DPR-retrieved documents but also significantly enhances robustness in both noise and unknown aspects. This includes a +7.9 increase in accuracy (measured by the exact match score) with noisy retrieved documents, and a +10.5 increase in the rejection rate for real-time questions that are beyond the pre-training knowledge scope.

2.1 RETRIEVAL-AUGMENTED LANGUAGE MODELS

Retrieval-Augmented Language Models (RALMs) represent a significant advancement in natural language processing, combining the power of large language models with the specificity and detail provided by external knowledge sources (Guu et al., 2020; Lewis et al., 2020; Izacard et al., 2022). These models first leverage a retriever to scan a vast evidence corpus, such as Wikipedia, to identify a set of documents pertinent to the user’s query.

Wikipedia: “It Must Have Been Love” is a song written by Per Gessle and performed by the Swedish pop duo Roxette. The power ballad became the duo’s third number one hit in the United States.

Wikipedia: “It Must Be Love” is a song written and originally recorded in 1971 by Labi Siffre. It was also recorded by ska/pop band Madness in 1981.

The answer is Labi Siffre. The first passage confirms that Roxette is the performer of “It Must Have Been Love.” However, the second passage mistakenly references “It Must Be Love,” which is a completely different song and not relevant to the query.

Following this, a reader component is employed to meticulously analyze these documents and formulate a response. This two-pronged approach ensures both relevance and depth in the generated answers. Recent follow-up work has mainly focused on improving the retriever (Karpukhin et al., 2020; Qu et al., 2021; Sachan et al., 2022; Ma et al., 2023) or the reader (Izacard & Grave, 2021; Cheng et al., 2021; Yu et al., 2022a), training the system end-to-end (Lewis et al., 2020; Singh et al., 2021), and integrating the retrieval systems with large-scale black-box language models (Yu et al., 2023a; Shi et al., 2023c; Yu et al., 2023b; Trivedi et al., 2023). Another line of RALMs such as kNN-LM (Khandelwal et al., 2020; Zhong et al., 2022) retrieves a set of tokens and interpolates between the next token distribution and kNN distributions computed from the retrieved tokens at inference. The evolution has also led to the emergence and popularity of retrieval-augmented products, such as ChatGPT plugin, Langchain, and New Bing.

Robustness of RALMs. Recent studies highlight the impact of context relevance on language model performance (Creswell et al., 2022; Shi et al., 2023a; Yoran et al., 2023). Notably, Creswell et al. (2022) demonstrated that incorporating random or irrelevant contexts could adversely affect QA performance. In contrast, Shi et al. (2023a) discovered that adding irrelevant context to exemplars or task-specific instructions can sometimes enhance model performance, implying that models might intrinsically possess capabilities, developed during pre-training, to manage such scenarios. Most pertinent to our research is the study by Yoran et al. (2023), which focused on training RALMs to disregard irrelevant contexts. This approach, while distinct from our proposed solution, underscores the importance of context relevance in enhancing the effectiveness of RALMs.

2.2 CHAIN-OF-X APPROACHES IN LARGE LANGUAGE MODELS

Recent research shows that large language models (LLMs) are capable of decomposing complex problems into a series of intermediate steps, pioneered by the concept of Chain-of-Thought (CoT) prompting (Wei et al., 2022; Kojima et al., 2022). The CoT approach mirrors human problem-solving methods, where complex issues are broken down into smaller components. By doing so, LLMs can tackle each segment of a problem with focused attention, reducing the likelihood of overlooking critical details or making erroneous assumptions. This sequential breakdown makes the reasoning process more transparent, allowing for easier identification and correction of any logical missteps.

The CoT methodology has been effectively applied in various contexts, including multi-modal reasoning (Zhang et al., 2023b), multi-lingual scenarios (Shi et al., 2023b), and knowledge-driven applications (Wang et al., 2023b). And additionally, there has been a surge in the development of other chain-of-X methods, addressing diverse challenges in LLM applications. These include chain-of-explanation (Huang et al., 2023), chain-of-knowledge (Wang et al., 2023a), chain-of-verification (Dhuliawala et al., 2023) and IR chain-of-thought (Trivedi et al., 2023). For instance, Chain-of-Verification (Dhuliawala et al., 2023) generates an initial response, formulates verification questions, and revises the response based on these questions, reducing factual errors and hallucinations in the response. Closely related to our work is IR chain-of-thought (Trivedi et al., 2023), which employs CoT to infer and supplement unretrieved information, thereby improving the accuracy of complex reasoning tasks. While chain-of-X approaches have shown promise in enhancing LLMs’ performance across various domains, their application in RALMs, particularly for improving robustness in noisy and unknown scenarios, is relatively unexplored. This gap signifies further research in applying these strategies to augment RALMs, thereby enhancing their robustness and reliability.

3 Proposed method

3.1 Overview

In this section, we introduce Chain of Note, an innovative advancement for retrieval-augmented language models (RALMs). Specifically, CON framework generates sequential reading notes for the retrieved documents, which enables a systematic evaluation of the relevance and accuracy of information retrieved from external documents. By creating sequential reading notes, the model not only assesses the pertinence of each document to the query but also identifies the most critical and reliable pieces of information within these documents. This process helps in filtering out irrelevant or less trustworthy content, leading to more accurate and contextually relevant responses.

Figure 2: Illustration of the Chain of Note (CON) framework with three distinct types of reading notes. Type (a) depicts the scenario where the language model identifies a document that directly answers the query, leading to a final answer formulated from the retrieved information. Type (b) represents situations where the retrieved document, while not directly answering the query, provides contextual insights, enabling the language model to integrate this context with its inherent knowledge to deduce an answer. Type (c) illustrates instances where the language model encounters irrelevant documents and lacks the necessary knowledge to respond, resulting in an “unknown” answer. This figure exemplifies the CoN framework’s capability to adaptively process information, balancing direct information retrieval, contextual inference, and the recognition of its knowledge boundaries.

3.2 Background of existing ralms

RALMs signify a transformative development in language models, enhancing their output by incorporating external knowledge. These models operate by introducing an auxiliary variable, denoted as \(d\), which represents retrieved documents. This inclusion allows them to consider a range of possible documents, thereby producing responses that are more informed and precise (Lazaridou et al., 2022; Shi et al., 2023c). The RALM models can be represented as \(p(y\\|x) = \sum_i p(y\\|d_i, x)p(d_i\\|x)\). Here, \(x\) represents the input query, and \(y\) signifies the model’s generated response. In practice, it is infeasible to compute the sum over all possible documents due to the vast number of potential sources. Consequently, the most common approach involves approximating the sum over \(d\) using the \(k\) highest ranked documents, and providing all these documents as part of the input. We assume, w.l.o.g., that these documents are \([d_1, \dots, d_k]\), yielding \(p(y\\|x) = \sum_{i=1}^k p(y\\|d_i, x)p(d_i\\|x)\). However, the existing RALMs suffer from several limitations:

  • Risk of Surface-Level Processing: When directly generating an answer, language models might rely on surface-level information without deep comprehension. Thus, language models could easily overlook the nuances of question or documents, particularly in complex or indirect questions.
  • Difficulty in Handling Contradictory Information: When faced with documents containing contradictory information, directly generating an answer becomes challenging. The model may struggle to resolve these contradictions or to determine which piece of information is more credible or relevant.
  • Reduced Transparency and Interpretability: Direct answer generation offers limited insight into how the model arrived at its conclusion. This lack of transparency makes it challenging for users to understand the basis of the model’s conclusions.
  • Overdependence on Retrieved Documents: Direct generation can lead to an overreliance on the content of the retrieved documents (i.e. tendency to extract information from retrieved documents (Shi et al., 2023a)), ignoring the model’s inherent knowledge base. This can be particularly limiting when the retrieved documents are noisy or out-of-date.

3.3 THE Chain of Note FRAMEWORK

The Chain of Note (CON) framework presents a solution to the challenges faced by retrieval-augmented language models (RALMs). This framework significantly enhances the ability of RALMs to critically assess retrieved documents through a structured note-taking process. Specifically, it involves generating concise and contextually relevant summaries or notes for each document. This method allows the model to systematically evaluate the relevance and accuracy of information drawn from external documents. By creating sequential reading notes, CON not only assesses the pertinence of each document to the query but also pinpoints the most reliable information and resolves conflicting information. This approach effectively filters out irrelevant or less trustworthy content, leading to responses that are both more accurate and contextually relevant.

Given an input question $x$ and $k$ retrieved documents $[d1, \cdots, dk]$, the model aims to generate textual outputs comprising multiple segments $[yd1 , \cdots, ydk , y]$. Here, $ydi$ signifies the tokens for the $i$-th segment, representing the reading note for the corresponding document $di$, as shown in Figure 2. After generating individual reading notes, the model synthesizes the information to create a consolidated final response $y$. The implementation of the Chain of Note (CON) involves three key steps: (1) designing the notes $ydi$, (2) collecting the data, and (3) training the model.

3.3.1 NOTES DESIGN

The framework primarily constructs three types of reading notes, as shown in Figure 2, based on the relevance of the retrieved documents to the input question: First, when a document directly answers the query, the model formulates the final response based on this information, as shown in Figure 2(a). Second, if the retrieved document does not directly answer the query but provides useful context, the model leverages this information along with its inherent knowledge to deduce an answer, as shown in Figure 2(b). Third, in cases where the retrieved documents are irrelevant, and the model lacks sufficient knowledge to answer, it defaults to responding with “unknown”, as shown in Figure 2(c). This nuanced approach mirrors human information processing, striking a balance between direct retrieval, inferential reasoning, and the acknowledgment of knowledge gaps.

3.3.2 DATA COLLECTION

To equip the model with the ability to generate such reading notes, it’s essential to gather appropriate training data. Manual annotation for each reading note is resource-intensive, so we employ a state-of-the-art language model – ChatGPT – to generate the notes data. This method is both cost-effective and enhances reproducibility. We initiate this process by randomly sampling 10k questions from the NQ (Kwiatkowski et al., 2019) training dataset. ChatGPT is then prompted with specific instructions and in-context examples to the three distinct types of note generation (detailed in Appendix A.3). The quality of ChatGPT’s predictions is subsequently assessed through human evaluations on a small subset of the data before proceeding to the entire set. The NQ dataset is chosen as our primary dataset due to its diverse range of real user queries from search engines. However, to ensure the model’s adaptability, we also test its performance on three additional open-domain datasets, including TriviaQA, WebQ, and RealTimeQA, showing its generalization capabilities to out-of-domain data.

3.3.3 MODEL TRAINING

After collecting 10K training data from ChatGPT, the next step is to use them to train our Chain of Note model, which is based on an open-source LLaMA-2 7B (Touvron et al., 2023) model. To do this, we concatenate the instruction, question and documents as a prompt and train the model to generate notes and answer in a standard supervised way. Our in-house LLaMA-2 7B model learns to sequentially generate reading notes for each document to assess their relevance to the input query. Responses are generated based on the document’s relevance, enhancing accuracy and reducing misinformation. For irrelevant documents, the model either relies on inherent knowledge for an answer or responds with “unknown” if the answer cannot be determined.

Weighted Loss on Notes and Answers. A unique aspect of our training approach is the implementation of a weighted loss strategy. This involves varying the loss weights assigned to reading notes and answers. In our preliminary studies, We observed that assigning equal loss to both components can reduce the quality of the final answer and prolong the training time for convergence. This issue arises mainly because notes, being lengthier, contribute disproportionately to the loss. To overcome the drawback, we alternate the focus of the loss function: 50% of the time, the next token prediction loss is computed on the entire notes and answer sequence $[yd1 , \cdots, ydk , y]$, and the remaining 50% of the time, the next token prediction loss is calculated solely on the answer $y$. This strategy is designed to ensure that while the model learns to generate contextually rich reading notes, the primary focus remains on the accuracy and reliability of the final answer.

4 EXPERIMENTS

4.1 EXPERIMENTAL SETTINGS AND EVALUATIONS

4.1.1 DATASETS AND SPLITS

We conducted comprehensive experiments using three benchmark datasets in open-domain question answering (QA): NQ (Kwiatkowski et al., 2019), TriviaQA (Joshi et al., 2017), and WebQ (Berant et al., 2013), with further details provided in Appendix A.1. Additionally, we employed RealTimeQA (Kasai et al., 2023) as a special case to evaluate “unknown” robustness.

Datasets Full Size IR Recall Subset Size
NQ 3,610 73.82% 2,086
TriviaQA 7,993 89.95% 7,074
WebQ 2,032 64.22% 1,231

Table 1: Dataset statistics. The recall evaluation is based on DPR retrieval on the full test set..

The evaluation was conducted based on two evaluation sets: full set and subset evaluation. Firstly, akin to traditional open-domain QA evaluation, we assessed the models using all questions from the test set to evaluate the overall QA performance. The documents were retrieved using DPR, and the top-$k$ documents were fed into the generator. We adhered to the same test splits for the open-domain QA setting as used by Izacard & Grave (2021); Karpukhin et al. (2020). For TriviaQA, evaluations from LLaMA-2 (Touvron et al., 2023) were conducted on the Wikipedia dev set comprising 7,993 examples. Therefore, we also follow the same evaluation on this dev set to facilitate comparisons with their performance. Secondly, to assess the model’s noise robustness and unknown robustness, we extracted subsets from the above test sets that contained relevant documents in the retrieved list. We then enumerated each retrieved document to determine if it was a golden document for the given question. Based on the noise ratio $r$, for instance, if the top-$k$ documents are needed for the generator, then $k \cdot r$ would be the number of noisy documents, and $k \cdot (1 - r)$ would be the number of relevant documents. For example, when noise ratio is 20% and top-5 documents are needed, then 4 are relevant documents, and 1 is irrelevant documents. During the enumeration of the retrieved documents, we populated two lists; when one list reached its limit, we stopped adding more documents to that list until both lists were complete. In instances where no relevant documents are retrieved by the DPR for certain questions, we exclude these from robustness evaluation. Therefore, the subset is smaller than the original test set, as shown in Table 1.

4.1.2 BASELINE METHODS

For fair comparability, we trained all models using the same training set, with the main difference being in the input and output formats. As outlined in the methods section, we denote an input question as \(x\) and its corresponding answer as \(y\). Besides, \(d_i\) represents the \(i\)-th retrieved document, and \(y_{d_i}\) is the associated reading note for that document. Here we show the difference of methods to compare. LLaMA-2 w/o IR: This model is trained to directly generate an answer from the input question, without relying on any external retrieved information. Essentially, it learns the function \(f: x \rightarrow y\), transforming a given question \(x\) directly into an answer \(y\). DPR + LLaMA-2: This approach trains the model to generate an answer not only from the question but also by incorporating retrieved documents. It learns the function \(f: {x, d_1, \cdots, d_k} \rightarrow y\), meaning it transforms the question \(x\) and a set of retrieved documents \({d_1, \cdots, d_k}\) into an answer \(y\). DPR + LLaMA-2 with Chain of Note: In this model, the training process involves generating reading notes for each retrieved document before formulating the final answer. It learns the function \(f: {x, d_1, \cdots, d_k} \rightarrow {y_{d_1}, \cdots, y_{d_k}, y}\), thereby enabling the model to process the question \(x\) and retrieved documents \({d_1, \cdots, d_k}\) to produce reading notes \({y_{d_1}, \cdots, y_{d_k}}\) and the final answer \(y\).

Models Dataset EM Original EM Improved F1 Original F1 Improved
LLaMA-2 w/o IR NQ 28.80 - 37.53 -
  TriviaQA 63.19 - 68.61 -
  WebQ 28.30 - 42.77 -
  Average 35.98 - 44.27 -
DPR + LLaMA-2 + Chain of Note NQ 47.39 48.92 (+1.53) 55.81 57.53 (+1.72)
  TriviaQA 74.92 76.27 (+1.35) 81.53 82.25 (+0.72)
  WebQ 29.58 32.33 (+2.75) 43.51 46.68 (+3.17)
  Average 48.49 50.46 (+1.97) 56.97 58.78 (+1.81)

Table 2: Overall QA Performance on the entire test sets. Equipped with the same retrieved documents, our Chain of Note outperforms the standard RALM system on three open-domain QA datasets.

Models Noise Ratio Dataset EM Original EM Improved F1 Original F1 Improved
LLaMA-2 w/o IR - NQ 42.89 - 49.44 -
    TriviaQA 67.76 - 72.80 -
    WebQ 40.29 - 56.44 -
    Average 50.31 - 59.56 -
DPR + LLaMA-2 + Chain of Note 100% NQ 34.28 41.83 (+7.55) 54.28 56.63 (+2.35)
    TriviaQA 61.44 63.43 (+1.99) 64.62 65.91 (+1.29)
    WebQ 55.30 64.30 (+9.00) 73.83 75.89 (+2.06)
    Average 61.67 70.00 (+8.33) 80.02 81.24 (+1.22)
DPR + LLaMA-2 + Chain of Note 80% NQ
DPR + LLaMA-2 + Chain of Note 0% NQ 29.58 36.85 (+7.27) 35.46 40.60 (+5.14)
    TriviaQA 39.72 47.66 (+7.94) 54.52 57.70 (+3.18)
    WebQ 49.92 57.55 (+7.63) 64.58 67.00 (+2.42)
    Average 39.72 47.66 (+7.94) 54.52 57.70 (+3.18)

Table 3: Evaluation on Noise Robustness. The Chain of Note framework shows superior performance compared to the standard RALM system, particularly notable at higher noise ratios.

4.1.3 EVALUATION METRICS

For the evaluation of open-domain QA performance, we have employed two widely recognized metrics: Exact Match (EM) and F1 score, as suggested by prior work in the Chen et al. (2017); Karpukhin et al. (2020); Zhu et al. (2021). For EM score, an answer is deemed correct if its normalized form – obtained through the normalization procedure delineated by Karpukhin et al. (2020) – corresponds to any acceptable answer in the provided list. Similar to EM score, F1 score treats the prediction and ground truth as bags of tokens, and compute the average overlap between the prediction and ground truth answer (Chen et al., 2017). Besides, we use reject rate (RR) to evaluate the unknown robustness when given questions beyond a language model’s knowledge scope.

4.2 EVALUATION ON OVERALL QA PERFORMANCE

In our evaluation, we compared our method and various baselines across three open-domain QA benchmarks, as detailed in Table 2. We noted that RALM (DPR + LLaMA-2) with retrieval functionality consistently outperformed LLaMA-2 without retrieval. This improvement is closely tied to the effectiveness of the retrieval process. As indicated in Table 1, DPR demonstrates markedly superior retrieval performance on the NQ and TriviaQA datasets compared to WebQ. Consequently, the benefits of retrieval are more pronounced on NQ and TriviaQA.

Furthermore, when comparing our enhanced RALM, which integrates CON, with the standard RALM, our method persistently shows better performance. There is an average improvement of +1.97 in EM scores across all three datasets. Delving deeper, we find that this improvement varies depending on whether DPR successfully retrieves relevant documents. Specifically, the average improvement is +1.2 when DPR retrieves relevant documents and +2.3 when it does not on the NQ dataset. This disparity suggests that our CON improve RALM’s in scenarios where more noisy documents are fetched in the first retrieval stage. This observation aligns with our findings on noise robustness, which are elaborated upon in the subsequent sections detailing our experimental results.

Figure 3: Evaluation on Noise Robustness with two different scenarios: noisy documents obtained through retrieval and completely random documents sampled from the entire Wikipedia.

4.3 EVALUATION ON NOISE ROBUSTNESS

Our evaluation of noise robustness was carried out under two scenarios: using noisy documents obtained through retrieval (by removing relevant documents from the retrieved sets and retaining the top-ranked irrelevant ones) and using completely random documents sampled from the entire Wikipedia. Noisy retrieved documents often contain misleading information due to their semantic similarity to the input question, contrasting with random documents which represent total noise.

Table 3 shows that RALM enhanced with CON consistently outperforms the standard RALM, especially in scenarios with exclusively noisy documents. An average improvement of +7.9 in EM score on fully noisy documents is observed on three open-domain QA datasets, in average. Experiments with lower noise ratios also consistently demonstrate the improvements brought by CON, aligning the overall performance with that presented in Table 2. We observed that when presented with entirely noisy documents, both the standard RALM and our CON performed worse than the original LLaMA-2 without IR. This suggests that RALMs can be misled by noisy information, leading to more hallucinations. However, our model can perform almost as well as the original LLaMA-2 without IR, indicating its noise robustness and its capability to ignore irrelevant information.

Furthermore, our comparison with random noise revealed several important observations. Figure 3 illustrates that both standard RALM and RALM with CON perform better with random documents than with noisy retrieved ones. This indicates that semantically relevant noisy documents are more likely to mislead the language model into producing incorrect information. Moreover, in both noisy scenarios, our method shows enhanced robustness compared to the standard RALM.

4.4 EVALUATION ON UNKNOWN ROBUSTNESS

Models Dataset Noise Type EM Score F1 Score
DPR + LLaMA-2 + Chain of Note RealTimeQA Type 1 [EM 1] [F1 1]
DPR + LLaMA-2 + Chain of Note RealTimeQA Type 2 [EM 2] [F1 2]
DPR + LLaMA-2 + Chain of Note RealTimeQA Type 3 [EM 3] [F1 3]

Table 4 illustrates that our RALM equipped with CON exhibits superior robustness in handling unknown scenario, particularly evident in the RealTimeQA benchmark. This benchmark falls completely outside the model’s domain and contains real-time information that was not part of the LLaMA-2 pre-training data. Despite this, models are still capable of providing correct answers in some cases, as the answers remain consistent over time. In comparison to the standard RALM system, our method shows a significant improvement, exceeding +10.5 in its ability to reject to answer questions in unknown scenario. The evaluation is based on reject rate (RR), i.e., number of rejected questions / total questions. This highlights our model’s enhanced capability to discern and disregard information that is unfamiliar or not learned during its initial training phase.

Table 4: Dataset statistics. The recall evaluation is based on DPR retrieval on the full test set..

Table 5: Case Study. Compared to Standard RALM, our RALM with Chain of Note exhibits a deeper understanding of how documents reveal information relevant to the question. It goes beyond merely capturing surface-level terms, leading to more accurate responses.

4.5 CASE STUDIES

In our case studies, as illustrated in Table 5, we compare the responses generated by the standard RALM and our enhanced RALM with COT. These examples highlight the differences in how each model processes and interprets information from retrieved documents.

In the first case study, the question pertains to the most recent Summer Olympics held in the USA. The standard RALM is misled by the mention of ”Chicago’s bid for the 2016 Summer Olympics.” Lacking a deep comprehension of the content, it incorrectly focuses on the more recent year (2016), resulting in an inaccurate answer. In contrast, the RALM with CON carefully analyzes the information. It notes that while Chicago bid for the 2016 Olympics, there’s no confirmation of it being a successful bid. This leads to the correct conclusion that the most recent Olympics in the USA were held in 1996.

The second case study involves identifying the language of the first Jnanpith Award recipient. Here, the standard RALM fails to synthesize information across documents. It identifies G. Sankara Kurup as the award recipient but does not connect this information to the language of his work. Conversely, the RALM with CON effectively combines details from both documents. It recognizes that while the first document mentions Kurup’s award, the second document provides the missing language detail, leading to the correct answer of Malayalam. Both cases demonstrate the superior capability of our CON in understanding and integrating information from multiple sources. Unlike the standard RALM, which often grasps only surface-level details, our model delves deeper, discerning more nuanced and contextually relevant information to arrive at accurate conclusions.

5 CONCLUSION

In this paper, we introduce the CHAIN-OF-NOTING (CON) framework, a novel methodology designed to enhance the robustness of RALMs. The central concept of CON revolves around the generation of sequential reading notes for each retrieved document. This process allows for an in-depth assessment of document relevance to the posed question and aids in synthesizing this information to craft the final answer. We utilized ChatGPT to generate the initial training data for CON, which was further refined using an LLaMA-2 7B model. Our tests across various open-domain QA benchmarks reveal that RALMs integrated with CON considerably surpass traditional RALMs in performance.

Previous: NEFTune Next: Self-Improving for NER

post contain ""

    No matching posts found containing ""