00:00:00

Share Your Feedback 🏝️

RAG vs Long Context

RAG vs Long Context

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Model | MS ToRA Next: Video QA

RAG vs Long Context

  • Related Project: Private
  • Category: Paper Review
  • Date: 2023-10-05

Retrieval meets Long Context Large Language Models

  • url: https://arxiv.org/abs/2310.03025
  • pdf: https://arxiv.org/pdf/2310.03025
  • abstract: Extending the context window of large language models (LLMs) is getting popular recently, while the solution of augmenting LLMs with retrieval has existed for years. The natural questions are: i) Retrieval-augmentation versus long context window, which one is better for downstream tasks? ii) Can both methods be combined to get the best of both worlds? In this work, we answer these questions by studying both solutions using two state-of-the-art pretrained LLMs, i.e., a proprietary 43B GPT and LLaMA2-70B. Perhaps surprisingly, we find that LLM with 4K context window using simple retrieval-augmentation at generation can achieve comparable performance to finetuned LLM with 16K context window via positional interpolation on long context tasks, while taking much less computation. More importantly, we demonstrate that retrieval can significantly improve the performance of LLMs regardless of their extended context window sizes. Our best model, retrieval-augmented LLaMA2-70B with 32K context window, outperforms GPT-3.5-turbo-16k and Davinci003 in terms of average score on seven long context tasks including question answering and query-based summarization. It also outperforms its non-retrieval LLaMA2-70B-32k baseline by a margin, while being much faster at generation. Our study provides general insights on the choice of retrieval-augmentation versus long context extension of LLM for practitioners.

Contents

TL;DR


  1. 대규모 언어모델을 통한 질문 응답 및 요약 작업 향상 연구.
  2. 검색 보강 및 위치 보간 방법을 통한 context window 확장 방법 비교.
  3. 다양한 데이터셋과 벤치마크를 사용하여 모델 성능 평가 및 분석.

1 서론

최근 대규모 언어모델(LLM)은 생산, 연구, 오픈 소스 커뮤니티에서 주목을 받고 있습니다. self-attention 메커니즘의 계산 복잡성을 해결하기 위해 대안적인 방법으로 검색 보강 방식이 연구되어 왔습니다. 이 논문에서는 검색 보강이 LLM의 성능을 향상시킬 수 있는지, 또한 어떻게 결합하면 더 높은 정확도를 달성할 수 있는지에 대해 연구하고자 합니다. 두 가지 최신 LLM, 43B GPT와 LLaMA2-70B를 사용하여 7가지 downstream 장문 컨텍스트 작업에서 종합적인 연구를 수행하였습니다.


2 관련 연구

2.1 장문 컨텍스트 대규모 언어모델

GPU의 발전과 메모리 효율적인 정확한 어텐션(attention) 덕분에 LLM의 context window 크기가 점차 증가하였습니다. 그러나 context window을 더 확장하는 것은 계산 복잡성 때문에 어려움이 있습니다. 최근에는 계속 학습이나 파인튜닝을 통해 LLM의 context window을 확장하는 연구가 이루어졌습니다.

2.2 효율적인 어텐션 방법

이전 연구에서는 self-attention의 계산 복잡성을 다루기 위해 여러가지 근사적 어텐션 방법들이 도입되었습니다. 최근에는 FlashAttention이라는 방법이 GPU 메모리 간의 읽기와 쓰기를 최적화하여 정확한 어텐션 계산을 가속화하는 데 도움이 되었습니다.

2.3 검색 보강 언어 모델

검색을 통한 보강은 오랜 기간 동안 언어 모델의 성능을 향상시키기 위해 연구되어 왔습니다. 이 연구에서는 독립된 검색기를 사용하여 관련 컨텍스트만을 읽도록 하여 처리 속도를 향상시키고, 공간적 어텐션 패턴을 적용하는 새로운 방법을 제안합니다.


3 실험 설정

3.1 대규모 언어모델

이 연구에서는 두 개의 pre-trained GPT 모델을 사용하여, 검색 보강 및 자체 어텐션 메커니즘을 통해 장문 정보 통합 능력을 비교합니다. 이 모델들은 지시 튜닝 후 40B 이상의 크기에서 더 효과적입니다.

3.2 데이터셋 및 지표

다양한 문서 QA 및 요약 작업을 포함하는 7가지 데이터셋을 사용하여 모델을 평가합니다. 이를 통해 검색 보강이 장문 컨텍스트에서의 모델 성능에 미치는 영향을 평가합니다.

3.3 context window 확장

위치 보간 방법을 사용하여 GPT-43B와 LLaMA2-70B의 4K context window을 각각 16K와 32K로 확장합니다. 이는 RoPE 임베딩과 함께 간단하면서도 효과적인 방법입니다.

3.4 검색

Dragon, Contriever, OpenAI 임베딩을 포함한 세 가지 검색기를 실험합니다. 이 검색기들은 질문과 컨텍스트 청크를 독립적으로 인코딩하고, 가장 관련성 높은 청크들을 선택하여 프롬프트의 컨텍스트로 연결합니다.


4 결과

4.1 주요 결과

검색 보강을 통해 4K LLM의 성능이 크게 향상되었으며, 긴 컨텍스트 LLM(e.g., 16K 및 32K)은 검색 보강된 4K 모델보다 더 나은 결과를 보였습니다. 이는 검색 보강이 LLM의 장문 컨텍스트 처리 능력을 향상시킬 수 있음을 시사합니다.

4.2 오픈AI 모델과의 비교

LLaMA2-70B-32k와 검색 보강 모델은 오픈AI의 여러 모델보다 향상된 결과를 보여주었습니다. 이는 검색 보강이 장문 컨텍스트 작업에서 강력한 모델을 구축하는 데 기여할 수 있음을 보여줍니다.

4.3 검색기의 영향

다양한 검색기를 사용하여 LLaMA2-70B의 성능을 비교한 결과, 검색 보강이 모든 컨텍스트 길이에서 성능을 향상시키는 일관된 결과를 보여주었습니다. 공개적으로 사용 가능한 검색기가 상업적인 OpenAI 임베딩보다 더 나은 결과를 보였습니다.

4.4 검색된 청크 수 증가의 영향

더 많은 청크를 검색할수록 성능이 향상되는 경향을 보였으나, 청크 수가 너무 많으면 성능이 저하될 수 있습니다. 이는 모델이 중간의 정보를 놓칠 수 있기 때문입니다.

이 연구는 검색 보강이 대규모 언어모델의 장문 컨텍스트 작업에 미치는 영향을 체계적으로 탐구하고, 여러 데이터셋과 벤치마크를 통해 모델의 성능을 평가합니다. 검색 보강이 모델의 장문 처리 능력을 크게 향상시키며, 특히 대규모 모델에서 더욱 두드러지는 성능 개선을 보여줍니다.


1 Introduction

The long context large language models (LLM) have recently received a lot of attention in production (e.g., Anthropic, 2023; OpenAI, 2023b), research community (e.g., Chen et al., 2023; Liu et al., 2023; Tworkowski et al., 2023), and open source community (e.g., Kaiokendev, 2023). Although the approximate attention methods have been studied for years (e.g., Tay et al., 2022) (due to the quadratic time and memory complexities of self-attention mechanism in sequence length), the recent advance for long context LLMs with exact attention is mainly driven by the development of faster GPU with more memory and memory-efficient exact attention (Dao et al., 2022; Dao, 2023). An alternative and long-standing solution for handling long context is retrieval. Specifically, the LLMs only read relevant context retrieved from a standalone retriever (e.g., Karpukhin et al., 2020; Wang et al., 2022; Lin et al., 2023), which is much easier to scale 1 and runs orders of magnitudes faster than LLMs for selecting relevant context. Conceptually, the retrieval-augmented decoder-only LLM can be viewed as applying the sparse attention over its long context window, where the sparsity pattern is not predefined as Child et al. (2019) but determined by the standalone retriever. In other words, unretrieved context is treated as irrelevant and has zero-valued attention weights. Given the surge of interest in long context LLM research and much more required computation at inference 2, it is still unclear for practitioners whether extending the context window of LLM provides higher accuracy than the retrieval augmentation for downstream tasks with informative queries. Moreover, it would be compelling if we could combine the strength of both methods and achieve even higher accuracies. In this work, we attempt to answer the above questions through a comprehensive study. Specifically, we make the following contributions:

  1. We perform comprehensive study using two state-of-the-art LLMs, a proprietary 43B pretrained GPT and LLaMA2-70B (Touvron et al., 2023b) on 7 downstream long context tasks, including single and multi document question answering (QA) as well as query-based summarization.
  2. We demonstrate that retrieval-augmentation significantly improves the performance of 4K context LLMs. Perhaps surprisingly, we find this simple retrieval-augmented baseline can perform comparable to 16K long context LLMs, i.e., average score 29.32 vs. 29.45 by using GPT-43B, and 36.02 vs. 36.78 by using LLaMA2-70B, while using much less computation.
  3. Furthermore, we demonstrate that the performance of long context LLM (i.e., 16K or 32K) can still be improved by retrieval, especially for the larger LLaMA2-70B. As a result, our best model, retrieval augmented LLaMA2-70B-32k-ret with 32K context window (avg. score 43.6), outperforms GPT-3.5-turbo-16k (avg. score 42.8) and Davinci-003 in terms of average score. It also largely outperforms its non-retrieval LLaMA2-70B-32k baseline (avg. score 40.9), while can be much faster at generation (e.g., 4× faster on NarrativeQA).

We organize the rest of the paper as follows. We discuss related work in Section 2, and present the experimental setup in Section 3. We report results in Section 4 and conclude the paper in Section 5.

In this section, we discuss the related work in long context LLM, efficient attention methods, and retrieval-augmented language models.

2.1 LONG CONTEXT LARGE LANGUAGE MODELS

  • Over the past few years, pretraining large language models (LLMs) with long context window becomes a viable solution thanks to faster GPU with more memory and memory-efficient exact attention (e.g., Dao et al., 2022). For example, the context window for pretrained LLM have been increased from 1024 of GPT-2 (Radford et al., 2019), 2048 of GPT-3 (Brown et al., 2020), 4096 of Llama 2 (Touvron et al., 2023b), to 8192 of GPT-4 (OpenAI, 2023a). However, further extending the context window in pretraining can be challenging, because,
    1. pretraining LLM from scratch with long context (e.g., >16K tokens) is very expensive due to the quadratic time and memory complexities of exact attention, and
    2. most of documents in pretraining corpus (e.g., Common Crawl) are relatively short.
  • Most recently, researchers start to extend the context window of LLMs with continued training or fine-tuning (e.g., Kaiokendev, 2023; Nijkamp et al., 2023; Chen et al., 2023; Tworkowski et al., 2023; Mohtashami & Jaggi, 2023).
  • Tworkowski et al. (2023) introduced LongLLaMA by fine-tuning the 3B and 7B OpenLLaMA checkpoints with contrastive training on 8K context length. Landmark attention (Mohtashami & Jaggi, 2023) extends the context length of LLaMA 7B from 4K to 32K by introducing “landmark tokens” to represent blocks of the context and fine-tuning the attention to use landmark tokens for selecting relevant blocks.
  • Chen et al. (2023) and Kaiokendev (2023) introduced positional interpolation to extend the context window sizes of RoPE-based (Su et al., 2021) pretrained LLMs. In particular, Chen et al. (2023) demonstrates promising results on LLaMA 7B to 65B (Touvron et al., 2023a) with minimal fine-tuning effort (within 1000 steps).
  • ALiBi (Press et al., 2021) extrapolates context window length by removing the positional embeddings while simply biasing the key-query attention scores with a linear penalty that is proportional to their distance, so one does not need fine-tuning for context window extrapolation.
  • Ratner et al. (2023) chunks long context into multiple sub-windows and re-use the positional embeddings across these windows, thus can handle longer context without any further fine-tuning. In this work, we apply positional interpolation method to extend the 4K context window of a proprietary 43B pretrained LLM and LLaMA2-70B (Touvron et al., 2023b) to 16K and 32K, as they both use rotary position embedding.

2.2 EFFICIENT ATTENTION METHODS

  • In previous study, many approximate attention methods (Tay et al., 2022) have been introduced for dealing with the quadratic complexity of self-attention that becomes a computational bottleneck for long context. They can be grouped into the following categories:
    1. Sparse attention mechanisms with predefined sparsity patterns (e.g., Child et al., 2019; Parmar et al., 2018; Ho et al., 2019; Beltagy et al., 2020; Zaheer et al., 2020; Zhu et al., 2021),
    2. recurrence-based method (Dai et al., 2019; Bulatov et al., 2022),
    3. low-rank projection attention (e.g., Wang et al., 2020; Xiong et al., 2021; Tay et al., 2021; Zhu et al., 2021),
    4. memory-based mechanisms (e.g., Rae et al., 2020; Liu et al., 2018),
    5. similarity and clustering based methods (e.g., Kitaev et al., 2020; Tay et al., 2020; Roy et al., 2021).
  • These approximate methods introduce inductive bias (e.g., predefined sparsity) that can fit well for a specific domain but may reduce model quality in general LLM training.
  • Most recently, FlashAttention (Dao et al., 2022; Dao, 2023) is introduced to speed up the exact attention computation by accounting for reads and writes between levels of GPU memory. FlashAttention is particularly useful for handling longer sequences.

2.3 RETRIEVAL-AUGMENTED LANGUAGE MODELS

  • Retrieval has been integrated into language models for years to improve perplexity (Borgeaud et al., 2022; Wang et al., 2023), factual accuracy (Nakano et al., 2021), downstream task accuracy (Guu et al., 2020; Izacard & Grave, 2021; Izacard et al., 2022; Lewis et al., 2020), and in-context learning capability (Huang et al., 2023). Combined with a standalone retriever (Karpukhin et al., 2020; Wang et al., 2022; Lin et al., 2023), retrieval-augmented LLM is well established for handling question answering with long document and in open-domain.
  • In previous study, language models have been augmented with retrieval at inference (Khandelwal et al., 2019; Yogatama et al., 2021), fine-tuning (Izacard et al., 2022; Lewis et al., 2020; Guu et al., 2020), and pretraining (Borgeaud et al., 2022; Izacard et al., 2022; Wang et al., 2023). There are also methods that try to integrate LLM and retriever in a single model and build the end-to-end solution (e.g., Jiang et al., 2022; Shi et al., 2023). However, most of the previous works mainly study retrieval-augmentation for LLMs that have around 10 billion parameters, except a few recent ones (e.g., Shi et al., 2023).
  • In this work, we focus on decoder-only LLMs with 43B and 70B parameters trained on trillions of tokens, because the LLMs at such scale exhibit strong zero-shot capability to incorporate context after instruction tuning (Wei et al., 2021; 2022).

2.4 CONCURRENT WORK

  • When we are preparing this manuscript, we notice that a concurrent work (Bai et al., 2023) (arXived on 28 Aug 2023) also studies the impact of retrieval on long context LLM, including black-box model G

3 EXPERIMENTAL SETUP

In this section, we present the details of our experimental setup.

3.1 LARGE LANGUAGE MODELS

We focus on comparing the zero-shot capability of integrating long context information for generative QA or summarization tasks via retrieval or LLM’s own self-attention mechanism. In contrast to most existing works that focus on relatively small models (e.g., 3B or 7B) (Kaiokendev, 2023; Nijkamp et al., 2023; Tworkowski et al., 2023; Mohtashami & Jaggi, 2023), we gather the insights by exploring model sizes that are larger than 40B after instruction tuning, as previous study suggests that instruction tuning becomes effective when the decoder-only LLM has around 50B parameters (Wei et al., 2021; 2022).

Specifically, we experimented with two pretrained GPT models, a proprietary Nemo GPT-43B and LLaMA2-70B. GPT-43B is a 43 billion parameter model that is trained with 1.1T tokens with 70% English corpus and the other 30% for multilingual and code data. For the English pretraining corpus, GPT-43B used Common Crawl web archive (WARC), Wikipedia, Reddit, Books, Gutenberg, ArXiv, StackExchange, PubMed, etc. It contains 48 layers with the hidden dimension of 8,192. It is trained with a sequence length of 4,096 and RoPE embeddings (Su et al., 2021). The other LLaMA2-70B is a public available 70B GPT model trained on 2T tokens using around 90% English data. It contains 80 layers with the hidden dimension of 8,192. It also has the context window size of 4,096 and trained with RoPE embeddings.

3.2 DATASETS AND METRICS

In this study, we include seven datasets ranging from single document QA, multi document QA, to query-based summarization for our zero-shot evaluations. Specifically, we include four datasets from the validation set of the Scroll benchmark (Shaham et al., 2022).

  • QMSum(QM) (Zhong et al., 2021) is a query-based summarization dataset, consisting of 232 meetings’ transcripts and their corresponding summaries from multiple domains such as academic, industrial product. Annotators were tasked with writing queries basing on the contexts and ensuring that the relevant text for answering each query spans contains at least 200 words or 10 turns.
  • Qasper(QASP) (Dasigi et al., 2021) is a question answering dataset over NLP papers filtered from the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al., 2020). Qasper contains abstractive, extractive, and yes/no questions, as well as unanswerable ones.
  • NarrativeQA (NQA) (Kociský et al. ˇ , 2018) is an established question answering dataset over entire books from Project Gutenberg3 and movie scripts from a list of websites. Summaries of the books and scripts obtained from Wikipedia were given to the annotators to produce question-answer pairs, resulting in approximately 30 questions and answers for each of the 1,567 books and scripts. Each question was answered by providing two reference answers.
  • QuALITY(QLTY) (Pang et al., 2022) is a multiple-choice question answering dataset over stories and articles sourced from several resources, such as Project Gutenberg and the Open American National Corpus4. 50% of the questions in QuALITY are labeled as hard to ensure the whole given document must be read slowly to conclude a correct answer, i.e., a skim of the document always yields wrong answers.

We take another three datasets from LongBench (Bai et al., 2023).

  • MuSiQue(MSQ) (Trivedi et al., 2022) stands for Multihop Questions via Single-hop Question Composition aiming at multihop reasoning question answering. A bottom–up process of constructing multihop from single-hop questions allows systematic exploration of a large space of multihop candidates and greater control over which questions that are composed manually. In order to correctly generate the answers, LLMs require connected reasoning by reducing potential reasoning shortcuts, minimizing train-test leakage, and including harder distractor contexts. Thus, MuSiQue is significantly less cheatable via disconnected reasoning than previous datasets.
  • HotpotQA (HQA) (Yang et al., 2018) is a Wikipedia-based question-answer dataset with several key features. First, multiple supporting documents are required to be read for answering and reasoning. Second, the questions are diverse and not constrained to any pre-existing knowledge bases. Third, sentence-level supporting are provided with strong supervision to support LLM’s requirement for reasoning. Finally, new types of factoid comparison questions are provided to test LLMs’ ability to extract and compare various entity properties in text.
  • MultiFieldQA-en(MFQA) (Bai et al., 2023) was manually curated to better test the model’s long context understanding ability across diverse

The full details of the dataset can be found in Table 1. We can see that our evaluation datasets have a wide range of average document length from 4.9k (QASP) to 84k (NQA). Therefore, for the baseline model without retrieval, we truncate the document accordingly to fit into the input sequence length.

Following the official metrics, we report the geometric mean of ROUGE scores (i.e., ROUGE1/2/L) (Lin, 2004) for QM, the exact matching (EM) score for QLTY, and F1 scores for the remaining five datasets QASP, NQA, MSQ, HQA, and MFQA.

3.3 CONTEXT WINDOW EXTENSION

We extend the context window length with position interpolation method (Chen et al., 2023), as it is simple and effective for RoPE embeddings. We extend the 4K context window to 16K for GPT-43B. For LLaMA2-70B, we extend its 4K context window to 16K and 32K. We follow Chen et al. (2023) and finetune both LLMs on the Pile dataset (Gao et al., 2021) with batch size as 128, constant learning rate of 5e-6 to adapt the position embeddings.

3.4 RETRIEVAL

For the retriever, we experimented with three retrievers: 1) Dragon (Lin et al., 2023) as it achieves state-of-the-art results on both supervised and zero-shot information retrieval benchmarks (Thakur et al., 2021). Dragon is a dual encoder model that consists of a query encoder and a context encoder. 2) a widely used Contriever model (Izacard et al., 2021). Following the MoCo technique (He et al., 2020), Contriever used a simple contrastive learning framework to pre-train models for information retrieval. It was trained without supervision and achieved competitive results with BM25 for R@100 on the BEIR benchmark (Thakur et al., 2021), and 3) OpenAI embedding5. For the OpenAI embedding model, we use the latest “text-embedding-ada-002” as recommended by OpenAI. It accepts 8,191 maximum input tokens for one sequence with an output vector of 1,536 dimensions. The cosine similarities are then computed between the questions and the list of contexts for retrieval ranking.

To use these retrievers, we first chunk each context document with 300 words, and then we encode both the questions and all chunks independently with corresponding encoders. The most relevant N chunks, ranked by the dot product of the question embedding and chunk embedding, are then concatenated together (following the left to right order from the most relevant to least relevant) as the context of the prompt for generation. Table 1 shows the statistics of the top N retrieved chunks while Figure 1 gives more details of the token length distribution of all seven datasets. Note that, some dataset like Qasper (QASP) is relatively short and don’t have up to 20 chunks, so the average length of top-10 chunks and top-20 chunks are close. We can see that top-5 chunks can all fit into 4k sequence length (except few outliers) while top-10 and top-20 chunks can fit into 16k sequence length.

3.5 INSTRUCTION TUNING

To train the pretrained LLMs to follow instructions for question answering or text summarization, we also performed instruction tuning. We first construct a blend of instruction tuning datasets consisting of 102K training samples from the Soda dataset (Kim et al., 2022), ELI5 dataset (Fan et al., 2019), FLAN dataset (Wei et al., 2021) , Open Assistatant dataset (Köpf et al., 2023), Dolly (Conover et al., 2023) and a proprietary sourced conversational dataset, to adapt both GPT-43B and LLaMA2-70B to follow instructions. In terms of the template, we use “System: {System}\n\nUser: {Question}\n\nAssistant: {Answer}” as the format to support multi-turn dialogue training. As all of the tasks contain the context information for reasoning over at inference time, we add the context before the dialogue, i.e. “System: {System}\n\n{Context}\n\nUser: {Question}\n\nAssistant: {Answer}”. We finetune the LLM by taking the loss only on the {Answer} part with batch size 128 and learning rate of 5e-6 for 1000 steps. For the rest of the paper, results are all reported using the instruction tuned chat model on top of the foundational GPT-43B and LLaMA2-70B.

4 RESULTS

In this section, we report the results and provide detailed analysis.

4.1 MAIN RESULTS

In Table 2, we compare different model variants with context lengths ranging from 4K to as long as 32K using GPT-43B and LLaMA2-70B. First, we find that baseline models without retrieval of 4k sequence length achieve the worst results for both GPT-43B and LLaMA2-70B. This is because the minimum average sequence length of all seven tasks exceeds 4096, the context window of the foundation models and therefore valuable texts get truncated randomly. As a result, retrieval is especially helpful for 4K LLMs e.g., LLaMA2-70B-4K is improved from 31.61 to 35.73 while GPT-43B-4K is improved from 26.44 to 29.32. Second, we observe that HotpotQA (HQA) especially favors long sequence models as the score improves from 34.64 to 43.97 for LLaMA2-70B and from 28.91 to 37.48 for GPT-43B when the sequence length increases from 4k to 16k. This is because Hotpot QA is a multi-hop dataset where the questions are not hard to answer but all intermediate hops are necessary to get correct answer. Therefore, long context are beneficial to increase the recall of incorporating all intermediate hops.

It is quite interesting that the retrieval-augmented long context LLM (e.g., 16K and 32K) can obtain better results than retrieval-augmented 4K context LLM, even they are feed with the same top 5 chunks of evidence. We hypothesize this interesting observation is related to the “lost in the middle” phenomenon (Liu et al., 2023), where the LLMs has such “U-shaped” performance curve. Specifically, LLMs are better at utilizing relevant information that occurs at the beginning or end of its input context window. Due to this reason, the 4K context LLM tends to ignore the information in the middle of 4K input, while 32K context LLM tend to ignore the information in the middle of 32K input. From Figure 1, the length of top 5 chunks is about 2K tokens, which can be in the middle and ignored by 4K context LLM, but is only at the beginning part of 16K and 32K context and may not be ignored by the 16K or 32K context LLM.

Note that, we have very different observation from the conclusion drawn from LongBench work (Bai et al., 2023): “Retrieval brings improvement for model with weak ability on long contexts, but the performance still lags behind models that have strong long context understanding capability”. Here, we demonstrate retrieval can significantly improve the performance of both GPT-43B and LLaMA2-70B regardless their context window size. For example, our best retrieval-augmented LLaMA2-70B-32k-ret outperforms its baseline w/o retrieval by a margin, i.e., 39.60 vs. 37.36. We think the major reason for such different conclusion is that Bai et al. (2023) uses much smaller LLM with 6B and 7B parameters, which usually has relatively worse zero-shot capability to incorporate the retrieved chunked context. In contrast, the larger instruction tuned LLMs like LLaMA2-70B has much stronger zero-shot capability to incorporate retrieved evidence. This observation is becoming more clear when one compares the gain of retrieval-augmentation between GPT-43B and LLaMA2-70B, where LLaMA2-70B enjoys larger benefit of incorporating context through retrieval.

4.2 COMPARING TO OPENAI MODELS

To further understand how good is our best model, i.e., augmenting LLaMA2-70B-32k with retrieval, we also compare it to GPT-3.5-turbo(4k), GPT-3.5-turbo-16k and Davinci-003 on those seven datasets.6 We found that LLaMA2-70B-32k-ret achieves better results than GPT-3.5-turbo-16k in terms of the average accuracy over seven datasets, while better than Davinci-003 (w/ 175B parameters) on the average over 4 tasks. This indicates LLaMA2-70B-32k with retrieval is a strong model for these long context tasks, and our conclusion is built on the state-of-the-art results.

4.3 ABLATION ON DIFFERENT RETRIEVERS

To investigate the impacts of different retrievers on top of LLaMA2-70B, we compare Dragon, Contriever, and OpenAI embeddings on top of LLaMA2-70B-4k and LLaMA2-70B-32k. The results in Table 4 confirms that our finding, i.e., retrieval can boost the performance of both short context and long context LLMs, is consistent across different retrievers. Also, we observe that public available retrievers can do better than the commercially closed OpenAI embeddings.

4.4 INCREASING THE NUMBER OF RETRIEVED CHUNKS

To study the impact of adding more retrieved chunks to the context, we increase the number of retrieved chunks from 5 to 20 using Dragon retriever and the results can be found in Table 5. We observe that for different sequence lengths, the best averaged results are obtained either from top 5 or top 10. Even if 20 chunks can still fit into the 16K and 32K context window (as shown in Figure 1), adding more chunks up to 20 is not helpful and will sometime hurt the performance. We believe this is related to the “lost in the middle” phenomenon (Liu et al., 2023) or the model is getting distracted by irrelevant information and therefore needs further research.

5 CONCLUSION

In this work, we systematically study the retrieval-augmentation versus long context extension using the state-of-the-art LLMs after instruction tuning for various long context QA and query-based summarization tasks. After study, we have the following interesting findings: i) Retrieval largely boosts the performance of both 4K short context LLM and 16K/32K long context LLMs. ii) The 4K context LLMs with simple retrieval augmentation can perform comparable to 16K long context LLMs, while being more efficient at inference. iii) After context window extension and retrieval-augmentation, the best model LLaMA2-70B-32k-ret can outperform GPT-3.5-turbo-16k and Davinci003 in terms of average score on a suit of downstream tasks with informative queries. Our study shed light on the promising direction of combining retrieval and long context techniques together to build better LLM.

Previous: Model | MS ToRA Next: Video QA

post contain ""

    No matching posts found containing ""