Contents
1 서론
최근 대규모 언어모델(LLM)은 생산, 연구, 오픈 소스 커뮤니티에서 주목을 받고 있습니다. self-attention 메커니즘의 계산 복잡성을 해결하기 위해 대안적인 방법으로 검색 보강 방식이 연구되어 왔습니다. 이 논문에서는 검색 보강이 LLM의 성능을 향상시킬 수 있는지, 또한 어떻게 결합하면 더 높은 정확도를 달성할 수 있는지에 대해 연구하고자 합니다. 두 가지 최신 LLM, 43B GPT와 LLaMA2-70B를 사용하여 7가지 downstream 장문 컨텍스트 작업에서 종합적인 연구를 수행하였습니다.
2 관련 연구
2.1 장문 컨텍스트 대규모 언어모델
GPU의 발전과 메모리 효율적인 정확한 어텐션(attention) 덕분에 LLM의 context window 크기가 점차 증가하였습니다. 그러나 context window을 더 확장하는 것은 계산 복잡성 때문에 어려움이 있습니다. 최근에는 계속 학습이나 파인튜닝을 통해 LLM의 context window을 확장하는 연구가 이루어졌습니다.
2.2 효율적인 어텐션 방법
이전 연구에서는 self-attention의 계산 복잡성을 다루기 위해 여러가지 근사적 어텐션 방법들이 도입되었습니다. 최근에는 FlashAttention이라는 방법이 GPU 메모리 간의 읽기와 쓰기를 최적화하여 정확한 어텐션 계산을 가속화하는 데 도움이 되었습니다.
2.3 검색 보강 언어 모델
검색을 통한 보강은 오랜 기간 동안 언어 모델의 성능을 향상시키기 위해 연구되어 왔습니다. 이 연구에서는 독립된 검색기를 사용하여 관련 컨텍스트만을 읽도록 하여 처리 속도를 향상시키고, 공간적 어텐션 패턴을 적용하는 새로운 방법을 제안합니다.
3 실험 설정
3.1 대규모 언어모델
이 연구에서는 두 개의 pre-trained GPT 모델을 사용하여, 검색 보강 및 자체 어텐션 메커니즘을 통해 장문 정보 통합 능력을 비교합니다. 이 모델들은 지시 튜닝 후 40B 이상의 크기에서 더 효과적입니다.
3.2 데이터셋 및 지표
다양한 문서 QA 및 요약 작업을 포함하는 7가지 데이터셋을 사용하여 모델을 평가합니다. 이를 통해 검색 보강이 장문 컨텍스트에서의 모델 성능에 미치는 영향을 평가합니다.
3.3 context window 확장
위치 보간 방법을 사용하여 GPT-43B와 LLaMA2-70B의 4K context window을 각각 16K와 32K로 확장합니다. 이는 RoPE 임베딩과 함께 간단하면서도 효과적인 방법입니다.
3.4 검색
Dragon, Contriever, OpenAI 임베딩을 포함한 세 가지 검색기를 실험합니다. 이 검색기들은 질문과 컨텍스트 청크를 독립적으로 인코딩하고, 가장 관련성 높은 청크들을 선택하여 프롬프트의 컨텍스트로 연결합니다.
4 결과
4.1 주요 결과
검색 보강을 통해 4K LLM의 성능이 크게 향상되었으며, 긴 컨텍스트 LLM(e.g., 16K 및 32K)은 검색 보강된 4K 모델보다 더 나은 결과를 보였습니다. 이는 검색 보강이 LLM의 장문 컨텍스트 처리 능력을 향상시킬 수 있음을 시사합니다.
4.2 오픈AI 모델과의 비교
LLaMA2-70B-32k와 검색 보강 모델은 오픈AI의 여러 모델보다 향상된 결과를 보여주었습니다. 이는 검색 보강이 장문 컨텍스트 작업에서 강력한 모델을 구축하는 데 기여할 수 있음을 보여줍니다.
4.3 검색기의 영향
다양한 검색기를 사용하여 LLaMA2-70B의 성능을 비교한 결과, 검색 보강이 모든 컨텍스트 길이에서 성능을 향상시키는 일관된 결과를 보여주었습니다. 공개적으로 사용 가능한 검색기가 상업적인 OpenAI 임베딩보다 더 나은 결과를 보였습니다.
4.4 검색된 청크 수 증가의 영향
더 많은 청크를 검색할수록 성능이 향상되는 경향을 보였으나, 청크 수가 너무 많으면 성능이 저하될 수 있습니다. 이는 모델이 중간의 정보를 놓칠 수 있기 때문입니다.
이 연구는 검색 보강이 대규모 언어모델의 장문 컨텍스트 작업에 미치는 영향을 체계적으로 탐구하고, 여러 데이터셋과 벤치마크를 통해 모델의 성능을 평가합니다. 검색 보강이 모델의 장문 처리 능력을 크게 향상시키며, 특히 대규모 모델에서 더욱 두드러지는 성능 개선을 보여줍니다.
The long context large language models (LLM) have recently received a lot of attention in production (e.g., Anthropic, 2023; OpenAI, 2023b), research community (e.g., Chen et al., 2023; Liu et al., 2023; Tworkowski et al., 2023), and open source community (e.g., Kaiokendev, 2023). Although the approximate attention methods have been studied for years (e.g., Tay et al., 2022) (due to the quadratic time and memory complexities of self-attention mechanism in sequence length), the recent advance for long context LLMs with exact attention is mainly driven by the development of faster GPU with more memory and memory-efficient exact attention (Dao et al., 2022; Dao, 2023). An alternative and long-standing solution for handling long context is retrieval. Specifically, the LLMs only read relevant context retrieved from a standalone retriever (e.g., Karpukhin et al., 2020; Wang et al., 2022; Lin et al., 2023), which is much easier to scale 1 and runs orders of magnitudes faster than LLMs for selecting relevant context. Conceptually, the retrieval-augmented decoder-only LLM can be viewed as applying the sparse attention over its long context window, where the sparsity pattern is not predefined as Child et al. (2019) but determined by the standalone retriever. In other words, unretrieved context is treated as irrelevant and has zero-valued attention weights. Given the surge of interest in long context LLM research and much more required computation at inference 2, it is still unclear for practitioners whether extending the context window of LLM provides higher accuracy than the retrieval augmentation for downstream tasks with informative queries. Moreover, it would be compelling if we could combine the strength of both methods and achieve even higher accuracies. In this work, we attempt to answer the above questions through a comprehensive study. Specifically, we make the following contributions:
We organize the rest of the paper as follows. We discuss related work in Section 2, and present the experimental setup in Section 3. We report results in Section 4 and conclude the paper in Section 5.
In this section, we discuss the related work in long context LLM, efficient attention methods, and retrieval-augmented language models.
In this section, we present the details of our experimental setup.
We focus on comparing the zero-shot capability of integrating long context information for generative QA or summarization tasks via retrieval or LLM’s own self-attention mechanism. In contrast to most existing works that focus on relatively small models (e.g., 3B or 7B) (Kaiokendev, 2023; Nijkamp et al., 2023; Tworkowski et al., 2023; Mohtashami & Jaggi, 2023), we gather the insights by exploring model sizes that are larger than 40B after instruction tuning, as previous study suggests that instruction tuning becomes effective when the decoder-only LLM has around 50B parameters (Wei et al., 2021; 2022).
Specifically, we experimented with two pretrained GPT models, a proprietary Nemo GPT-43B and LLaMA2-70B. GPT-43B is a 43 billion parameter model that is trained with 1.1T tokens with 70% English corpus and the other 30% for multilingual and code data. For the English pretraining corpus, GPT-43B used Common Crawl web archive (WARC), Wikipedia, Reddit, Books, Gutenberg, ArXiv, StackExchange, PubMed, etc. It contains 48 layers with the hidden dimension of 8,192. It is trained with a sequence length of 4,096 and RoPE embeddings (Su et al., 2021). The other LLaMA2-70B is a public available 70B GPT model trained on 2T tokens using around 90% English data. It contains 80 layers with the hidden dimension of 8,192. It also has the context window size of 4,096 and trained with RoPE embeddings.
In this study, we include seven datasets ranging from single document QA, multi document QA, to query-based summarization for our zero-shot evaluations. Specifically, we include four datasets from the validation set of the Scroll benchmark (Shaham et al., 2022).
We take another three datasets from LongBench (Bai et al., 2023).
The full details of the dataset can be found in Table 1. We can see that our evaluation datasets have a wide range of average document length from 4.9k (QASP) to 84k (NQA). Therefore, for the baseline model without retrieval, we truncate the document accordingly to fit into the input sequence length.
Following the official metrics, we report the geometric mean of ROUGE scores (i.e., ROUGE1/2/L) (Lin, 2004) for QM, the exact matching (EM) score for QLTY, and F1 scores for the remaining five datasets QASP, NQA, MSQ, HQA, and MFQA.
We extend the context window length with position interpolation method (Chen et al., 2023), as it is simple and effective for RoPE embeddings. We extend the 4K context window to 16K for GPT-43B. For LLaMA2-70B, we extend its 4K context window to 16K and 32K. We follow Chen et al. (2023) and finetune both LLMs on the Pile dataset (Gao et al., 2021) with batch size as 128, constant learning rate of 5e-6 to adapt the position embeddings.
For the retriever, we experimented with three retrievers: 1) Dragon (Lin et al., 2023) as it achieves state-of-the-art results on both supervised and zero-shot information retrieval benchmarks (Thakur et al., 2021). Dragon is a dual encoder model that consists of a query encoder and a context encoder. 2) a widely used Contriever model (Izacard et al., 2021). Following the MoCo technique (He et al., 2020), Contriever used a simple contrastive learning framework to pre-train models for information retrieval. It was trained without supervision and achieved competitive results with BM25 for R@100 on the BEIR benchmark (Thakur et al., 2021), and 3) OpenAI embedding5. For the OpenAI embedding model, we use the latest “text-embedding-ada-002” as recommended by OpenAI. It accepts 8,191 maximum input tokens for one sequence with an output vector of 1,536 dimensions. The cosine similarities are then computed between the questions and the list of contexts for retrieval ranking.
To use these retrievers, we first chunk each context document with 300 words, and then we encode both the questions and all chunks independently with corresponding encoders. The most relevant N chunks, ranked by the dot product of the question embedding and chunk embedding, are then concatenated together (following the left to right order from the most relevant to least relevant) as the context of the prompt for generation. Table 1 shows the statistics of the top N retrieved chunks while Figure 1 gives more details of the token length distribution of all seven datasets. Note that, some dataset like Qasper (QASP) is relatively short and don’t have up to 20 chunks, so the average length of top-10 chunks and top-20 chunks are close. We can see that top-5 chunks can all fit into 4k sequence length (except few outliers) while top-10 and top-20 chunks can fit into 16k sequence length.
To train the pretrained LLMs to follow instructions for question answering or text summarization, we also performed instruction tuning. We first construct a blend of instruction tuning datasets consisting of 102K training samples from the Soda dataset (Kim et al., 2022), ELI5 dataset (Fan et al., 2019), FLAN dataset (Wei et al., 2021) , Open Assistatant dataset (Köpf et al., 2023), Dolly (Conover et al., 2023) and a proprietary sourced conversational dataset, to adapt both GPT-43B and LLaMA2-70B to follow instructions. In terms of the template, we use “System: {System}\n\nUser: {Question}\n\nAssistant: {Answer}” as the format to support multi-turn dialogue training. As all of the tasks contain the context information for reasoning over at inference time, we add the context before the dialogue, i.e. “System: {System}\n\n{Context}\n\nUser: {Question}\n\nAssistant: {Answer}”. We finetune the LLM by taking the loss only on the {Answer} part with batch size 128 and learning rate of 5e-6 for 1000 steps. For the rest of the paper, results are all reported using the instruction tuned chat model on top of the foundational GPT-43B and LLaMA2-70B.
In this section, we report the results and provide detailed analysis.
In Table 2, we compare different model variants with context lengths ranging from 4K to as long as 32K using GPT-43B and LLaMA2-70B. First, we find that baseline models without retrieval of 4k sequence length achieve the worst results for both GPT-43B and LLaMA2-70B. This is because the minimum average sequence length of all seven tasks exceeds 4096, the context window of the foundation models and therefore valuable texts get truncated randomly. As a result, retrieval is especially helpful for 4K LLMs e.g., LLaMA2-70B-4K is improved from 31.61 to 35.73 while GPT-43B-4K is improved from 26.44 to 29.32. Second, we observe that HotpotQA (HQA) especially favors long sequence models as the score improves from 34.64 to 43.97 for LLaMA2-70B and from 28.91 to 37.48 for GPT-43B when the sequence length increases from 4k to 16k. This is because Hotpot QA is a multi-hop dataset where the questions are not hard to answer but all intermediate hops are necessary to get correct answer. Therefore, long context are beneficial to increase the recall of incorporating all intermediate hops.
It is quite interesting that the retrieval-augmented long context LLM (e.g., 16K and 32K) can obtain better results than retrieval-augmented 4K context LLM, even they are feed with the same top 5 chunks of evidence. We hypothesize this interesting observation is related to the “lost in the middle” phenomenon (Liu et al., 2023), where the LLMs has such “U-shaped” performance curve. Specifically, LLMs are better at utilizing relevant information that occurs at the beginning or end of its input context window. Due to this reason, the 4K context LLM tends to ignore the information in the middle of 4K input, while 32K context LLM tend to ignore the information in the middle of 32K input. From Figure 1, the length of top 5 chunks is about 2K tokens, which can be in the middle and ignored by 4K context LLM, but is only at the beginning part of 16K and 32K context and may not be ignored by the 16K or 32K context LLM.
Note that, we have very different observation from the conclusion drawn from LongBench work (Bai et al., 2023): “Retrieval brings improvement for model with weak ability on long contexts, but the performance still lags behind models that have strong long context understanding capability”. Here, we demonstrate retrieval can significantly improve the performance of both GPT-43B and LLaMA2-70B regardless their context window size. For example, our best retrieval-augmented LLaMA2-70B-32k-ret outperforms its baseline w/o retrieval by a margin, i.e., 39.60 vs. 37.36. We think the major reason for such different conclusion is that Bai et al. (2023) uses much smaller LLM with 6B and 7B parameters, which usually has relatively worse zero-shot capability to incorporate the retrieved chunked context. In contrast, the larger instruction tuned LLMs like LLaMA2-70B has much stronger zero-shot capability to incorporate retrieved evidence. This observation is becoming more clear when one compares the gain of retrieval-augmentation between GPT-43B and LLaMA2-70B, where LLaMA2-70B enjoys larger benefit of incorporating context through retrieval.
To further understand how good is our best model, i.e., augmenting LLaMA2-70B-32k with retrieval, we also compare it to GPT-3.5-turbo(4k), GPT-3.5-turbo-16k and Davinci-003 on those seven datasets.6 We found that LLaMA2-70B-32k-ret achieves better results than GPT-3.5-turbo-16k in terms of the average accuracy over seven datasets, while better than Davinci-003 (w/ 175B parameters) on the average over 4 tasks. This indicates LLaMA2-70B-32k with retrieval is a strong model for these long context tasks, and our conclusion is built on the state-of-the-art results.
To investigate the impacts of different retrievers on top of LLaMA2-70B, we compare Dragon, Contriever, and OpenAI embeddings on top of LLaMA2-70B-4k and LLaMA2-70B-32k. The results in Table 4 confirms that our finding, i.e., retrieval can boost the performance of both short context and long context LLMs, is consistent across different retrievers. Also, we observe that public available retrievers can do better than the commercially closed OpenAI embeddings.
To study the impact of adding more retrieved chunks to the context, we increase the number of retrieved chunks from 5 to 20 using Dragon retriever and the results can be found in Table 5. We observe that for different sequence lengths, the best averaged results are obtained either from top 5 or top 10. Even if 20 chunks can still fit into the 16K and 32K context window (as shown in Figure 1), adding more chunks up to 20 is not helpful and will sometime hurt the performance. We believe this is related to the “lost in the middle” phenomenon (Liu et al., 2023) or the model is getting distracted by irrelevant information and therefore needs further research.
In this work, we systematically study the retrieval-augmentation versus long context extension using the state-of-the-art LLMs after instruction tuning for various long context QA and query-based summarization tasks. After study, we have the following interesting findings: i) Retrieval largely boosts the performance of both 4K short context LLM and 16K/32K long context LLMs. ii) The 4K context LLMs with simple retrieval augmentation can perform comparable to 16K long context LLMs, while being more efficient at inference. iii) After context window extension and retrieval-augmentation, the best model LLaMA2-70B-32k-ret can outperform GPT-3.5-turbo-16k and Davinci003 in terms of average score on a suit of downstream tasks with informative queries. Our study shed light on the promising direction of combining retrieval and long context techniques together to build better LLM.