00:00:00

RAG, Survey | RAG Survey**

https://dsdanielpark.github.io https://github.com/dsdanielpark

RAG, Survey | RAG Survey**

MinWoo(Daniel) Park | Tech Blog

Created: 2024-01-11 15:12:53 +0000

Last modified: 2024-09-05 20:56:50 +0900

RAG, Survey | RAG Survey**

Related Project: Private
Category: Paper Review
Date: 2024-01-12

[RAG 종류 및 도식 색인마킹]

Retrieval-Augmented Generation for Large Language Models: A Survey

url: https://arxiv.org/abs/2312.10997
pdf: https://arxiv.org/pdf/2312.10997
github: https://github.com/Tongji-KGLLM/RAG-Survey
abstract: Large Language Models (LLMs) demonstrate significant capabilities but face challenges such as hallucination, outdated knowledge, and non-transparent, untraceable reasoning processes. Retrieval-Augmented Generation (RAG) has emerged as a promising solution by incorporating knowledge from external databases. This enhances the accuracy and credibility of the models, particularly for knowledge-intensive tasks, and allows for continuous knowledge updates and integration of domain-specific information. RAG synergistically merges LLMs’ intrinsic knowledge with the vast, dynamic repositories of external databases. This comprehensive review paper offers a detailed examination of the progression of RAG paradigms, encompassing the Naive RAG, the Advanced RAG, and the Modular RAG. It meticulously scrutinizes the tripartite foundation of RAG frameworks, which includes the retrieval , the generation and the augmentation techniques. The paper highlights the state-of-the-art technologies embedded in each of these critical components, providing a profound understanding of the advancements in RAG systems. Furthermore, this paper introduces the metrics and benchmarks for assessing RAG models, along with the most up-to-date evaluation framework. In conclusion, the paper delineates prospective avenues for research, including the identification of challenges, the expansion of multi-modalities, and the progression of the RAG infrastructure and its ecosystem.

Contents

Retrieval-Augmented Generation for Large Language Models: A Survey

TL;DR

Retrieval-Augmented Generation for Large Language Models: A Survey

개요

대규모 언어모델(LLM)은 향상된 능력을 보여주지만, 환각(hallucination), 오래된 지식, 불투명하고 추적할 수 없는 인퍼런스 과정과 같은 문제에 직면함. Retrieval-Augmented Generation(RAG)은 외부 데이터베이스의 지식을 통합하여 이를 해결하는 유망한 솔루션으로 부상함. 이는 지식 집약적 작업의 정확성과 신뢰성을 향상시키고, 지속적인 지식 업데이트와 도메인별 정보 통합을 가능케 함. RAG는 LLM의 내재된 지식과 외부 데이터베이스의 방대한 동적 저장소를 시너지 있게 결합함. 이 종합적인 리뷰 논문은 Naive RAG, Advanced RAG, Modular RAG을 아우르는 RAG 패러다임의 발전을 상세히 분석함. 또한, RAG 프레임워크의 삼중 기초인 검색, 생성 및 증강 기법을 면밀히 조사함. 본 논문은 이런 핵심 구성 요소에 내장된 최신 기술을 강조하여 RAG 시스템의 발전에 대한 심도 있는 이해를 제공함. 더 나아가 최신 평가 프레임워크와 벤치마크를 소개하며, 현재 직면한 도전과 잠재적인 연구 및 개발 방향을 제시함.

I. 서론

대규모 언어모델(LLM)은 주목할 만한 성공을 이루었으나, 특히 도메인 특화 또는 지식 집약적 작업에서 중요한 한계에 직면함 [1]. 특히, 훈련 데이터 범위를 넘어서는 쿼리를 처리하거나 최신 정보를 필요로 할 때 “환각(hallucinations)”을 생성하는 문제가 있음 [2]. Retrieval-Augmented Generation(RAG)은 외부 지식 베이스에서 관련 문서 청크를 검색하여 LLM을 강화함으로써 이런 문제를 해결함. 외부 지식을 참고하여 RAG는 사실적으로 부정확한 콘텐츠 생성 문제를 효과적으로 줄임. LLM에의 통합은 RAG를 실질적인 응용에 적합하도록 하는 주요 기술로 정착시킴.

RAG 기술은 최근 몇 년간 급속히 발전해 왔으며, 관련 연구를 요약한 기술 트리가 Figure 1에 나타남. 대규모 모델 시대의 RAG 발전 궤적은 여러 단계적 특성을 보임. 초기에는 Transformer 아키텍처의 부상과 함께 RAG의 탄생이 일치했으며, Pre-Training Models(PTM)를 통해 추가 지식을 통합하여 언어 모델을 강화하는 데 중점을 둠. 이 초기 단계는 사전 훈련 기법을 개선하려는 기초 작업으로 특징지어짐 [3]–[5]. ChatGPT [6]의 도래는 중요한 순간을 나타내며, LLM이 강력한 in-context learning(ICL) 능력을 보여줌. RAG 연구는 인퍼런스 단계에서 LLM이 보다 복잡하고 지식 집약적 작업을 수행할 수 있도록 더 나은 정보를 제공하는 방향으로 전환됨. 연구가 진행됨에 따라 RAG의 강화는 더 이상 인퍼런스 단계에 국한되지 않고, LLM의 파인튜닝 기법과 더 긴밀히 통합되기 시작함.

RAG의 급성장 분야는 체계적인 종합과 동반되지 않아 그 폭넓은 궤적을 명확히 하지 못함. 이 설문 조사는 RAG 프로세스를 매핑하고, 그 진화와 예상되는 미래 경로를 도식화함으로써 이 격차를 메우고자 함. LLM 내 RAG의 통합에 중점을 두고 기술 패러다임과 연구 방법을 고려하여 100개 이상의 RAG 연구에서 세 가지 주요 연구 패러다임을 요약하고, “검색,” “생성,” “증강”의 핵심 단계의 주요 기술을 분석함. 반면, 현재 연구는 방법에 주로 초점을 맞추고 있으며, RAG를 평가하는 방법을 분석하고 요약하는 데 부족함. 본 논문은 RAG에 적용 가능한 downstream 작업, 데이터셋, 벤치마크 및 평가 방법을 포괄적으로 검토함. 전반적으로, 본 논문은 대규모 모델과 RAG에 대한 독자 및 전문가에게 상세하고 구조화된 이해를 제공하기 위해 기초 기술 개념, 역사적 진행 및 LLM 이후 등장한 RAG 방법 및 응용의 스펙트럼을 철저히 수집하고 분류함. 검색 증강 기술의 진화를 밝히고, 각 접근 방식의 강점과 약점을 맥락에서 평가하며, 다가오는 트렌드와 혁신에 대해 추측함.

기여

본 설문 조사에서는 SOTA RAG 방법을 철저하고 체계적으로 검토하며, Naive RAG, Advanced RAG, Modular RAG을 포함한 패러다임을 통해 그 진화를 설명함.
RAG 프로세스의 중심 기술을 식별하고 논의하며, 특히 “검색,” “생성,” “증강” 측면에 초점을 맞추어 이들의 시너지를 깊이 있게 탐구하고, 이런 구성 요소들이 어떻게 복합적이고 효과적인 RAG 프레임워크를 구성하는지를 설명함.
RAG의 현재 평가 방법을 요약하고, 26개의 작업, 약 50개의 데이터셋을 다루며 평가 목표와 지표, 현재 평가 벤치마크 및 도구를 설명함. 또한, RAG의 현재 문제를 해결할 수 있는 향후 방향을 강조함.

이 논문은 다음과 같이 전개됨: II절에서는 RAG의 주요 개념과 현재 패러다임을 소개함. 다음 세 절에서는 핵심 구성 요소인 “검색”, “생성” 및 “증강”을 각각 탐구함. III절은 검색 최적화 방법, 인덱싱, 쿼리 및 임베딩 최적화에 중점을 둠. IV절은 생성에서 사후 검색 프로세스와 LLM 파인튜닝에 중점을 둠. V절은 세 가지 증강 프로세스를 분석함. VI절은 RAG의 downstream 작업과 평가 시스템에 중점을 둠. VII절은 RAG가 현재 직면하고 있는 문제와 미래 개발 방향에 대해 주로 논의함. 마지막으로, VIII절에서 논문을 결론지음.

II. RAG 개요

RAG의 전형적인 응용 사례는 Figure 2에 설명됨. 사용자는 최근 널리 논의된 뉴스에 대해 ChatGPT에 질문을 던짐. ChatGPT가 사전 훈련 데이터에 의존하므로 최근 발전에 대한 업데이트를 제공할 수 없음. RAG는 외부 데이터베이스에서 지식을 소싱하고 통합하여 이 정보 격차를 메움. 이 경우 사용자의 쿼리와 관련된 뉴스 기사를 수집함. 이런 기사들은 원래의 질문과 결합하여 LLM이 잘 정보화된 답변을 생성할 수 있는 포괄적인 프롬프트를 형성함.

RAG 연구 패러다임은 지속적으로 진화하고 있으며, 이를 Figure 3에 나타난 것처럼 Naive RAG, Advanced RAG, Modular RAG 세 단계로 분류함. RAG 방법은 비용 효율적이며 기본 LLM의 성능을 능가하지만, 여러가지 한계도 나타냄. Naive RAG의 특정한 단점을 해결하기 위해 Advanced RAG 및 Modular RAG의 개발이 이루어짐.

A. Naive RAG

Naive RAG 연구 패러다임은 ChatGPT의 광범위한 채택 직후 대두된 초기 방법을 나타냄. Naive RAG는 인덱싱, 검색, 생성을 포함하는 전통적인 프로세스를 따르며, 이는 “Retrieve-Read” 프레임워크로도 특징지어짐 [7].

인덱싱: PDF, HTML, Word, Markdown 등 다양한 형식의 원시 데이터 정리 및 추출부터 시작하여, 이를 균일한 일반 텍스트 형식으로 변환함. 언어 모델의 컨텍스트 제한을 수용하기 위해 텍스트를 더 작은 소화 가능한 청크로 분할함. 청크는 임베딩 모델을 사용하여 벡터 표현으로 인코딩되고 벡터 데이터베이스에 저장됨. 이 단계는 이후 검색 단계에서 효율적인 유사성 검색을 가능하게 하는 데 중요함.
검색: 사용자 쿼리를 수신하면 RAG 시스템은 인덱싱 단계에서 사용된 동일한 인코딩 모델을 사용하여 쿼리를 벡터 표현으로 변환함. 그런 다음 쿼리 벡터와 인덱싱된 코퍼스 내 청크 벡터 간의 유사성 점수를 계산함. 시스템은 쿼리와 가장 큰 유사성을 보이는 상위 K개의 청크를 우선 순위로 검색함. 이런 청크는 프롬프트에서 확장된 컨텍스트로 사용됨.
생성: 제기된 쿼리와 선택된 문서는 대규모 언어모델이 응답을 공식화하는 데 할당된 일관된 프롬프트로 종합됨. 모델의 응답 접근 방식은 작업별 기준에 따라 다를 수 있으며, 내재된 파라메트릭 지식을 활용하거나 제공된 문서 내 정보로 응답을 제한할 수 있음. 진행 중인 대화의 경우, 기존 대화 기록을 프롬프트에 통합하여 모델이 다중 턴 대화 상호 작용을 효과적으로 수행할 수 있게 함.

그러나 Naive RAG는 다음과 같은 단점이 존재

검색 과제: 검색 단계는 Precision와 재현성에서 종종 고군분투하여 잘못 정렬되거나 관련 없는 청크를 선택하고 중요한 정보를 놓치는 경우가 있음.
생성 어려움: 응답을 생성할 때, 모델은 검색된 컨텍스트에 의해 뒷받침되지 않는 콘텐츠를 생성하는 환각 문제에 직면할 수 있음. 이 단계는 또한 출력의 부적절성, 독성, 또는 편향성으로 인해 품질과 신뢰성을 떨어뜨릴 수 있음.
증강 장애: 다양한 작업과의 검색 정보 통합이 어려울 수 있으며, 때로는 불연속적이거나 일관되지 않은 출력을 초래함. 여러 소스에서 유사한 정보가 검색되는 경우 중복되는 응답이 발생할 수 있음. 다양한 구절의 중요성과 관련성을 결정하고 스타일 및 톤의 일관성을 보장하는 것은 추가적인 복잡성을 더함.

복잡한 문제에 직면하여, 원래 쿼리에 기반한 단일 검색으로는 충분한 컨텍스트 정보를 얻기에 부족할 수 있음. 게다가, 생성 모델이 증강된 정보에 과도하게 의존하여 통찰력이나 종합된 정보를 추가하지 않고 검색된 콘텐츠를 단순히 반복하는 출력이 나오지 않도록 주의가 필요함.

B. Advanced RAG

Advanced RAG는 Naive RAG의 한계를 극복하기 위한 특정 개선 사항을 도입함. 검색 품질을 향상시키는 데 중점을 두고 사전 검색 및 사후 검색 전략을 사용함. 인덱싱 문제를 해결하기 위해 Advanced RAG는 슬라이딩 윈도우 접근 방식, 세분화된 세분화, 메타데이터 통합을 통해 인덱싱 기술을 개선함. 또한, 검색 프로세스를 간소화하기 위해 몇 가지 최적화 방법을 통합함 [8].

RAG는 외부 데이터베이스에서 정보를 검색하여 대규모 언어모델의 지식을 증강, 정확성을 향상시킴.
Naive RAG는 기본적인 “Retrieve-Read” 프로세스를 따르지만, 검색의 Precision와 생성의 신뢰성에서 문제를 겪음.
Advanced RAG는 검색 품질을 개선하기 위한 전략을 도입하여 Naive RAG의 한계를 극복하고자 함.

이 문서는 RAG 기술의 발전과 적용을 체계적으로 설명하며, 특히 대규모 언어모델과의 통합을 강조함. 다양한 예시와 수학적 설명을 통해 RAG의 잠재력과 한계, 그리고 향후 연구 방향을 제시함.

4. RAG의 세 가지 패러다임 비교

Fig. 3. RAG의 세 가지 패러다임 비교
왼쪽: Naive RAG는 주로 색인, 검색, 생성의 세 부분으로 구성됨.
중앙: Advanced RAG는 사전 검색 및 사후 검색의 최적화 전략을 제안하며, 여전히 체인과 같은 구조를 따름.
오른쪽: Modular RAG는 이전 패러다임에서 발전하여 전반적으로 더 큰 유연성을 보여줌. 이는 여러 특정 기능 모듈의 도입과 기존 모듈의 교체에서 명확히 드러남. 전반적인 프로세스는 순차적 검색 및 생성에 국한되지 않고 반복적이고 적응적인 검색 등의 방법을 포함함.

Pre-retrieval Process

목표: 색인 구조와 원래 쿼리 최적화
- 색인 최적화: 색인 품질 향상
  - 데이터 세분화, 색인 구조 최적화, 메타데이터 추가, 정렬 최적화, 혼합 검색 등 전략 수행
- 쿼리 최적화: 사용자 쿼리를 명확하고 검색 작업에 적합하게 만듦
  - 쿼리 재작성, 쿼리 변환, 쿼리 확장 등 방법 사용 [7], [9]–[11]

Post-Retrieval Process

중요성: 관련 컨텍스트와 쿼리의 효과적 통합
- 주요 방법: 청크 재정렬 및 컨텍스트 압축
  - 가장 관련성이 높은 콘텐츠를 프롬프트 가장자리로 재배치
  - LlamaIndex2, LangChain3, HayStack 등의 프레임워크에서 구현 [12]

C. Modular RAG

지향점: 이전 RAG 패러다임을 넘어 향상된 적응성과 다용성 제공
- 구성 요소 개선 전략: 유사성 검색을 위한 검색 모듈 추가, 검색기 파인튜닝
- 혁신적 모듈: 재구성된 RAG 모듈 [13], 재배열된 RAG 파이프라인 [14]
새로운 모듈 예시:
- Search 모듈: LLM 생성 코드와 쿼리 언어를 사용하여 다양한 데이터 소스에 직접 검색 수행 [15]
- RAG-Fusion: 다중 쿼리 전략으로 사용자 쿼리를 다양한 관점으로 확장, 병렬 벡터 검색 및 지능형 재정렬 사용 [16]
- Memory 모듈: LLM의 메모리를 활용하여 검색을 안내

Summary

RAG의 세 가지 패러다임(Naive, Advanced, Modular)은 색인 및 검색 최적화 방법과 유연성 측면에서 차별화됨.
Pre-retrieval과 Post-retrieval 프로세스는 검색 품질 향상과 결과 통합에 중점을 둠.
Modular RAG는 새로운 모듈을 도입하여 검색 및 처리 기능을 강화하며, 적응성과 다용성을 강조함.

5. Modular RAG의 새로운 패턴

적응성: 모듈 교체 또는 재구성으로 특정 문제 해결
Rewrite-Retrieve-Read 모델 [7]: LLM의 능력을 활용하여 검색 쿼리를 재작성
Generate-Read, Recite-Read 등의 접근 방식: 전통적인 검색을 LLM 생성 콘텐츠로 대체, 모델의 지식 집약적 작업 처리 능력 강화
하이브리드 검색 전략: 다양한 쿼리에 대응하기 위해 키워드, 시맨틱, 벡터 검색 통합
모듈 구성 및 상호작용 조정: Demonstrate-Search-Predict (DSP) [23] 프레임워크 등 사용

RAG vs Fine-tuning

비교: 모델 최적화 방법으로 RAG, Fine-tuning(FT), 프롬프트 엔지니어링
- Prompt Engineering: 모델의 고유한 능력 활용
- RAG: 맞춤형 교재 제공과 유사, 정확한 정보 검색에 이상적
- FT: 모델의 행동과 스타일 깊이 맞춤화, 특정 구조 및 형식 복제에 적합

III. RETRIEVAL

효율적 검색: 데이터 소스에서 관련 문서 검색 중요
검색 소스 및 단위:
- 데이터 구조: 텍스트, 반구조화 데이터(PDF), 구조화 데이터(KG)에서 검색
- LLM 생성 콘텐츠 활용: 외부 보조 정보의 한계를 극복하기 위한 연구

Page 7

7. RAG와 다른 모델 최적화 방법 비교

Fig. 4. “외부 지식 요구” 및 “모델 적응 요구” 측면에서 RAG와 다른 모델 최적화 방법 비교

프롬프트 엔지니어링: LLM의 능력 활용, 최소한의 수정 필요
Fine-tuning: 모델 추가 학습 포함
모듈형 RAG: Fine-tuning 기법과 통합되어 발전

패션 상품 검색 예시

이미지 및 상품 메타 정보 기반 검색:
- 인풋: 상품 이미지 및 메타 정보
- 모델 사용: LLM을 활용한 RAG 모델
- 프로세스: 이미지에서 텍스트로 변환, 관련 메타 정보와 함께 검색 수행

이와 같이 RAG 및 관련 기술을 활용하여 다양한 데이터 소스 및 형식에서 검색을 최적화하고, 특히 패션 상품과 같은 구체적인 응용 분야에서도 높은 정확도의 결과를 제공할 수 있음.

8. 검색 세분화의 중요성

인퍼런스 과정에서 적절한 검색 세분화를 선택하는 것은 밀집 검색기의 검색 성능과 downstream 작업 성능을 향상시키는 간단하면서도 효과적인 전략임. 텍스트에서 검색 세분화는 세밀한 단위에서 큰 단위로 다양하며, 예를 들어 토큰, 구절, 문장, 명제, 청크, 문서 등이 있음. 이 중 DenseX [30]는 명제를 검색 단위로 사용하는 개념을 제안함. 명제는 텍스트 내에서 고유한 사실적 세그먼트를 캡슐화한 원자적 표현으로, 간결하고 독립적인 자연어 형식으로 제시됨. 이 접근 방식은 검색 정확성과 관련성을 향상시키는 것을 목표로 함. 지식 그래프(KG)에서는 검색 세분화가 엔티티, 트리플릿, 하위 그래프로 구성됨. 검색 세분화는 추천 작업에서 아이템 ID [40] 검색, 문장 쌍 [38] 검색과 같은 downstream 작업에 적응할 수 있음. 자세한 정보는 표 I에 설명되어 있음.

B. 인덱싱 최적화

인덱싱 단계에서는 문서가 처리, 분할되어 임베딩으로 변환된 후 벡터 데이터베이스에 저장됨. 인덱스 구축의 품질은 검색 단계에서 올바른 컨텍스트를 얻을 수 있는지를 결정함.

1) 청크 전략

문서를 고정된 수의 토큰(e.g., 100, 256, 512)으로 분할하는 것이 일반적인 방법임 [88]. 더 큰 청크는 더 많은 컨텍스트를 포착할 수 있지만, 노이즈도 증가시키고 더 긴 처리 시간과 높은 비용을 요구함. 반면, 작은 청크는 필요한 컨텍스트를 충분히 전달하지 못할 수 있지만, 노이즈가 적음. 그러나 청크는 문장 내의 잘림(truncation)을 유발하여, 재귀적 분할 및 슬라이딩 윈도우 방법의 최적화를 요구함. 이런 방법은 여러 검색 과정에서 전역적으로 관련된 정보를 병합하여 계층적인 검색을 가능하게 함 [89]. 그러나 이런 접근 방식들은 여전히 의미적 완전성과 컨텍스트 길이 사이의 균형을 맞출 수 없음. 따라서 Small2Big와 같은 방법이 제안됨, 여기서는 문장(작은)을 검색 단위로 사용하고 앞뒤 문장을 LLM에게 (큰) 컨텍스트로 제공함 [90].

2) 메타데이터 부착

청크는 페이지 번호, 파일 이름, 저자, 범주 타임스탬프와 같은 메타데이터 정보로 강화될 수 있음. 그 후 검색은 이 메타데이터를 기반으로 필터링되어 검색의 범위를 제한함. 검색 중 문서 타임스탬프에 다른 가중치를 부여하여 시간 인식 RAG를 구현할 수 있으며, 이는 지식의 신선함을 보장하고 오래된 정보를 회피함.

원본 문서에서 메타데이터를 추출하는 것 외에도, 메타데이터는 인위적으로 생성될 수 있음. 예를 들어, 단락의 요약을 추가하거나 가설 질문을 도입하는 방법이 있음. 이 방법은 역 HyDE로도 알려져 있음. 구체적으로, LLM을 사용하여 문서로 답변할 수 있는 질문을 생성한 다음, 검색 중에 원래 질문과 가설 질문 사이의 유사성을 계산하여 질문과 답변 사이의 의미적 차이를 줄임.

3) 구조적 인덱스

정보 검색을 강화하는 효과적인 방법 중 하나는 문서에 대한 계층적 구조를 구축하는 것임. 구조를 구축함으로써, RAG 시스템은 관련 데이터를 신속하게 검색하고 처리할 수 있음.

계층형 인덱스 구조

파일은 부모-자식 관계로 배열되며, 청크는 이들과 연결됨. 각 노드에는 데이터 요약이 저장되어 있어 데이터의 신속한 탐색을 돕고, RAG 시스템이 어떤 청크를 추출할지 결정하는 데 도움을 줌. 이 접근 방식은 또한 블록 추출 문제로 인한 착각을 완화할 수 있음.

지식 그래프 인덱스

지식 그래프를 활용하여 문서의 계층적 구조를 구축하는 것은 일관성을 유지하는 데 기여함. 이는 서로 다른 개념과 엔티티 간의 연결을 설명하여 착각의 가능성을 줄임. 또 다른 장점은 정보 검색 프로세스를 LLM이 이해할 수 있는 명령어로 변환하여 지식 검색의 정확성을 향상시키고, LLM이 문맥적으로 일관된 응답을 생성할 수 있게 하여 RAG 시스템의 전체 효율성을 향상시킴. 문서 내용과 구조 간의 논리적 관계를 포착하기 위해 KGP [91]는 여러 문서 간의 인덱스를 구축하는 방법을 제안함. 이 KG는 노드(문서의 단락 또는 구조를 나타내는, 예: 페이지 및 테이블)와 간선(단락 간의 의미적/어휘적 유사성 또는 문서 구조 내의 관계)를 포함하여 다중 문서 환경에서 지식 검색 및 인퍼런스 문제를 효과적으로 해결함.

C. 쿼리 최적화

Naive RAG에서 주요 챌린지 중 하나는 사용자의 원래 쿼리에 직접 의존하는 검색 방식임. 정밀하고 명확한 질문을 구성하는 것은 어렵고, 부적절한 쿼리는 저조한 검색 효과를 초래함. 때로는 질문 자체가 복잡하고 언어가 잘 조직되지 않음. 또 다른 어려움은 언어의 복잡성 및 모호성에 있음. 언어 모델은 전문 용어를 다룰 때 또는 여러 의미를 가진 모호한 약어를 다룰 때 종종 어려움을 겪음. 예를 들어, “LLM”이 대규모 언어모델을 의미하는지 법학 석사를 의미하는지 구분하지 못할 수 있음.

1) 쿼리 확장

단일 쿼리를 여러 쿼리로 확장하여 쿼리의 내용을 풍부하게 하고, 특정 뉘앙스의 부족을 해결하기 위한 추가 컨텍스트를 제공함으로써 생성된 응답의 최적 관련성을 보장함.

다중 쿼리: 프롬프트 엔지니어링을 통해 LLMs를 이용해 쿼리를 확장하고, 이런 쿼리를 병렬로 실행할 수 있음. 쿼리 확장은 무작위가 아니며, 세심하게 설계됨.
하위 쿼리: 하위 질문 계획 프로세스는 원래 질문을 완전히 답변하기 위해 필요한 하위 질문을 생성하는 것을 나타냄. 관련 컨텍스트를 추가하는 이 과정은 원칙적으로 쿼리 확장과 유사함. 구체적으로, 복잡한 질문은 최소에서 최대 프롬프트 방법을 사용하여 일련의 더 간단한 하위 질문으로 분해될 수 있음 [92].
Chain-of-Verification(CoVe): 확장된 쿼리는 LLM을 통해 검증되어 착각을 줄이는 효과를 달성함. 검증된 확장 쿼리는 일반적으로 더 높은 신뢰성을 보임 [93].

페이지 9: 쿼리 변환 및 라우팅

2) 쿼리 변환

쿼리를 변환하여 사용자 원래 쿼리가 아닌 변환된 쿼리 기반으로 청크를 검색함.

쿼리 재작성: 원래 쿼리는 LLM 검색에 항상 최적화되어 있지 않음, 특히 실제 세계 시나리오에서 그러함. 따라서 LLM에 쿼리를 재작성하도록 프롬프트할 수 있음. LLM을 사용한 쿼리 재작성 외에도, RRR (Rewrite-retrieve-read) [7]와 같은 특수한 소형 언어 모델을 활용할 수 있음. 타오바오에서 BEQUE [9]로 알려진 쿼리 재작성 방법의 구현은 롱테일 쿼리에 대한 Recall 효과를 크게 향상시켜 GMV 상승을 이끌어냄.
프롬프트 엔지니어링: 원래 쿼리를 기반으로 LLM이 쿼리를 생성하도록 하여 후속 검색을 수행하는 또 다른 쿼리 변환 방법임. HyDE [11]는 가설 문서(원래 쿼리에 대한 추정 답변)를 구성함. 문제나 쿼리에 대한 임베딩 유사성을 추구하기보다는 답변에서 답변으로의 임베딩 유사성에 중점을 둠. Step-back Prompting 방법 [10]을 사용하여 원래 쿼리를 추상화하여 고수준 개념 질문(step-back question)을 생성함. RAG 시스템에서는 step-back question과 원래 쿼리 모두 검색에 사용되며, 두 결과 모두 언어 모델 답변 생성의 기초로 활용됨.

3) 쿼리 라우팅

다양한 쿼리에 기반하여 다양한 RAG 파이프라인으로 라우팅하여 다양한 시나리오를 수용할 수 있는 다목적 RAG 시스템에 적합함.

메타데이터 라우터/필터: 첫 번째 단계는 쿼리에서 키워드(엔티티)를 추출한 후, 키워드와 청크 내의 메타데이터를 기반으로 필터링하여 검색 범위를 좁히는 것임.
시맨틱 라우터: 또 다른 라우팅 방법은 쿼리의 시맨틱 정보를 활용하는 것임. 특정 접근 방법은 Semantic Router 6을 참조하라. 확실히, 하이브리드 라우팅 접근 방식도 적용할 수 있으며, 시맨틱 및 메타데이터 기반 방법을 결합하여 쿼리 라우팅을 향상시킴.

D. 임베딩

RAG에서 검색은 질문과 문서 청크의 임베딩 간의 유사성(e.g., 코사인 유사성)을 계산하여 이루어지며, 임베딩 모델의 의미 표현 능력이 핵심 역할을 함. 이는 주로 희소 인코더(BM25)와 밀집 검색기(BERT 아키텍처 사전학습 언어 모델)를 포함함. 최근 연구에서는 AngIE, Voyage, BGE 등 [94]–[96]과 같은 주목할 만한 임베딩 모델을 소개하며, 다중 작업 지시 조정의 이점을 누리고 있음. Hugging Face의 MTEB 리더보드 7은 8개의 작업에서 임베딩 모델을 평가하며, 58개의 데이터셋을 다룸. 추가적으로, C-MTEB는 중국어 능력에 집중하며, 6개의 작업과 35개의 데이터셋을 다룸. “어떤 임베딩 모델을 사용할 것인가”에 대한 만능 해답은 없음. 그러나 일부 특정 모델은 특정 사용 사례에 더 적합함.

1) 혼합/하이브리드 검색

희소 및 밀집 임베딩 접근 방식은 서로 다른 관련성 특징을 포착하며, 보완적 관련성 정보를 활용하여 서로에게 이점을 줄 수 있음. 예를 들어, 희소 검색 모델은 밀집 검색 모델을 훈련하기 위한 초기 검색 결과를 제공하는 데 사용될 수 있음.

6https://github.com/aurelio-labs/semantic-router
7https://huggingface.co/spaces/mteb/leaderboard

희소 검색 모델: 밀집 검색 모델의 0-shot 검색 기능을 향상시키고, 드문 엔티티를 포함한 쿼리를 처리하는 데 도움을 줌으로써 밀집 검색기의 강건성을 향상시킬 수 있음.

2) 임베딩 모델의 파인튜닝

맥락이 사전 학습 코퍼스와 크게 벗어나는 경우, 특히 의료, 법률 실무 및 기타 고유한 용어가 많은 전문 분야에서는 자체 도메인 데이터세트에서 임베딩 모델을 파인튜닝하는 것이 필수적임. 도메인 지식을 보충하는 것 외에도, 파인튜닝의 또 다른 목적은 검색기와 생성기를 정렬하는 것임. 예를 들어, LLM의 결과를 파인튜닝의 감독 신호로 사용하는 LSR (LM-supervised Retriever)이 있음. PROMPTAGATOR [21]는 LLM을 소수의 쿼리 생성기로 사용하여 작업별 검색기를 생성하며, 지도 학습에서 특히 데이터 부족 도메인에서의 과제를 해결함. 또 다른 접근 방식인 LLM-Embedder [97]는 LLM을 활용하여 여러 downstream 작업에서 보상 신호를 생성함. 검색기는 데이터세트의 하드 레이블과 LLM의 소프트 보상에서 두 가지 유형의 감독 신호로 파인튜닝됨. 이 이중 신호 접근 방식은 파인튜닝 프로세스를 더욱 효과적으로 만들어 임베딩 모델을 다양한 downstream 애플리케이션에 맞춤화함. REPLUG [72]는 검색기와 LLM을 사용하여 검색된 문서의 확률 분포를 계산한 다음 KL Divergence을 계산하여 감독 훈련을 수행함. 이 직관적이고 효과적인 훈련 방법은 LM을 감독 신호로 사용하여 검색 모델의 성능을 향상시키며, 특정 크로스 어텐션 메커니즘이 필요하지 않음. 또한, RLHF (Reinforcement Learning from Human Feedback)에 영감을 받아 LM 기반 피드백을 사용하여 강화 학습을 통해 검색기를 강화함.

E. Adapter

모델 파인튜닝은 API를 통한 기능 통합이나 제한된 로컬 컴퓨팅 리소스에서의 제약을 해결하는 데 어려움을 겪을 수 있음. 따라서 일부 접근 방식은 정렬을 돕기 위해 외부 어댑터를 통합하는 것을 선택함.

UP-RISE [20]: 다중 작업 기능을 최적화하기 위해, 주어진 0-shot 작업 입력에 적합한 프롬프트를 자동으로 검색할 수 있는 경량 프롬프트 검색기를 훈련함.
AAR (Augmentation-Adapted Retriver) [47]: 여러 downstream 작업을 수용할 수 있는 범용 어댑터를 도입함.
PRCA [69]: 특정 작업에서 성능을 향상시키기 위해 플러그 가능한 보상 기반 컨텍스트 어댑터를 추가함.
BGM [26]: 검색기와 LLM을 고정하고, 그 사이에 브리지 Seq2Seq 모델을 훈련함. 브리지 모델은 검색된 정보를 LLM이 효과적으로 처리할 수 있는 형식으로 변환하는 것을 목표로 하며, 이를 통해 각 쿼리에 대해 패시지를 재랭킹하고 동적으로 선택할 수 있음.
PKG: 지식 그래프 인덱스를 통해 화이트박스 모델에 지식을 통합하는 혁신적인 방법을 도입함. 이 접근 방식은 검색 모듈을 직접적으로 대체하여 쿼리에 따라 관련 문서를 생성하고, 파인튜닝 과정에서의 어려움을 해결하고 모델 성능을 향상시키는 데 도움을 줌.

페이지 10: 생성 및 컨텍스트 조정

IV. 생성

검색 후, 검색된 모든 정보를 LLM에 직접 입력하여 질문에 답변하는 것은 좋은 방법이 아님. 두 가지 관점에서 검색된 내용을 조정하는 방법을 소개함: 검색된 콘텐츠 조정과 LLM 조정.

A. 컨텍스트 선택

불필요한 정보는 LLM의 최종 생성을 방해할 수 있으며, 지나치게 긴 컨텍스트는 LLM이 “중간에서 길을 잃는” 문제를 초래할 수 있음 [98]. 휴먼과 마찬가지로, LLM은 긴 텍스트의 시작과 끝에만 집중하며 중간 부분은 잊어버리는 경향이 있음. 따라서 RAG 시스템에서는 검색된 콘텐츠를 추가로 처리해야 함.

1) 재랭킹

재랭킹은 문서 청크를 재정렬하여 가장 관련성 있는 결과를 먼저 강조 표시함으로써 전체 문서 풀을 줄임. 정보 검색에서 이중 목적을 수행하여, 강화제이자 필터 역할을 하며, 보다 정교한 입력을 제공하여 언어 모델 처리의 Precision를 높임 [70]. 재랭킹은 다양성, 관련성, MRR과 같은 사전 정의된 메트릭에 의존하는 규칙 기반 방법이나, BERT 시리즈의 인코더-디코더 모델(e.g., SpanBERT), Cohere rerank 또는 bge-raranker-large와 같은 전문 재랭킹 모델, GPT와 같은 일반 대규모 언어모델을 사용하는 모델 기반 접근으로 수행할 수 있음 [12], [99].

2) 컨텍스트 선택/압축

RAG 프로세스에서 일반적인 오해는 가능한 한 많은 관련 문서를 검색하고 이를 연결하여 긴 검색 프롬프트를 형성하는 것이 유익하다는 믿음임. 그러나 과도한 컨텍스트는 더 많은 노이즈를 도입하여 LLM의 주요 정보 인식을 감소시킬 수 있음.
(Long) LLMLingua [100], [101]는 GPT-2 Small 또는 LLaMA-7B와 같은 소형 언어 모델(SLM)을 사용하여 중요하지 않은 토큰을 탐지하고 제거하여, 휴먼이 이해하기 어려운 형태로 변환하지만 LLM이 잘 이해할 수 있도록 함. 이 접근 방식은 LLM의 추가 훈련 없이도 프롬프트 압축을 위한 직접적이고 실용적인 방법을 제공하며, 언어 무결성과 압축 비율 사이의 균형을 맞춤. PRCA는 정보 추출기를 훈련하여 이 문제를 해결함 [69]. 유사하게,

RAG 모델의 평가 및 적용: 최신 연구와 미래 전망

1. Self-RAG의 혁신적 접근

Self-RAG [25]는 “reflection tokens”이라는 새로운 개념을 도입하여 모델이 자체 출력을 반성할 수 있도록 함. 이 토큰은 “retrieve”와 “critic” 두 가지 종류로 나뉨. 모델은 자율적으로 검색을 활성화할 시점을 결정하거나, 미리 정의된 임계값이 프로세스를 트리거할 수 있음. 검색 중에는 생성기가 여러 단락에 걸쳐 단편 수준의 빔 서치를 수행해 가장 일관된 시퀀스를 도출함. Critic 점수는 세분화 점수를 업데이트하는 데 사용되며, 인퍼런스 중에 이런 가중치를 조정하여 모델의 행동을 맞춤화할 수 있음. Self-RAG의 설계는 추가적인 분류기나 자연어 인퍼런스(NLI) 모델에 대한 의존을 제거하고, 검색 메커니즘을 언제 활용할지에 대한 의사 결정 과정을 간소화하여 정확한 응답을 생성하는 모델의 자율적 판단 능력을 향상시킴.

이미지 설명

이미지에는 Self-RAG 모델의 구조가 시각화되어 있으며, reflection tokens의 작용 방식과 검색 메커니즘이 상세히 묘사되어 있음.

2. RAG 모델의 다양한 downstream 작업

2.1 주어진 작업의 핵심 목표

RAG의 핵심 작업은 질문 응답(QA)으로, 전통적인 단일 홉/다중 홉 QA, 선택형 질문, 도메인 특정 QA 및 긴 형식의 시나리오에 적합함. QA 외에도 RAG는 정보 추출(IE), 대화 생성, 코드 검색 등 여러 downstream 작업으로 확장되고 있음.

2.2 평가 목표

RAG 모델 평가의 주요 목적은 검색 품질과 생성 품질을 평가하는 것임. 검색 품질은 검색 모듈의 성과를 측정하기 위해 검색 엔진, 추천 시스템, 정보 검색 시스템의 표준 메트릭을 활용함. 생성 품질은 검색된 컨텍스트에서 일관되고 관련성 있는 답변을 합성하는 생성기의 능력을 평가함.

핵심 요약

Self-RAG는 reflection tokens를 통해 검색 메커니즘을 최적화하여 모델의 자율적 판단 능력을 강화함.
RAG 모델은 QA를 비롯한 다양한 downstream 작업에 적용되며, 평가 목표는 검색 및 생성 품질을 중심으로 설정됨.
평가 메트릭은 검색 품질과 생성 품질을 다양한 관점에서 분석하여 모델의 효율성을 측정함.

3. 평가 프레임워크 및 도구

3.1 평가 메트릭

평가 메트릭은 검색 품질과 생성 품질을 측정하기 위해 다양한 기준을 사용함. 예를 들어, 검색 품질은 Hit Rate, MRR, NDCG와 같은 메트릭을 통해 평가됨. 생성 품질은 컨텍스트의 일관성과 관련성을 바탕으로 BLEU 및 ROUGE 메트릭을 통해 평가됨.

3.2 평가 벤치마크 및 도구

다양한 벤치마크와 도구가 제안되어 RAG의 평가를 지원함. RGB, RECALL, CRUD와 같은 벤치마크는 RAG 모델의 필수 능력을 평가하는 데 중점을 둠. RAGAS, ARES, TruLens와 같은 자동화 도구는 LLM을 활용하여 품질 점수를 평가함.

4. RAG의 미래 방향과 연구 과제

4.1 긴 컨텍스트와 RAG의 역할

긴 문서 질문 응답에 있어 RAG의 역할은 여전히 중요함. LLM이 긴 컨텍스트를 처리할 수 있더라도, RAG는 효율성을 높이기 위해 청크 검색 및 필요 시 입력을 제공함.

4.2 RAG의 강건성

검색 중 발생하는 노이즈나 모순된 정보는 RAG의 출력 품질에 부정적 영향을 미칠 수 있음. 이런 도전 요소에 대한 RAG의 내성을 향상시키는 연구가 활발히 진행 중임.

4.3 하이브리드 접근 방식

RAG와 파인튜닝을 결합하는 전략이 주목받고 있음. RAG 시스템 결과로 SLM을 도입하고 파인튜닝하는 방법도 연구되고 있음.

핵심 요약

긴 컨텍스트 문서 처리에서 RAG는 효율성을 높이기 위한 필수 요소로 남아 있음.
RAG의 강건성은 노이즈 및 모순된 정보에 대한 내성 향상을 통해 강화됨.
하이브리드 접근 방식은 RAG와 파인튜닝의 최적 통합을 추구하며, SLM의 도입을 통해 성능을 개선할 수 있음.

5. 결론

RAG 모델은 NLP 분야에서 혁신적인 변화를 이끌고 있으며, 다양한 downstream 작업과 평가 기준을 통해 그 성과를 최적화하고 있음. Self-RAG의 도입으로 검색 메커니즘이 더욱 정교해졌으며, 향후 연구에서는 RAG의 강건성 및 하이브리드 접근 방식을 통한 성능 향상이 기대됨. 이런 발전은 RAG의 실용성을 높이고, 다양한 응용 분야에서 그 사용을 더욱 확장할 수 있는 가능성을 열어줌.

RAG와 관련된 최신 기술 동향 및 발전

Fig. 6: RAG 생태계 요약

1. 초기 학습 곡선

RAG (Retrieval-Augmented Generation)는 언어 모델(LLM)과 외부 비매개 데이터베이스의 비매개 데이터를 통합하여 LLM의 기능을 확장함.
이는 학습 초기 단계에서 모델이 대량의 데이터를 효과적으로 활용하도록 지원.

2. 특화 (Specialization)

RAG의 최적화는 프로덕션 환경에서의 효율성을 강화함.
기술 스택의 발전은 RAG의 성능을 더욱 향상시키며, 이는 다시 기술 스택의 발전을 촉진하는 상호작용 관계를 형성.

3. 멀티모달 RAG의 발전

초기 텍스트 기반의 Q&A를 넘어 다양한 모달 데이터로 확장.
이미지, 오디오, 비디오, 코드 등 다양한 형태의 데이터를 통합하여 멀티모달 모델을 개발함.
예: RA-CM3는 텍스트와 이미지를 동시에 생성하고 검색하는 멀티모달 모델로, BLIP-2는 이미지 인코더와 LLM을 결합하여 이미지-텍스트 전환을 지원함.

4. 결론

RAG 기술은 언어 모델과 외부 데이터의 통합을 통해 다양한 작업에서의 적용을 확장하며, 이는 AI 배포에 있어 실질적인 응용 가능성을 높임.
RAG의 발전은 AI 연구 및 산업계의 관심을 끌고 있으며, 그 응용 범위는 지속적으로 확장 중임.

수식과 알고리즘 설명

A. 수식 설명

RAG의 기본 개념는 검색과 생성의 결합임.
수식적으로, 주어진 입력 $x$에 대해 검색(Retrieval) 과정은 관련 정보를 $R(x)$로 반환하며, 생성(Generation) 과정은 이를 이용하여 최종 출력 $G(R(x), x)$을 생성함.

$R(x) = \text{Retrieve related information}$ $G(R(x), x) = \text{Generate final output using retrieved information}$

B. 알고리즘 설명

입력 단계: 텍스트 입력 $x$를 받음.
검색 단계: $x$와 관련된 정보를 검색하여 $R(x)$를 획득.
생성 단계: 검색된 정보 $R(x)$와 $x$를 결합하여 최종 결과 $G(R(x), x)$를 생성.

핵심 요약

RAG는 언어 모델과 외부 지식의 통합을 통해 다양한 데이터 형태를 처리하며 AI 응용의 실질적 가능성을 높임.
멀티모달 RAG의 발전은 이미지, 오디오, 비디오 등 다양한 데이터를 통합하여 모델의 성능을 강화함.
RAG의 기술 발전은 AI 연구 및 산업계에서의 응용 범위를 확장하며, 지속적인 관심과 발전이 예상됨.

패션 상품 검색 예시

이미지와 상품 메타 정보를 기반으로 패션 상품을 검색할 때, RAG는 이미지 검색을 통해 관련 상품 정보를 획득하고, 메타 정보를 이용하여 최종 추천 리스트를 생성함.
예를 들어, 사용자 입력 이미지가 특정 브랜드의 가방일 경우, 해당 브랜드의 관련 정보를 검색하여 추천 제품 리스트를 생성함.

참고 문헌 및 기술 개요

현대 인공지능 연구에서 대규모 언어모델(LLM)은 다양한 분야에서 중요한 역할을 수행하고 있음. 특히 개인화된 지식 기반 대화 시스템을 구축하기 위한 연구가 활발하게 진행되고 있음. 이런 연구들은 사용자의 개인적 데이터를 이해하고, 그에 기반하여 대화의 맥락을 유지하는 데 초점을 맞추고 있음. 대표적인 연구로는 Xu et al. (2022)의 장기적 퍼소나 메모리를 활용한 오픈 도메인 대화 시스템과 Wen et al. (2016)의 신경 대화 시스템에서의 조건부 생성 및 스냅샷 학습 등이 있음. 또한, He and McAuley (2016)의 연구는 패션 트렌드의 시각적 변화를 모델링하기 위해 원클래스 협업 필터링을 사용했으며, 이는 패션 산업에서 이미지와 상품 메타 데이터를 기반으로 한 검색 시스템 개발에 중요한 기초 자료가 됨.

핵심 요약

대규모 언어모델은 개인화된 지식 기반 대화 시스템 구축에 있어서 중요한 역할을 수행.
장기적 퍼소나 메모리 및 조건부 생성 기술은 대화 시스템의 맥락 유지에 기여.
패션 트렌드의 시각적 변화를 모델링하는 연구는 이미지 기반 검색 시스템 개발에 활용 가능.

수식 및 알고리즘

알고리즘의 수학적 이해설명에 앞서, 이런 연구들이 대화 시스템에서 어떻게 구현되는지를 살펴봄. 예를 들어, 조건부 생성 모델에서의 확률적 방법은 다음과 같은 수학적 수식으로 설명될 수 있음.

\[P(y|x, z) = \frac{P(x, y, z)}{P(x, z)}\]

$P(y x, z)$는 주어진 조건 $x$와 $z$ 하에서 $y$가 발생할 확률을 나타냄.
$P(x, y, z)$는 $x$, $y$, $z$가 동시에 발생할 확률.
$P(x, z)$는 $x$와 $z$가 동시에 발생할 확률로, 확률 분포의 정규화를 위해 사용됨.

이와 같은 수식은 대화 시스템에서 사용되는 조건부 확률 모델의 기본적인 형태를 나타내며, 이는 시스템이 사용자 입력에 적합한 응답을 생성하는 데 도움을 줌.

이미지 기반 패션 검색 시스템

패션 상품 검색을 위한 이미지 및 메타 데이터 기반 시스템은 다음과 같은 단계로 구현될 수 있음:

이미지 인식: Convolutional Neural Networks (CNN)을 활용하여 입력 이미지의 특징을 추출함.
메타 데이터 활용: 상품의 텍스트 메타 데이터를 분석하여 추가적인 문맥 정보를 얻음.
결과 통합: 이미지와 메타 데이터에서 얻은 정보를 결합하여 사용자에게 가장 적절한 검색 결과를 제공함.

예를 들어, 사용자가 특정 스타일의 재킷 이미지를 업로드하면, 시스템은 CNN을 사용하여 이미지의 시각적 특징을 추출하고, 텍스트 메타 데이터를 기반으로 유사한 스타일의 다른 재킷을 검색하여 추천할 수 있음.

Page 21

대규모 비디오 캡션 모델

Yang et al. (2023)의 연구는 대규모 비디오 캡션 모델의 중요성을 강조함. 이 연구에서는 비디오 데이터를 기반으로 한 대화 모델의 성능을 개선하기 위해 대규모 시각 언어 모델을 사전 학습하여 비디오 캡션을 생성함. 이와 같은 모델은 비디오의 복잡한 장면을 이해하고, 이를 자연어로 설명하는 능력을 갖추고 있음. 이런 기술은 다양한 미디어 콘텐츠의 자동화된 설명 생성에 활용될 수 있음.

핵심 요약

대규모 시각 언어 모델은 비디오 캡션 생성에 중요한 역할을 수행.
비디오 데이터의 복잡한 장면을 이해하고 자연어로 설명하는 능력 제공.
미디어 콘텐츠의 자동화된 설명 생성에 활용 가능.

이와 같은 최신 연구는 비디오 데이터의 자동화된 처리 및 설명 생성을 통해 다양한 산업 분야에서 효율성을 높일 수 있는 가능성을 제시함.

*출처: https://github.com/Tongji-KGLLM/RAG-Survey

1 Introduction

Large language models (LLMs) such as the GPT series [Brown et al., 2020, OpenAI, 2023] and the LLama series [Touvron et al., 2023], along with other models like [Google, 2023], have achieved remarkable sucGemini cess in natural language processing, demonstrating superior performance on various benchmarks including SuperGLUE [Wang et al., 2019], MMLU [Hendrycks et al., 2020], and BIG-bench [Srivastava et al., 2022]. Despite these advancements, LLMs exhibit notable limitations, particularly in handling domain-specific or highly specialized queries [Kandpal et al., 2023]. A common issue is the generation of information, or ”hallucinations” [Zhang et al., 2023b], especially when queries extend beyond the model’s training data or necessitate up-to-date information. These shortcomings underscore the impracticality of deploying LLMs as black-box solutions in real-world production environments without additional safeguards. One promising approach to mitigate these limitations is RetrievalAugmented Generation (RAG), which integrates external data retrieval into the generative process, thereby enhancing the model’s ability to provide accurate and relevant responses. RAG, introduced by Lewis et al. [Lewis et al., 2020] in mid-2020, stands as a paradigm within the realm of LLMs, enhancing generative tasks. Specifically, RAG involves an initial retrieval step where the LLMs query an external data source to obtain relevant information before proceeding to answer questions or generate text. This process not only informs the subsequent generation phase but also ensures that the responses are grounded in retrieved evidence, thereby significantly enhancing the accuracy and relevance of the output. The dynamic retrieval of information from knowledge bases during the inference phase allows RAG to address issues such as the generation of factually incorrect content, commonly referred to as “hallucinations.” The integration of RAG into LLMs has seen rapid adoption and has become a pivotal technology in refining the capabilities of chatbots and rendering LLMs more viable for practical applications.

The evolutionary trajectory of RAG unfolds across four distinctive phases, as illustrated in Figure 1. In its inception in 2017, aligned with the emergence of the Transformer architecture, the primary thrust was on assimilating additional knowledge through Pre-Training Models (PTM) to augment language models. This epoch witnessed RAG’s foundational efforts predominantly directed at optimizing pre-training methodologies.

Following this initial phase, a period of relative dormancy ensued before the advent of chatGPT, during which there was minimal advancement in related research for RAG. The subsequent arrival of chatGPT marked a pivotal moment in the trajectory, propelling LLMs into the forefront. The community’s focal point shifted towards harnessing the capabilities of LLMs to attain heightened controllability and address evolving requirements. Consequently, the lion’s share of RAG endeavors concentrated on inference, with a minority dedicated to fine-tuning processes. As LLM capabilities continued to advance, especially with the introduction of GPT-4, the landscape of RAG technology underwent a significant transformation. The emphasis evolved into a hybrid approach, combining the strengths of RAG and fine-tuning, alongside a dedicated minority continuing the focus on optimizing pre-training methodologies.

Figure 1: Technology tree of RAG research development featuring representative works

Despite the rapid growth of RAG research, there has been a lack of systematic consolidation and abstraction in the field, which poses challenges in understanding the comprehensive landscape of RAG advancements. This survey aims to outline the entire RAG process and encompass the current and future directions of RAG research, by providing a thorough examination of retrieval augmentation in LLMs.

Therefore, this paper aims to comprehensively summarize and organize the technical principles, developmental history, content, and, in particular, the relevant methods and applications after the emergence of LLMs, as well as the evaluation methods and application scenarios of RAG. It seeks to provide a comprehensive overview and analysis of existing RAG technologies and offer conclusions and prospects for future development methods. This survey intends to furnish readers and practitioners with a thorough and systematic comprehension of large models and RAG, elucidate the progression and key technologies of retrieval augmentation, clarify the merits and limitations of various technologies along with their suitable contexts, and forecast potential future developments.

Our contributions are as follows:

We present a thorough and systematic review of the state-of-the-art RAG, delineating its evolution through paradigms including naive RAG, advanced RAG, and modular RAG. This review contextualizes the broader scope of RAG research within the landscape of LLMs.
We identify and discuss the central technologies integral to the RAG process, specifically focusing on the aspects of “Retrieval”, “Generator” and “Augmentation”, and delve into their synergies, elucidating how these com- ponents intricately collaborate to form a cohesive and effective RAG framework.
We construct a thorough evaluation framework for RAG, outlining the evaluation objectives and metrics. Our comparative analysis clarifies the strengths and weaknesses of RAG compared to fine-tuning from various perspectives. Additionally, we anticipate future directions for RAG, emphasizing potential enhancements to tackle current challenges, expansions into multi-modal settings, and the development of its ecosystem.

The paper unfolds as follows: Section 2 and 3 define RAG and detail its developmental process. Section 4 through 6 explore core components—Retrieval, “Generation” and “Augmentation”—highlighting diverse embedded technologies. Section 7 focuses on RAG’s evaluation system. Section 8 compare RAG with other LLM optimization methods and suggest potential directions for its evolution. The paper concludes in Section 9.

2 Definition

The definition of RAG can be summarized from its workflow. Figure 2 depicts a typical RAG application workflow. In this scenario, a user inquires ChatGPT about a recent high-profile event (i.e., the abrupt dismissal and reinstatement of OpenAI’s CEO) which generated considerable public discourse. ChatGPT as the most renowned and widely utilized LLM, constrained by its pretraining data, lacks knowledge of recent events. RAG addresses this gap by retrieving up-to-date document excerpts from external knowledge bases. In this instance, it procures a selection of news articles pertinent to the inquiry. These articles, alongside the initial question, are then amalgamated into an enriched prompt that enables ChatGPT to synthesize an informed response. This example illustrates the RAG process, demonstrating its capability to enhance the model’s responses with real-time information retrieval.

Technologically, RAG has been enriched through various innovative approaches addressing pivotal questions such as “what to retrieve” “when to retrieve” and “how to use the retrieved information”. For “what to retrieve” research has progressed from simple token [Khandelwal et al., 2019] and entity retrieval [Nishikawa et al., 2022] to more complex structures like chunks [Ram et al., 2023] and knowledge graph [Kang et al., 2023], with studies focusing on the retrieval and the level of data structurgranularity of ing. Coarse granularity brings more information but with lower precision. Retrieving structured text provides more information while sacrificing efficiency. The question of “when to retrieve” has led to strategies ranging from single [Wang et al., 2023e, Shi et al., 2023] to adaptive [Jiang et al., 2023b, Huang et al., 2023] and multiple retrieval [Izacard et al., 2022] methods. High frequency of retrieval brings more information and lower efficiency. As for ”how to use” the retrieved data, integration techniques have been developed across various levels of the model [Khattab et al., 2022], architecture, intermediate layers [Liang et al., 2023]. Although the “intermediate” and “output layers” are more effective, there are problems with the need for training and low efficiency.

RAG is a paradigm that enhances LLMs by integrating external knowledge bases. It employs a synergistic approach, combining information retrieval mechanisms and In-Context Learning (ICL) to bolster the LLM’s performance. In this framework, a query initiated by a user prompts the retrieval of pertinent information via search algorithms. This information is then woven into the LLM’s prompts, providing additional context for the generation process. RAG’s key advantage lies in its obviation of the need for retraining of LLMs for taskspecific applications. Developers can instead append an external knowledge repository, enriching the input and thereby refining the model’s output precision. RAG has become one of the most popular architectures in LLMs’ systems, due to its high practicality and low barrier to entry, with many conversational products being built almost entirely on RAG.

The RAG workflow comprises three key steps. First, the corpus is partitioned into discrete chunks, upon which vector indices are constructed utilizing an encoder model. Second, RAG identifies and retrieves chunks based on their vector similarity to the query and indexed chunks. Finally, the model synthesizes a response conditioned on the contextual information gleaned from the retrieved chunks. These steps form the fundamental framework of the RAG process, underpinning its information retrieval and context-aware generation capabilities. Next, we will provide an introduction to the RAG research framework.

3 RAG Framework

The RAG research paradigm is continuously evolving, and this section primarily delineates its progression. We categorize it into three types: Naive RAG, Advanced RAG, and Modular RAG. While RAG were cost-effective and surpassed the performance of the native LLM, they also exhibited several limitations. The development of Advanced RAG and Modular RAG was a response to these specific shortcomings in Naive RAG.

3.1 Naive RAG

The Naive RAG research paradigm represents the earliest methodology, which gained prominence shortly after the widespread adoption of ChatGPT. The Naive RAG follows a traditional process that includes indexing, retrieval, and generation. It is also characterized as a “Retrieve-Read” framework [Ma et al., 2023a].

Indexing The indexing process is a crucial initial step in data preparation that occurs offline and involves several stages. It begins with data indexing, where original data is cleansed and extracted, and various file formats such as PDF, HTML, Word, and Markdown are converted into standardized plain text. In order to fit within the context limitations of language models, this text is then segmented into smaller, more manageable chunks in a process known as chunking. These chunks are subsequently transformed into vector representations through an embedding model, chosen for its balance between inference efficiency and model size. This facilitates similarity comparisons during the retrieval phase. Finally, an index is created to store these text chunks and their vector embeddings as key-value pairs, which allows for efficient and scalable search capabilities.

Retrieval Upon receipt of a user query, the system employs the same encoding model utilized during the indexing phase to transcode the input into a vector representation. It then proceeds to compute the similarity scores between the query vector and the vectorized chunks within the indexed corpus. The system prioritizes and retrieves the top K chunks that demonstrate the greatest similarity to the query. These chunks are subsequently used as the expanded contextual basis for addressing the user’s request.

Figure 2: A representative instance of the RAG process applied to question answering

Generation The posed query and selected documents are synthesized into a coherent prompt to which a large language model is tasked with formulating a response. The model’s approach to answering may vary depending on task-specific criteria, allowing it to either draw upon its inherent parametric knowledge or restrict its responses to the information contained within the provided documents. In cases of ongoing dialogues, any existing conversational history can be integrated into the prompt, enabling the model to engage in multi-turn dialogue interactions effectively.

Drawbacks in Naive RAG Naive RAG faces significant challenges in three key areas: “Retrieval,” “Generation,” and “Augmentation”.

Retrieval quality poses diverse challenges, including low precision, leading to misaligned retrieved chunks and potential issues like hallucination or mid-air drop. Low recall also occurs, resulting in the failure to retrieve all relevant chunks, thereby hindering the LLMs’ ability to craft comprehensive responses. Outdated information further compounds the problem, potentially yielding inaccurate retrieval results. Response generation quality presents hallucination challenge, where the model generates answers not grounded in the provided context, as well as issues of irrelevant context and potential toxicity or bias in the model’s output.

The augmentation process presents its own challenges in effectively integrating context from retrieved passages with the current generation task, potentially leading to disjointed or incoherent output. Redundancy and repetition are also concerns, especially when multiple retrieved passages contain similar information, resulting in repetitive content in the generated response.

Discerning the importance and relevance of multiple retrieved passages to the generation task is another challenge, requiring the proper balance of each passage’s value. Additionally, reconciling differences in writing styles and tones to ensure consistency in the output is crucial.

Lastly, there’s a risk of generation models overly depending on augmented information, potentially resulting in outputs that merely reiterate the retrieved content without providing new value or synthesized information.

3.2 Advanced RAG Advanced

RAG has been developed with targeted enhancements to address the shortcomings of Naive RAG. In terms of retrieval quality, Advanced RAG implements pre-retrieval and POST-retrieval strategies. To address the indexing challenges experienced by Naive RAG, Advanced RAG has refined its indexing approach using techniques such as sliding window, fine-grained segmentation, and metadata. It has also introduced various methods to optimize the retrieval process [ILIN, 2023].

Pre-Retrieval Process Optimizing Data Indexing.The goal of optimizing data indexing is to enhance the quality of the content being indexed. This involves five primary strategies: enhancing data granularity, optimizing index structures, adding metadata, alignment optimization, and mixed retrieval.

Enhancing data granularity aims to elevate text standardization, consistency, factual accuracy, and rich context to improve the RAG system’s performance. This includes removing irrelevant information, dispelling ambiguity in entities and terms, confirming factual accuracy, maintaining context, and updating outdated documents.

Optimizing index structures involves adjusting the size of chunks to capture relevant context, querying across multiple index paths, and incorporating information from the graph structure to capture relevant context by leveraging relationships between nodes in a graph data index.

Adding metadata information involves integrating referenced metadata, such as dates and purposes, into chunks for filtering purposes, and incorporating metadata like chapters and subsections of references to improve retrieval efficiency. Alignment optimization addresses alignment issues and disparities between documents by introducing “hypothetical questions” [Li et al., 2023d] into documents to rectify alignment issues and differences.

Retrieval During the retrieval stage, the primary focus is on identifying the appropriate context by calculating the similarity between the query and chunks. The embedding model is central to this process. In the advanced RAG, there is potential for optimization of the embedding models.

Fine-tuning Embedding. Fine-tuning embedding models significantly impact the relevance of retrieved content in RAG systems. This process involves customizing embedding models to enhance retrieval relevance in domain-specific contexts, especially for professional domains dealing with evolving or rare terms. The BGE embedding model [BAAI, 2023], such as BGE-large-EN developed by BAAI2, is an example of a high-performance embedding model that can be fine-tuned to optimize retrieval relevance. Training data for fine-tuning can be generated using language models like GPT-3.5-turbo to formulate questions grounded on document chunks, which are then used as fine-tuning pairs.

Dynamic Embedding adapts to the context in which words are used, unlike static embedding, which uses a single vector for each word [Karpukhin et al., 2020]. For example, in transformer models like BERT, the same word can have varied embeddings depending on surrounding words. OpenAI’s embeddings-ada-02 model3, built upon the principles of LLMs like GPT, is a sophisticated dynamic embedding model that captures contextual understanding. However, it may not exhibit the same sensitivity to context as the latest full-size language models like GPT-4.

2 https://huggingface.co/BAAI/bge-large-en>

3 https://platform.openai.com/docs/guides/embeddings>

Post-Retrieval Process After retrieving valuable context from the database, it is essential to merge it with the query as an input into LLMs while addressing challenges posed by context window limits. Simply presenting all relevant documents to the LLM at once may exceed the context window limit, introduce noise, and hinder the focus on crucial information. Additional processing of the retrieved content is necessary to address these issues.

Re-Ranking. Re-ranking the retrieved information to relocate the most relevant content to the edges of the prompt is a key strategy. This concept has been implemented in frameworks such as LlamaIndex4, LangChain5, and HayStack [Blagojevi, 2023]. For example, Diversity Ranker6 prioritizes reordering based on document diversity, while LostInTheMiddleRanker alternates placing the best document at the beginning and end of the context window. Additionally, approaches like cohereAI rerank [Cohere, 2023], bge-rerank7, and LongLLMLingua [Jiang et al., 2023a] recalculate the semantic similarity between relevant text and the query, addressing the challenge of interpreting vector-based simulated searches for semantic similarity.

Prompt Compression. Research indicates that noise in retrieved documents adversely affects RAG performance. In POST-processing, the emphasis lies in compressing irrelevant context, highlighting pivotal paragraphs, and reducing the overall context length. Approaches such as Selective Context and LLMLingua [Litman et al., 2020, Anderson et al., 2022] utilize small language models to calculate prompt mutual information or perplexity, estimating element importance. Recomp [Xu et al., 2023a] addresses this by training compressors at different granularities, while Long Context [Xu et al., 2023b] and “Walking in the Memory Maze” [Chen et al., 2023a] design summarization techniques to enhance LLM’s key information perception, particularly in dealing with extensive contexts.

3.3 Modular RAG

The modular RAG structure diverges from the traditional Naive RAG framework, providing greater versatility and flexibility. It integrates various methods to enhance functional modules, such as incorporating a search module for similarity retrieval and applying a fine-tuning approach in the retriever [Lin et al., 2023]. Restructured RAG modules [Yu et al., 2022] and iterative methodologies like [Shao et al., 2023] have been developed to address specific issues. The modular RAG paradigm is increasingly becoming the norm in the RAG domain, allowing for either a serialized pipeline or an end-to-end training approach across multiple modules. The comparison of three RAG paradigms is depicted in Figure 3. However, Modular RAG is not standalone. Advanced RAG is a specialized form of modular RAG, and further, Naive RAG itself is a special case of Advanced RAG. The relationship among the three paradigms is one of inheritance and development.

4 https://www.llamaindex.ai>

5 https://www.langchain.com/>

6 https://haystack.deepset.ai/blog/enhancing-rag-pipelines-in-haystack>

7 https://huggingface.co/BAAI/bge-reranker-large>

Figure 3: Comparison between the three paradigms of RAG

New Modules Search Module. In contrast to the similarity retrieval in Naive/Advanced RAG, the Search Module is tailored to specific scenarios and incorporates direct searches on additional corpora. This integration is achieved using code generated by the LLM, query languages such as SQL or Cypher, and other custom tools. The data sources for these searches can include search engines, text data, tabular data, and knowledge graphs [Wang et al., 2023d].

Memory Module. This module harnesses the memory capabilities of the LLM to guide retrieval. The approach involves identifying memories most similar to the current input. Selfmem [Cheng et al., 2023b] utilizes a retrieval-enhanced generator to create an unbounded memory pool iteratively, combining the “original question” and “dual question”. By employing a retrieval-enhanced generative model that uses its own outputs to improve itself, the text becomes more aligned with the data distribution during the reasoning process. Consequently, the model’s own outputs are utilized instead of the training data [Wang et al., 2022a].

Fusion. RAG-Fusion [Raudaschl, 2023]enhances traditional search systems by addressing their limitations through a multi-query approach that expands user queries into multiple, diverse perspectives using an LLM. This approach not only captures the explicit information users seek but also uncovers deeper, transformative knowledge. The fusion process involves parallel vector searches of both original and expanded queries, intelligent re-ranking to optimize results, and pairing the best outcomes with new queries. This sophisticated method ensures search results that align closely with both the explicit and implicit intentions of the user, leading to more insightful and relevant information discovery.

Routing. The RAG system’s retrieval process utilizes diverse sources, differing in domain, language, and format, which can be either alternated or merged based on the situation [Li et al., 2023b]. Query routing decides the subsequent action to a user’s query, with options ranging from summarization, searching specific databases, or merging different pathways into a single response. The query router also chooses the appropriate data store for the query, which may include various sources like vector stores, graph databases, or relational databases, or a hierarchy of indices—for instance, a summary index and a document block vector index for multidocument storage. The query router’s decision-making is predefined and executed via LLMs calls, which direct the query to the chosen index.

Predict. It addresses the common issues of redundancy and noise in retrieved content. Instead of directly retrieving from a data source, this module utilizes the LLM to generate the necessary context [Yu et al., 2022]. The content produced by the LLM is more likely to contain pertinent information compared to that obtained through direct retrieval.

Task Adapter. This module focuses on adapting RAG to a variety of downstream tasks. UPRISE automates the retrieval of prompts for zero-shot task inputs from a pre-constructed data pool, thereby enhancing universality across tasks and models [Cheng et al., 2023a]. Meanwhile, PROMPTAGATOR [Dai et al., 2022] utilizes LLM as a few-shot query generator and, based on the generated data, creates task-specific retrievers. By leveraging the generalization capability of LLMs, it enables the development of task-specific end-to-end retrievers with minimal examples.

New Patterns. The organizational structure of Modular RAG is highly adaptable, allowing for the substitution or rearrangement of modules within the RAG process to suit specific problem contexts. Naive RAG and Advanced RAG can both be considered as being composed of some fixed modules. As illustrated in the figure 3, Naive RAG primarily consists of the “Retrieve” and “Read” modules. A typical pattern of Advanced RAG builds upon the foundation of Naive RAG by adding “Rewrite” and “Rerank” modules. However, on the whole, modular RAG enjoys greater diversity and flexibility.

Current research primarily explores two organizational paradigms. The first involves adding or replacing modules, while the second focuses on adjusting the organizational flow between modules. This flexibility enables tailoring the RAG process to effectively address a wide array of tasks.

Adding or Replacing Modules.The strategy of introducing or substituting modules involves maintaining the core structure of the Retrieval-Read process while integrating additional modules to enhance specific functionalities. The RRR model [Ma et al., 2023a] introduces the Rewrite-RetrieveRead process, utilizing the LLM performance as a reinforcement learning incentive for a rewriting module. This enables the rewriter to fine-tune retrieval queries, thereby improving the downstream task performance of the reader.

Similarly, modules can be selectively swapped in methodologies like Generate-Read [Yu et al., 2022], where the LLM’s generation module takes the place of the retrieval module. The Recite-Read approach [Sun et al., 2022] transforms external retrieval into retrieval from model weights, requiring the LLM to initially memorize task-specific information and subsequently produce output capable of handling knowledge-intensive natural language processing tasks.

Adjusting the Flow between Modules. zheIn the realm of module flow adjustment, there is a focus on enhancing the interaction between language models and retrieval models. DSP [Khattab et al., 2022] introduces the DemonstrateSearch-Predict framework, treating the context learning system as an explicit program rather than a final task prompt, leading to more effective handling of knowledge-intensive tasks. The ITER-RETGEN [Shao et al., 2023] approach utilizes generated content to guide retrieval, iteratively implementing “retrieval-enhanced generation” and “generationenhanced retrieval” within a Retrieve-Read-Retrieve-Read flow. This method demonstrates an innovative way of using one module’s output to improve the functionality of another.

Optimizing the RAG Pipeline The optimization of the retrieval process aims to enhance the efficiency and quality of information in RAG systems. Current research focuses on integrating diverse search technologies, refining retrieval steps, incorporating cognitive backtracking, implementing versatile query strategies, and leveraging embedding similarity. These efforts collectively strive to achieve a balance between retrieval efficiency and the depth of contextual information in RAG systems.

Hybrid Search Exploration. The RAG system optimizes its performance by intelligently integrating various techniques, including keyword-based search, semantic search, and vector search. This approach leverages the unique strengths of each method to accommodate diverse query types and information needs, ensuring consistent retrieval of highly relevant and context-rich information. The use of hybrid search serves as a robust supplement to retrieval strategies, thereby enhancing the overall efficacy of the RAG pipeline.

Recursive Retrieval and Query Engine. Recursive retrieval involves acquiring smaller chunks during the initial retrieval phase to capture key semantic meanings. Subsequently, larger chunks containing more contextual information are provided to the LLM in later stages of the process. This two-step retrieval method helps to strike a balance between efficiency and the delivery of contextually rich responses.

StepBack-prompt approach encourages the LLM to move away from specific instances and engage in reasoning around broader concepts and principles [Zheng et al., 2023]. Experimental results demonstrate a significant performance increase in various challenging, inference-based tasks when backward prompts are used, highlighting their natural adaptability to the RAG process. These retrieval-enhancing steps can be applied both in generating responses to backward prompts and in the final question-answering process.

Sub-Queries. Depending on the scenario, various query strategies can be employed, such as using query engines provided by frameworks like LlamaIndex, leveraging tree queries, utilizing vector queries, or executing simple sequential querying of chunks.

Hypothetical Document Embeddings. HyDE operates on the belief that the answers generated might be closer in the embedding space than a direct query. Using the LLM, HyDE creates a hypothetical document (answer) in response to a query, embeds this document, and uses the resulting embedding to retrieve real documents similar to the hypothetical one. Instead of seeking embedding similarity based on the query, this approach focuses on the embedding similarity from one answer to another [Gao et al., 2022]. However, it might not consistently produce desirable outcomes, especially when the language model is unfamiliar with the subject matter, potentially leading to more instances with errors.

4 Retrieval

In the context of RAG, it is crucial to efficiently retrieve relevant documents from the data source. However, creating a proficient retriever presents significant challenges. This sectionelves into three fundamental questions: 1) How can we achieve accurate semantic representations? 2) What methods can align the semantic spaces of queries and documents? 3) How can the retriever’s output be aligned with the preferences of the Large Language Model?

The combination of these diverse methods has led to notable advancements, resulting in enhanced retrieval outcomes and improved performance for RAG.

4.1 Enhancing Semantic Representations

In RAG, the semantic space is essential as it involves the multidimensional mapping of queries and documents. Retrieval accuracy in this semantic space significantly impacts RAG outcomes. This section will present two methods for building accurate semantic spaces.

Chunk optimization When managing external documents, the initial step involves breaking them down into smaller chunks to extract finegrained features, which are then embedded to represent their semantics. However, embedding overly large or excessively small text chunks may lead to sub-optimal outcomes. Therefore, identifying the optimal chunk size for documents within the corpus is crucial to ensuring the accuracy and relevance of the retrieved results.

Choosing an appropriate chunking strategy requires careful consideration of several vital factors, such as the nature of the indexed content, the embedding model and its optimal block size, the expected length and complexity of user queries, and the specific application’s utilization of the retrieved results. For instance, the selection of a chunking model should be based on the content’s length—whether it is longer or shorter. Additionally, different embedding models demonstrate distinct performance characteristics at varying block sizes. For example, sentence-transformer performs better with single sentences, while text-embedding-ada-002 excels with blocks containing 256 or 512 tokens.

Additionally, factors like the length and complexity of user input questions, and the specific needs of the application (e.g., semantic search or question answering), have effect on the choice of a chunking strategy. This choice can be directly influenced by the token limits of the selected LLMs, requiring adjustments to the block size. In reality, getting precise query results involves flexibly applying different chunking strategies. There is no one-size-fits-all ”best” strategy, only the most appropriate one for a particular context.

Current research in RAG explores various block optimization techniques aimed at improving both retrieval efficiency and accuracy. One such approach involves the use of sliding window technology, enabling layered retrieval by merging globally related information across multiple retrieval processes. Another strategy, known as the “small2big” method, utilizes small text blocks during the initial search phase and subsequently provides larger related text blocks to the language model for processing.

The abstract embedding technique prioritizes top K retrieval based on document abstracts (or summaries), offering a comprehensive understanding of the entire document context. Additionally, the metadata filtering technique leverages document metadata to enhance the filtering process. An innovative approach, the graph indexing technique, transforms entities and relationships into nodes and connections, significantly improving relevance, particularly in the context of multi-hop problems.

Fine-tuning Embedding Models Once the appropriate size of chunks is determined, the next crucial step involves embedding these chunks and the query into the semantic space using an embedding model. The effectiveness of the embedding is critical as it impacts the model’s ability to represent the corpus. Recent research has introduced prominent embedding models such as AngIE, Voyage, BGE,etc [Li and Li, 2023, VoyageAI, 2023, BAAI, 2023]. These models have undergone pre-training on extensive corpora. However, their capability to accurately capture domain-specific information may be limited when applied to specialized domains.

Moreover, task-specific fine-tuning of embedding models is essential to ensure that the model comprehends the user query in terms of content relevance. A model without fine-tuning may not adequately address the requirements of a specific task. Consequently, fine-tuning an embedding model becomes crucial for downstream applications. There are two primary paradigms in embedding fine-tuning methods.

Domain Knowledge Fine-tuning. To ensure that an embedding model accurately captures domain-specific information, it is imperative to utilize domain-specific datasets for fine-tuning. This process diverges from standard language model fine-tuning, chiefly in the nature of the datasets involved. Typically, the dataset for embedding model fine-tuning encompasses three principal elements: queries, a corpus, and relevant documents. The model employs these queries to identify pertinent documents within the corpus. The efficacy of the model is then gauged based on its ability to retrieve these relevant documents in response to the queries. The dataset construction, model fine-tuning, and evaluation phases each present distinct challenges. The LlamaIndex [Liu, 2023] introduces a suite of pivotal classes and functions designed to enhance the embedding model fine-tuning workflow, thereby simplifying these intricate processes. By curating a corpus infused with domain knowledge and leveraging the methodologies offered, one can adeptly fine-tune an embedding model to align closely with the specific requirements of the target domain.

Fine-tuning for Downstream Tasks. Fine-tuning embedding models for downstream tasks is a critical step in enhancing model performance. In the realm of utilizing RAG for these tasks, innovative methods have emerged to finetune embedding models by harnessing the capabilities of LLMs. For example, PROMPTAGATOR [Dai et al., 2022] utilizes the LLM as a few-shot query generator to create task-specific retrievers, addressing challenges in supervised fine-tuning, particularly in data-scarce domains. Another approach, LLM-Embedder [Zhang et al., 2023a], exploits LLMs to generate reward signals for data across multiple downstream tasks. The retriever is fine-tuned with two types of supervised signals: hard labels for the dataset and soft rewards from the LLMs. This dual-signal approach fosters a more effective fine-tuning process, tailoring the embedding model to diverse downstream applications.

While these methods improve semantic representation by incorporating domain knowledge and task-specific fine-tuning, retrievers may not always exhibit optimal compatibility with certain LLMs. To address this, some researchers have explored direct supervision of the fine-tuning process using feedback from LLMs. This direct supervision seeks to align the retriever more closely with the LLM, thereby improving performance on downstream tasks. A more comprehensive discussion on this topic is presented in Section 4.3.

4.2 Aligning Queries and Documents

In the context of RAG applications, retrievers may utilize a single embedding model for encoding both the query and the documents, or employ separate models for each. Additionally, the user’s original query may suffer from imprecise phrasing and lack of semantic information. Therefore, it is crucial to align the semantic space of the user’s query with those of the documents. This section introduces two fundamental techniques aimed at achieving this alignment.

Query Rewriting Query rewriting is a fundamental approach for aligning the semantics of a query and a document. Methods such as Query2Doc and ITER-RETGEN leverage LLMs to create a pseudo-document by combining the original query with additional guidance [Wang et al., 2023c, Shao et al., 2023]. HyDE constructs query vectors using textual cues to generate a “hypothetical” document capturing essential patterns [Gao et al., 2022]. RRR introduces a framework that reverses the traditional retrieval and reading order, focusing on query rewriting [Ma et al., 2023a]. STEP-BACKPROMPTING enables LLMs to perform abstract reasoning and retrieval based on high-level concepts [Zheng et al., 2023]. Additionally, the multi-query retrieval method utilizes LLMs to generate and execute multiple search queries simultaneously, advantageous for addressing complex problems with multiple sub-problems.

Embedding Transformation Beyond broad strategies such as query rewriting, there exist more granular techniques specifically designed for embedding transformations. LlamaIndex [Liu, 2023] exemplifies this by introducing an adapter module that can be integrated following the query encoder. This adapter facilitates fine-tuning, thereby optimizing the representation of query embeddings to map them into a latent space that is more closely aligned with the intended tasks.

The challenge of aligning queries with structured external documents, particularly when addressing the incongruity between structured and unstructured data, is addressed by SANTA [Li et al., 2023d]. It enhances the retriever’s sensitivity to structured information through two pre-training strategies: first, by leveraging the intrinsic alignment between structured and unstructured data to inform contrastive learning in a structured-aware pre-training scheme; and second, by implementing Masked Entity Prediction. The latter utilizes an entity-centric masking strategy that encourages language models to predict and fill in the masked entities, thereby fostering a deeper understanding of structured data.

The issue of aligning queries with structured external documents, especially when dealing with the disparity between structured and unstructured data, is tackled by SANTA [Li et al., 2023d]. This approach improves the retriever’s ability to recognize structured information through two pre-training strategies: firstly, by utilizing the inherent alignment between structured and unstructured data to guide contrastive learning in a structured-aware pre-training scheme; and secondly, by employing Masked Entity Prediction. The latter uses an entity-centric masking strategy to prompt language models to predict and complete the masked entities, thus promoting a more profound comprehension of structured data.

4.3 Aligning Retriever and LLM

In the RAG pipeline, enhancing retrieval hit rate through various techniques may not necessarily improve the final outcome, as the retrieved documents may not align with the specific requirements of the LLMs. Therefore, this section introduces two methods aimed at aligning the retriever outputs with the preferences of the LLMs.

Fine-tuning Retrievers Several studies utilize feedback signals from LLMs to refine retrieval models. For instance, AAR [Yu et al., 2023b] introduces supervisory signals for a pre-trained retriever using an encoder-decoder architecture. This is achieved by identifying the LM’s preferred documents through FiD cross-attention scores. Subsequently, the retriever undergoes fine-tuning with hard negative sampling and standard cross-entropy loss. Ultimately, the refined retriever can be directly applied to enhance unseen target LMs, resulting in improved performance in the target task. Additionally, it is suggested that LLMs may have a preference for focusing on readable rather than information-rich documents.

REPLUG [Shi et al., 2023] utilizes a retriever and an LLM to calculate the probability distributions of the retrieved documents and then performs supervised training by computing the KL divergence. This straightforward and effective training method enhances the performance of the retrieval model by using an LM as the supervisory signal, eliminating the need for specific cross-attention mechanisms.

UPRISE [Cheng et al., 2023a] also employs frozen LLMs to fine-tune the prompt retriever. Both the LLM and the retriever take prompt-input pairs as inputs and utilize the scores provided by the LLM to supervise the retriever’s training, effectively treating the LLM as a dataset labeler. In addition, Atlas [Izacard et al., 2022] proposes four methods of supervised fine-tuning embedding models:

Attention Distillation. This approach employs crossattention scores generated by the LLM during output to distill the model’s knowledge.
EMDR2. By using the Expectation-Maximization algorithm, this method trains the model with retrieved documents as latent variables. Perplexity Distillation directly trains the model using the perplexity of generated tokens as an indicator.
LOOP. This method presents a novel loss function based on the impact of document deletion on LLM prediction, offering an efficient training strategy to better adapt the model to specific tasks.

These approaches aim to improve the synergy between the retriever and the LLM, leading to enhanced retrieval performance and more accurate responses to user inquiries.

Adapters Fine-tuning models may present challenges, such as integrating functionality through an API or addressing constraints arising from limited local computational resources. Consequently, some approaches opt to incorporate an external adapter to aid in alignment.

PRCA trains the adapter through a context extraction The retriever’s outphase and a reward-driven phase. put is then optimized using a token-based autoregressive strategy [Yang et al., 2023b]. The token filtering ap- proach employs cross-attention scores to efficiently filter tokens, selecting only the highest-scoring input tokens [Berchansky et al., 2023].RECOMP introduces both extractive and generative compressors for summary generation. These compressors either select relevant sentences or synthesize document information, creating summaries tailored to multi-document queries [Xu et al., 2023a].

Furthermore, PKG introduces an innovative method for integrating knowledge into white-box models via directive fine-tuning [Luo et al., 2023]. In this approach, the retriever module is directly substituted to generate relevant documents according to a query. This method assists in addressing the difficulties encountered during the fine-tuning process and enhances model performance.

5 Generation

A crucial component of RAG is its generator, which is responsible for converting retrieved information into coherent and fluent text. Unlike traditional language models, RAG’s generator sets itself apart by improving accuracy and relevance via the incorporation of retrieved data. In RAG, the generator’s input encompasses not only typical contextual information but also relevant text segments obtained through the retriever. This comprehensive input enables the generator to gain a deep understanding of the question’s context, resulting in more informative and contextually relevant responses. Furthermore, the generator is guided by the retrieved text to ensure coherence between the generated content and the obtained information. The diverse input data has led to targeted efforts during the generation phase, all aimed at refining the adaptation of the large model to the input data derived from queries and documents. In the following subsections, we will explore the introduction of the generator by delving into aspects of POST-retrieval processing and fine-tuning.

5.1 Post-retrieval with Frozen LLM

In the realm of untunable LLMs , many studies rely on wellestablished models like GPT-4 [OpenAI, 2023] to harness their comprehensive internal knowledge for systematically synthesizing retrieved information from various documents.

However, challenges persist with these large models, including limitations on context length and susceptibility to redundant information. To tackle these issues, certain research endeavors have turned their focus to POST-retrieval processing.

Post-retrieval processing involves treating, filtering, or optimizing the relevant information retrieved by the retriever from a large document database. Its main goal is to enhance the quality of retrieval results, aligning them more closely with user needs or subsequent tasks. It can be viewed as a reprocessing of the documents obtained during the retrieval phase. Common operations in POST-retrieval processing typically include information compression and result reranking.

Information Compression The retriever excels at retrieving relevant information from a vast knowledge base, but managing the substantial amount of information within retrieval documents is a challenge. Ongoing research aims to extend the context length of large language models to tackle this issue. However, current large models still struggle with context limitations. Therefore, there are scenarios where condensing information becomes necessary. Information condensation is significant for reducing noise, addressing context length restrictions, and enhancing generation effects.

PRCA tackled this issue by training an information extractor [Yang et al., 2023b]. In the context extraction phase, when provided with an input text Sinput, it is capable of producing an output sequence Cextracted that represents the condensed context from the input document. The training process is designed to minimize the difference between Cextracted and the actual context Ctruth.

Similarly, RECOMP adopts a comparable approach by training an information condenser using contrastive learning [Xu et al., 2023a]. Each training data point consists of one positive sample and five negative samples, and the encoder undergoes training using contrastive loss throughout this process [Karpukhin et al., 2020] .

Another study has taken a different approach by aiming to reduce the number of documents in order to imIn the study prove the accuracy of the model’s answers. by [Ma et al., 2023b], they propose the “Filter-Reranker” paradigm, which combines the strengths of LLMs and Small Language Models (SLMs). In this paradigm, SLMs serve as filters, while LLMs function as reordering agents. The research shows that instructing LLMs to rearrange challenging samples identified by SLMs leads to significant improvements in various Information Extraction (IE) tasks.

Reranking The re-ranking model is pivotal in optimizing the document set retrieved from the retriever. Language models often face performance declines when additional context is introduced, and re-ranking effectively addresses this issue. The core concept involves rearranging document records to prioritize the most relevant items at the top, thereby limiting the total number of documents. This not only resolves the challenge of context window expansion during retrieval but also enhances retrieval efficiency and responsiveness.

The re-ranking model assumes a dual role throughout the information retrieval process, functioning as both an optimizer and a refiner. accurate input for subsequent ing [Zhuang et al., 2023].

It provides more effective and language model process-Contextual compression is incorporated into the reordering process to offer more precise retrieval information. This method entails reducing the content of individual documents and filtering the entire document, with the ultimate goal of presenting the most relevant information in the search results for a more focused and accurate display of pertinent content.

5.2 Fine-tuning LLM for RAG

Optimizing the generator within the RAG model is a critical aspect of its architecture. The generator’s role is to take the retrieved information and produce relevant text, forming the final output of the model. The optimization of the generator aims to ensure that the generated text is both natural and effectively leverages the retrieved documents to better meet the user’s query needs.

In standard LLMs generation tasks, the input typically consists of a query. RAG stands out by incorporating not only a query but also various retrieved documents (structured/unstructured) by the retriever into the input. This additional information can significantly influence the model’s understanding, particularly for smaller models. In such cases, fine-tuning the model to adapt to the input of both query and retrieved documents becomes crucial. Before presenting the input to the fine-tuned model, POST-retrieval processing usually occurs for the documents retrieved by the retriever. It is essential to note that the fine-tuning method for the generator in RAG aligns with the general fine-tuning approach for LLMs. In the following, we will briefly describe some representative works involving data (formatted/unformatted) and optimization functions.

General Optimization Process As part of the general optimization process, the training data typically consists of input-output pairs, aiming to train the model to produce the output y given the input x. In the work of Self-Mem [Cheng et al., 2023b], a traditional training process is employed, where given the input x, relevant documents z are retrieved (selecting Top-1 in the paper), and after integrating (x, z), the model generates the output y. The paper utilizes two common paradigms for fine-tuning, namely Joint-Encoder and Dual-Encoder [Arora et al., 2023, Wang et al., 2022b, Lewis et al., 2020, Xia et al., 2019, Cai et al., 2021, Cheng et al., 2022].

In the Joint-Encoder paradigm, a standard model based on an encoder-decoder is used. Here, the encoder initially encodes the input, and the decoder, through attention mechanisms, combines the encoded results to generate tokens in an autoregressive manner. On the other hand, in the DualEncoder paradigm, the system sets up two independent encoders, with each encoder encoding the input (query, context) and the document, respectively. The resulting outputs undergo bidirectional cross-attention processing by the decoder in sequence. Both architectures utilize the Transformer [Vaswani et al., 2017] as the foundational block and optimize with Negative Log-Likelihood loss.

Utilizing Contrastive Learning In the phase of preparing training data for language models, interaction pairs of input and output are usually created. This traditional method can lead to ”exposure bias,” where the model is only trained on individual, correct output examples, thus restricting its exposure to a range of possible outputs citesequence. This limitation can hinder the model’s real-world performance by causing it to overfit to the particular examples in the training set, thereby reducing its ability to generalize across various contexts.

To mitigate exposure bias, SURGE [Kang et al., 2023] proposes the use of graph-text contrastive learning. This method includes a contrastive learning objective that prompts the model to produce a range of plausible and coherent responses, expanding beyond the instances encountered in the training data. This approach is crucial in reducing overfitting and strengthening the model’s ability to generalize.

For retrieval tasks that engage with structured data, the SANTA framework [Li et al., 2023d] implements a tripartite training regimen to effectively encapsulate both structural and semantic nuances. The initial phase focuses on the retriever, where contrastive learning is harnessed to refine the query and document embeddings.

Subsequently, the generator’s preliminary training stage employs contrastive learning to align the structured data with its unstructured document descriptions. In a further stage of generator training, the model acknowledges the critical role of entity semantics in the representation learning of textual data for retrieval, as highlighted by [Sciavolino et al., 2021, Zhang et al., 2019]. This process commences with the identification of entities within the structured data, followed by the application of masks over these entities within the generator’s input data, thus setting the stage for the model to anticipate and predict these masked elements.

The training regimen progresses with the model learning to reconstruct the masked entities by leveraging contextual information. This exercise cultivates the model’s comprehension of the textual data’s structural semantics and facilitates the alignment of pertinent entities within the structured data. The overarching optimization goal is to train the language model to accurately restore the obscured spans, thereby enriching its understanding of entity semantics [Ye et al., 2020].

6 Augmentation in RAG

This section is structured around three key aspects: the augmentation stage, sources of augmentation data, and the augmentation process. These facets elucidate the critical technologies pivotal to RAG’s development. A taxonomy of RAG’s core components is presented in Figure 4.

6.1 RAG in Augmentation

Stages RAG, a knowledge-intensive endeavor, incorporates a variety of technical methodologies across the pre-training, fine-tuning, and inference stages of language model training.

Pre-training Stage During the pre-training stage, researchers have investigated methods to bolster PTMs for open-domain QA through retrieval-based strategies. The REALM model adopts a structured, interpretable method for knowledge embedding, framing pre-training, and fine-tuning as a retrieve-then-predict workflow within the masked language model (MLM) framework [Arora et al., 2023].

Figure 4: Taxonomy of RAG’s core components

RETRO [Borgeaud et al., 2022] leverages retrieval augmentation for large-scale pre-training from scratch, achieving a reduction in model parameters while surpassing standard GPT models in terms of perplexity. RETRO distinguishes itself with an additional encoder designed to process features of entities retrieved from an external knowledge base, building on the foundational structure of GPT models.

Atlas[Izacard et al., 2022] also incorporates a retrieval mechanism into the T5 architecture [Raffel et al., 2020] in both the pre-training and fine-tuning stages. It uses a pretrained T5 to initialize the encoder-decoder language model and a pre-trained Contriever for the dense retriever, improving its efficiency for complex language modeling tasks.

Furthermore, COG [Lan et al., 2022] introduces a novel text generation methodology that emulates copying text fragments from pre-existing collections. Utilizing efficient vector search tools, COG computes and indexes contextually meaningful representations of text fragments, demonstrating superior performance in domains such as question-answering and domain adaptation when compared to RETRO.

The advent of scaling laws has catalyzed the growth of model parameters, propelling autoregressive models into the mainstream. Rdesearchers are expanding the RAG approach to pretrained larger models, with RETRO++ exemplifying this trend by scaling up the model parameters while preserving or enhancing performance [Wang et al., 2023b].

Empirical evidence underscores marked improvements in text generation quality, factual accuracy, reduced toxicity, and downstream task proficiency, especially in knowledgeintensive applications like open-domain QA. These results imply that integrating retrieval mechanisms into the pre-training of autoregressive language models constitutes a promising avenue, marrying sophisticated retrieval techniques with expansive language models to yield more precise and efficient language generation.

The benefits of augmented pre-training include a robust foundational model that outperforms standard GPT models in perplexity, text generation quality, and task-specific performance, all while utilizing fewer parameters. This method is particularly adept at handling knowledge-intensive tasks and facilitates the development of domain-specific models through training on specialized corpora.

Nonetheless, this approach faces challenges such as the necessity for extensive pre-training datasets and resources, as well as diminished update frequencies with increasing model sizes. Despite these hurdles, the approach offers significant advantages in model resilience. Once trained, retrieval-enhanced models can operate independently of external libraries, enhancing generation speed and operational efficiency. The potential gains identified render this methodology a compelling subject for ongoing investigation and innovation in artificial intelligence and machine learning.

Fine-tuning Stage RAG and Fine-tuning are powerful tools for enhancing LLMs, and combining the two can meet the needs of more specific scenarios. On one hand, fine-tuning allows for the retrieval of documents with a unique style, achieving better semantic expression and aligning the differences between queries and documents. This ensures that the output of the retriever is more aptly suited to the scenario at hand. On the other hand, fine-tuning can fulfill the generation needs of making stylized and targeted adjustments. Furthermore, fine-tuning can also be used to align the retriever and generator for improved model synergy.

The main goal of fine-tuning the retriever is to improve the quality of semantic representations, achieved by directly fine-tuning the Embedding model using a corpus [Liu, 2023]. By aligning the retriever’s capabilities with the preferences of the LLMs through feedback signals, both can be better coordinated [Yu et al., 2023b, Izacard et al., 2022, Yang et al., 2023b, Shi et al., 2023]. Fine-tuning the retriever for specific downstream tasks can lead to improved adaptability [cite]. The introduction of task-agnostic fine-tuning aims to enhance the retriever’s versatility in multi-task scenarios [Cheng et al., 2023a].

Fine-tuning generator can result in outputs that are more stylized and customized. On one hand, it allows for specialized adaptation to different input data formats. For example, fine-tuning LLMs to fit the structure of knowledge graphs [Kang et al., 2023], the structure of text pairs [Kang et al., 2023, Cheng et al., 2023b], and other specific structures [Li et al., 2023d]. On the other hand, by constructing directive datasets, one can demand LLMs to generate specific formats content. For instance, in adaptive or iterative retrieval scenarios, LLMs are fine-tuned to generate content that will help determine the timing for the next step of action [Jiang et al., 2023b, Asai et al., 2023].

By synergistically fine-tuning both the retriever and the generator, we can enhance the model’s generalization capabilities and avoid overfitting that may arise from training them separately. However, joint fine-tuning also leads to increased resource consumption. RA-DIT [Lin et al., 2023] presents a lightweight, dual-instruction tuning framework that can effectively add retrieval capabilities to any LLMs. The retrieval-enhanced directive fine-tuning updates the LLM, guiding it to make more efficient use of the information retrieved and to disregard distracting content.

Despite its advantages, fine-tuning has limitations, including the need for specialized datasets for RAG fine-tuning and the requirement for significant computational resources. However, this stage allows for customizing models to specific needs and data formats, potentially reducing resource usage compared to the pre-training phase while still being able to fine-tune the model’s output style.

In summary, the fine-tuning stage is essential for the adaptation of RAG models to specific tasks, enabling the refinement of both retrievers and generators. This stage enhances the model’s versatility and adaptability to various tasks, despite the challenges presented by resource and dataset requirements. The strategic fine-tuning of RAG models is therefore a critical component in the development of efficient and effective retrieval-augmented systems.

Inference Stage The inference stage in RAG models is crucial, as it involves extensive integration with LLMs. Traditional RAG approaches, also known as Naive RAG, involve incorporating retrieval content at this stage to guide the generation process. To overcome the limitations of Naive RAG, advanced techniques introduce more contextually rich information during inference. The DSP framework [Khattab et al., 2022] utilizes a sophisticated exchange of natural language text between fronzen LMs and retrieval models (RMs), enriching the context and thereby improving generation outcomes. The PKG [Luo et al., 2023] method equips LLMs with a knowledge-guided module that allows for the retrieval of pertinent information without modifying the LMs’ parameters, enabling more complex task execution. CREAICL [Li et al., 2023b] employs a synchronous retrieval of cross-lingual knowledge to enhance context, while RECITE [Sun et al., 2022] generates context by sampling paragraphs directly from LLMs.

Further refinement of the RAG process during inference is seen in approaches that cater to tasks necessiITRG [Feng et al., 2023] ittating multi-step reasoning. eratively retrieves information to identify the correct reaITERsoning paths, thereby improving task adaptability. RETGEN [Shao et al., 2023] follows an iterative strategy, merging retrieval and generation in a cyclical process that alternates between “retrieval-enhanced generation” and “generation-enhanced retrieval”. For non-knowledgeintensive (NKI) tasks, PGRA [Guo et al., 2023] proposes a two-stage framework, starting with a task-agnostic retriever followed by a prompt-guided reranker to select and prioritize evidence. In contrast, IRCOT [Trivedi et al., 2022] combines RAG with Chain of Thought (CoT) methodologies, alternating CoT-guided retrievals with retrieval-informed CoT processes, significantly boosting GPT-3’s performance across various question-answering tasks.

In essence, these inference-stage enhancements provide lightweight, cost-effective alternatives that leverage the capabilities of pre-trained models without necessitating further training. The principal advantage is maintaining static LLM parameters while supplying contextually relevant information to meet specific task demands. Nevertheless, this approach is not without limitations, as it requires meticulous data processing and optimization, and is bound by the foundational model’s intrinsic capabilities. To address diverse task requirements effectively, this method is often paired with procedural optimization techniques such as step-wise reasoning, iterative retrieval, and adaptive retrieval strategies.

6.2 Augmentation Source

The effectiveness of RAG models is heavily impacted by the selection of data sources for augmentation. Different levels of knowledge and dimensions require distinct processing techniques. They are categorized as unstructured data, structured data, and content generated by LLMs. The technology tree of representative RAG research with different augmentation aspects is depicted in Figure 5. The leaves, colored in three different shades, represent enhancements using various types of data: unstructured data, structured data, and content generated by LLMs. The diagram clearly shows that initially, augmentation was mainly achieved through unstructured data, such as pure text. This approach later expanded to include the use of structured data (e.g. knowledge graph) for further improvement. More recently, there has been a growing trend in research that utilizes content generated by the LLMs themselves for retrieval and augmentation purposes.

Augmented with Unstructured Data Unstructured text, is gathered from corpora, such as prompt data for fine-tuning large models [Cheng et al., 2023a] and cross-lingual data [Li et al., 2023b]. Retrieval units vary from tokens (e.g., kNN-LM [Khandelwal et al., 2019]) to phrases (e.g., NPM, COG [Lee et al., 2020, Lan et al., 2022]) and document paragraphs, with finer granularities offering precision at the cost of increased retrieval complexity.

FLARE [Jiang et al., 2023b] introduces an active retrieval approach, triggered by the LM’s generation of lowprobability words. It creates a temporary sentence for document retrieval, then regenerates the sentence with the retrieved context to predict subsequent sentences. RETRO uses the previous chunk to retrieve the nearest neighbor at the chunk level, combined with the previous chunk’s context, it guides the generation of the next chunk. To preserve causality, the generation of the next block Ci only utilizes the nearest neighbor of the previous block N (Ci−1) and not N (Ci).

Augmented with Structured Data Structured data, such as knowledge graphs (KGs), provide high-quality context and mitigate model hallucinaRET-LLMs [Modarressi et al., 2023] constructs a tions. knowledge graph memory from past dialogues for future reference. SUGRE [Kang et al., 2023] employs Graph Neural Networks (GNNs) to encode relevant KG subgraphs, ensuring consistency between retrieved facts and generated text through multi-modal contrastive learning. Knowl-

edGPT [Wang et al., 2023d] generates KB search queries and stores knowledge in a personalized base, enhancing the RAG model’s knowledge richness and contextuality.

LLMs-Generated Content in RAG Addressing the limitations of external auxiliary information in RAG, some research has focused on exploiting LLMs’ internal knowledge. SKR [Wang et al., 2023e] classifies questions as known or unknown, applying retrieval enhancement selectively. GenRead [Yu et al., 2022] replaces the retriever with an LLM generator, finding that LLM-generated contexts often contain more accurate answers due to better alignment with the pre-training objectives of causal language modeling. Selfmem [Cheng et al., 2023b] iteratively creates an unbounded memory pool with a retrieval-enhanced generator, using a memory selector to choose outputs that serve as dual problems to the original question, thus self-enhancing the generative model.

These methodologies underscore the breadth of innovative data source utilization in RAG, striving to improve model performance and task effectiveness.

6.3 Augmentation Process

In the domain of RAG, the standard practice often involves a singular retrieval step followed by generation, which can termed the “lost lead to inefficiencies. A notable issue, in the middle” phenomenon, arises when a single retrieval yields redundant content that may dilute or contradict essential information, thereby degrading the generation quality [Liu et al., 2023a]. Furthermore, such singular retrieval is typically insufficient for complex problems demanding multistep reasoning, as it provides a limited scope of information [Yoran et al., 2023].

As illustrated in Figure 5, to circumvent these challenges, contemporary research has proposed methods for refining the retrieval process: iterative retrieval, recursive retrieval and adaptive retrieval. Iterative retrieval allows the model to engage in multiple retrieval cycles, enhancing the depth and relevance of the information obtained. Recursive retrieval process where the results of one retrieval operation are used as the input for the subsequent retrieval. It helps to delve deeper into relevant information, particularly when dealing with complex or multi-step queries. Recursive retrieval is often used in scenarios where a gradual approach is needed to converge on a final answer, such as in academic research, legal case analysis, or certain types of data mining tasks. Adaptive retrieval, on the other hand, offers a dynamic adjustment mechanism, tailoring the retrieval process to the specific demands of varying tasks and contexts.

Iterative Retrieval Iterative retrieval in RAG models is a process where documents are repeatedly collected based on the initial query and the text generated thus far, providing a more comprehensive knowledge base for LLMs [Borgeaud et al., 2022, Arora et al., 2023]. This approach has been shown to enhance the robustness of subsequent answer generation by offering additional contextual references through multiple retrieval iterations. However, it may suffer from semantic discontinuity and the accumulation of irrelevant information, as it typically relies on a sequence of n tokens to demarcate the boundaries between generated text and retrieved documents. To address specific data scenarios, recursive retrieval and multi-hop retrieval techniques are utilized. Recursive retrieval involves a structured index to process and retrieve data in a hierarchical manner, which may include summarizing sections of a document or lengthy PDF before performing a retrieval based on this summary. Subsequently, a secondary retrieval within the document refines the search, embodying the recursive nature of the process. In contrast, multi-hop retrieval is designed to delve deeper into graphstructured data sources, extracting interconnected information [Li et al., 2023c].

Figure 5: Technology tree of representative RAG research with different augmentation aspects

Additionally, some methodologies integrate the steps of reITER-RETGEN [Shao et al., 2023] trieval and generation. employs a synergistic approach that leverages “retrievalenhanced generation” alongside “generation-enhanced retrieval” for tasks that necessitate the reproduction of specific information. The model harnesses the content required to address the input task as a contextual basis for retrieving pertinent knowledge, which in turn facilitates the generation of improved responses in subsequent iterations.

Recursive Retrieval Recursive Retrieval is often used in information retrieval and NLP to improve the depth and relevance of search results.

The process involves iteratively refining search queries based on the results obtained from previous searches. Recursive Retrieval aims to enhance the search experience by gradually converging on the most pertinent information through a IRCoT [Trivedi et al., 2022] uses chain-offeedback loop. thought to guide the retrieval process and refines the CoT with the obtained retrieval results. ToC [Kim et al., 2023] creates a clarification tree that systematically optimizes the ambiguous parts in the Query. It can be particularly useful in complex search scenarios where the user’s needs are not entirely clear from the outset or where the information sought is highly specialized or nuanced. The recursive nature of the process allows for continuous learning and adaptation to the user’s requirements, often resulting in improved satisfaction with the search outcomes.

Adaptive Retrieval Adaptive retrieval methods, exemplified by Flare and SelfRAG [Jiang et al., 2023b, Asai et al., 2023], refine the RAG framework by enabling LLMs to actively determine the optimal moments and content for retrieval, thus enhancing the efficiency and relevance of the information sourced.

These methods are part of a broader trend wherein in their operations, as LLMs employ active judgment seen in model agents like AutoGPT, Toolformer, and [Yang et al., 2023c, Schick et al., 2023, Graph-Toolformer Zhang, 2023]. Graph-Toolformer, for instance, divides its retrieval process into distinct steps where LLMs proactively use retrievers, apply Self-Ask techniques, and employ few-shot prompts to initiate search queries. This proactive stance allows LLMs to decide when to search for necessary information, akin to how an agent utilizes tools.

WebGPT [Nakano et al., 2021] integrates a reinforcement learning framework to train the GPT-3 model in autonomously using a search engine during text generation. It navigates this process using special tokens that facilitate actions such as search engine queries, browsing results, and citing references, thereby expanding GPT-3’s capabilities through the use of external search engines.

Flare automates timing retrieval by monitoring the confidence of the generation process, as indicated by the probability of generated terms [Jiang et al., 2023b]. When the probability falls below a certain threshold would activates the retrieval system to collect relevant information, thus optimizing the retrieval cycle.

Self-RAG [Asai et al., 2023] introduces “reflection tokens” that allow the model to introspect its outputs. These tokens come in two varieties: “retrieve” and “critic”. The model autonomously decides when to activate retrieval, or alternatively, a predefined threshold may trigger the process. During retrieval, the generator conducts a fragmentlevel beam search across multiple paragraphs to derive the most coherent sequence. Critic scores are used to update the subdivision scores, with the flexibility to adjust these weights during inference, tailoring the model’s behavior. Self-RAG’s design obviates the need for additional classifiers or reliance on Natural Language Inference (NLI) models, thus streamlining the decision-making process for when to engage retrieval mechanisms and improving the model’s autonomous judgment capabilities in generating accurate responses.

LLM optimization has received significant attention due to its increasing prevalence. Techniques such as prompt engineering, Fine-Tuning (FT), and RAG each have distinct characteristics, visually represented in Figure 6. While prompt engineering leverages a model’s inherent capabilities, optimizing LLMs often requires the application of both RAG and FT methods. The choice between RAG and FT should be based on the specific requirements of the scenario and the inherent properties of each approach. A detailed comparison of RAG and FT is presented in Table 1.

6.4 RAG vs Fine-Tuning

RAG is like giving a model a textbook for tailored information retrieval, perfect for specific queries. On the other hand, FT is like a student internalizing knowledge over time, better for replicating specific structures, styles, or formats. FT can improve model performance and efficiency by reinforcing base model knowledge, adjusting outputs, and teaching complex instructions. However, it is not as good for integrating new knowledge or rapidly iterating new use cases.

The two methods, RAG and FT, are not mutually exclusive and can be complementary, augmenting a model’s capabilities at different levels. In some cases, their combined use may yield optimal performance. The optimization process

involving RAG and FT can necessitate multiple iterations to achieve satisfactory results.

7 RAG Evaluation

The rapid advancement and growing adoption of RAG in the field of Natural Language Processing (NLP) have propelled the evaluation of RAG models to the forefront of research in the LLMs community. The primary objective of this evaluation is to comprehend and optimize the performance of RAG models across diverse application scenarios.

Historically, RAG models assessments have centered on their execution in specific downstream tasks. These evaluations employ established metrics suitable to the tasks at hand. For instance, question answering evaluations rely on EM and F1 scores [Wang et al., 2023a, might Shi et al., 2023, Feng et al., 2023, Ma et al., 2023a], whereas fact-checking tasks often hinge on accuracy as the priIzacard et al., 2022, mary metric Shao et al., 2023. Tools like RALLE, designed for the automatic evaluation of RAG applications, similarly base their assessments on these task-specific metrics [Hoshi et al., 2023]. Despite this, there is a notable paucity of research dedicated to evaluating the distinct characteristics of RAG models, with only a handful of related studies.

The following section shifts the focus from task-specific evaluation methods and metrics to provide a synthesis of the existing literature based on their unique attributes. This exploration covers the objectives of RAG evaluation, the aspects along which these models are assessed, and the benchmarks and tools available for such evaluations. The aim is to offer a comprehensive overview of RAG model evaluation, outlining the methodologies that specifically address the unique aspects of these advanced generative systems.

7.1 Evaluation Targets

The assessment of RAG models mainly revolves around two key components: the retrieval and generation modules. This division ensures a thorough evaluation of both the quality of context provided and the quality of content produced.

Retrieval Quality Evaluating the retrieval quality is crucial for determining the effectiveness of the context sourced by the retriever component. Standard metrics from the domains of search engines, recommendation systems, and information retrieval systems are employed to measure the performance of the RAG retrieval module. Metrics such as Hit Rate, MRR, and NDCG are commonly utilized for this purpose [Liu, 2023, Nguyen, 2023].

Generation Quality The assessment of generation quality centers on the generator’s capacity to synthesize coherent and relevant answers from the retrieved context. This evaluation can be categorized based on the content’s objectives: unlabeled and labeled content. For unlabeled content, the evaluation encompasses the faithfulness, relevance, and non-harmfulness of the generated answers. In contrast, for labeled content, the focus is on the accuracy of the information produced by the

Table 1: Comparison between RAG and Fine-Tuning

7.2 Evaluation Aspects

Contemporary evaluation practices of RAG models emphasize three primary quality scores and four essential abilities, which collectively inform the evaluation of the two principal targets of the RAG model: retrieval and generation.

Quality Scores Quality scores include context relevance, answer faithThese quality scores fulness, and answer relevance.

Context Relevance evaluates the precision and specificity of the retrieved context, ensuring relevance and minimizing processing costs associated with extraneous content.

Answer Faithfulness ensures that the generated answers remain true to the retrieved context, maintaining consistency and avoiding contradictions.

Figure 6: RAG compared with other model optimization methods

Answer Relevance requires that the generated answers are directly pertinent to the posed questions, effectively addressing the core inquiry.

Required Abilities RAG evaluation also encompasses four abilities indicative of its adaptability and efficiency: noise robustness, negative rejection, information integration, and counterfactual robustness [Chen et al., 2023b, Liu et al., 2023b]. These abilities are critical for the model’s performance under various challenges and complex scenarios, impacting the quality scores. Noise Robustness appraises the model’s capability to manage noise documents that are question-related but lack substantive information.

Negative Rejection assesses the model’s discernment in refraining from responding when the retrieved documents do not contain the necessary knowledge to answer a question.

Information Integration evaluates the model’s proficiency in synthesizing information from multiple documents to address complex questions.

Counterfactual Robustness tests the model’s ability to recognize and disregard known inaccuracies within documents, even when instructed about potential misinformation.

Context relevance and noise robustness are important for evaluating the quality of retrieval, while answer faithfulness, answer relevance, negative rejection, information integration, and counterfactual robustness are important for evaluating the

The specific metrics for each evaluation aspect are summarized in Table 2. It is essential to recognize that these metrics, derived from related work, are traditional measures and do not yet represent a mature or standardized approach for quantifying RAG evaluation aspects. Custom metrics tailored to the nuances of RAG models, though not included here, have also been developed in some evaluation studies.

7.3 Evaluation Benchmarks and Tools

This section delineates the evaluation framework for RAG models, comprising benchmark tests and automated evaluation tools. These instruments furnish quantitative metrics that not only gauge RAG model performance but also enhance comprehension of the model’s capabilities across various evaluation aspects. Prominent benchmarks such as RGB and RECALL [Chen et al., 2023b, Liu et al., 2023b] focus on appraising the essential abilities of RAG models. Concurrently, state-of-the-art automated tools like RAGAS [Es et al., 2023], ARES [Saad-Falcon et al., 2023], and TruLens8 employ LLMs to adjudicate the quality scores. These tools and benchmarks collectively form a robust framework for the systematic evaluation of RAG models, as summarized in Table 3.

8 https://www.trulens.org/trulenseval/coreconceptsragtriad/

Table 2: Summary of metrics applicable for evaluation aspects of RAG

8 Future Prospects

This section explores three future prospects for RAG: future challenges, modality expansion, and the RAG ecosystem.

8.1 Future Challenges of RAG

Despite the considerable progress in RAG technology, several challenges persist that warrant in-depth research:

Context Length. RAG’s efficacy is limited by the context window size of Large Language Models (LLMs). Balancing the trade-off between a window that is too short, risking insufficient information, and one that is too long, risking information dilution, is crucial. With ongoing efforts to expand LLM context windows to virtually unlimited sizes, the adaptation of RAG to these changes presents a significant research question [Xu et al., 2023c, Packer et al., 2023, Xiao et al., 2023]. Robustness. The presence of noise or contradictory information during retrieval can detrimentally affect RAG’s out-

put quality. This situation is figuratively referred to as “Misinformation can be worse than no information at all”. Improving RAG’s resistance to such adversarial or counterfactual inputs is gaining research momentum and has become a key performance metric [Yu et al., 2023a, Glass et al., 2021, Baek et al., 2023].

Hybrid Approaches (RAG+FT). Combining RAG with fine-tuning is emerging as a leading strategy. Determining the optimal integration of RAG and fine-tuning whether sequential, alternating, or through end-to-end joint training—and how to harness both parameterized and non-parameterized advantages are areas ripe for exploration [Lin et al., 2023].

Expanding LLM Roles. Beyond generating final answers, LLMs are leveraged for retrieval and evaluation within RAG frameworks. Identifying ways to further unlock LLMs potential in RAG systems is a growing research direction.

Scaling Laws. While scaling laws [Kaplan et al., 2020] are established for LLMs, their applicability to RAG remains uncertain. Initial studies [Wang et al., 2023b] have begun to address this, yet the parameter count in RAG models still lags behind that of LLMs. The possibility of an Inverse Scaling Law9, where smaller models outperform larger ones, is particularly intriguing and merits further investigation.

Production-Ready RAG. RAG’s practicality and alignment with engineering requirements have facilitated its adoption. However, enhancing retrieval efficiency, improving document recall in large knowledge bases, and ensuring data security—such as preventing inadvertent disclosure of document sources or metadata by LLMs—are critical engineering challenges that remain to be addressed [Alon et al., 2022].

Modality Extension of RAG RAG has text-based questiontranscended its answering confines, embracing a diverse array of modal data. This expansion has spawned innovative VLM models that integrate RAG concepts across various domains:

Image. RA-CM3 [Yasunaga et al., 2022] stands as a pioneering VLM model of both retrieving and generating text and images. BLIP-2 [Li et al., 2023a] leverages frozen image encoders alongside LLMs for efficient visual language pre-training, enabling zero-shot image-to-text conversions. The “Visualize Before You Write” method [Zhu et al., 2022] employs image generation to steer the LM’s text generation, showing promise in open-ended text generation tasks.

Audio and Video. The GSS method retrieves and stitches together audio clips to convert machine-translated data into speech-translated data [Zhao et al., 2022]. UEOP marks a significant advancement in end-to-end automatic speech recognition by incorporating external, offline strategies for voice-to-text conversion [Chan et al., 2023]. Additionally, KNN-based attention fusion leverages audio embeddings and semantically related text embeddings to refine ASR, thereby accelerating domain adaptation. Vid2Seq augments language models with specialized temporal markers, facilitating the prediction of event boundaries and textual descriptions within a unified output sequence [Yang et al., 2023a].

Code. RBPS [Nashid et al., 2023] excels in small-scale learning tasks by retrieving code examples that align with developers’ objectives through encoding and frequency analysis. This approach has demonstrated efficacy in tasks such as test assertion generation and program repair. For structured knowledge, the CoK method [Li et al., 2023c] first extracts facts pertinent to the input query from a knowledge graph, then integrates these facts as hints within the input, enhancing performance in knowledge graph question-answering tasks.

8.2 Ecosystem of RAG

Downstream Tasks and Evaluation RAG has shown considerable promise in enriching language models with the capacity to handle intricate queries and produce detailed responses by leveraging extensive knowledge bases. Empirical evidence suggests that RAG excels in a variety of downstream tasks, including open-ended question answering and fact verification. The integration of RAG not only bolsters the precision and relevance of responses but also their diversity and depth.

9 https://github.com/inverse-scaling/prize

The scalability and versatility of RAG across multiple domains warrant further investigation, particularly in specialized fields such as medicine, law, and education. In these areas, RAG could potentially reduce training costs and enhance performance compared to traditional fine-tuning approaches in professional domain knowledge question answering.

Concurrently, refining the evaluation framework for RAG is essential to maximize its efficacy and utility across different tasks. This entails the development of nuanced metrics and assessment tools that can gauge aspects such as contextual relevance, creativity of content, and non-maleficence.

Furthermore, improving the interpretability of RAG-driven models continues to be a key goal. Doing so would allow users to understand the reasoning behind the responses generated by the model, thereby promoting trust and transparency in the use of RAG applications.

Technical Stack The development of the RAG ecosystem is greatly impacted by the progression of its technical stack. Key tools like LangChain and LLamaIndex have quickly gained popularity with the emergence of ChatGPT, providing extensive RAGrelated APIs and becoming essential in the realm of LLMs.

Emerging technical stacks, while not as feature-rich as LangChain and LLamaIndex, distinguish themselves with specialized offerings. For instance, Flowise AI10 prioritizes a low-code approach, enabling users to deploy AI applications, including RAG, through a user-friendly drag-and-drop interface. Other technologies like HayStack, Meltano11, and Cohere Coral12 are also gaining attention for their unique contributions to the field.

In addition to AI-focused providers, traditional software and cloud service providers are expanding their offerings to include RAG-centric services. Verba13 from Weaviate is designed for personal assistant applications, while Amazon’s Kendra14 provides an intelligent enterprise search service, allowing users to navigate through various content repositories using built-in connectors. During the evolution of the RAG technology landscape, there has been a clear divergence towards different specializations, such as: 1) Customization. Tailoring RAG to meet a specific requirements. 2) Simplification. Making RAG easier to use, thereby reducing the initial learning curve. 3) Specialization. Refining RAG to serve production environments more effectively.

The mutual growth of RAG models and their technical stack is evident; technological advancements consistently establish new standards for the existing infrastructure. In turn, enhancements to the technical stack drive the evolution of RAG capabilities. The RAG toolkit is converging into a foundational technical stack, laying the groundwork for advanced enterprise applications. However, the concept of a fully integrated, comprehensive platform remains on the horizon, pending further innovation and development.

10 https://flowiseai.com>

11 https://meltano.com>

12 https://cohere.com/coral>

13 https://github.com/weaviate/Verba>

14 https://aws.amazon.com/cn/kendra/>

Figure 7: Summary of RAG ecosystem

9 Conclusion

The summary of this paper, as depicted in Figure 7, highlights RAG’s significant advancement in enhancing the capabilities of LLMs through the integration of parameterized knowledge from language models with extensive nonparameterized data from external knowledge bases. Our survey illustrates the evolution of RAG technologies and their impact on knowledge-intensive tasks. Our analysis delineates three developmental paradigms within the RAG framework: Naive, Advanced, and Modular RAG, each marking a progressive enhancement over its predecessors. The Advanced RAG paradigm extends beyond the Naive approach by incorporating sophisticated architectural elements, including query rewriting, chunk reranking, and prompt summarization. These innovations have led to a more nuanced and modular architecture that enhances both the performance and the interpretability of LLMs. RAG’s technical integration with other AI methodologies, such as fine-tuning and reinforcement learning, has further expanded its capabilities. In content retrieval, a hybrid methodology that leverages both structured and unstructured data sources is emerging as a trend, providing a more enriched retrieval process. Cutting-edge research within the RAG framework is exploring novel concepts such as self-retrieval from LLMs and the dynamic timing of information retrieval.

Despite the strides made in RAG technology, research opportunities abound in improving its robustness and its ability to manage extended contexts. RAG’s application scope is also widening into VLM domains, adapting its principles to interpret and process diverse data forms such as images, videos, and code. This expansion underscores RAG’s significant practical implications for AI deployment, attracting interest from both academic and industrial sectors. The growing ecosystem of RAG is underscored by an increase in RAG-centric AI applications and the ongoing development of supportive tools. However, as RAG’s application landscape expands, there is an imperative need to refine evaluation methodologies to keep pace with its evolution. Ensuring that performance assessments remain accurate and representative is crucial for capturing the full extent of RAG’s contributions to the AI research and development community.

post contain ""

No matching posts found containing ""

Share Your Feedback 🏝️

RAG, Survey | RAG Survey**

RAG, Survey | RAG Survey**

Retrieval-Augmented Generation for Large Language Models: A Survey

TL;DR

Retrieval-Augmented Generation for Large Language Models: A Survey

개요

I. 서론

II. RAG 개요

A. Naive RAG

B. Advanced RAG

4. RAG의 세 가지 패러다임 비교

Pre-retrieval Process

Post-Retrieval Process

C. Modular RAG

5. Modular RAG의 새로운 패턴

RAG vs Fine-tuning

III. RETRIEVAL

Page 7

7. RAG와 다른 모델 최적화 방법 비교

패션 상품 검색 예시

8. 검색 세분화의 중요성

B. 인덱싱 최적화

1) 청크 전략

2) 메타데이터 부착

3) 구조적 인덱스

계층형 인덱스 구조

지식 그래프 인덱스

C. 쿼리 최적화

1) 쿼리 확장

페이지 9: 쿼리 변환 및 라우팅

2) 쿼리 변환

3) 쿼리 라우팅

D. 임베딩

1) 혼합/하이브리드 검색

2) 임베딩 모델의 파인튜닝

E. Adapter

페이지 10: 생성 및 컨텍스트 조정

IV. 생성

A. 컨텍스트 선택

1) 재랭킹

2) 컨텍스트 선택/압축

RAG 모델의 평가 및 적용: 최신 연구와 미래 전망

1. Self-RAG의 혁신적 접근

이미지 설명

2. RAG 모델의 다양한 downstream 작업

2.1 주어진 작업의 핵심 목표

2.2 평가 목표

핵심 요약

3. 평가 프레임워크 및 도구

3.1 평가 메트릭

3.2 평가 벤치마크 및 도구

4. RAG의 미래 방향과 연구 과제

4.1 긴 컨텍스트와 RAG의 역할

4.2 RAG의 강건성

4.3 하이브리드 접근 방식

핵심 요약

5. 결론

RAG와 관련된 최신 기술 동향 및 발전

Fig. 6: RAG 생태계 요약

1. 초기 학습 곡선

2. 특화 (Specialization)

3. 멀티모달 RAG의 발전

4. 결론

수식과 알고리즘 설명

A. 수식 설명

B. 알고리즘 설명

핵심 요약

패션 상품 검색 예시

관련 문서 및 링크

참고 문헌 및 기술 개요

핵심 요약

수식 및 알고리즘

이미지 기반 패션 검색 시스템

관련 문서 및 링크

Page 21

대규모 비디오 캡션 모델

핵심 요약

1 Introduction

2 Definition