abstract: Knowledge Graphs (KGs) represent human-crafted factual knowledge in the form of triplets (head, relation, tail), which collectively form a graph. Question Answering over KGs (KGQA) is the task of answering natural questions grounding the reasoning to the information provided by the KG. Large Language Models (LLMs) are the state-of-the-art models for QA tasks due to their remarkable ability to understand natural language. On the other hand, Graph Neural Networks (GNNs) have been widely used for KGQA as they can handle the complex graph information stored in the KG. In this work, we introduce GNN-RAG, a novel method for combining language understanding abilities of LLMs with the reasoning abilities of GNNs in a retrieval-augmented generation (RAG) style. First, a GNN reasons over a dense KG subgraph to retrieve answer candidates for a given question. Second, the shortest paths in the KG that connect question entities and answer candidates are extracted to represent KG reasoning paths. The extracted paths are verbalized and given as input for LLM reasoning with RAG. In our GNN-RAG framework, the GNN acts as a dense subgraph reasoner to extract useful graph information, while the LLM leverages its natural language processing ability for ultimate KGQA. Furthermore, we develop a retrieval augmentation (RA) technique to further boost KGQA performance with GNN-RAG. Experimental results show that GNN-RAG achieves state-of-the-art performance in two widely used KGQA benchmarks (WebQSP and CWQ), outperforming or matching GPT-4 performance with a 7B tuned LLM. In addition, GNN-RAG excels on multi-hop and multi-entity questions outperforming competing approaches by 8.9–15.5% points at answer F1.
지식 그래프를 활용한 GNN-RAG 기법
1 Introduction
대규모 언어모델(LLMs)은 자연어 이해 능력으로 인해 많은 자연어 처리(NLP) 작업에서 SOTA 모델입니다. LLM의 강력함은 대규모 텍스트 데이터 코퍼스에서 사전 훈련을 통해 일반적인 휴먼 지식을 습득하는 데서 비롯됩니다. 그러나 사전 훈련은 비용이 많이 들고 시간이 많이 걸리며, LLM은 새로운 지식이나 도메인 지식을 쉽게 적응하지 못하고 환각(hallucination)에 취약합니다.
지식 그래프(KGs)는 정보를 구조화된 형태로 저장할 수 있는 데이터베이스로, 쉽게 업데이트할 수 있습니다. KG는 트리플렛(head, relation, tail)의 형태로 휴먼이 작성한 사실 지식을 나타내며, 복잡한 상호작용을 캡처하여 질문 응답(QA)과 같은 지식 집약적 작업에 널리 사용됩니다.
RAG(정보 검색 기반 생성)는 KG에서 얻은 최신 정보로 입력 컨텍스트를 풍부하게 하여 LLM의 환각을 줄입니다. KGQA 작업에서 목표는 KG에서 제공한 정보를 기반으로 자연어 질문에 답변하는 것입니다. 그러나 KG는 복잡한 그래프 정보를 저장하고 있으며, 관련 정보를 효과적으로 검색하는 것은 챌린지입니다. 기존의 LLM 기반 검색 방법은 복잡한 그래프 정보를 처리하는 데 한계가 있습니다.
이 연구에서는 GNN-RAG를 소개합니다. GNN-RAG는 그래프 신경망(GNN)을 활용하여 KG에 저장된 복잡한 그래프 정보를 처리하고, LLM의 인퍼런스를 돕는 새로운 방법입니다. 실험 결과 GNN-RAG는 복잡한 KGQA 성능에서 최대 15.5% 포인트까지 다른 시스템을 능가했습니다.
2 Related Work
[KGQA Methods]
KGQA 방법은 (A) 의미론적 파싱(SP) 방법과 (B) 정보 검색(IR) 방법으로 나눌 수 있습니다. SP 방법은 주어진 질문을 논리적 형태의 쿼리로 변환하여 KG에서 답을 얻습니다. 그러나 SP 방법은 학습을 위한 정확한 논리적 쿼리를 필요로 하며, 이는 시간 소모적이고 실행 불가능한 쿼리를 생성할 수 있습니다. 반면, IR 방법은 약한 지도 학습 설정에서 KG 정보를 검색하여 KGQA 인퍼런스에 사용합니다.
[Graph-augmented LMs]
그래프 정보를 저장하는 LMs와의 결합은 새로운 연구 영역입니다. 첫 번째 방향은 GNN에서 얻은 잠재 그래프 정보를 사용하여 LMs를 향상시키는 것이고, 두 번째 방향은 그래프 정보를 입력에 삽입하는 것입니다. 두 번째 방법은 큰 그래프에서 노이즈이 많은 정보를 가져올 수 있지만, GNN-RAG는 GNN을 정보 검색에 사용하고 RAG를 KGQA 인퍼런스에 사용하여 우수한 성능을 달성합니다.
3 Problem Statement & Background
[KGQA]
KG \(G\)가 (v, r, v’) 형식으로 표현된 사실을 포함하고, 주어진 자연어 질문 \(q\)에 대해 KG 내 올바른 답을 추출하는 것이 목표입니다. 훈련 시 질문-답 쌍이 주어지며, 답으로 이어지는 경로의 정답은 주어지지 않습니다.
[Retrieval & Reasoning]
KG는 수백만 개의 사실과 노드를 포함하므로, 질문에 특화된 작은 서브그래프 \(G_q\)를 검색하여 사용합니다. 검색된 서브그래프와 질문은 인퍼런스 모델의 입력으로 사용됩니다.
[GNNs]
KGQA는 노드 분류 문제로 볼 수 있으며, GNN은 강력한 그래프 표현 학습기입니다. GNN은 각 이웃 노드로부터 메시지를 집계하여 노드의 표현을 업데이트합니다.
\[h^{(l+1)}_v = \sigma \left( W^{(l)} \sum_{u \in \mathcal{N}(v)} \frac{1}{c_{vu}} h^{(l)}_u \right)\]\(h^{(l)}_v\)는 레이어 \(l\)에서 노드 \(v\)의 표현이고, \(c_{vu}\)는 정규화 상수입니다.
[LLMs]
LLM은 KG 정보를 자연어로 변환하여 처리합니다. 예를 들어, “Knowledge: Jamaica → language_spoken → English \n Question: Jamaican people speak?”의 형태로 입력을 받습니다.
4 GNN-RAG
4.1 GNN
GNN-RAG는 SOTA GNN을 활용하여 질문에 대한 답 후보를 검색합니다. GNN은 복잡한 그래프 상호작용을 처리하고 다중 홉 질문에 답할 수 있는 능력이 있습니다. GNN이 인퍼런스를 완료한 후, 최종 GNN 표현 \(h^{(L)}\)을 기반으로 노드를 답변과 비답변으로 분류합니다. 인퍼런스 경로는 LLM 기반 RAG의 입력으로 사용됩니다.
4.2 LLM
GNN-RAG로 얻은 인퍼런스 경로를 LLM의 입력으로 사용합니다. 이를 위해 RAG 프롬프트 튜닝을 수행하여 적절한 답변을 생성합니다.
\[\{ \text{Reasoning Paths} \} \\ \text{Question:} \{ \text{Question} \}\]4.3 Retrieval Analysis: Why GNNs & Their Limitations
GNN은 다중 홉 정보를 포함하는 관련 KG 부분을 검색하는 데 적합합니다. 실험 결과, 깊은 GNN(L=3)은 복잡한 그래프 구조를 처리하고 유용한 다중 홉 정보를 효과적으로 검색합니다. 그러나 간단한 질문에서는 정확한 질문-관계 매칭이 중요하여 LLM 기반 검색기가 더 나은 성능을 보입니다.
4.4 Retrieval Augmentation (RA)
RA는 다양한 접근 방식에서 검색된 KG 정보를 결합하여 다양성과 답변 Recall을 증가시킵니다. GNN-RAG+RA는 GNN 기반 검색기를 LLM 기반 검색기와 결합하여 단일 홉 및 다중 홉 질문에서 성능을 향상시킵니다.
5 Experimental Setup
[KGQA Datasets]
WebQuestionsSP(WebQSP)와 Complex WebQuestions 1.1(CWQ) 벤치마크를 사용하여 실험합니다. WebQSP는 Freebase KG를 사용하여 최대 2홉의 인퍼런스이 필요한 질문을 포함하고, CWQ는 최대 4홉의 인퍼런스이 필요한 복잡한 질문을 포함합니다.
[Implementation & Evaluation]
서브그래프 검색을 위해 PageRank 알고리즘을 사용하고, 인퍼런스를 위해 ReaRev GNN을 사용합니다. 평가 메트릭으로 Hit, Hits@1(H@1), F1을 사용합니다.
[Competing Methods]
SOTA GNN 및 LLM 방법과 비교합니다. 또한, 기존의 임베딩 기반 방법과도 비교합니다.
6 Results
GNN-RAG는 두 KGQA 벤치마크에서 좋은 성능을 보이며, 복잡한 질문에서 최대 15.5% 성능 향상을 달성했습니다. GNN-RAG+RA는 LLM 기반 검색기보다 다중 홉 질문에서 최대 17.2% 성능 향상을 보였습니다.
7 Conclusion
GNN-RAG는 LLM과 GNN의 결합을 통해 KGQA 성능을 크게 향상시키는 새로운 방법입니다. GNN-RAG는 두 개의 주요 벤치마크에서 SOTA 성능을 달성하였으며, 복잡한 질문에서 다중 홉 정보를 효과적으로 검색하여 LLM의 인퍼런스 능력을 향상시킵니다.
Large Language Models (LLMs) [Brown et al., 2020, Bommasani et al., 2021, Chowdhery et al., 2023] are the state-of-the-art models in many NLP tasks due to their remarkable ability to understand natural language. LLM’s power stems from pretraining on large corpora of textual data to obtain general human knowledge [Kaplan et al., 2020, Hoffmann et al., 2022]. However, because pretraining is costly and time-consuming [Gururangan et al., 2020], LLMs cannot easily adapt to new or in-domain knowledge and are prone to hallucinations [Zhang et al., 2023].
Knowledge Graphs (KGs) [Vrandeˇci´c and Krötzsch, 2014] are databases that store information in structured form that can be easily updated. KGs represent human-crafted factual knowledge in the form of triplets (head, relation, tail), e.g., <Jamaica → language_spoken → English>, which collectively form a graph. In the case of KGs, the stored knowledge is updated by fact addition or removal. As KGs capture complex interactions between the stored entities, e.g., multi-hop relations, they are widely used for knowledge-intensive task, such as Question Answering (QA) [Pan et al., 2024].
Retrieval-augmented generation (RAG) is a framework that alleviates LLM hallucinations by enriching the input context with up-to-date and accurate informa- tion [Lewis et al., 2020], e.g., obtained from the KG. In the KGQA task, the goal is to answer natural questions grounding the reasoning to the information provided by the KG. For instance, the input for RAG becomes “Knowledge: Jamaica → language_spoken → English \n Question: Jamaican people speak?”, where the LLM has access to KG information for answering the question.
effect
RAG’s performance highly depends on the KG facts that are retrieved [Wu et al., 2023]. The challenge is that KGs store complex graph information (they usually consist of millions of facts) and retrieving the right information requires effective graph processing, while retrieving irrelevant information may confuse the LLM during its KGQA reasoning [He et al., 2024]. Existing retrieval methods that rely on LLMs to retrieve relevant KG information (LLM-based retrieval) underperform on multi-hop KGQA as they cannot handle complex graph information [Baek et al., 2023, Luo et al., 2024] or they need the internal knowledge of very large LMs, e.g., GPT-4, to compensate for missing information during KG retrieval [Sun et al., 2024].
In this work, we introduce GNN-RAG, a novel method for improving RAG for KGQA. GNN-RAG relies on Graph Neural Networks (GNNs) [Mavromatis and Karypis, 2022], which are powerful graph representation learners, to handle the complex graph information stored in the KG. Although GNNs cannot understand natural language the same way LLMs do, GNN-RAG repurposes their graph processing power for retrieval. First, a GNN reasons over a dense KG subgraph to retrieve answer candidates for a given question. Second, the shortest paths in the KG that connect question entities and GNN-based answers are extracted to represent useful KG reasoning paths. The extracted paths are verbalized and given as input for LLM reasoning with RAG. Furthermore, we show that GNN-RAG can be augmented with LLM-based retrievers to further boost KGQA performance. Experimental results show GNN-RAG’s superiority over competing RAG-based systems for KGQA by outperforming them by up to 15.5% points at complex KGQA performance (Figure 1). Our contributions are summarized below:
KGQA Methods.
KGQA methods fall into two categories [Lan et al., 2022]: (A) Semantic Parsing (SP) methods and (B) Information Retrieval (IR) methods. SP methods [Sun et al., 2020, Lan and Jiang, 2020, Ye et al., 2022] learn to transform the given question into a query of logical form, e.g., SPARQL query. The transformed query is then executed over the KG to obtain the answers. However, SP methods require ground-truth logical queries for training, which are time-consuming to annotate in practice, and may lead non-executable queries due to syntactical or semantic errors [Das et al., 2021, Yu et al., 2022]. IR methods [Sun et al., 2018, 2019] focus on the weakly-supervised KGQA setting, where only question-answer pairs are given for training. IR methods retrieve KG information, e.g., a KG subgraph [Zhang et al., 2022a], which is used as input during KGQA reasoning. In Appendix A, we analyze the reasoning abilities of the prevailing models (GNNs & LLMs) for KGQA, and in Section 4, we propose GNN-RAG which leverages the strengths of both of these models.
Figure 2: The landscape of existing KGQA methods. GNN-based methods reason on dense subgraphs as they can handle complex and multi-hop graph information. LLM-based methods employ the same LLM for both retrieval and reasoning due to its ability to understand natural language.
Graph-augmented LMs.
Combining LMs with graphs that store information in natural language is an emerging research area [Jin et al., 2023]. There are two main directions, (i) methods that enhance LMs with latent graph information [Zhang et al., 2022b, Tian et al., 2024, Huang et al., 2024], e.g., obtained by GNNs, and (ii) methods that insert verbalized graph information at the input [Xie et al., 2022, Jiang et al., 2023a, Jin et al., 2024], similar to RAG. The methods of the first direction are limited because of the modality mismatch between language and graph, which can lead to inferior performance for knowledge-intensive tasks [Mavromatis et al., 2024]. On the other hand, methods of the second direction may fetch noisy information when the underlying graph is large and such information can decrease the LM’s reasoning ability [Wu et al., 2023, He et al., 2024]. GNN-RAG employs GNNs for information retrieval and RAG for KGQA reasoning, achieving superior performance over existing approaches.
KGQA. We are given a KG G that contains facts represented as (v, r, v′), where v denotes the head entity, v′ denotes the tail entity, and r is the corresponding relation between the two entities. Given G and a natural language question q, the task of KGQA is to extract a set of entities {a} ∈ G that correctly answer q. Following previous works [Lan et al., 2022], question-answer pairs are given for training, but not the ground-truth paths that lead to the answers.
Retrieval & Reasoning. As KGs usually contain millions of facts and nodes, a smaller question- specific subgraph Gq is retrieved for a question q, e.g., via entity linking and neighbor extraction [Yih et al., 2015]. Ideally, all correct answers for the question are contained in the retrieved subgraph, {a} ∈ Gq. The retrieved subgraph Gq along with the question q are used as input to a reasoning model, which outputs the correct answer(s). The prevailing reasoning models for the KGQA setting studied are GNNs and LLMs.
GNNs. KGQA can be regarded as a node classification problem, where KG entities are classified as answers vs. non-answers for a given question. GNNs Kipf and Welling [2016], Veliˇckovi´c et al. [2017], Schlichtkrull et al. [2018] are powerful graph representation learners suited for tasks such as node classification. GNNs update the representation h(l) v of node v at layer l by aggregating messages m(l) vv′ from each neighbor v′. During KGQA, the message passing is also conditioned to the given question q [He et al., 2021]. For readability purposes, we present the following GNN update for KGQA,
LLMs. LLMs for KGQA use KG information to perform retrieval-augmented generation (RAG) as follows. The retrieved subgraph is first converted into natural language so that it can be processed by the LLM. The input given to the LLM contains the KG factual information along with the question and a prompt. For instance, the input becomes “Knowledge: Jamaica → language_spoken → Which language do Jamaican people speak?”, where the LLM English \n Question: has access to KG information for answering the question.
Landscape of KGQA Methods
Figure 2 presents the landscape of existing KGQA methods with respect to KG retrieval and reasoning. GNN-based methods, such as GraftNet [Sun et al., 2018], NSM [He et al., 2021], and ReaRev [Mavromatis and Karypis, 2022], reason over a dense KG subgraph leveraging the GNN’s ability to handle complex graph information. Recent LLM-based methods leverage the LLM’s power for both retrieval and reasoning. ToG [Sun et al., 2024] uses the LLM to retrieve relevant facts hop-by-hop. RoG [Luo et al., 2024] uses the LLM to generate plausible relation paths which are then mapped on the KG to retrieve the relevant information.
LLM-based Retriever
We present an example of an LLM-based retriever (RoG; [Luo et al., 2024]). Given training question-answer pairs, RoG extracts the shortest paths to the answers starting from question entities for fine-tuning the retriever. Based on the extracted paths, an LLM (LLaMA2-Chat-7B [Touvron et al., 2023]) is fine-tuned to generate reasoning paths given a question \(q\) as
\[\text{LLM}(\text{prompt}, q) \Rightarrow \{r_1 \rightarrow \cdots \rightarrow r_t\}_k,\]where the prompt is “Please generate a valid relation path that can be helpful for answering the following question: {Question}”. Beam-search decoding is used to generate \(k\) diverse sets of relations \(\{\langle \text{official\_language} \rangle, \langle \text{language\_spoken} \rangle\}\) for the question “Which language do Jamaican people speak?”. The generated paths are mapped on the KG, starting from the question entities, in order to retrieve the intermediate entities for RAG, e.g., \(\langle \text{Jamaica} \rightarrow \text{language\_spoken} \rightarrow \text{English} \rangle\).
We introduce GNN-RAG, a novel method for combining language understanding abilities of LLMs with the reasoning abilities of GNNs in a retrieval-augmented generation (RAG) style. We provide the overall framework in Figure 3. First, a GNN reasons over a dense KG subgraph to retrieve answer candidates for a given question. Second, the shortest paths in the KG that connect question entities and GNN-based answers are extracted to represent useful KG reasoning paths. The extracted paths are verbalized and given as input for LLM reasoning with RAG. In our GNN-RAG framework, the GNN acts as a dense subgraph reasoner to extract useful graph information, while the LLM leverages its natural language processing ability for ultimate KGQA.
In order to retrieve high-quality reasoning paths via GNN-RAG, we leverage state-of-the-art GNNs for KGQA. We prefer GNNs over other KGQA methods, e.g., embedding-based methods [Saxena et al., 2020], due to their ability to handle complex graph interactions and answer multi-hop questions. GNNs mark themselves as good candidates for retrieval due to their architectural benefit of exploring diverse reasoning paths [Mavromatis and Karypis, 2022, Choi et al., 2024] that result in high answer recall.
When GNN reasoning is completed (\(L\) GNN updates via Equation 1), all nodes in the subgraph are scored as answers vs. non-answers based on their final GNN representations \(h(L)\), followed by the \(\text{softmax}(\cdot)\) operation. The GNN parameters are optimized via node classification (answers vs. non-answers) using the training question-answer pairs. During inference, the nodes with the highest probability scores, e.g., above a probability threshold, are returned as candidate answers, along with the shortest paths connecting the question entities with the candidate answers (reasoning paths). The retrieved reasoning paths are used as input for LLM-based RAG.
After obtaining the reasoning paths by GNN-RAG, we verbalize them and give them as input to a downstream LLM, such as ChatGPT or LLaMA. However, LLMs are sensitive to the input prompt template and the way that the graph information is verbalized.
To alleviate this issue, we opt to follow RAG prompt tuning [Lin et al., 2023, Zhang et al., 2024] for LLMs that have open weights and are feasible to train. A LLaMA2-Chat-7B model is fine-tuned based on the training question-answer pairs to generate a list of correct answers, given the prompt: “Based on the reasoning paths, please answer the given question. Reasoning Paths: The reasoning paths are verbalized as “{question entity} → {relation} → {entity} → · · · → {relation} → {answer entity}”. During training, the reasoning paths are the shortest paths from question entities to answer entities. During inference, the reasoning paths are obtained by GNN-RAG.
\[\{ \text{Reasoning Paths} \} \\ \text{Question:} \{ \text{Question} \}\]GNNs leverage the graph structure to retrieve relevant parts of the KG that contain multi-hop information. We provide experimental evidence on why GNNs are good retrievers for multi-hop
KGQA. We train two different GNNs, a deep one (L = 3) and a shallow one (L = 1), and measure their retrieval capabilities. We report the ‘Answer Coverage’ metric, which evaluates whether the retriever is able to fetch at least one correct answer for RAG. Note that ‘Answer Coverage’ does not measure downstream KGQA performance but whether the retriever fetches relevant KG information. ‘#Input Tokens’ denotes the median number of the input tokens of the retrieved KG paths. Table 1 shows GNN retrieval results for single-hop and multi-hop questions of the WebQSP dataset compared to an LLM-based retriever (RoG; Equation 2). The results indicate that deep GNNs (L = 3) can handle the complex graph structure and retrieve useful multi-hop information more effectively (%Ans. Cov.) and efficiently (#Input Tok.) than the LLM and the shallow GNN.
Table 1: Retrieval results for WebQSP.
On the other hand, the limitation of GNNs is for simple (1-hop) questions, where accurate question-relation matching is more important than deep graph search (see our Theorem in Appendix A that states this GNN limitation). In such cases, the LLM retriever is better at selecting the right KG information due to its natural language understanding abilities (we provide an example later in Figure 5).
Retrieval augmentation (RA) combines the retrieved KG information from different approaches to increase diversity and answer recall. Motivated by the results in Section 4.3, we present a RA technique (GNN-RAG+RA), which complements the GNN retriever with an LLM-based retriever to combine their strengths on multi-hop and single-hop questions, respectively. Specifically, we experiment with the RoG retrieval, which is described in Equation 2. During inference, we take the union of the reasoning paths retrieved by the two retrievers.
A downside of LLM-based retrieval is that it requires multiple generations (beam-search decoding) to retrieve diverse paths, which trades efficiency for effectiveness (we provide a performance analysis in Appendix A). A cheaper alternative is to perform RA by combining the outputs of different GNNs, which are equipped with different LMs in Equation 3. Our GNN-RAG+Ensemble takes the union of the retrieved paths of the two different GNNs (GNN+SBERT & GNN+LMSR) as input for RAG.
KGQA Datasets. We experiment with two widely used KGQA benchmarks: WebQuestionsSP (WebQSP) [Yih et al., 2015], Complex WebQuestions 1.1 (CWQ) [Talmor and Berant, 2018]. WebQSP contains 4,737 natural language questions that are answerable using a subset Freebase KG [Bollacker et al., 2008]. The questions require up to 2-hop reasoning within this KG. CWQ contains 34,699 total complex questions that require up to 4-hops of reasoning over the KG. We provide the detailed dataset statistics in Appendix C.
Implementation & Evaluation. For subgraph retrieval, we use the linked entities and the pagerank algorithm to extract dense graph information [He et al., 2021]. We employ ReaRev [Mavromatis and Karypis, 2022], which is a GNN targeting at deep KG reasoning (Section 4.3), for GNN-RAG. The default implementation is to combine ReaRev with SBERT as the LM in Equation 3. In addition, we combine ReaRev with LMSR, which is obtained by following the implementation of SR [Zhang et al., 2022a]. We employ RoG [Luo et al., 2024] for RAG-based prompt tuning (Section 4.2). For evaluation, we adopt Hit, Hits@1 (H@1), and F1 metrics. Hit measures if any of the true answers is found in the generated response, which is typically employed when evaluating LLMs. H@1 is the accuracy of the top/first predicted answer. F1 takes into account the recall (number of true answers found) and the precision (number of false answers found) of the generated answers. Further experimental setup details are provided in Appendix C.
Competing Methods. We compare with SOTA GNN and LLM methods for KGQA [Mavromatis and Karypis, 2022, Li et al., 2023]. We also include earlier embedding-based methods [Saxena et al., 2020] as well as zero-shot/few-shot LLMs [Taori et al., 2023]. We do not compare with semantic parsing methods [Yu et al., 2022] as they use additional training data (SPARQL annotations), which are difficult to obtain in practice. Furthermore, we compare GNN-RAG with LLM-based retrieval approaches [Luo et al., 2024, Sun et al., 2024] in terms of efficiency and effectiveness.
Table 2: Performance comparison of different methods on the two KGQA benchmarks. We denote the best and second-best method.
Hit is used for LLM evaluation. We use the default GNN-RAG (+RA) implementation. GNN-RAG, RoG, KD-CoT, and G-Retriever use 7B fine-tuned LLaMA2 models. KD-CoT employs ChatGPT as well.
Table 3: Performance analysis (F1) on multi-hop (hops≥ 2) and multi-entity (entities≥ 2) questions.
Main Results. Table 2 presents performance results of different KGQA methods. GNN-RAG is the method that performs overall the best, achieving state-of-the-art results on the two KGQA benchmarks in almost all metrics. The results show that equipping LLMs with GNN-based retrieval boosts their reasoning ability significantly (GNN+LLM vs. KG+LLM). Specifically, GNN-RAG+RA outperforms RoG by 5.0–6.1% points at Hit, while it outperforms or matches ToG+GPT-4 performance, using an LLM with only 7B parameters and much fewer LLM calls – we estimate ToG+GPT-4 has an overall cost above $800, while GNN-RAG can be deployed on a single 24GB GPU. GNN-RAG+RA outperforms ToG+ChatGPT by up to 14.5% points at Hit and the best performing GNN by 5.3–9.5% points at Hits@1 and by 0.7–10.7% points at F1.
Multi-Hop & Multi-Entity KGQA. Table 3 compares performance results on multi-hop questions, where answers are more than one hop away from the question entities, and multi-entity questions, which have more than one question entities. GNN-RAG leverages GNNs to handle complex graph information and outperforms RoG (LLM-based retrieval) by 6.5–17.2% points at F1 on WebQSP and by 8.5–8.9% points at F1 on CWQ. In addition, GNN-RAG+RA offers an additional improvement by up to 6.5% points at F1. The results show that GNN-RAG is an effective retrieval method when deep graph search is important for successful KGQA.
Table 4: Performance comparison (F1 at KGQA) of different retrieval augmentations (Section 4.4). ‘#LLM Calls’ are controlled by the hyperparameter k (number of beams) during beam-search decoding for LLM-based retrievers, ‘#Input Tokens’ denotes the median number of tokens.
Retrieval Augmentation. Table 4 compares different retrieval augmentations for GNN-RAG. The primary metric is F1, while the other metrics assess how well the methods retrieve relevant information from the KG. Based on the results, we make the following conclusions:
GNN-based retrieval is more efficient (#LLM Calls, #Input Tokens) and effective (F1) than LLM-based retrieval, especially for complex questions (CWQ); see rows (e-f) vs. row (d).
Retrieval augmentation works the best (F1) when combining GNN-induced reasoning paths with LLM-induced reasoning paths as they fetch non-overlapping KG information (increased #Input Tokens) that improves retrieval for KGQA; see rows (h) & (i).
Augmenting all retrieval approaches does not necessarily cause improved performance (F1) as the long input (#Input Tokens) may confuse the LLM; see rows (g/j) vs. rows (e/h).
Although the two GNNs perform differently at KGQA (F1), they both improve RAG with LLMs; see rows (a-b) vs. rows (e-f). We note though that weak GNNs are not effective retrievers (see Appendix D.2).
In addition, GNN-RAG improves the vanilla LLM by up to 176% at F1 without incurring additional LLM calls; see row (c) vs. row (e). Overall, retrieval augmentation of GNN-induced and LLM-induced paths combines their strengths and achieves the best KGQA performance.
Table 5: Retrieval effect on performance (% Hit) using various LLMs.
Retrieval Effect on LLMs. Table 5 presents performance results of various LLMs using GNN-RAG or LLM-based retrievers (RoG and ToG). We report the Hit metric as it is difficult to extract the number of answers from LLM’s output. GNN-RAG (+RA) is the retrieval approach that achieves the largest improvements for RAG. For instance, GNN-RAG+RA improves ChatGPT by up to 6.5% points at Hit over RoG and ToG. Moreover, GNN-RAG substan- tially improves the KGQA performance of weaker LLMs, such as Alpaca-7B and Flan-T5-xl. The improvement over RoG is up to 13.2% points at Hit, while GNN-RAG outper- forms LLaMA2-Chat-70B+ToG using a lightweight 7B LLaMA2 model. The results demonstrate that GNN-RAG can be integrated with other LLMs to improve their KGQA reasoning without retraining.
Case Studies on Faithfulness. Figure 4 illustrates two case studies from the CWQ dataset, show- ing how GNN-RAG improves LLM’s faithfulness, i.e., how well the LLM follows the question’s instructions and uses the right information from the KG. In both cases, GNN-RAG retrieves multi-hop information, which is necessary for answering the questions correctly. In the first case, GNN-RAG retrieves both crucial facts <Gilfoyle → characters_that_have_lived_here → Toronto> and <Toronto → province.capital → Ontario> that are required to answer the question, unlike the KG-RAG baseline (RoG) that fetches only the first fact. In the second case, the KG-RAG baseline incorrectly retrieves information about <Erin Brockovich → person> and not <Erin Brockovich → film_character> that the question refers to. GNN-RAG uses GNNs to explore how
Figure 4: Two case studies that illustrate how GNN-RAG improves the LLM’s faithfulness. In both cases, GNN-RAG retrieves multi-hop information that is necessary for answering the complex questions.
Figure 5: One case study that illustrates the benefit of retrieval augmentation (RA). RA uses LLMs to fetch semantically relevant KG information, which may have been missed by the GNN.
Figure 5 illustrates one case study from the WebQSP dataset, showing how RA (Section 4.4) improves GNN-RAG. Initially, the GNN does not retrieve helpful information due to its limitation to understand natural language, i.e., that
Further ablation studies are provided in Appendix D. Limitations are discussed in Appendix E.
We introduce GNN-RAG, a novel method for combining the reasoning abilities of LLMs and GNNs for RAG-based KGQA. Our contributions are the following. (1) Framework: GNN-RAG repurposes GNNs for KGQA retrieval to enhance the reasoning abilities of LLMs. Moreover, our retrieval analysis guides the design of a retrieval augmentation technique to boost GNN-RAG performance. (2) Effectiveness & Faithfulness: GNN-RAG achieves state-of-the-art performance in two widely used KGQA benchmarks (WebQSP and CWQ). Furthermore, GNN-RAG is shown to retrieve multi-hop information that is necessary for faithful LLM reasoning on complex questions. (3) Efficiency: GNN-RAG improves vanilla LLMs on KGQA performance without incurring additional LLM calls as existing RAG systems for KGQA require. In addition, GNN-RAG outperforms or matches GPT-4 performance with a 7B tuned LLM.