00:00:00

Share Your Feedback 🏝️

Graph Neural Prompting

Graph Neural Prompting

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Logic CoT Next: QA LoRA***

Graph Neural Prompting

  • Related Project: Private
  • Category: Paper Review
  • Date: 2023-09-28

Graph Neural Prompting with Large Language Models

  • url: https://arxiv.org/abs/2309.15427
  • pdf: https://arxiv.org/pdf/2309.15427
  • abstract: Large Language Models (LLMs) have shown remarkable generalization capability with exceptional performance in various language modeling tasks. However, they still exhibit inherent limitations in precisely capturing and returning grounded knowledge. While existing work has explored utilizing knowledge graphs to enhance language modeling via joint training and customized model architectures, applying this to LLMs is problematic owing to their large number of parameters and high computational cost. In addition, how to leverage the pre-trained LLMs and avoid training a customized model from scratch remains an open question. In this work, we propose Graph Neural Prompting (GNP), a novel plug-and-play method to assist pre-trained LLMs in learning beneficial knowledge from KGs. GNP encompasses various designs, including a standard graph neural network encoder, a cross-modality pooling module, a domain projector, and a self-supervised link prediction objective. Extensive experiments on multiple datasets demonstrate the superiority of GNP on both commonsense and biomedical reasoning tasks across different LLM sizes and settings.

Contents

TL;DR


  1. 지식 그래프와 언어 모델 결합 연구: 본 논문에서는 기존 언어 모델(Large Language Models, LLMs)에 지식 그래프(Knowledge Graphs, KGs)를 결합하여 성능을 향상시키는 새로운 방법, Graph Neural Prompting(GNP)을 제안한다.
  2. 통상적 인퍼런스과 생물의학 인퍼런스를 포함한 여러 공개 벤치마크 데이터셋에서 실험을 수행하였으며, 이를 통해 GNP의 유효성을 검증하였다.
  3. GNP는 지식 그래프를 인코딩하고, 이를 통해 LLM의 입력 텍스트와 지식을 연결하는 구조로 설계되었다. 특히, 그래프 신경망(GNN)과 교차 모달 풀링, 도메인 프로젝터 등이 포함된다.

  1. GNP(Graph Neural Prompting)는 사전 훈련된 대규모 언어모델(LLM)에 지식 그래프(KG)로부터 유익한 지식을 효과적으로 통합할 수 있는 새로운 플러그 앤 플레이 방법으로
  2. LLM이 동결되었을 때 기본값보다 +13.5% 향상되고, LLM이 튜닝되었을 때는 +1.8% 성능이 향상되었다고 보고합니다.
  3. 공통감각 인퍼런스 및 생물의학 인퍼런스 작업에 대한 광범위한 실험을 통해 GNP가 다양한 데이터셋과 설정에서 우수한 성능을 나타냄을 입증하여,
  4. GNP는 전체 훈련 곡선을 예측함으로써 표준 훈련 방법보다 훨씬 적은 계산 비용을 사용하여 언어 모델링 손실을 효과적으로 적합시킬 수 있다고 보고합니다.

방법

  1. 지식 그래프(KG)에서 연관된 사실들을 추출하여 구조적 지식을 확보
  2. GNN을 사용하여 지식 그래프의 복잡한 관계를 인코딩하고, 이를 통해 개별 엔티티 또는 노드의 임베딩을 생성
  3. 텍스트 입력과 관련이 깊은 노드 임베딩을 동적으로 식별하고 이를 통합하여 전체 그래프 수준의 임베딩을 형성
  4. 그래프 수준 임베딩과 텍스트 도메인 간의 차이를 연결하는 도메인 프로젝터를 사용하여 언어 모델에 적합한 형태로 정보를 변환
  5. 구조적 정보 인식을 강화하고 그래프 지식을 자기지도 방식으로 학습하기 위해 링크 예측 작업을 설정해 그래프에서 일부 연결을 가린 후 모델이 이를 예측하도록 학습
  6. 그래프 신경 프롬프팅(Graph Neural Prompting, GNP) 최종적으로, 모든 단계를 통합하여 생성된 그래프 신경 프롬프트를 언어 모델에 입력으로 제공하고, 이를 통해 텍스트 생성 또는 질문 응답과 같은 downstream 작업을 수행

1 서론

최근 LLMs는 다양한 자연어 처리(NLP) 작업에서 향상된 성능을 보여주고 있다. 이런 모델들은 많은 파라미터를 조정하여 세부 작업에 적합하게 조정하는 기존의 방식과는 달리, 프롬프트 기반의 적응 방법을 통해 특정 작업에 대한 모델의 행동을 조절할 수 있는 방법을 제시한다.

1.1 문제 정의

기존 LLMs는 텍스트 기반 지식을 활용하는 데에는 강력하지만, 구조화된 지식을 효과적으로 활용하는 데에는 한계가 있다. 이에 대한 해결책으로, 지식 그래프를 이용하여 모델이 더욱 풍부하고 정확한 지식을 학습할 수 있는 방법을 모색한다.

1.2 해결 방법

지식 그래프를 활용하여 LLMs에 구조화된 지식을 제공함으로써, 모델이 실제 세계의 지식을 더 정확하게 파악하고 이를 기반으로 더 정확한 답변을 생성할 수 있도록 하는 것이다. 이를 위해 Graph Neural Prompting(GNP) 방법을 제안한다.


2 관련 연구

2.1 LLMs 및 질의응답 시스템

LLMs는 질의응답 시스템에서 주요한 역할을 하며, 이를 위해 다양한 모델과 기법이 제안되어 왔다. 하지만, 이런 모델들은 종종 정확한 사실적 지식의 획득에 실패하곤 한다.

2.2 지식 그래프의 활용

지식 그래프는 다양한 사실들을 구조화된 형태로 저장하며, 이를 통해 LLMs의 성능을 개선할 수 있는 기회를 제공한다. 특히, 지식 그래프를 통한 사전 훈련이나 질의응답을 위한 지식 통합이 중요한 연구 분야로 자리 잡고 있다.


3 선수 지식

3.1 지식 그래프 정의

지식 그래프는 엔티티들의 집합 \(E\), 관계들의 집합 \(R\), 그리고 사실 삼중항들의 집합 \(T = \{(e_h, r, e_t)\} \in E \times R \times E\)로 정의된다.

3.2 다지선다형 질의응답

다지선다형 질의응답 문제는 주어진 질문 \(Q\), 답변 옵션 \(A = \{a_k\}_{k=1}^K\), 그리고 선택적인 문맥 \(C\)를 가지고, 가장 적절한 답변을 선택하는 모델 \(F_\Theta\)를 설계하는 것이다.


4 방법

4.1 프롬프트를 통한 질의응답

LLMs을 위한 전형적인 프롬프트 접근 방식은 질문, 문맥, 답변 옵션을 토큰화하여 입력 텍스트 토큰 \(X\)로 변환하고, 여기에 프롬프트 토큰 \(P\)를 추가하는 것이다. 이 프롬프트는 구조화된 지식을 포함하는 소프트 프롬프트 형태로, 지식 그래프에서 인코딩된 정보를 반영한다.

4.2 서브그래프 검색

입력 텍스트 토큰 \(X\)에 대응하는 지식 그래프의 서브그래프를 검색하여, 해당 텍스트와 관련된 엔티티와 그 관계를 포함하는 서브그래프 \(G'\)를 구성한다. 이 과정은 문제의 본문과 옵션에 대응하는 엔티티를 찾아내고, 이들을 포함하는 서브그래프를 검색함으로써, 모델이 보다 정확한 답변을 생성할 수 있도록 지원한다.

4.3 그래프 신경망 프롬프팅

GNP는 GNN 인코더, 교차 모달 풀링 모듈, 도메인 프로젝터, 그리고 자기 감독 링크 예측 목표를 포함한다. 이런 각 구성 요소는 지식 그래프에서의 복잡한 관계를 모델에 효과적으로 전달하기 위해 설계되었다.

4.4 도메인 프로젝터 및 링크 예측

도메인 프로젝터는 그래프 수준 임베딩을 텍스트 도메인과 일치시키기 위해 설계되었으며, 링크 예측은 모델이 엔티티 간의 관계를 인식하고 그래프 지식을 학습하도록 돕는다. 이 과정은 노이즈를 최소화하고, 모델이 중요한 정보에 집중할 수 있도록 지원한다.


5 실험

5.1 실험 설정

실험은 일반 도메인과 생물의학 도메인에서 수행되며, 각각 ConceptNet과 UMLS 지식 그래프를 사용한다. 사용된 데이터셋은 다양한 인퍼런스 능력을 평가하기 위해 선택되었다.

5.2 실험 결과

GNP는 LLM을 고정한 상태에서 프롬프트 설계 지침을 사용하여 다양한 데이터셋과 설정에서 현저한 성능 향상을 보여주었다. 이는 GNP가 LLMs에 구조화된 지식을 효과적으로 전달할 수 있음을 시사한다.


6 성능 비교

LLM의 다양한 설정에서 GNP를 포함한 다양한 모델들의 성능을 비교하였다. 결과적으로, GNP는 LLM의 파라미터를 최소한으로 조정하면서도 향상된 성능을 달성하는 것으로 나타났다.


7 Ablation Study

GNP의 다양한 구성 요소의 기여도를 분석하기 위해 절제 연구(Ablation Study)를 수행하였다. 도메인 프로젝터, 교차 모달 풀링, 자기 감독 링크 예측의 세부 구성 요소들이 모델의 성능에 큰 영향을 미치는 것으로 확인되었다.

본 논문은 지식 그래프를 활용하여 LLMs의 성능을 향상시키는 새로운 방법을 제안하고, 이를 다양한 데이터셋에서 실험을 통해 검증하였다.


1 Introduction

Large Language Models (LLMs) have demonstrated exceptional performance capability in various NLP tasks and use cases such as question answering (Robinson, Rytting, and Wingate 2023) and text summarization (Zhang et al. 2023). Moreover, the significant growth in model size has further endowed LLMs with emergent capabilities (Wei et al. 2022b), laying the groundwork for exploring artificial general intelligence (Bubeck et al. 2023). Accordingly, LLMs have attracted tremendous interest from academia (Wei et al. 2022a; Zhao et al. 2023) and industry (Anil et al. 2023; OpenAI 2023).

Given the broad success of LLMs, many techniques have emerged to adapt these general-purpose models to downstream tasks. Beyond the conventional approach of model fine-tuning where all model parameters are adjusted (Howard and Ruder 2018), prompt-based adaptation methods are proposed to modulate a frozen LLM’s behavior through prompts (Brown et al. 2020; Lester, Al-Rfou, and Constant 2021; Li and Liang 2021). Rather than adapt the parameters in LLMs, these methods freeze the LLMs and typically trainable parameters. The idea of introduce additional freezing LLMs is appealing, especially as the model size grows and the training resource dependency intensifies.

Figure 1: Result comparison across LLM Frozen (parameters unchanged) and LLM Tuned (parameters updated) settings. The proposed Graph Neural Prompting significantly improves the performance. Reported results are averaged across six datasets on two tasks for an 11B FLAN-T5 model.

On the other hand, despite the success of LLMs in handling different real-world applications and the feasibility of adapting to specific downstream tasks, they still exhibit the inherent limitations of language modeling in accurately capturing and returning grounded knowledge (Lewis et al. 2020; Pan et al. 2023). Knowledge graphs (KGs), storing enormous facts, serve as a systematic way of representing knowledge (Ji et al. 2021). Consequently, existing methods have incorporated KGs to assist language modeling, often by designing customized model architectures to accommodate both KGs and textual data, followed by joint training sessions (Yasunaga et al. 2022; Zhang et al. 2022). Nonetheless, joint training KGs and text for LLMs is challenging due to the extensive parameters LLMs contain and the substantial computation resources they require. In addition, numerous pre-trained LLMs with exceptional capabilities are released. It becomes advantageous to employ these pre-existing LLMs, particularly beneficial if we can sidestep the need to craft a specialized model and train it from scratch. A direct approach to employing KGs for retrieval-augmented generation (Lewis et al. 2020) is to feed the KG triples into LLMs directly (Baek, Aji, and Saffari 2023). However, this method can introduce substantial noise, given that KGs might contain various extraneous contexts. Therefore, we ask:

Can we learn beneficial knowledge from KGs and integrate them into pre-trained LLMs? To answer the question, we propose Graph Neural Prompting (GNP), a novel plug-and-play method to assist pre-trained LLMs in learning beneficial knowledge from KGs. GNP retrieves and encodes the pertinent grounded knowledge to derive Graph Neural Prompt, an embedding vector that can be sent into LLMs to provide guidance and instructions. In particular, GNP first utilizes a graph neural network (GNN) to capture and encode the intricate graph knowledge into entity/node embeddings. Then, a cross-modality pooling module is present to determine the most relevant node embeddings in relation to the text input, and consolidate these node embeddings into a holistic graph-level embedding. After that, GNP encompasses a domain projector to bridge the inherent disparities between the graph and text domains. Finally, a self-supervised link prediction objective is introduced to enhance the model comprehension of relationships between entities and capture graph knowledge in a self-supervised manner.

To fully evaluate our model, we conduct extensive experiments on multiple public benchmark datasets in the tasks of commonsense reasoning and biomedical reasoning. We further report the results across different LLM sizes and settings. We conclude that GNP can effectively encode intricate knowledge in KGs and significantly improve performance. Figure 1 shows the averaged performance six datasets. improvement using our method across Specifically, GNP improves the baseline by +13.5% when LLM is frozen, validating the superiority of our method in learning effective prompts. In addition, by using our method, fine-tuning LLMs with parameter-efficient approach LoRA (Hu et al. 2022) shows an improvement of +1.8%. More promisingly, compared to model full fine-tuning without leveraging any efficient tuning approaches, our method can achieve competitive or superior performance in 10 out of 12 evaluations, as shown in the experiment section. To summarize, our main contributions are:

  • To the best of our knowledge, this is the first attempt to study the learning of beneficial knowledge from KGs for pre-trained LLMs.
  • We propose GNP, a novel plug-and-play method for pre-trained LLMs to extract valuable knowledge from KGs. The proposed method contains various tailored designs, including a standard GNN, a cross-modality pooling module, a domain projector, and a self-supervised graph learning objective.
  • Extensive experiments demonstrate the superiority of GNP on multiple datasets across different settings. We also present the ablation study, model design comparison, parameter sensitivity analysis, case study and visualization to validate the effectiveness of GNP.

Large Language Models and Question Answering. Recently, various LLMs have been proposed (Chung et al. 2022; Touvron et al. 2023; Brown et al. 2020) and have demonstrated remarkable performance across different tasks (Shi et al. 2023; Chen et al. 2023b; Wei et al. 2024; Hong et al. 2023). Question answering, as a fundamental task, demands intricate reasoning and understanding comprehension skills to interpret the text and provide appropriate responses to the posed questions (Lu et al. 2022; Zhu et al. 2021; Wang et al. 2023; Chen et al. 2023a). Although LLMs have strong learning capabilities, they have the limitation of precisely capturing accurate factual knowledge and are susceptible to generating unfounded responses (Zhao et al. 2023; Ji et al. 2023; Bang et al. 2023). In addition, the enormous number of parameters in LLMs poses difficulties in adapting LLMs for downstream tasks (Scao et al. 2022; Smith et al. 2022). Correspondingly, various approaches are presented to alleviate the intensive training dependency and reduce the computational expenses (Lester, Al-Rfou, and Constant 2021; Li and Liang 2021; Hu et al. 2022). For instance, Prompt Tuning (Lester, Al-Rfou, and Constant 2021) introduces soft prompts to condition the pre-trained LLMs for downstream tasks. In our work, we propose to retrieve the factual knowledge from KGs to enhance LLMs, while still benefiting from circumventing the burdensome training expenses by using pre-trained LLMs. Knowledge Graphs for Language Modeling. Many graph learning methods are proposed to encode graphs and KGs (Ji et al. 2021; Tian et al. 2023a,b; Tang et al. 2022; Wang, Jin, and Derr 2022; Xu et al. 2023; Kou et al. 2022). Recent studies indicate that KGs can enhance language modeling by providing background knowledge (Ren et al. 2021; Wang et al. 2019). One approach to achieve this is integrating KGs into the pre-training stage of language modeling. For instance, ERNIE (Sun et al. 2021), JAKET (Yu et al. 2022), and JointGT (Ke et al. 2021) develop pre-training objectives tailored for KG triples and the paired sentences. DRAGON (Yasunaga et al. 2022) introduces a customized fusion framework to jointly pre-train the model for KGs and text. Moreover, KGs are leveraged to assist language modeling for question answering (Lin et al. 2019; Lv et al. 2020; Feng et al. 2020; Mihaylov and Frank 2018). Specifically, GreaseLM (Zhang et al. 2022) and QAGNN (Yasunaga et al. 2021) suggest that KGs can scaffold reasoning about entities with the graph structure such as negation and multihop reasoning to facilitate complex question answering. To encode KGs, many works study methods to learn KG entity and relation embeddings, such as TransE (Bordes et al. 2013) and DistMult (Yang et al. 2015). Recently, with the aim of integrating KGs into the emerging domain of LLMs, given existing studies pose difficulties when applying, KAPING (Baek, Aji, and Saffari 2023) employs knowledge graphs to extract relevant triples. These triples correspond to the input question, with the expectation that directly feeding them into LLMs is beneficial, despite the presence of noise. In our work, we present a learning method for identifying beneficial knowledge from KGs, offering substantial benefits to LLMs.

Figure 2: The overall framework. Given a multiple choice question, we first retrieve subgraphs from the knowledge graph based on the entities in the question and options. We then develop Graph Neural Prompting (GNP) to encode the pertinent factual knowledge and structural information to obtain the Graph Neural Prompt. GNP contains various designs including a GNN, a cross-modality pooling module, a domain projector, and a self-supervised link prediction objective. Later, the obtained Graph Neural Prompt is sent into LLM for inference along with the input text embedding. We utilize the standard maximum likelihood objective for downstream task adaptation, while LLM is kept frozen or tuned depending on different experimental settings.

3 Preliminary

In this section, we describe the knowledge graph and formally define the problem of multiple choice question answering.

Definition 1. Knowledge Graph. A knowledge graph is defined as \(G = (E, R, T)\), where:

  • \(E\) is the set of entities,
  • \(R\) is the set of relations,
  • \(T\) is the collection of fact triples \(\{(e_h, r, e_t)\} \in E \times R \times E\), where \(e_h\) denotes the head entity, \(r\) is the relation, and \(e_t\) indicates the tail entity.

Problem 1. Multiple Choice Question Answering. Given a question \(Q\), a set of answer options \(A = \{a_k\}_{k=1}^K\), and an optional context \(C\) depending on open-book or close-book, the task is to design a machine learning model \(F_\Theta\) with parameters \(\Theta\) that selects the best option to answer the question. Here \(K\) denotes the total number of answer options and \(a_k\) indicates the \(k\)-th answer option. The ground truth label \(y \in A\) is the correct answer for \(Q\). In addition, we use the knowledge graph \(G\) to provide rich knowledge and assist the model to answer the question.

4 Method

Methodology: In this section, we introduce the techniques of prompting LLMs for question answering as well as subgraph retrieval. Additionally, we present Graph Neural Prompting and elaborate on its components and designs. Figure 2 illustrates the framework of our method.

Prompting LLMs for Question Answering: Prompting is the de facto approach to elicit responses from LLMs (Liu et al. 2023). The typical approach of prompting LLMs for multi-choice question answering is simple. Given a question $Q$, the optional context $C$, and the answer options $A$, we first tokenize the concatenation of $C$, $Q$, $A$ into a sequence of input text tokens $X$. We then design a series of prompt tokens, $P$, and prepend it to the input text tokens $X$, which is later considered as input for the LLM model to generate prediction $y’ = f([P, X])$. The LLM model can be trained for downstream task adaptation using a standard maximum likelihood loss using teacher forcing (Williams and Zipser 1989) and a cross-entropy loss:

\[\mathcal{L} = -\log p(y \mid X, \Theta)\]

where $y$ represents the target sequence, $X$ represents the sequence of input text tokens concatenated with the prompt tokens, and $\Theta$ represents the model parameters. The objective for text generation using a language model can be formalized as minimizing the negative log-likelihood of the target sequence given the input sequence and the model parameters. Mathematically, this can be expressed as:

\[\mathcal{L}_{\text{TextGeneration}} = - \log p(y|X, \Theta) ... (1)\]

where \(p\) is the probability distribution parameterized by the model. The prompt \(P\) can be either a hard prompt in the form of textual input, or a soft prompt in the form of learnable embedding vectors.

  • \(y\) represents the target sequence,
  • \(X\) represents the input sequence,
  • \(\Theta\) represents the model parameters,
  • \(p(y \mid X, \Theta)\) is the probability of generating the target sequence \(y\) given the input \(X\) and the model parameters \(\Theta\).

Unlike existing methods that solely use a text string as the hard prompt, our Graph Neural Prompting approach encodes structural and factual information contained in the knowledge graph $G$ into a soft prompt $P$, which is a sequence of trainable vectors that can be concatenated with the token embedding of $X$. The learning of $P$ is encouraged to provide rich structural information and knowledge from $G$ as well as task instruction for each data instance.

Subgraph Retrieval: To semantically align the input text tokens $X$ with the massive knowledge graph $G$ with millions of nodes, we retrieve subgraphs of $G$ that contain the relevant entities to the tokens in $X$. In particular, for each answer option $a_k$ and its corresponding context $C$ and question $Q$, we first obtain a set of matched entities $E_{\text{match}}$ via entity linking to match the tokens in $X$ to the entities in $G$. We then retrieve a subgraph $G’$ based on the entities in $E_{\text{match}}$ by including their two-hop neighbors and the relations that connect them (Yasunaga et al. 2022). The retrieved subgraph contains the necessary content and knowledge to assist the model in answering $Q$.

where $\sigma$ is the GELU activation function, $FFN1$ and $FFN2$ are feed-forward neural networks, and $H_3$ is the final node embeddings obtained with cross-modality attention considered. Next, we generate the graph-level embedding by average pooling the node embeddings $H_3$ in $G’$:

Graph Neural Prompting: Graph Neural Prompting contains various designs, including a GNN encoder that embeds the knowledge graph, a cross-modality pooling module that determines the pertinent node embeddings, a domain projector that bridges the discrepancies between graph and text, and a self-supervised link prediction objective that encourages the model to recognize structural information.

GNN Encoder: Although the retrieved subgraph $G’$ contains rich contextual information regarding the question and answer choices, some entities and relations are not relevant to the actual question. Directly feeding every fact triples in $G’$ can introduce noise and prevent the LLM model from concentrating on the critical information. Therefore, we introduce a GNN to encode the most relevant knowledge and further integrate the complex relationships among the entities. In particular, we first initialize the node embeddings using pre-trained entity embeddings (Feng et al. 2020; Yasunaga, Leskovec, and Liang 2022). Next, we employ a standard graph attention network (Veličković et al. 2018) as our GNN encoder for the retrieved subgraph $G’$. The encoding process is formulated as follows:

\[H_1 = f_{GNN}(G'), \quad \text{where } H_1 \in \mathbb{R}^{d_g} \text{ represents the node embeddings learned by GNN for every node in } G',\]

Cross-modality Pooling: With the aim of identifying the most pertinent nodes in relation to the question, and consolidating the node embeddings into a holistic graph-level representation for subsequent use, we design the cross-modality pooling module. In particular, we first introduce a self-attention layer to dynamically identify node significance using the internal graph characteristics and the implicit interactions among nodes:

\[H_2 = \text{Self-Attn}(H_1), \quad \text{where } H_2 \text{ is node embeddings obtained after calculating self-attention and Self-Attn indicates the self-attention component.}\]

Then, we leverage the textual prompt to calculate the importance of nodes within the graph. To ensure uniformity, we utilize the dictionary in the LLM to obtain the text embeddings $T \in \mathbb{R}^{d_t}$ for every token in the input text, where $d_t$ denotes the dimension of the LLM dictionary. Concretely, we start by applying a transformation to the text embeddings $T$ and obtain the transformed text embedding $T’$, ensuring that the dimension of $T’$ matches the dimension $d_g$ of node embeddings $H_2$. After that, we calculate the cross-modality attention using $H_2$ and $T’$. We use $H_2$ as the query and the $T’$ as the key and the value. The procedure is as follows:

\[H_4 = \text{Cross-Modal-Attention}(H_2, T'), \quad \text{where } H_4 \text{ represents the graph-level embedding that takes into account the node significance in } G'.\]

Domain Projector: In order to create a mapping between the graph-level embeddings and the text domain to facilitate comprehension

by the LLM, we design a domain projector to align them. This projector aims to bridge the inherent disparities between the graph and text, allowing for more seamless integration. In addition, the projector maps the graph-level embeddings to the same dimension $d_t$ of LLM, which ensures compatibility and consistency when interfacing with the LLM’s inherent structures. We design the projector as follows:

\[Z = \text{Domain-Projector}(H_4), \quad \text{where } Z \text{ denotes Graph Neural Prompt, the final output of GNP, and } FFN3, FFN4 \text{ are feed-forward neural networks.}\]

Self-supervised Link Prediction: While the downstream cross-entropy objective enables the model to learn and adapt to the target dataset, we design a link prediction task to further refine its understanding of relationships between entities and capture graph knowledge in a self-supervised manner. Specifically, we mask out some edges in $G’$ and enforce the model to predict them. This encourages the model to learn to use the partial graph content and structure to reason about the missing links. Concretely, we denote the set of masked-out edges as $E_{\text{mask}} \subseteq E$. Given the learned node embeddings of the head entity and tail entity in a triplet ${h_3, t_3} \in H_3$, we adopt a widely-used knowledge graph embedding method DistMult (Yang et al. 2015) to map the entity embeddings and relation in the KG to vectors, $h, r, t$. We then define the scoring function $\phi(e_h, e_t) = \langle h, r, t \rangle$ to generate the scores for each triple, where $\langle \cdot, \cdot, \cdot \rangle$ denotes the trilinear dot product, and $r$ represents the relations in KGs. A higher $\phi$ indicates a higher chance of $(e_h, r, e_t)$ being a correct positive triple instead of an incorrect negative triple. We enforce the model to predict the masked edges in $E_{\text{mask}}$ as positive and other random edges as negative. The link prediction loss $L_{\text{lp}}$ is defined as follows:

\[S_{\text{pos}} = -\log \sigma_s\phi(e_h, e_t) + \gamma), \quad \text{where } \gamma \text{ is the margin, } \sigma_s \text{ is the sigmoid function, } \{(e'_h, r, e'_t)\} \text{ are } n \text{ negative triples corresponding to the positive triplet } (e_h, r, e_t), \text{ and } S_{\text{neg}} = \log(1 - \sigma_s\phi(e'_h, r, e'_t) + \gamma)) \text{ is the score for incorrect } n \text{ negative triples.}\]

The final objective function $L$ is defined as the weighted combination of $L_{\text{TextGenerationLLM}}$ and $L_{\text{lp}}$:

\[L = \lambda L_{\text{TextGenerationLLM}} + (1-\lambda) L_{\text{lp}}, \quad \text{where } \lambda \text{ is a trade-off weight for balancing two losses.}\]

Table 1: Overall experimental results on commonsense reasoning and biomedical reasoning tasks. The best results across different LLM sizes and settings are highlighted in bold. $\Delta P_T$ and $\Delta LoRA$ represent the relative performance improvement of our method to Prompt Tuning and LoRA, respectively. We also include the full fine-tuning result in gray color for further reference. * means multiple prompt design methods are evaluated while only the best result is reported. Accuracy is used as the evaluation metric.

5 Experiments

In this section, we conduct extensive experiments to compare the performances of different models. We also show ablation study, model design comparison, and parameter sensitivity analysis to demonstrate the effectiveness of GNP. Moreover, we present case study and visualization to provide an intuitive understanding and illustrate how KGs benefit.

Experiment setup Knowledge Graphs and Datasets. We conduct experiments on both the general domain (commonsense reasoning) and the biomedical domain (biomedical reasoning). For the used knowledge graphs, we consider ConceptNet (Speer, Chin, and Havasi 2017) that contains rich commonsense knowledge regarding the daily concepts, and Unified Medical Language System (UMLS) (Bodenreider 2004) that involves well-structured health and biomedical information. For datasets, we use four commonsense reasoning datasets, including OpenBookQA (OBQA) (Mihaylov et al. 2018), AI2 Reasoning Challenge (ARC) (Clark et al. 2018), Physical Interaction Question Answering (PIQA) (Bisk et al. 2020), and RiddleSense (Riddle) (Lin et al. 2021). In addition, we consider PubMedQA (Jin et al. 2019) and BioASQ (Tsatsaronis et al. 2015) for biomedical reasoning. Two Settings: LLM Frozen vs. LLM Tuned. To fully evaluate the model, we employ two settings: LLM Frozen and LLM Tuned. For LLM Frozen, we keep the parameters in LLM unchanged and only adapt the prompt. For LLM Tuned, the original LLM parameters are updated for downstream tasks by utilizing LoRA or full fine-tuning.

Baselines. In the setting of LLM Frozen, we compare with nine baselines, including LLM-only that uses no prompt, three prompt design methods that use different instructions as hard prompts, KG Flattening that flattens the nodes in the graph into a sequence via relevance score (REL) ranking (Yasunaga et al. 2022) or breadth-first search (BFS), KAPING (Baek, Aji, and Saffari 2023) that injects the important KG triples within one-hop (OH) and two-hop (TH) neighborhoods, and Prompt Tuning (Lester, Al-Rfou, and Constant 2021) that introduces soft prompts. In the setting of LLM Tuned, we compare with LoRA that updates partial LLM parameters. In addition, we include full model fine-tuning results as the referencing benchmark.

Implementation Details. For the proposed model, we set the learning rate to 1e-4, batch size to 8, hidden dimension of GNN to 1024, and training epochs to 50. In order to adapt the model effectively to each dataset, we search the GNN layers from 2 to 5, cross-modality pooling layers from 1 to 3, trade-off weight λ from {0.1, 0.5}, and link drop rate from {0.1, 0.3, 0.7}. We choose FLAN-T5 xlarge (3B parameters) and xxlarge (11B parameters) as the LLMs used in this paper. We adjust the maximum sequence length of LLMs to best fit the question length for each dataset. We run all experiments on four NVIDIA Tesla V100 GPUs with 24GB RAM.

6 Performance Comparison

To comprehensively evaluate our model, we conduct rigorous experiments using various LLMs across two reasoning tasks under different settings. The results are reported in Table 1. According to the table, in the setting of LLM Frozen, we observe that the utilization of the prompt design instructions often yields performance improvement, compared to LLM-only that uses no instructions, though the enhancement is mostly marginal. Interestingly, the baseline methods that inject KG information directly (KG Flattening and KAPING) can significantly hurt the model performance. This aligns with our motivation that KGs contain irrelevant contexts for the downstream tasks that could introduce noises or even alter the semantics if not handled carefully. While Prompt Tuning shows improved outcomes using the trainable is trivial. In contrast, soft prompts, our GNP exhibits significant and notable performance improvements across various datasets, settings, and LLMs. For example, for the commonsense reasoning task, GNP provides +25.37% improvement on Riddle for 3B LLM, and +15.66% improvement for 11B LLM. In addition, for the biomedical reasoning task, GNP improves the performance by +34.14% on BioASQ for 3B LLM and +38.75% for 11B LLM. In general, GNP achieves an improvement of +12.76% and +13.54% for 3B and 11B LLM, respectively.

In the setting of LLM Tuned, we first study the performance in comparison with LoRA and then report the model full fine-tuning for additional reference. As shown in the table, LoRA is a significantly more powerful approach than Prompt Tuning due to the direct update of the LLM internal parameters. Combining with the proposed GNP, the performance can be further improved. For example, GNP achieves 3.73% improvement on OBQA for 3B LLM, and 3.57% improvement on BioAQS for 11B LLM. Moreover, model full fine-tuning is an important reference to study the performance gap since LoRA only updates a small fraction of the model parameters. Surprisingly, we find that the incorporation of GNP can surpass the results of full fine-tuning. In contrast, relying solely on LoRA shows difficulties in achieving a comparable performance of full fine-tuning. In total, our final performance matches or surpasses model full fine-tuning in 10 out of 12 evaluations across different LLM sizes and datasets, as shown in Table 1.

7 Ablation Study

Since GNP contains various model components (i.e., cross-modality pooling (CMP), self-supervised link prediction (SLP), and domain projector (DP)), we conduct ablation studies to analyze the contributions of different components by removing each of them independently (see Table 2). Specifically, removing DP significantly affects the performance, showing that DP has a large contribution to the proposed method. In addition, the decreasing performances of removing CMP and SLP demonstrate the effectiveness of CMP and SLP in enhancing the model. In most cases, SLP yields greater significance compared to CMP, while in BioASQ, CMP plays a more important role. Finally, the proposed GNP achieves the best results in all cases, indicating the strong capability of different components in our model.

Table 2: Results of ablation study.

Model Design Comparison A salient property of GNP is the learning of Graph Neural Prompt for each data instance, i.e., various questions yield different retrieved subgraphs, resulting in unique prompts. Given its distinction to the dataset-level prompt (DLP) from Prompt Tuning that learns prompt for each dataset, we present the outcomes of integrating DLP for further investigation. As shown in Table 3, incorporating DLP cannot further boost the performance and might even diminish it in certain cases. This indicates that our instance-level prompt provides adequate guidance for LLM to perform well. In addition, we validate the importance of explicitly modeling relations using a widely-used Relational GNN (RGNN) (Zhang et al. 2022). The observed decline in performance suggests that a standard GNN is sufficient to capture the graph information, and explicitly modeling the relations might increase the difficulty of generating suitable guidance for the task.

Parameter Sensitivity Next, we perform sensitivity analysis focusing on the following parameters: the number of GNN layers and the number of layers in the cross-modality pooling component. Impact of GNN layers. We evaluate the influence of GNN layers for both 3B and 11B models in Figure 3. According to the figure, we have the following observations. First, various datasets have different optimal numbers of GNN layers. To illustrate, for ARC, 3 layers can achieve the optimal performance while 4 layers perform the best for PubMedQA. Second, the optimal number of GNN layers for 3B and 11B

Figure 3: Performance w.r.t. different number of GNN layers.

Figure 4: Performance w.r.t. different number of cross-modality pooling layers.

LLMs differs. For example, for OBQA, 3 layers work best for 3B LLM, while 11B LLM reaches its top performance when using 5 layers. Third, choosing different GNN layers can have a weak impact on some datasets while can also drastically affect the performance on other datasets. To demonstrate, increasing from 3 layers to 5 layers for 11B LLM can decrease the performance on ARC by a large margin (from 78.1 to 74.3), while adjusting the layers for BioASQ may not lead to a big change in the performance. Impact of cross-modality pooling layers. We report the performance of different cross-modality pooling layers in Figure 4. As shown in the figure, we observe that the commonsense reasoning dataset OBQA and biomedical reasoning dataset BioASQ demonstrate different reactions to layer numbers. Specifically, for OBQA, the performance of the larger 11B LLM increases with more layers, while the performance of the smaller 3B LLM decreases. On the other hand, for BioASQ, the larger 11B LLM tends to show a degraded performance when adding more layers, while the smaller 3B model presents an improved performance. This indicates that suitable cross-modality pooling layers can lead to the best model performance.

8 Case Study and Visualization

For a more intuitive understanding and comparison, we randomly select two examples from the OBQA dataset and visualize the retrieved subgraphs in Figure 5. For visualization clarity, we only show question entities and a limited number of their neighbors. We remarkably notice that the retrieved subgraphs encompass certain entities for the correct answer, and there exist edges connecting the question and answer entities, which makes the task of question answering easier by leveraging this information.

Figure 5: Case study on two QA examples from OBQA dataset. Question entities are marked in green and their subsampled neighbors in the KG are marked in blue. The entities appearing in the correct answer are marked in orange.

To answer the question “What is the best way to guess a baby’s eye color?”, Prompt Tuning makes the wrong generation “Just take a random guess”. On the other hand, our retrieved subgraph offers the links that directly relate the entity “babies” to “family”, “record”, and further to “genealogy”, which all appear in the correct option (d). This important context provides valuable insights for the model. Note that the subgraph also contains irrelevant entities such as “round” and “nursery”. This explains why directly using the knowledge graph can introduce noise. However, our GNP method possesses the capability to collect the most critical information in the graph to determine the correct answer.

The second question “Were there fossil fuels in the ground when humans evolved?” requires correctly identifying the historical sequencing order between the entity “humans” and “fossil fuels”. The retrieved subgraph contains the critical relation, i.e., “humans”, “evolve”, “prior”, “fossil fuel”. Nevertheless, the subgraph also contains the entity “created” that could confuse the model into selecting option (a). GNP is able to capture the structural proximity among the key entities and select the correct answer (c).

9 Conclusion

In this paper, we address the limitations of LLMs in precisely capturing and returning grounded knowledge. In particular, we propose Graph Neural Prompting (GNP), a novel plug- and-play method to assist pre-trained LLMs in learning beneficial knowledge from KGs. Extensive experiments on commonsense and biomedical reasoning tasks demonstrate that GNP can improve the performance by +13.5% when LLM is frozen, and +1.8% when LLM is tuned. In addition, we present ablation studies, model design comparison, parameter sensitivity, case study and visualization to validate the effectiveness of the proposed method.

Previous: Logic CoT Next: QA LoRA***

post contain ""

    No matching posts found containing ""