00:00:00

Share Your Feedback 🏝️

Survey | Information Extraction Survey

Survey | Information Extraction Survey

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Art of Balancing Next: Improving Text Embeddings with Large Language Models

Survey | Information Extraction Survey

  • Related Project: Private
  • Category: Paper Review
  • Date: 2024-01-01

Large Language Models for Generative Information Extraction: A Survey

  • url: https://arxiv.org/abs/2312.17617
  • pdf: https://arxiv.org/pdf/2312.17617
  • abstract: Information extraction (IE) aims to extract structural knowledge (such as entities, relations, and events) from plain natural language texts. Recently, generative Large Language Models (LLMs) have demonstrated remarkable capabilities in text understanding and generation, allowing for generalization across various domains and tasks. As a result, numerous works have been proposed to harness abilities of LLMs and offer viable solutions for IE tasks based on a generative paradigm. To conduct a comprehensive systematic review and exploration of LLM efforts for IE tasks, in this study, we survey the most recent advancements in this field. We first present an extensive overview by categorizing these works in terms of various IE subtasks and learning paradigms, then we empirically analyze the most advanced methods and discover the emerging trend of IE tasks with LLMs. Based on thorough review conducted, we identify several insights in technique and promising research directions that deserve further exploration in future studies. We maintain a public repository and consistently update related resources at: \url{this https URL}.

[Figure 2에 트리형으로 잘 정리된 도표 있음]


Contents

TL;DR


  • 자연어 처리에서 정보 추출을 위한 생성적 접근 방식 탐구
  • 대규모 언어모델을 활용한 NER, RE, EE 작업의 효율적 모델링
  • 학습 패러다임 및 도메인별 성능 분석을 통한 전략 개발

1 서론

  • 정보 추출(IE): 자연어 처리의 핵심 도메인으로, 일반 텍스트를 구조화된 지식으로 변환. 이는 지식 그래프 구축, 지식 인퍼런스, 질문 응답 등의 downstream 작업에 필수적임.
  • 대규모 언어모델(LLMs): GPT-4, Llama 등의 등장으로 자연어 이해, 생성, 일반화 능력이 향상되었으며, 생성적 정보 추출 방법이 실제 환경에서의 적용 가능성을 높임.
  • 연구 동향: LLMs는 NER, RE, EE 등의 다양한 IE 작업에서 우수한 모델링 능력을 보임. 이는 작업 간 의존성을 파악하고 지시적 프롬프트를 통해 일관된 성능을 달성하는 방식으로 이루어짐.


2 생성적 IE의 기초

  • 목표: 입력 텍스트와 목표 추출 시퀀스 간 조건부 확률을 최대화하는 것.

    \[\max_{\theta} \prod_{t=1}^m p(y_t \mid y_{1:t-1}, X, P; \theta)\]

    상기 식에서 \(\theta\)는 LLM의 파라미터이며, 학습 가능하거나 고정될 수 있음.

IE 작업의 종류

  • 명명된 개체 인식 (NER): 개체의 범위를 식별(Entity Identification)하고, 식별된 개체에 유형을 할당(Entity Typing)하는 두 가지 하위 작업을 포함
  • 관계 추출 (RE): 두 개체 간의 관계 유형을 분류하는 ‘Relation Classification’, 관계 유형과 해당 헤드와 테일 개체 범위를 식별하는 ‘Relation Triplet’, 정확한 관계 유형과 범위, 헤드와 테일 개체의 유형을 제공하는 ‘Relation Strict’ 등 세 가지로 구분
  • 사건 추출 (EE): 사건의 발생을 나타내는 트리거 단어와 유형을 식별하고 분류하는 ‘Event Detection’과 문장에서 사건의 특정 역할을 가진 인수를 식별하고 분류하는 ‘Event Arguments Extraction’의 두 하위 작업으로 나눌 수 있음.


3 LLMs for Different Information Extraction Tasks

3.1 명명된 개체 인식(NER)

방법

  • GPTNER: NER 작업을 텍스트 생성 문제로 변환하여 해결. 입력된 문장에서 개체를 식별하고, 해당 개체의 유형을 생성하는 프로세스를 구현. 이를 위해 텍스트 입력에 대한 지시적 프롬프트를 사용하여 모델이 더 정확한 예측을 할 수 있도록 돕습니다.
  • 자체 검증 전략: 오류 수정을 위해 잘못된 입력에 대한 결과를 걸러내고, 재분류하는 과정을 포함. 이는 모델이 자동으로 자신의 예측을 검증하고, 필요한 경우 수정을 통해 최종 출력의 정확도를 높임.

프로세스

  1. 입력 준비: 문장을 입력으로 받고, 각 단어에 대한 초기 예측 실행
  2. 개체 및 유형 예측: 각 단어가 특정 개체의 일부인지, 그리고 해당 개체의 유형이 무엇인지 예측
  3. 자체 검증: 초기 예측을 검토하고, 오류가 의심되는 부분에 대해 재분석을 실시하여 예측을 수정
  4. 결과 출력: 최종적으로 수정된 개체 및 유형 정보를 출력


3.2 관계 추출(RE)

방법

  • QA4RE: 질문 답변(QA) 프레임워크를 통해 관계 추출 작업을 해결. 주어진 텍스트에서 두 개체 간의 관계를 질문 형식으로 변환하여 LLM에 제시하고, 이에 대한 답변을 통해 관계를 추출
  • GPT-RE: 과제 인식 표현을 활용하여 개체 간의 관계를 더 명확히 하고, 입력과 레이블 간의 매핑을 설명할 수 있는 논리적 인퍼런스를 포함시킴.

프로세스

  1. 개체 식별: 주어진 텍스트에서 관계를 형성할 두 개체 식별
  2. 질문 생성: 식별된 개체 간의 관계를 질문 형태로 변환
  3. LLM을 통한 답변 생성: 생성된 질문을 LLM에 제시하고, 모델이 답변을 통해 관계 유형을 제공
  4. 결과 평가: 생성된 답변을 분석하여 정확한 관계 추출 여부를 평가


3.3 사건 추출(EE)

방법

  • Code4Struct: 텍스트를 코드로 변환하여 구조화된 예측을 수행. 이는 텍스트의 구조적 요소를 프로그래밍 언어의 클래스로 표현하여, LLM이 텍스트와 구조 간의 연관성을 더 잘 이해하도록 함.
  • PGAD: 확장된 컨텍스트을 고려하여 사건 인수 추출을 위한 다양한 프롬프트 표현을 생성. 이는 특정 인수 역할에 대한 질문을 다양화하여, 문맥과 조정하여 인수를 정확히 식별

프로세스

  1. 사건 트리거 식별: 주어진 문장에서 사건을 나타내는 트리거 단어 식별
  2. 인수 추출: 사건에 관련된 인수를 식별하고, 각 인수의 역할을 분류
  3. 프롬프트 최적화: 사건과 관련된 다양한 인수에 대해 특화된 프롬프트를 생성하여 LLM의 이해도를 높임.
  4. 구조화된 출력: 추출된 사건과 인수 정보를 구조화된 형태로 출력


3.4 범용 정보 추출

여러 IE 작업을 모델링하기 위한 통합 seq2seq 프레임워크를 제안함. NL-LLMs는 모든 IE 작업을 범용 자연어 스키마로 통합하며, Code-LLMs는 코드 생성을 통해 구조적 지식을 추출함.



4 학습 패러다임

  • 지도 학습: LLM을 IE 작업에 대한 레이블이 있는 데이터로 추가 훈련하는 가장 일반적인 방법
  • 0-shot: 특정 IE 작업을 위한 훈련 예제 없이도 답변을 생성할 수 있음.
  • 데이터 증강: 다양한 데이터를 생성하여 훈련 예제 또는 정보를 효과적으로 향상시킴.


5 특정 도메인

  • 다중 모드: LLM을 사용하여 텍스트-이미지 쌍과 사고 연쇄 지식을 결합하여 멀티모달 NER과 RE의 성능을 향상시킴.
  • 의료: 임상 텍스트 마이닝에서 LLM의 잠재력을 탐구하고, 합성 데이터를 활용한 새로운 훈련 접근 방식을 제안함.


6 평가 및 분석

  • LLM의 IE 성능은 여전히 의문점이 있으며, NER, RE, EE의 주요 하위 작업에 대한 LLM의 능력을 탐구하는 최근 연구가 이를 해결하기 위해 수행됨.
  • 여러 IE 하위 작업을 동시에 평가하여 ChatGPT의 전반적인 능력을 평가함.

[참고자료 1] Discriminative vs. Generative

정보 추출(IE) 작업과 관련하여 두 가지 주요 접근 방식, 즉 판별(discriminative) 방식과 생성적(generative) 방식을 알아봅니다.

이 두 접근 방식은 명명된 개체 인식(NER), 관계 추출(RE), 그리고 사건 추출(EE)과 같은 주요 정보 추출 작업에 적용됩니다.


1. 판별 모델 (Discriminative Model)

판별 모델의 목표는 주어진 데이터의 가능성(likelihood)을 최대화하는 것입니다. 이 방식에서는 주어진 데이터 내에서 특정 출력(e.g., 태그, 클래스)의 조건부 확률을 직접 모델링합니다.

식 (1)은 특정 주석이 달린 문장 \(x\)와 잠재적으로 중복되는 트리플들의 집합에 대한 클래스 확률을 계산하는 예를 보여줍니다.

\[p_{\text{cls}}(t|x) = p((s, r, o)|x_j)\]

상기 식에서 \(t_j = (s, r, o)\)는 주어진 문장 \(x\) 내의 개체 \(s\) (시작), 관계 \(r\) (관계 유형), 개체 \(o\) (목적)를 나타내는 트리플입니다.

식 (2)는 문장 \(x\) 내의 각 위치 \(i\)에 대해 순차적 태깅을 사용하여 태그를 생성하는 또 다른 판별 방법을 설명합니다. 상기 식에서는 ‘BIESO’ 체계를 사용하여 각 단어의 태그 시퀀스를 주석 처리합니다. 이 때 목표는 목표 태그 시퀀스의 로그 가능성을 최대화하는 것입니다.

\[p_{\text{tag}}(y|x) = \frac{\exp(h_{i, y_i})}{\sum_{y'_i} \exp(h_{i, y'_i})}\]

상기 식에서 \(h_i\)는 위치 \(i\)에서의 숨겨진 벡터이고, \(y_i\)는 해당 위치에서의 태그입니다.


2. 생성적 모델 (Generative Model)

생성적 접근 방식에서는 입력 시퀀스로부터 출력 시퀀스를 생성하는 모델을 구축합니다. 이 모델은 입력 데이터의 구조적인 해석을 자동으로 학습하여, 새로운 예제에 대해 출력을 생성할 수 있습니다.

식 (3)은 입력 텍스트 \(X\)와 프롬프트 \(P\)가 주어졌을 때 목표 추출 시퀀스 \(Y\)의 조건부 확률을 자동회귀 형식으로 최대화하는 목표를 설명합니다.

\[p_{\theta}(Y|X,P) = \prod_{i=1}^m p_{\theta}(y_i|X, P, y_{<i})\]

상기 식에서 \(\theta\)는 LLM의 파라미터이며, 고정되거나 학습할 수 있습니다. 이 식은 각 단계에서 이전의 모든 출력 \(y_{<i}\)를 고려하여 다음 출력 \(y_i\)를 예측합니다.


1 Introduction

Information Extraction (IE) is a crucial domain in natural language processing that converts plain text into structured knowledge. IE serves as a foundational requirement for a wide range of downstream tasks, such as knowledge graph construction (Zhong et al., 2023), knowledge reasoning (Fu et al., 2019) and question answering (Srihari et al., 1999). Typical IE tasks consists of Named Entity Recognition (NER), Relation Extraction (RE) and Event Extraction (EE) (Wang et al., 2023c). Meanwhile, the emergence of large language models (LLMs) (e.g., GPT-4 (OpenAI, 2023a), Llama (Hugo et al., 2023)) has greatly promoted the development of natural language processing, due to their extraordinary capabilities in text understanding, generation, and generalization. Therefore, there has been a recent surge of interest in generative IE methods (Qi et al., 2023; Guo et al., 2023; Sainz et al., 2023) that adopt LLMs to generate structural information rather than extracting structural information from plain text. These methods prove to be more practical in real-world scenarios compared to discriminated methods (Chen et al., 2023a; Lou et al., 2023), as they efficiently handle schemas containing millions of entities without significant performance degradation (Josifoski et al., 2022).

Figure 1: LLMs have been extensively explored for generative IE. These studies encompass various learning paradigms, diverse LLM architectures, and specialized frameworks designed for a single subtask, as well as universal frameworks capable of addressing multiple subtasks simultaneously.

On the one hand, LLMs have attracted significant attention from researchers in exploring their potential for various scenarios of IE tasks. In addition to excelling in individual IE tasks such as NER (Yuan et al., 2022), RE (Wan et al., 2023), and EE (Wang et al., 2023d), LLMs possess a remarkable ability to effectively model various IE tasks in a universal format. This is conducted by capturing inter-task dependencies with instructive prompts, and achieve consistent performance (Lu et al., 2022; Sainz et al., 2023). On the other hand, recent works have shown the outstanding generalization of LLMs to not only learn from IE training data through fine-tuning (Paolini et al., 2021), but also extract information in few-shot and even zero-shot scenarios relying solely on in-context examples or instructions (Wei et al., 2023; Wang et al., 2023d). For above two groups of research works: 1) the universal framework that encompasses multiple tasks (Zhao et al., 2023); 2) deficiency of training data scenarios, existing surveys (Nasar et al., 2021; Zhou et al., 2022a; Ye et al., 2022) do not fully explore them.

In this survey, we provide a comprehensive exploration of LLMs for generative IE. To achieve this, we categorize existing representative methods mainly using two taxonomies:

  • (1) a taxonomy of numerous IE subtasks, which aims to classify the different types of information that can be extracted individually or uniformly using LLMs, and
  • (2) a taxonomy of learning paradigms, which categorizes various novel approaches that utilize LLMs for generative IE.

Furthermore, we also demonstrate studies that focus on specific domains and evaluate/analyze performance of LLMs for IE. Additionally, we compare performance of several representative methods across various settings to gain a deeper understanding of their potential and limitations, and provide insightful analysis on the challenges and future directions of employing LLMs for generative IE. To the best of our knowledge, this is the first survey on generative IE with LLMs. The remaining part of this survey is organized as follows: We first introduce the definition of generative IE and target of all subtasks (Section 2). Then, in Section 3, we introduce representative models for each task and universal IE, and compare their performance. In Section 4, we summarize different learning paradigms of LLMs for IE. Additionally, we introduce works proposed for special domains in Section 5, and present recent studies that explore ability of LLMs on IE tasks in Section 6. Finally, we propose potential research directions for future studies in Section 7. In Appendix A and B, we provide a comprehensive summary of the most commonly used LLMs and dataset statistics, as reference for researchers.

This generative IE survey primarily covers the tasks of NER, RE, and EE (Wang et al., 2023c; Sainz et al., 2023). The three types of IE tasks are formulated in a generative manner. Given an input text (e.g., sentence or document) with a sequence of \(n\) tokens \(X = [x_1, \ldots, x_n]\), a prompt \(P\), and the target extraction sequence \(Y = [y_1, \ldots, y_m]\), the objective is to maximize the conditional probability in an auto-regressive formulation:

\[\max_{\theta} \prod_{t=1}^m p(y_t \mid y_{1:t-1}, X, P; \theta)\]

where \(\theta\) donates the parameters of LLMs, which can be frozen or trainable. In the era of LLMs, several works have proposed appending extra prompts or instructions \(P\) to \(X\) in order to enhance the comprehensibility of the task for LLMs (Wang et al., 2023c). Even though the input text \(X\) remains the same, the target sequence varies for each task:

  • Named Entity Recognition (NER) includes two tasks: Entity Identification and Entity Typing. The former task is concerned with identifying spans of entities (e.g., ‘Steve’), and the latter task focuses on assigning types to these identified entities (e.g., ‘PERSON’).
  • Relation Extraction (RE) may have different settings in different works. We categorize it using three terms following other works (Lu et al., 2022; Wang et al., 2023c): (1) Relation Classification refers to classifying the relation type between two given entities; (2) Relation Triplet refers to identifying the relation type and the corresponding head and tail entity spans; (3) Relation Strict refers to giving the correct relation type, the span, and the type of head and tail entity.
  • Event Extraction (EE) can be divided into two subtasks (Wang et al., 2022a): (1) Event Detection (also known as Event Trigger Extraction in some works) aims to identify and classify the trigger word and type that most clearly represents the occurrence of an event. (2) Event Arguments Extraction aims to identify and classify arguments from the sentences that are specific roles in the events.

2 Preliminaries of Generative IE

We also conduct experimental analysis to evaluate the performance of various methods on representative datasets for three subtasks. Furthermore, we categorize universal frameworks into two formats: natural language (NL-LLMs based) and code language (Code-LLMs based), to discuss how they model the three distinct tasks using a unified paradigm (§3.4).

Figure 2: Taxonomy of research in generative IE using LLMs, which consists of tasks, learning paradigms, specific domain, and evaluation & analysis. The models within the sub-node of ‘Specific Domain’ node are divided by each domain. The display order of works in other leaf nodes is primarily organized chronologically.

3.1 Named Entity Recognition

Named Entity Recognition (NER) is a crucial component of IE and can be seen as a predecessor or subtask of RE and EE. It is also a fundamental task in other Natural Language Processing (NLP) tasks, thus attracting significant attention from researchers to explore new possibilities in the era of LLMs. Xia et al. (2023b) introduces bias by allocating all probability mass to the observed sequence; this paper proposes a reranking-based approach within the Seq2Seq formulation that redistributes likelihood among candidate sequences using a contrastive loss, instead of augmenting data. Due to the gap between the sequence labeling nature of NER and text generation models like LLMs, GPTNER (Wang et al., 2023b) introduces a transformation of NER into a generation task and proposes a self-verification strategy to rectify the mislabeling of NULL inputs as entities. Xie et al. (2023b) proposes a training-free self-improving framework that uses LLM to predict on the unlabeled corpus to obtain pseudo demonstrations, thereby enhancing the performance of LLM on zero-shot NER.

Figure 3: The comparison of prompts with NL-LLMs and Code-LLMs for Universal IE. This figure refers to InstructUIE (Wang et al., 2023c) and Code4UIE (Guo et al., 2023). Both NL and code-based methods attempt to construct a universal schema for various subtasks. However, they differ in terms of prompt format and the way they utilize the generation capabilities of LLMs. The Python subclass usually has docstrings for better explanation of the class to LLMs.

Tab. 1 shows the comparison of NER on five main datasets, which are obtained from their original papers. We can observe that:

  • 1) the models in few-shot and zero-shot settings still have a huge performance gap with the models in SFT and DA settings.
  • 2) Even though there is little difference between backbones, the performance varies greatly between methods under the ICL paradigm. For example, GPT-NER opens up at least a 6% F1 value gap with other methods on each dataset, and up to about 19% higher.
  • 3) Compared to ICL, there are only minor differences in performance between models under the SFT paradigm, even though the parameters in their backbones can differ by up to a few hundred times.

3.2 Relation Extraction

RE also plays an important role in IE, which usually has different setups in different studies as mentioned in Section 2. To address the poor performance of LLMs on RE tasks due to the low incidence of RE in instruction-tuning datasets, as indicated in Gutiérrez et al. (2022), QA4RE (Zhang et al., 2023b) introduces a framework that enhances LLMs’ performance by aligning RE tasks with QA tasks. GPT-RE (Wan et al., 2023) incorporates task-aware representations and enriching demonstrations with reasoning logic to improve the low relevance between entity and relation and the inability to explain input-label mappings. Due to the large number of predefined relation types and uncontrolled LLMs, Li et al. (2023e) proposes to integrate LLM with a natural language inference module to generate relation triples, enhancing document-level relation datasets.

As shown in the Tab. 2 and 3, we statistically found that uni-ie models are generally biased towards solving harder Relation Strict problems due to learning the dependencies between multi-tasks (Paolini et al., 2021; Lu et al., 2022), while the task-specific methods solve more simple RE subtasks (e.g. Relation Classification). In addition, compared with NER, it can be found that the performance differences between models in RE are more obvious, indicating that the potential of LLM in RE task still has a great space to explore.

3.3 Event Extraction

Events can be defined as specific occurrences or incidents that happen in a given context. Recently, many studies (Lu et al., 2023) aim to understand events and capture their correlations by extracting event triggers and arguments using LLMs, which is essential for various reasoning tasks (Bhagavatula et al., 2020). ClarET (Zhou et al., 2022b) undergoes three pre-trained tasks to capture the correlation between events more efficiently and achieves SOTA on multiple downstream tasks. Code4Struct (Wang et al., 2023d) leverages LLMs’ ability to translate text into code to tackle structured prediction tasks, using programming language features to introduce external knowledge and constraints through alignment between structure and code. Considering the interrelation between different arguments in the extended context, PGAD (Luo and Xu, 2023) employs a text diffusion model to create a variety of context-aware prompt representations, enhancing both sentencelevel and document-level event argument extraction by identifying multiple role-specific argument span queries and coordinating them with the context.

We collect the experimental results from recent studies on the common EE dataset (i.e., ACE05 (Walker et al., 2006)), which is shown in Tab. 4. As can be seen from the results, the vast majority of current methods are based on the SFT paradigm, and the number of methods that use LLMs for either zero-shot or few-shot learning is small. In addition, generative methods outperform discriminative ones by a wide margin, especially in the metric of Argument Classification, indicating the great potential of generative LLMs for EE.

3.4 Universal Information Extraction

Table 2: Comparison of Micro-F1 Values for Relation Strict Extraction. † indicates that the model is discriminative.

Different IE tasks are highly diversified, with different optimization objectives and task-specific schema, resulting in the need for isolated models to handle the complexity of a large amount of IE tasks, settings, and scenarios (Lu et al., 2022). As shown in Fig. 2, many works solely focus on a subtask of IE. However, recent advancements in LLMs have led to the proposal of a unified seq2seq framework in several studies (Wang et al., 2023c; Sainz et al., 2023). This framework aims to model all IE tasks, capturing the common abilities of IE and learning the dependencies across multiple tasks. The prompt format for Uni-IE can typically be divided into natural language-based LLMs (NL-LLMs) and code-based LLMs (code-LLMs), as illustrated in Fig. 3.

NL-LLMs. NL-based methods unify all IE tasks in a universal natural language schema. For instance, UIE (Lu et al., 2022) proposes a unified text-tostructure generation framework that encodes extraction structures, and captures common IE abilities through a structured extraction language. InstructUIE (Wang et al., 2023c) enhances UIE by constructing expert-written instructions for fine-tuning LLMs to consistently model different IE tasks and capture the inter-task dependency. Additionally, ChatIE (Wei et al., 2023) explores the use of LLMs like GPT3 (Brown et al., 2020) and ChatGPT (OpenAI, 2023b) in zero-shot prompting, transforming the task into a multi-turn question-answering problem.

Code-LLMs. On the other hand, Code-based meth-ods unify IE tasks by generating code with a universal programming schema (Wang et al., 2023d). Code4UIE (Guo et al., 2023) proposes a universal retrieval-augmented code generation framework, which leverages Python classes to define schemas and uses in-context learning to generate codes that extract structural knowledge from texts. Besides, CodeKGC (Bi et al., 2023) leverages the structural knowledge inherent in code and employs schemaaware prompts and rationale-enhanced generation to improve performance. To enable LLMs to adhere to guidelines out-of-the-box, GoLLIE (Sainz et al., 2023) is proposed to enhance zero-shot performance on unseen IE tasks by fine-tuning LLMs to align with annotation guidelines.

In general, NL-LLMs are trained on a wide range of text and can understand and generate human language, which allows the prompts and instructions to be more concise and easier to design. However, NL-LLMs may struggle to produce unnatural outputs due to the distinct syntax and structure of IE tasks (Bi et al., 2023), which differs from the training data. Code, being a formalized language, possesses the inherent capability to accurately represent knowledge across diverse schema, which makes it more suitable for structural prediction (Guo et al., 2023). But code-based methods often require a substantial amount of text to define a Python class (see Fig. 3), which in turn limits the sample size of the context. Through experimental

4 Learning Paradigms

In this section, we categorize methods based on their learning paradigms, including Supervised Fine-tuning (§4.1, refers to further training LLMs on IE tasks using labeled data), Few-shot (§4.2, refers to the generalization from a small number of labeled examples by training or in-context learning), Zero-shot (§4.3, refers to generating answer without any training examples for the specific IE tasks), and Data Augmentation (§4.4, refers to enhancing information by applying various transformations to the existing data using LLMs), to highlight the commonly used approaches for adapting LLMs to IE.

4.1 Supervised Fine-tuning

Entering all training data to fine-tune LLMs is the most common and promising method, which allows the model to capture the underlying structural patterns in the data, and generalize well to unseen IE tasks. For example, Deepstruct (Wang et al., 2022a) introduces structure pre-training on a collection of task-agnostic corpora to enhance the structural understanding of language models. UniNER (Zhou et al., 2023) explores targeted distillation and mission-focused instruction tuning to train student models for broad applications, such as NER. GIELLM (Gan et al., 2023) fine-tunes LLMs using mixed datasets, which are collected to utilize the mutual reinforcement effect to enhance performance on multiple tasks.

Table 4: Comparison of Micro-F1 Values for Event Extraction on ACE05. Evaluation tasks include: Trigger Identification (Trg-I), Trigger Classification (Trg-C), Argument Identification (Arg-I), and Argument Classification (Arg-C). † indicates that the model is discriminative.

Representative Model novative training prompts (e.g., instruction (Wang et al., 2023c) and guidelines (Sainz et al., 2023)) for learning and capturing the inter-task dependencies of known tasks and generalizing them to unseen tasks and domains. In terms of cross-type generalization, BART-Gen (Li et al., 2021a) proposes a document-level neural model, by formulating EE task as conditional generation, resulting in better performance and excellent portability on unseen event types.

On the other hand, In order to improve the ability of LLMs under zero shot prompts (no need for further fine-tuning on IE tasks), QA4RE (Zhang et al., 2023b) and ChatIE (Wei et al., 2023) propose to improve the performance of LLMs (like FLAN-T5 (Chung et al., 2022) and GPT (OpenAI, 2023a)) on zero-shot IE tasks, with transforming IE into a multi-turn question-answering problem for aligning IE tasks with QA tasks. (Li et al., 2023b) integrates the chain-of-thought approach and proposes the summarize-and-ask prompting to solve the challenge of ensuring the reliability of outputs from black box LLMs (Ma et al., 2023c; Wang et al., 2023c).

4.2 Few-shot

Few-shot learning has access to only a limited number of labeled examples, leading to challenges like overfitting and difficulty in capturing complex relationships (Huang et al., 2020). Fortunately, scaling up the parameters of LLMs gives them amazing generalization capabilities compared to small pre-trained models, allowing them to achieve excellent performance in few-shot settings (Li and Zhang, 2023; Ashok and Lipton, 2023). TANL, UIE, and cp-NER propose innovative approaches (e.g., Translation between Augmented Natural Languages framework (Paolini et al., 2021), text-to-structure generation framework (Lu et al., 2022), Collaborative Domain-Prefix Tuning (Chen et al., 2023b)), which achieve state-of-the-art performance and demonstrate effectiveness in fewshot fine-tuning. Despite the success of LLMs, they face challenges in training-free IE because of the difference between sequence labeling and text-generation models (Gutiérrez et al., 2022). To overcome these limitations, GPT-NER (Wang et al., 2023b) introduces a self-verification strategy, while GPT-RE (Wan et al., 2023) enhances task-aware representations and incorporates reasoning logic into enriched demonstrations. These approaches demonstrate how to effectively leverage the capabilities of GPT for in-context learning. CODEIE (Li et al., 2023f) and CodeKGC (Bi et al., 2023) show that converting IE tasks into code generation tasks with code-style prompts and in-context examples leads to superior performance compared to NL-LLMs. This is because code-style prompts provide a more effective representation of structured output, enabling them to effectively handle the complex dependencies in natural language.

4.3 Zero-shot

The main challenges in zero-shot learning lie in enabling the model to effectively generalize for tasks and domains that it has not been trained on, as well as aligning the pre-trained paradigm of LLMs. Due to the large amount of knowledge embedded within, LLMs show impressive abilities in zero-shot scenarios of unseen tasks (Kojima et al., 2022; Wei et al., 2023). To achieve zero-shot cross-domain generalization of LLMs in IE tasks, several works have been proposed (Wang et al., 2022a; Sainz et al., 2023; Zhou et al., 2023; Wang et al., 2023c). These works offer a universal framework for modeling various IE tasks and domains, and introduce incomparison in Tab. 1, 2, and 4, we can observe that uni-IE models with SFT setting outperform task-specific models in the NER, RE, and EE tasks for most datasets.

Figure 4: Comparison of different data augmentation methods.

4.4 Data Augmentation

Data augmentation involves generating meaningful and diverse data to effectively enhance the training examples or information, while avoiding the introduction of unrealistic, misleading, and offset patterns. Recent powerful LLMs also demonstrate remarkable performance in data generation tasks (Whitehouse et al., 2023), which has attracted the attention of many researchers using LLMs to generate synthetic data for IE. It can be roughly divided into three strategies as shown in Fig. 4. Data Annotation. This strategy directly generates labeled data using LLMs. For instance, Zhang et al. (2023c) proposes LLMaAA to improve accuracy and data efficiency by employing LLMs as annotators within an active learning loop, thereby optimizing both the annotation and training processes. AugURE (Wang et al., 2023a) employs withinsentence pairs augmentation and cross-sentence pairs extraction to enhance the diversity of positive pairs for unsupervised RE, and introduces margin loss for sentence pairs. Knowledge Retrieval. This strategy retrieves relevant knowledge from LLMs for IE. PGIM (Li et al., 2023d) presents a two-stage framework for Multimodal NER, which leverages ChatGPT as an implicit knowledge base to heuristically retrieve auxiliary knowledge for more efficient entity prediction. Amalvy et al. (2023) proposes to improve NER on long documents by generating a synthetic context retrieval training dataset, and training a neural context retriever. Inverse Generation. This strategy prompts LLMs to produce natural text or questions based on the structural data provided as input, aligning with the training paradigm of LLMs. For example, SynthIE (Josifoski et al., 2023) shows that LLMs can create high-quality synthetic data for complex tasks by reversing the task direction. They used this approach to create a large dataset for closed information extraction and trained new models that outperformed previous benchmarks. This demonstrates the potential of using LLMs for generating synthetic data for various complex tasks. Rather than relying on ground-truth targets, which limits their generalizability and scalability, STAR (Ma et al., 2023b) generates structures from valid triggers and arguments, then generates passages with LLMs.

Overall, these strategies have their own advantages and disadvantages. While data annotation can directly meet task requirements, the ability of LLMs for structured generation still needs improvement. Knowledge retrieval can provide additional information about entities and relations, but it suffers from the hallucination problem and introduces noise. Inverse generation is aligned with the QA paradigm of LLMs. However, it requires structural data and there exists a gap between the generated pairs and the domain that needs to be addressed.

5 Specific Domain

It is non-ignorable that LLMs have tremendous potential for extracting information from some specific domains, such as mulitmodal (Chen and Feng, 2023; Li et al., 2023d), medical (Tang et al., 2023; Ma et al., 2023a) and scientific (Dunn et al., 2022; Cheung et al., 2023) information. For example: Multimodal. Chen and Feng (2023) introduces a conditional prompt distillation method that enhances a model’s reasoning ability by combining text-image pairs with chain-of-thought knowledge from LLMs, significantly improving performance in multimodal NER and multimodal RE. Medical. Tang et al. (2023) explores the potential of LLMs in the field of clinical text mining and proposes a novel training approach, which leverages synthetic data, to enhance performance and address privacy issues. Scientific. Dunn et al. (2022) presents a sequenceto-sequence approach by using GPT-3 for joint NER and RE from complex scientific text, demonstrating its effectiveness in extracting complex scientific knowledge in materials chemistry.

6 Evaluation & Analysis

Despite the great success of LLMs in various natural language processing tasks, their performance in the field of information extraction is still questionable (Han et al., 2023). To alleviate this problem, recent research has explored the capabilities of LLMs with respect to the major subtasks of IE (i.e., NER (Xie et al., 2023a; Li and Zhang, 2023), RE (Wadhwa et al., 2023; Yuan et al., 2023), and EE (Gao et al., 2023)). Considering the superior reasoning capabilities of LLMs, Xie et al. (2023a) proposes four reasoning strategies for NER, which are designed to simulate ChatGPT’s potential on zero-shot NER. Wadhwa et al. (2023) explores the use of LLMs for RE and finds that few-shot prompting with GPT-3 achieves near SOTA performance, while Flan-T5 can be improved with chain-of-thought style explanations generated via GPT-3. For EE tasks, Gao et al. (2023) shows that ChatGPT still struggles with it due to the need for complex instructions and a lack of robustness.

Along this line, some researchers perform a more comprehensive analysis of LLMs by evaluating multiple IE subtasks simultaneously. Li et al. (2023a) evaluates ChatGPT’s overall ability on IE, including performance, explainability, calibration, and faithfulness. They find that ChatGPT mostly performs worse than BERT-based models in the standard IE setting, but excellently in the OpenIE setting. Furthermore, Han et al. (2023) introduces a soft-matching strategy for a more precise evaluation and identifies “unannotated spans” as the predominant error type, highlighting potential issues with data annotation quality.

7 Future Directions

The development of integrating LLMs for generative IE systems is still in its early stages, and there are numerous opportunities for improvement. Universal IE. Previous generative IE methods and benchmarks are often tailored for specific domains or tasks, limiting their generalizability (Yuan et al., 2022). Although some unified methods (Lu et al., 2022) using LLMs have been proposed recently, they still suffer from certain limitations (e.g., long context input, and misalignment of structured output). Therefore, further development of universal IE frameworks that can adapt flexibly to different domains and tasks is a promising research direction (such as integrating the insights of task-specific models to assist in constructing universal models). Low-Resource IE. The generative IE system with LLMs still encounters challenges in resourcelimited scenarios (Li et al., 2023a). Based on our summary, there is a need for further exploration of in-context learning of LLMs, particularly in terms of improving the selection of examples. Future research should prioritize the development of robust cross-domain learning techniques (Wang et al., 2023c), such as domain adaptation or multi-task learning, to leverage knowledge from resource-rich domains. Additionally, efficient data annotation strategies with LLMs should also be explored.

Prompt Design for IE. Designing effective instructions is considered to have a significant impact on the performance of LLMs (Qiao et al., 2022; Yin et al., 2023). One aspect of prompt design is to build input and output pairs that can better align with the pre-training stage of LLMs (e.g., code generation) (Guo et al., 2023). Another aspect is optimizing the prompt for better model understanding and reasoning (e.g., Chain-of-Thought) (Li et al., 2023b), by encouraging LLMs to make logical inferences or explainable generation. Additionally, researchers can explore interactive prompt design (such as multi-turn QA) (Zhang et al., 2023b), where LLMs can iteratively refine or provide feedback on the generated extractions automatically.

Open IE. The Open IE setting presents greater challenges for IE models, as they do not provide any candidate label set and rely solely on the models’ ability to comprehend the task. LLMs, with their knowledge and understanding abilities, have significant advantages in some Open IE tasks (Zhou et al., 2023). However, there are still instances of poor performance in more challenging tasks (Qi et al., 2023; Li et al., 2023a), which require further exploration by researchers.

8 Conclusion

In this survey, we focus on reviewing existing studies that utilize LLMs for various generative IE tasks. We first introduce the subtasks of IE and discuss some universal frameworks aiming to unify all IE tasks. Additional theoretical and experimental analysis provides insightful exploration for these methods. Then we delve into different learning paradigms that apply LLMs for IE and demonstrate their potential for extracting information in specific domains. We also introduce some studies for evaluation purposes. Finally, we analyze the current challenges and present potential future directions. We hope this survey can provide a valuable resource for researchers to explore more efficient utilization of LLMs for IE.

Previous: Art of Balancing Next: Improving Text Embeddings with Large Language Models

post contain ""

    No matching posts found containing ""