[Figure 2에 트리형으로 잘 정리된 도표 있음]
Contents
1 서론
2 생성적 IE의 기초
목표: 입력 텍스트와 목표 추출 시퀀스 간 조건부 확률을 최대화하는 것.
\[\max_{\theta} \prod_{t=1}^m p(y_t \mid y_{1:t-1}, X, P; \theta)\]상기 식에서 \(\theta\)는 LLM의 파라미터이며, 학습 가능하거나 고정될 수 있음.
IE 작업의 종류
3 LLMs for Different Information Extraction Tasks
3.1 명명된 개체 인식(NER)
방법
프로세스
3.2 관계 추출(RE)
방법
프로세스
3.3 사건 추출(EE)
방법
프로세스
3.4 범용 정보 추출
여러 IE 작업을 모델링하기 위한 통합 seq2seq 프레임워크를 제안함. NL-LLMs는 모든 IE 작업을 범용 자연어 스키마로 통합하며, Code-LLMs는 코드 생성을 통해 구조적 지식을 추출함.
4 학습 패러다임
5 특정 도메인
6 평가 및 분석
[참고자료 1] Discriminative vs. Generative
정보 추출(IE) 작업과 관련하여 두 가지 주요 접근 방식, 즉 판별(discriminative) 방식과 생성적(generative) 방식을 알아봅니다.
이 두 접근 방식은 명명된 개체 인식(NER), 관계 추출(RE), 그리고 사건 추출(EE)과 같은 주요 정보 추출 작업에 적용됩니다.
1. 판별 모델 (Discriminative Model)
판별 모델의 목표는 주어진 데이터의 가능성(likelihood)을 최대화하는 것입니다. 이 방식에서는 주어진 데이터 내에서 특정 출력(e.g., 태그, 클래스)의 조건부 확률을 직접 모델링합니다.
식 (1)은 특정 주석이 달린 문장 \(x\)와 잠재적으로 중복되는 트리플들의 집합에 대한 클래스 확률을 계산하는 예를 보여줍니다.
\[p_{\text{cls}}(t|x) = p((s, r, o)|x_j)\]상기 식에서 \(t_j = (s, r, o)\)는 주어진 문장 \(x\) 내의 개체 \(s\) (시작), 관계 \(r\) (관계 유형), 개체 \(o\) (목적)를 나타내는 트리플입니다.
식 (2)는 문장 \(x\) 내의 각 위치 \(i\)에 대해 순차적 태깅을 사용하여 태그를 생성하는 또 다른 판별 방법을 설명합니다. 상기 식에서는 ‘BIESO’ 체계를 사용하여 각 단어의 태그 시퀀스를 주석 처리합니다. 이 때 목표는 목표 태그 시퀀스의 로그 가능성을 최대화하는 것입니다.
\[p_{\text{tag}}(y|x) = \frac{\exp(h_{i, y_i})}{\sum_{y'_i} \exp(h_{i, y'_i})}\]상기 식에서 \(h_i\)는 위치 \(i\)에서의 숨겨진 벡터이고, \(y_i\)는 해당 위치에서의 태그입니다.
2. 생성적 모델 (Generative Model)
생성적 접근 방식에서는 입력 시퀀스로부터 출력 시퀀스를 생성하는 모델을 구축합니다. 이 모델은 입력 데이터의 구조적인 해석을 자동으로 학습하여, 새로운 예제에 대해 출력을 생성할 수 있습니다.
식 (3)은 입력 텍스트 \(X\)와 프롬프트 \(P\)가 주어졌을 때 목표 추출 시퀀스 \(Y\)의 조건부 확률을 자동회귀 형식으로 최대화하는 목표를 설명합니다.
\[p_{\theta}(Y|X,P) = \prod_{i=1}^m p_{\theta}(y_i|X, P, y_{<i})\]상기 식에서 \(\theta\)는 LLM의 파라미터이며, 고정되거나 학습할 수 있습니다. 이 식은 각 단계에서 이전의 모든 출력 \(y_{<i}\)를 고려하여 다음 출력 \(y_i\)를 예측합니다.
Information Extraction (IE) is a crucial domain in natural language processing that converts plain text into structured knowledge. IE serves as a foundational requirement for a wide range of downstream tasks, such as knowledge graph construction (Zhong et al., 2023), knowledge reasoning (Fu et al., 2019) and question answering (Srihari et al., 1999). Typical IE tasks consists of Named Entity Recognition (NER), Relation Extraction (RE) and Event Extraction (EE) (Wang et al., 2023c). Meanwhile, the emergence of large language models (LLMs) (e.g., GPT-4 (OpenAI, 2023a), Llama (Hugo et al., 2023)) has greatly promoted the development of natural language processing, due to their extraordinary capabilities in text understanding, generation, and generalization. Therefore, there has been a recent surge of interest in generative IE methods (Qi et al., 2023; Guo et al., 2023; Sainz et al., 2023) that adopt LLMs to generate structural information rather than extracting structural information from plain text. These methods prove to be more practical in real-world scenarios compared to discriminated methods (Chen et al., 2023a; Lou et al., 2023), as they efficiently handle schemas containing millions of entities without significant performance degradation (Josifoski et al., 2022).
Figure 1: LLMs have been extensively explored for generative IE. These studies encompass various learning paradigms, diverse LLM architectures, and specialized frameworks designed for a single subtask, as well as universal frameworks capable of addressing multiple subtasks simultaneously.
On the one hand, LLMs have attracted significant attention from researchers in exploring their potential for various scenarios of IE tasks. In addition to excelling in individual IE tasks such as NER (Yuan et al., 2022), RE (Wan et al., 2023), and EE (Wang et al., 2023d), LLMs possess a remarkable ability to effectively model various IE tasks in a universal format. This is conducted by capturing inter-task dependencies with instructive prompts, and achieve consistent performance (Lu et al., 2022; Sainz et al., 2023). On the other hand, recent works have shown the outstanding generalization of LLMs to not only learn from IE training data through fine-tuning (Paolini et al., 2021), but also extract information in few-shot and even zero-shot scenarios relying solely on in-context examples or instructions (Wei et al., 2023; Wang et al., 2023d). For above two groups of research works: 1) the universal framework that encompasses multiple tasks (Zhao et al., 2023); 2) deficiency of training data scenarios, existing surveys (Nasar et al., 2021; Zhou et al., 2022a; Ye et al., 2022) do not fully explore them.
In this survey, we provide a comprehensive exploration of LLMs for generative IE. To achieve this, we categorize existing representative methods mainly using two taxonomies:
Furthermore, we also demonstrate studies that focus on specific domains and evaluate/analyze performance of LLMs for IE. Additionally, we compare performance of several representative methods across various settings to gain a deeper understanding of their potential and limitations, and provide insightful analysis on the challenges and future directions of employing LLMs for generative IE. To the best of our knowledge, this is the first survey on generative IE with LLMs. The remaining part of this survey is organized as follows: We first introduce the definition of generative IE and target of all subtasks (Section 2). Then, in Section 3, we introduce representative models for each task and universal IE, and compare their performance. In Section 4, we summarize different learning paradigms of LLMs for IE. Additionally, we introduce works proposed for special domains in Section 5, and present recent studies that explore ability of LLMs on IE tasks in Section 6. Finally, we propose potential research directions for future studies in Section 7. In Appendix A and B, we provide a comprehensive summary of the most commonly used LLMs and dataset statistics, as reference for researchers.
This generative IE survey primarily covers the tasks of NER, RE, and EE (Wang et al., 2023c; Sainz et al., 2023). The three types of IE tasks are formulated in a generative manner. Given an input text (e.g., sentence or document) with a sequence of \(n\) tokens \(X = [x_1, \ldots, x_n]\), a prompt \(P\), and the target extraction sequence \(Y = [y_1, \ldots, y_m]\), the objective is to maximize the conditional probability in an auto-regressive formulation:
\[\max_{\theta} \prod_{t=1}^m p(y_t \mid y_{1:t-1}, X, P; \theta)\]where \(\theta\) donates the parameters of LLMs, which can be frozen or trainable. In the era of LLMs, several works have proposed appending extra prompts or instructions \(P\) to \(X\) in order to enhance the comprehensibility of the task for LLMs (Wang et al., 2023c). Even though the input text \(X\) remains the same, the target sequence varies for each task:
We also conduct experimental analysis to evaluate the performance of various methods on representative datasets for three subtasks. Furthermore, we categorize universal frameworks into two formats: natural language (NL-LLMs based) and code language (Code-LLMs based), to discuss how they model the three distinct tasks using a unified paradigm (§3.4).
Figure 2: Taxonomy of research in generative IE using LLMs, which consists of tasks, learning paradigms, specific domain, and evaluation & analysis. The models within the sub-node of ‘Specific Domain’ node are divided by each domain. The display order of works in other leaf nodes is primarily organized chronologically.
Named Entity Recognition (NER) is a crucial component of IE and can be seen as a predecessor or subtask of RE and EE. It is also a fundamental task in other Natural Language Processing (NLP) tasks, thus attracting significant attention from researchers to explore new possibilities in the era of LLMs. Xia et al. (2023b) introduces bias by allocating all probability mass to the observed sequence; this paper proposes a reranking-based approach within the Seq2Seq formulation that redistributes likelihood among candidate sequences using a contrastive loss, instead of augmenting data. Due to the gap between the sequence labeling nature of NER and text generation models like LLMs, GPTNER (Wang et al., 2023b) introduces a transformation of NER into a generation task and proposes a self-verification strategy to rectify the mislabeling of NULL inputs as entities. Xie et al. (2023b) proposes a training-free self-improving framework that uses LLM to predict on the unlabeled corpus to obtain pseudo demonstrations, thereby enhancing the performance of LLM on zero-shot NER.
Figure 3: The comparison of prompts with NL-LLMs and Code-LLMs for Universal IE. This figure refers to InstructUIE (Wang et al., 2023c) and Code4UIE (Guo et al., 2023). Both NL and code-based methods attempt to construct a universal schema for various subtasks. However, they differ in terms of prompt format and the way they utilize the generation capabilities of LLMs. The Python subclass usually has docstrings for better explanation of the class to LLMs.
Tab. 1 shows the comparison of NER on five main datasets, which are obtained from their original papers. We can observe that:
RE also plays an important role in IE, which usually has different setups in different studies as mentioned in Section 2. To address the poor performance of LLMs on RE tasks due to the low incidence of RE in instruction-tuning datasets, as indicated in Gutiérrez et al. (2022), QA4RE (Zhang et al., 2023b) introduces a framework that enhances LLMs’ performance by aligning RE tasks with QA tasks. GPT-RE (Wan et al., 2023) incorporates task-aware representations and enriching demonstrations with reasoning logic to improve the low relevance between entity and relation and the inability to explain input-label mappings. Due to the large number of predefined relation types and uncontrolled LLMs, Li et al. (2023e) proposes to integrate LLM with a natural language inference module to generate relation triples, enhancing document-level relation datasets.
As shown in the Tab. 2 and 3, we statistically found that uni-ie models are generally biased towards solving harder Relation Strict problems due to learning the dependencies between multi-tasks (Paolini et al., 2021; Lu et al., 2022), while the task-specific methods solve more simple RE subtasks (e.g. Relation Classification). In addition, compared with NER, it can be found that the performance differences between models in RE are more obvious, indicating that the potential of LLM in RE task still has a great space to explore.
Events can be defined as specific occurrences or incidents that happen in a given context. Recently, many studies (Lu et al., 2023) aim to understand events and capture their correlations by extracting event triggers and arguments using LLMs, which is essential for various reasoning tasks (Bhagavatula et al., 2020). ClarET (Zhou et al., 2022b) undergoes three pre-trained tasks to capture the correlation between events more efficiently and achieves SOTA on multiple downstream tasks. Code4Struct (Wang et al., 2023d) leverages LLMs’ ability to translate text into code to tackle structured prediction tasks, using programming language features to introduce external knowledge and constraints through alignment between structure and code. Considering the interrelation between different arguments in the extended context, PGAD (Luo and Xu, 2023) employs a text diffusion model to create a variety of context-aware prompt representations, enhancing both sentencelevel and document-level event argument extraction by identifying multiple role-specific argument span queries and coordinating them with the context.
We collect the experimental results from recent studies on the common EE dataset (i.e., ACE05 (Walker et al., 2006)), which is shown in Tab. 4. As can be seen from the results, the vast majority of current methods are based on the SFT paradigm, and the number of methods that use LLMs for either zero-shot or few-shot learning is small. In addition, generative methods outperform discriminative ones by a wide margin, especially in the metric of Argument Classification, indicating the great potential of generative LLMs for EE.
Table 2: Comparison of Micro-F1 Values for Relation Strict Extraction. † indicates that the model is discriminative.
Different IE tasks are highly diversified, with different optimization objectives and task-specific schema, resulting in the need for isolated models to handle the complexity of a large amount of IE tasks, settings, and scenarios (Lu et al., 2022). As shown in Fig. 2, many works solely focus on a subtask of IE. However, recent advancements in LLMs have led to the proposal of a unified seq2seq framework in several studies (Wang et al., 2023c; Sainz et al., 2023). This framework aims to model all IE tasks, capturing the common abilities of IE and learning the dependencies across multiple tasks. The prompt format for Uni-IE can typically be divided into natural language-based LLMs (NL-LLMs) and code-based LLMs (code-LLMs), as illustrated in Fig. 3.
NL-LLMs. NL-based methods unify all IE tasks in a universal natural language schema. For instance, UIE (Lu et al., 2022) proposes a unified text-tostructure generation framework that encodes extraction structures, and captures common IE abilities through a structured extraction language. InstructUIE (Wang et al., 2023c) enhances UIE by constructing expert-written instructions for fine-tuning LLMs to consistently model different IE tasks and capture the inter-task dependency. Additionally, ChatIE (Wei et al., 2023) explores the use of LLMs like GPT3 (Brown et al., 2020) and ChatGPT (OpenAI, 2023b) in zero-shot prompting, transforming the task into a multi-turn question-answering problem.
Code-LLMs. On the other hand, Code-based meth-ods unify IE tasks by generating code with a universal programming schema (Wang et al., 2023d). Code4UIE (Guo et al., 2023) proposes a universal retrieval-augmented code generation framework, which leverages Python classes to define schemas and uses in-context learning to generate codes that extract structural knowledge from texts. Besides, CodeKGC (Bi et al., 2023) leverages the structural knowledge inherent in code and employs schemaaware prompts and rationale-enhanced generation to improve performance. To enable LLMs to adhere to guidelines out-of-the-box, GoLLIE (Sainz et al., 2023) is proposed to enhance zero-shot performance on unseen IE tasks by fine-tuning LLMs to align with annotation guidelines.
In general, NL-LLMs are trained on a wide range of text and can understand and generate human language, which allows the prompts and instructions to be more concise and easier to design. However, NL-LLMs may struggle to produce unnatural outputs due to the distinct syntax and structure of IE tasks (Bi et al., 2023), which differs from the training data. Code, being a formalized language, possesses the inherent capability to accurately represent knowledge across diverse schema, which makes it more suitable for structural prediction (Guo et al., 2023). But code-based methods often require a substantial amount of text to define a Python class (see Fig. 3), which in turn limits the sample size of the context. Through experimental
In this section, we categorize methods based on their learning paradigms, including Supervised Fine-tuning (§4.1, refers to further training LLMs on IE tasks using labeled data), Few-shot (§4.2, refers to the generalization from a small number of labeled examples by training or in-context learning), Zero-shot (§4.3, refers to generating answer without any training examples for the specific IE tasks), and Data Augmentation (§4.4, refers to enhancing information by applying various transformations to the existing data using LLMs), to highlight the commonly used approaches for adapting LLMs to IE.
Entering all training data to fine-tune LLMs is the most common and promising method, which allows the model to capture the underlying structural patterns in the data, and generalize well to unseen IE tasks. For example, Deepstruct (Wang et al., 2022a) introduces structure pre-training on a collection of task-agnostic corpora to enhance the structural understanding of language models. UniNER (Zhou et al., 2023) explores targeted distillation and mission-focused instruction tuning to train student models for broad applications, such as NER. GIELLM (Gan et al., 2023) fine-tunes LLMs using mixed datasets, which are collected to utilize the mutual reinforcement effect to enhance performance on multiple tasks.
Table 4: Comparison of Micro-F1 Values for Event Extraction on ACE05. Evaluation tasks include: Trigger Identification (Trg-I), Trigger Classification (Trg-C), Argument Identification (Arg-I), and Argument Classification (Arg-C). † indicates that the model is discriminative.
Representative Model novative training prompts (e.g., instruction (Wang et al., 2023c) and guidelines (Sainz et al., 2023)) for learning and capturing the inter-task dependencies of known tasks and generalizing them to unseen tasks and domains. In terms of cross-type generalization, BART-Gen (Li et al., 2021a) proposes a document-level neural model, by formulating EE task as conditional generation, resulting in better performance and excellent portability on unseen event types.
On the other hand, In order to improve the ability of LLMs under zero shot prompts (no need for further fine-tuning on IE tasks), QA4RE (Zhang et al., 2023b) and ChatIE (Wei et al., 2023) propose to improve the performance of LLMs (like FLAN-T5 (Chung et al., 2022) and GPT (OpenAI, 2023a)) on zero-shot IE tasks, with transforming IE into a multi-turn question-answering problem for aligning IE tasks with QA tasks. (Li et al., 2023b) integrates the chain-of-thought approach and proposes the summarize-and-ask prompting to solve the challenge of ensuring the reliability of outputs from black box LLMs (Ma et al., 2023c; Wang et al., 2023c).
Few-shot learning has access to only a limited number of labeled examples, leading to challenges like overfitting and difficulty in capturing complex relationships (Huang et al., 2020). Fortunately, scaling up the parameters of LLMs gives them amazing generalization capabilities compared to small pre-trained models, allowing them to achieve excellent performance in few-shot settings (Li and Zhang, 2023; Ashok and Lipton, 2023). TANL, UIE, and cp-NER propose innovative approaches (e.g., Translation between Augmented Natural Languages framework (Paolini et al., 2021), text-to-structure generation framework (Lu et al., 2022), Collaborative Domain-Prefix Tuning (Chen et al., 2023b)), which achieve state-of-the-art performance and demonstrate effectiveness in fewshot fine-tuning. Despite the success of LLMs, they face challenges in training-free IE because of the difference between sequence labeling and text-generation models (Gutiérrez et al., 2022). To overcome these limitations, GPT-NER (Wang et al., 2023b) introduces a self-verification strategy, while GPT-RE (Wan et al., 2023) enhances task-aware representations and incorporates reasoning logic into enriched demonstrations. These approaches demonstrate how to effectively leverage the capabilities of GPT for in-context learning. CODEIE (Li et al., 2023f) and CodeKGC (Bi et al., 2023) show that converting IE tasks into code generation tasks with code-style prompts and in-context examples leads to superior performance compared to NL-LLMs. This is because code-style prompts provide a more effective representation of structured output, enabling them to effectively handle the complex dependencies in natural language.
The main challenges in zero-shot learning lie in enabling the model to effectively generalize for tasks and domains that it has not been trained on, as well as aligning the pre-trained paradigm of LLMs. Due to the large amount of knowledge embedded within, LLMs show impressive abilities in zero-shot scenarios of unseen tasks (Kojima et al., 2022; Wei et al., 2023). To achieve zero-shot cross-domain generalization of LLMs in IE tasks, several works have been proposed (Wang et al., 2022a; Sainz et al., 2023; Zhou et al., 2023; Wang et al., 2023c). These works offer a universal framework for modeling various IE tasks and domains, and introduce incomparison in Tab. 1, 2, and 4, we can observe that uni-IE models with SFT setting outperform task-specific models in the NER, RE, and EE tasks for most datasets.
Figure 4: Comparison of different data augmentation methods.
Data augmentation involves generating meaningful and diverse data to effectively enhance the training examples or information, while avoiding the introduction of unrealistic, misleading, and offset patterns. Recent powerful LLMs also demonstrate remarkable performance in data generation tasks (Whitehouse et al., 2023), which has attracted the attention of many researchers using LLMs to generate synthetic data for IE. It can be roughly divided into three strategies as shown in Fig. 4. Data Annotation. This strategy directly generates labeled data using LLMs. For instance, Zhang et al. (2023c) proposes LLMaAA to improve accuracy and data efficiency by employing LLMs as annotators within an active learning loop, thereby optimizing both the annotation and training processes. AugURE (Wang et al., 2023a) employs withinsentence pairs augmentation and cross-sentence pairs extraction to enhance the diversity of positive pairs for unsupervised RE, and introduces margin loss for sentence pairs. Knowledge Retrieval. This strategy retrieves relevant knowledge from LLMs for IE. PGIM (Li et al., 2023d) presents a two-stage framework for Multimodal NER, which leverages ChatGPT as an implicit knowledge base to heuristically retrieve auxiliary knowledge for more efficient entity prediction. Amalvy et al. (2023) proposes to improve NER on long documents by generating a synthetic context retrieval training dataset, and training a neural context retriever. Inverse Generation. This strategy prompts LLMs to produce natural text or questions based on the structural data provided as input, aligning with the training paradigm of LLMs. For example, SynthIE (Josifoski et al., 2023) shows that LLMs can create high-quality synthetic data for complex tasks by reversing the task direction. They used this approach to create a large dataset for closed information extraction and trained new models that outperformed previous benchmarks. This demonstrates the potential of using LLMs for generating synthetic data for various complex tasks. Rather than relying on ground-truth targets, which limits their generalizability and scalability, STAR (Ma et al., 2023b) generates structures from valid triggers and arguments, then generates passages with LLMs.
Overall, these strategies have their own advantages and disadvantages. While data annotation can directly meet task requirements, the ability of LLMs for structured generation still needs improvement. Knowledge retrieval can provide additional information about entities and relations, but it suffers from the hallucination problem and introduces noise. Inverse generation is aligned with the QA paradigm of LLMs. However, it requires structural data and there exists a gap between the generated pairs and the domain that needs to be addressed.
It is non-ignorable that LLMs have tremendous potential for extracting information from some specific domains, such as mulitmodal (Chen and Feng, 2023; Li et al., 2023d), medical (Tang et al., 2023; Ma et al., 2023a) and scientific (Dunn et al., 2022; Cheung et al., 2023) information. For example: Multimodal. Chen and Feng (2023) introduces a conditional prompt distillation method that enhances a model’s reasoning ability by combining text-image pairs with chain-of-thought knowledge from LLMs, significantly improving performance in multimodal NER and multimodal RE. Medical. Tang et al. (2023) explores the potential of LLMs in the field of clinical text mining and proposes a novel training approach, which leverages synthetic data, to enhance performance and address privacy issues. Scientific. Dunn et al. (2022) presents a sequenceto-sequence approach by using GPT-3 for joint NER and RE from complex scientific text, demonstrating its effectiveness in extracting complex scientific knowledge in materials chemistry.
Despite the great success of LLMs in various natural language processing tasks, their performance in the field of information extraction is still questionable (Han et al., 2023). To alleviate this problem, recent research has explored the capabilities of LLMs with respect to the major subtasks of IE (i.e., NER (Xie et al., 2023a; Li and Zhang, 2023), RE (Wadhwa et al., 2023; Yuan et al., 2023), and EE (Gao et al., 2023)). Considering the superior reasoning capabilities of LLMs, Xie et al. (2023a) proposes four reasoning strategies for NER, which are designed to simulate ChatGPT’s potential on zero-shot NER. Wadhwa et al. (2023) explores the use of LLMs for RE and finds that few-shot prompting with GPT-3 achieves near SOTA performance, while Flan-T5 can be improved with chain-of-thought style explanations generated via GPT-3. For EE tasks, Gao et al. (2023) shows that ChatGPT still struggles with it due to the need for complex instructions and a lack of robustness.
Along this line, some researchers perform a more comprehensive analysis of LLMs by evaluating multiple IE subtasks simultaneously. Li et al. (2023a) evaluates ChatGPT’s overall ability on IE, including performance, explainability, calibration, and faithfulness. They find that ChatGPT mostly performs worse than BERT-based models in the standard IE setting, but excellently in the OpenIE setting. Furthermore, Han et al. (2023) introduces a soft-matching strategy for a more precise evaluation and identifies “unannotated spans” as the predominant error type, highlighting potential issues with data annotation quality.
The development of integrating LLMs for generative IE systems is still in its early stages, and there are numerous opportunities for improvement. Universal IE. Previous generative IE methods and benchmarks are often tailored for specific domains or tasks, limiting their generalizability (Yuan et al., 2022). Although some unified methods (Lu et al., 2022) using LLMs have been proposed recently, they still suffer from certain limitations (e.g., long context input, and misalignment of structured output). Therefore, further development of universal IE frameworks that can adapt flexibly to different domains and tasks is a promising research direction (such as integrating the insights of task-specific models to assist in constructing universal models). Low-Resource IE. The generative IE system with LLMs still encounters challenges in resourcelimited scenarios (Li et al., 2023a). Based on our summary, there is a need for further exploration of in-context learning of LLMs, particularly in terms of improving the selection of examples. Future research should prioritize the development of robust cross-domain learning techniques (Wang et al., 2023c), such as domain adaptation or multi-task learning, to leverage knowledge from resource-rich domains. Additionally, efficient data annotation strategies with LLMs should also be explored.
Prompt Design for IE. Designing effective instructions is considered to have a significant impact on the performance of LLMs (Qiao et al., 2022; Yin et al., 2023). One aspect of prompt design is to build input and output pairs that can better align with the pre-training stage of LLMs (e.g., code generation) (Guo et al., 2023). Another aspect is optimizing the prompt for better model understanding and reasoning (e.g., Chain-of-Thought) (Li et al., 2023b), by encouraging LLMs to make logical inferences or explainable generation. Additionally, researchers can explore interactive prompt design (such as multi-turn QA) (Zhang et al., 2023b), where LLMs can iteratively refine or provide feedback on the generated extractions automatically.
Open IE. The Open IE setting presents greater challenges for IE models, as they do not provide any candidate label set and rely solely on the models’ ability to comprehend the task. LLMs, with their knowledge and understanding abilities, have significant advantages in some Open IE tasks (Zhou et al., 2023). However, there are still instances of poor performance in more challenging tasks (Qi et al., 2023; Li et al., 2023a), which require further exploration by researchers.
In this survey, we focus on reviewing existing studies that utilize LLMs for various generative IE tasks. We first introduce the subtasks of IE and discuss some universal frameworks aiming to unify all IE tasks. Additional theoretical and experimental analysis provides insightful exploration for these methods. Then we delve into different learning paradigms that apply LLMs for IE and demonstrate their potential for extracting information in specific domains. We also introduce some studies for evaluation purposes. Finally, we analyze the current challenges and present potential future directions. We hope this survey can provide a valuable resource for researchers to explore more efficient utilization of LLMs for IE.