[반복 및 투표 관련 색인마킹]
[자가평가 및 자가증강 등에 대한 수식 색인마킹]
Contents
최근 대규모 언어모델(LLM) 시대에 명명된 엔티티 인식(NER) 작업의 새로운 가능성을 탐구하는 많은 연구가 있었습니다 (OpenAI, 2022; Touvron et al., 2023; Chowdhery et al., 2022). 이런 연구들은 Zero-shot 예측, Few-shot In-Context Learning(ICL) 등을 위한 프롬프트 설계, 작업별 LLM 훈련, LLM을 사용하여 데이터를 생성하는 방법을 포함합니다.
이 논문에서는 LLM을 통해 Zero-shot NER의 성능 한계를 자가 개선 방식으로 극복할 가능성을 탐구합니다. 이는 주석이 없는 데이터만 접근 가능하고 추가적인 모델 훈련이나 보조 모델이 없는 엄격한 Zero-shot 시나리오에 중점을 둡니다.
Zero-Shot NER 자가 개선
Zero-shot NER의 성능 한계를 극복하기 위해, 저자원 환경에서 자가 개선 프레임워크를 제안합니다. 이 프레임워크는 기존의 프롬프트 설계와는 독립적으로 동작하며, 어떤 프롬프트 방식도 적용할 수 있습니다.
문제 정의
주어진 입력 문장 \(x\)에서 NER 작업은 \(x\)에서 구조적 출력 \(y\)를 인식하는 것입니다. \(y\)는 \((e,t)\) 쌍의 집합으로 구성됩니다. \(e\)는 \(x\)에서의 토큰 시퀀스인 엔티티 범위이고, \(t\)는 미리 정의된 엔티티 유형 집합에 속하는 대응 엔티티 유형입니다.
Zero-Shot NER Self-Improving 상세 프로세스
Step 1: Zero-Shot 자가 주석
미표시 코퍼스 \(U = \{x_i\}_{i=1}^n\)가 주어졌다고 가정하면, 미표시 샘플 \(x_i\)에 대해 LLM을 통해 Zero-shot 프롬프팅으로 예측을 생성합니다.
\[y_i = \arg\max_y P(y|T, x_i)\]\(T\)는 NER 작업 지시 사항이고, \(y_i = \{(e^j_i, t^j_i)\}_{j=1}^m\)으로 self-consistency(SC) (Wang et al., 2022)를 적용하여 각 예측에 대한 SC 점수를 얻습니다. 각 입력 문장 \(x_i\)의 샘플 수준 SC 점수 \(c_i\)를 얻습니다.
\[c_i = \frac{1}{m} \sum_j c^j_i\]Step 2: 신뢰성 있는 주석 선택
세 가지 신뢰성 있는 주석 선택 전략을 조사합니다.
Step 3: 자가 주석된 시연으로 인퍼런스
테스트 입력 \(x_q\)에 대해 신뢰성 있는 자가 주석 데이터셋에서 \(k\)개의 시연을 검색하여 인퍼런스를 돕습니다. 네 가지 시연 검색 방법을 조사합니다.
최종적으로, ICL을 수행하여 예측을 얻습니다.
\[y_q = \arg\max_y P(y|T, S, x_q)\]실험
CoNLL03, ACE05, WikiGold, GENIA 데이터셋에서 실험을 수행합니다. GPT-3.5를 LLM 백본으로 사용하고, text-embedding-ada-002를 텍스트 임베딩 모델로 사용합니다. 각 데이터셋에 대해 F1 점수를 보고합니다.
결과
분석
미표시 데이터 증가: 미표시 코퍼스의 크기를 증가시키는 것이 자가 개선 시나리오에서 성능 향상을 보장하지 않는 것으로 관찰되었습니다.
반복 자가 개선: 반복 자가 개선을 통해 성능을 향상시키려 했으나 대부분의 데이터셋에서 개선을 보장하지 않았습니다.
신뢰성 있는 주석 선택의 상한: 신뢰성 있는 주석 선택의 상한을 평가한 결과, gold label 설정과 동등한 성능을 보였습니다.
SC 점수 분석: SC 점수가 주석의 신뢰성을 효과적으로 반영함을 확인했습니다.
Weaker LLM에 대한 평가 제안된 프레임워크가 강한 0-shot 능력을 가진 모델에 더 적합함을 확인했습니다.
관련 연구
LLM을 통한 정보 추출 연구는 프롬프트 설계, 작업별 LLM 지시 튜닝, 데이터 증강 등을 포함하며, 이 연구는 훈련 없는 자가 개선 프레임워크를 제안하여 기존 연구와 차별화됩니다. ICL에서의 시연에 대한 연구와도 관련이 있으며, LLM을 사용하여 미표시 코퍼스에 대해 예측을 수행하고 신뢰성 있는 자가 주석 데이터를 시연으로 선택하는 방법을 제안합니다.
[상세]
1. 서론
최근 대규모 언어모델(LLM) 시대에 명명된 엔티티 인식(NER) 작업의 새로운 가능성을 탐구하는 많은 연구가 있었습니다(OpenAI, 2022; Touvron et al., 2023; Chowdhery et al., 2022). 이런 연구에는 Zero-shot 예측 또는 Few-shot In-Context Learning(ICL)을 위한 프롬프트 설계(Wei et al., 2023b; Wang et al., 2023; Xie et al., 2023; Li et al., 2023b), NER을 위한 작업별 LLM 훈련(Zhou et al., 2023; Sainz et al., 2023), 그리고 작은 특정 모델을 훈련시키기 위해 LLM을 사용하여 데이터를 생성하는 것(Zhang et al., 2023; Ma et al., 2023; Josifoski et al., 2023)이 포함됩니다.
이 논문에서는 LLM을 통해 Zero-shot NER의 성능 한계를 자가 개선(Self-Improving) 방식으로 극복할 가능성을 탐구합니다. 이 연구는 주석이 없는 데이터만 접근 가능하고, 추가적인 모델 훈련이나 보조 모델이 없는 엄격한 Zero-shot 시나리오에 중점을 둡니다. NER을 위한 완전 훈련이 없는 자가 개선 프레임워크를 제안하며, 이는 LLM의 자가 학습 능력을 자극하기 위해 미표시 코퍼스를 활용합니다.
2. Zero-Shot NER 자가 개선
2.1 동기
Zero-shot NER의 성능 한계를 극복하기 위해, 주석이 없는 데이터만 접근 가능하고 보조 모델이나 훈련이 없는 저자원 환경에서 자가 개선 프레임워크를 제안합니다. 이는 이전 프롬프트 설계 연구와는 독립적으로 어떤 프롬프트 방식도 이 프레임워크에 적용할 수 있습니다. Figure 1은 프레임워크의 개요를 보여줍니다.
2.2 문제 정의
주어진 입력 문장 \(x\)에서, NER 작업은 \(x\)에서 구조적 출력 \(y\)를 인식하는 것입니다. \(y\)는 \((e,t)\) 쌍의 집합으로 구성됩니다. \(e\)는 \(x\)에서의 토큰 시퀀스인 엔티티 범위이고, \(t\)는 미리 정의된 엔티티 유형 집합에 속하는 대응 엔티티 유형입니다.
2.2.1 Step 1: Zero-Shot 자가 주석
미표시 코퍼스 \(U = \{x_i\}_{i=1}^n\)가 주어졌다고 가정합니다. 레이블이 없는 훈련 세트를 이 연구에서 미표시 데이터셋으로 사용합니다. 미표시 샘플 \(x_i\)에 대해, 상단의 Figure 1에 보이는 것처럼 LLM을 통해 Zero-shot 프롬프팅으로 예측을 생성합니다. 이 과정은 다음과 같이 수식화됩니다. \(y_i = \arg\max_y P(y|T, x_i)\) \(T\)는 NER 작업 지시 사항이고, \(y_i = \{(e^j_i, t^j_i)\}_{j=1}^m\)입니다. self-consistency(SC) (Wang et al., 2022)를 적용하여 각 예측에 대한 SC 점수를 얻습니다. 이는 신뢰성 있는 주석 선택을 위해 2단계에서 사용됩니다. 모델에서 여러 답변을 샘플링하고, 각 예측 엔티티 \((e^j_i, t^j_i)\)에 대한 투표는 모든 샘플된 답변에 나타난 횟수로 표시되며, 이를 엔티티 수준 SC 점수 \(c^j_i\)로 나타냅니다. 그런 다음 모든 예측 엔티티에 대한 평균 SC 점수를 취하여 각 입력 문장 \(x_i\)의 샘플 수준 SC 점수 \(c_i\)를 얻습니다. \(c_i = \frac{1}{m} \sum_j c^j_i\) 각 SC 점수가 있는 자가 주석 샘플은 다음과 같이 나타낼 수 있습니다. \((x_i, \{(e^j_i, t^j_i, c^j_i)\}_{j=1}^m, c_i)\)
2.2.2 Step 2: 신뢰성 있는 주석 선택
더 높은 SC 점수가 더 높은 신뢰성을 나타낸다고 가정합니다. 따라서 다음 세 가지 신뢰성 있는 주석 선택 전략을 조사합니다.
2.2.3 Step 3: 자가 주석된 시연으로 인퍼런스
테스트 입력 \(x_q\)가 도착하면, 신뢰성 있는 자가 주석 데이터셋에서 \(k\)개의 시연을 검색하여 인퍼런스를 돕습니다. 다음 네 가지 시연 검색 방법을 조사합니다.
테스트 입력 \(x_q\)에 대해 검색된 자가 주석 시연을 \(S = \{(x_i, y_i)\}_{i=1}^k\)로 나타냅니다. 마지막으로, 프레임워크는 아래 Figure 1의 하단에 표시된 것처럼 이 \(k\)개의 샘플과 테스트 입력 문장 \(x_q\)를 연결하여 ICL을 수행합니다. 예측은 다음과 같이 얻습니다. \(y_q = \arg\max_y P(y|T, S, x_q)\)
3. 실험
3.1 설정
네 가지 널리 사용되는 NER 데이터셋, CoNLL03 (Sang and De Meulder, 2003), ACE05 (Walker et al., 2006), WikiGold (Balasuriya et al., 2009), 그리고 GENIA (Ohta et al., 2002)에서 실험을 수행합니다. GPT-3.5 (gpt-3.5-turbo)를 LLM 백본으로 사용하고, 텍스트 임베딩 모델로 text-embedding-ada-002를 사용하여 문장 표현을 얻습니다. \(k = 16\)과 \(K = 50\)로 설정합니다. SC의 경우, 온도를 0.7로 설정하고 5개의 답변을 샘플링합니다. 비용 절감을 위해, 두 번 무작위로 300개의 테스트 샘플을 샘플링하고 평균과 표준 편차를 보고합니다. 또한 레이블이 없는 훈련 샘플 500개를 무작위로 샘플링하여 미표시 코퍼스 \(U\)를 형성합니다. 초기 Zero-shot 프롬프팅은 베이스 라인이며, 이를 No-demos로 표시합니다. 이 논문에서 F1 점수를 보고합니다.
3.2 결과
주요 결과는 표 1에 나타나 있습니다. 임계값 $$ Th_{entity}
\(와\) Th_{sample} $$의 다른 값에 대한 결과는 Appendix E에서 찾을 수 있습니다.
3.3 분석
미표시 데이터 증가: \(U\)의 크기를 10배로 확장하고, 원본 훈련 세트에서 5000개의 샘플을 무작위로 샘플링했습니다. 결과는 Figure 2에 나타나 있습니다. 미표시 코퍼스의 크기를 증가시키는 것이 자가 개선 시나리오에서 성능 향상을 보장하지 않습니다. 한편, gold label 시나리오에서도 시연 풀의 크기를 증가시키는 것이 약간의 개선만을 가져옵니다. 이는 작은 데이터셋이 이미 데이터 분포를 대략적으로 포착했기 때문일 수 있습니다.
반복 자가 개선: 자가 주석 데이터를 시연으로 사용하여 다음 반복 자가 주석을 안내하는 부트스트래핑 프로세스를 형성합니다. 반복 자가 개선 프로세스의 예는 Appendix G에서 찾을 수 있습니다. 최대 8회 반복 실험을 수행합니다. 0번째 반복은 No-demos 설정을 나타냅니다. 결과는 Figure 3에 나와 있습니다. 반복 자가 개선을 증가시켜도 대부분의 데이터셋에서 개선을 보장할 수 없습니다. 이는 훈련 없는 과정에서 자가 주석의 오류 누적을 제거하기 어려운 사실 때문일 수 있습니다.
신뢰성 있는 주석 선택의 상한: 모든 샘플된 답변에서 실제 예측만 유지하고 잘못된 예측을 삭제하여 신뢰성 있는 주석 선택의 상한을 평가합니다. 결과는 표 2에 나와 있습니다. 더 자세한 결과는 Appendix F에서 찾을 수 있습니다. 상한 설정은 gold label 설정과 동등하게 수행되며, 신뢰성 있는 주석 선택을 개선할 여지가 있음을 나타냅니다.
SC 점수 분석: 엔티티 수준 SC 점수에 대한 커널 밀도 추정을 Figure 4에 나타냈습니다. 대부분의 실제 예측은 높은 SC 점수 범위에 모여 있는 반면, 대부분의 잘못된 예측은 낮은 SC 점수를 가집니다. 이는 SC 점수가 주석의 신뢰성을 효과적으로 반영함을 의미할 수 있습니다.
자가 검증: SC 외에도, LLM에게 자체 답변에 대한 신뢰도를 채점하도록 요청하여 자가 주석의 신뢰도를 측정하는 자가 검증(SV)을 탐구합니다. LLM이 인식된 엔티티를 출력한 후, 다음과 같이 요청하여 SV 점수를 얻습니다. “위의 답변을 제공하는 것에 얼마나 자신이 있습니까? 각 명명된 엔티티에 대해 0-5의 신뢰도 점수를 주세요.” SC와 SV의 비교 결과는 표 3에 나와 있습니다. 표에서 보이듯이, SV도 No-demos 베이스 라인과 비교하여 약간의 개선을 달성했습니다. 그러나 SC 측정보다 뒤처집니다. 이는 LLM이 자체 답변에 대해 지나치게 확신(confidence)을 갖는 경향이 있기 때문일 수 있습니다. CoNLL03 벤치마크에서 SV 측정 하에 3보다 낮은 신뢰도 점수를 받는 샘플이 없음을 발견했습니다. 지나친 확신 문제는 Li et al. (2023a)에서도 언급되었습니다.
Weaker LLM에 대한 평가: 제안된 자가 개선 프레임워크의 성능을 weaker LLM에서 탐구하기 위해, Llama2 chat 13B 모델(Touvron et al., 2023)에서 실험을 수행합니다. 결과는 표 4에 나와 있습니다. 이 실험에서는 2단계 다수결 투표 선택 전략과 가장 가까운 이웃 검색 방법을 사용합니다. 0-shot 시나리오에서 약한 능력을 가진 Llama2 13B 모델은 자가 개선 프레임워크 하에서 부정적인 결과를 보여줍니다. 이는 제안된 프레임워크가 강한 0-shot 능력을 가진 모델에 더 적합함을 나타냅니다. 상대적으로 약한 0-shot 능력을 가진 모델의 경우, 성능을 높이기 위해 프롬프트 설계를 개선하는 것이 더 효과적인 전략일 수 있습니다.
4. 관련 연구
LLM을 통한 정보 추출: LLM을 사용한 정보 추출(IE) 연구에는 프롬프트 설계(Wei et al., 2023b; Wang et al., 2023; Xie et al., 2023; Li et al., 2023b), 작업별 LLM 지시 튜닝(Zhou et al., 2023; Sainz et al., 2023) 및 데이터 증강(Zhang et al., 2023; Ma et al., 2023; Josifoski et al., 2023)이 포함됩니다. Zhang et al. (2023)은 LLM을 사용하여 데이터를 주석하고, 이를 특정 IE 모델을 파인튜닝하는 데 사용하며, 파인튜닝된 모델은 다음 반복에서 주석할 데이터를 선택하는 데 도움을 줍니다. 이전 연구와 달리, 이 연구는 LLM에서 Zero-shot 경계를 확장하기 위해 훈련 없는 자가 개선 프레임워크를 제안합니다. Zhang et al. (2023)과 달리, 이 프레임워크에서는 초기 레이블 데이터, 전문가 소규모 모델 또는 훈련 자원이 필요하지 않습니다. 또한, 작업은 이전 프롬프트 설계 연구와 독립적입니다. 그들은 성능을 향상시키기 위해 다양한 프롬프트 형식을 탐구했으며, 미표시 코퍼스를 사용하지 않았습니다. 그들과 달리, 이 연구는 복잡한 프롬프트 형식을 설계하지 않고 미표시 코퍼스를 사용하여 Zero-shot NER을 개선합니다.
ICL에서의 시연: 일부 연구에서는 ICL에 영향을 미치는 요인을 탐구했습니다(Lyu et al., 2023; Min et al., 2022; Wei et al., 2023a). Lyu et al. (2023)은 ICL에서 시연에 무작위로 레이블을 할당하는 것이 미치는 영향을 조사했습니다. 그러나 이 무작위 레이블링 방법은 문장 수준이 아닌 토큰 수준의 레이블 정보가 필요한 NER 작업에는 적합하지 않습니다. 그들과 달리, 먼저 LLM을 사용하여 미표시 코퍼스에 대해 예측을 하고, 그런 다음 신뢰성 있는 자가 주석 데이터를 시연으로 선택합니다.
[재고]
1 Introduction
There have been many works exploring new possibilities of the named entity recognition (NER) task in the era of large language models (LLMs) (OpenAI, 2022; Touvron et al., 2023; Chowdhery et al., 2022) recently. These studies include designing advanced prompting methods for zero-shot prediction or few-shot in-context learning (ICL) (Wei et al., 2023b; Wang et al., 2023; Xie et al., 2023; Li et al., 2023b), training task-specific LLMs for NER (Zhou et al., 2023; Sainz et al., 2023), and generating data with LLMs to train small specific models (Zhang et al., 2023; Ma et al., 2023; Josifoski et al., 2023).
*Corresponding authors. Code and data are publicly available: GitHub.
In this work, we explore the possibility of pushing the performance boundary of zero-shot NER with LLMs via self-improving. We focus on the strict zero-shot scenarios where no annotated data is available but only an unlabeled corpus is accessible, and no training resource or auxiliary models are available. We propose a totally training-free self-improving framework for NER, which utilizes an unlabeled corpus to stimulate the self-learning ability of LLMs. The framework consists of the following three steps:
Our contributions include:
Zero-Shot NER with Self-Improving
Motivation
To push the performance boundary of zero-shot NER with LLMs, we propose a self-improving framework under a strict zero-shot and low-resource setting: No annotated data but only an unlabeled corpus is available; No auxiliary model or training step is required. This study is orthogonal to previous prompt designing works, as any advanced prompting method can be applied to this framework. Figure 1 shows the framework overview.
Task Formulation
Given an input sentence \(x\), the NER task is to recognize the structure output \(y\) from \(x\), which consists of a set of \((e,t)\) pairs. \(e\) is an entity span, which is a sequence of tokens from \(x\); \(t\) is the corresponding entity type, which belongs to a predefined entity type set.
Step 1: Zero-Shot Self-Annotating
We assume an unlabeled corpus \(U = \{x_i\}_{i=1}^n\) is available. We use the training set without labels as the unlabeled dataset in this work. For unlabeled sample \(x_i\), we generate predictions with LLMs via zero-shot prompting, as shown in the upper part of Figure 1. This process is formulated as $$ y_i = \arg\max_y P(y | T, x_i) \(, where\) T \(is the task instruction of NER, and\) y_i = {(e^j_i, t^j_i)}{j=1}^m \(. We apply self-consistency (SC) (Wang et al., 2022) to obtain an SC score for each prediction, which will be used in step 2 for reliable annotation selection. We sample multiple answers from the model, and the vote for each predicted entity\)(e^j_i, t^j_i)\(is the times it appeared in all the sampled answers, which we denoted as entity-level SC score\) c^j_i \(. Then we get the sample-level SC score\) c_i \(for each input sentence\) x_i \(by taking the average SC score over all predicted entities in this sentence, i.e.,\) c_i = \frac{1}{m} \sum_j c^j_i \(. For each self-annotated sample with SC scores, we can denote it as\)(x_i, {(e^j_i, t^j_i, c^j_i)}{j=1}^m, c_i)$$. |
Step 2: Reliable Annotation Selection
We assume that a higher SC score indicates a higher reliability. Thus, we investigate the three following strategies for reliable annotation selection:
Step 3: Inference with Self-Annotated Demonstration
When a test input \(x_q\) arrives, we retrieve \(k\) demonstrations from the reliable self-annotated dataset to help the inference. We investigate the following four methods for demonstration retrieval:
Let \(S = \{(x_i, y_i)\}_{i=1}^k\) denote the self-annotated demonstrations retrieved for the test input \(x_q\). Finally, our framework conducts ICL by concatenating these \(k\) samples as well as the test input sentence \(x_q\), as shown in the lower part of Figure 1. The prediction is obtained via $$ y_q = \arg\max_y P(y | T, S, x_q) $$. |
Experiment
Setup
We experiment on four widely-used NER datasets, CoNLL03 (Sang and De Meulder, 2003), ACE05 (Walker et al., 2006), WikiGold (Balasuriya et al., 2009), and GENIA (Ohta et al., 2002). We use GPT-3.5 (gpt-3.5-turbo) as the LLM backbone and text-embedding-ada-002 model to get sentence representations. We set \(k = 16\) and \(K = 50\). For SC, we set temperature to 0.7 and sample 5 answers. For cost saving, we randomly sample 300 test samples twice then report the means and standard deviations, and we randomly sample 500 training samples without labels to form the unlabeled corpus \(U\). The naive zero-shot prompting is our baseline, which we denote as No-demos. We report F1 scores throughout this paper.
Results
The main results are shown in Table 1. Results of other values for thresholds \(\text{Th}_{\text{entity}}\) and \(\text{Th}_{\text{sample}}\) can be found in Appendix E.
al., 2023).
Analysis
Increasing unlabeled data. We expanded the size of \(U\) by 10 times and randomly sampled 5000 samples from the original training set. Results are shown in Figure 2. Increasing the size of the unlabeled corpus does not guarantee performance improvements under the self-improving scenario. Meanwhile, increasing the size of the demonstration pool only brings marginal improvement, even under the gold label scenario. The reason may be that the small dataset already approximately captures the data distribution.
Iterative self-improving. We use the self-annotated data as demonstrations to guide the next iteration of self-annotating, forming a bootstrapping process. The illustration of the iterative self-improving process can be found in Appendix G. We experiment up to 8 iterations. The 0-th iteration indicates the No-demos setting. Results are shown in Figure 3. Increasing iterations of self-improving cannot guarantee improvements on most datasets. This may be due to the fact that error accumulation in self-annotating is difficult to be eliminated in this training-free process.
Upper bound of reliable annotation selection. We keep only the true predictions and discard the false predictions in all the sampled answers to evaluate the upper bound of reliable annotation selection. Results are shown in Table 2. More detailed results can be found in Appendix F. Upper bound setting performs on par with the Gold label setting, indicating that there might still be space to be improved for reliable annotation selection.
SC score analysis. We plot the kernel density estimation for entity-level SC scores in Figure 4. Most true predictions gather in the interval of high SC scores, while most false predictions have low SC scores. This shows that SC scores effectively reflect the reliability of annotations.
Self-verification. Besides SC, we also explore self-verification (SV) to measure the confidence of self-annotation by asking the LLM to score its own answer about its own confidence. After the LLM outputs the recognized entities, we obtain the SV score by asking the LLM: “How confident are you in providing the above answers? Please give each named entity in your answer a confidence score of 0-5.” The comparison results between SC and SV are in Table 3. As shown in the table, SV also achieves some improvements compared with the No-demos baseline. However, it lags behind the SC measurement. This is presumably because the LLM tends to be over-confident about its own answer, since we found that no sample gets a confidence score lower than 3 under the SV measurement in the CoNLL03 benchmark. The overconfidence problem is also mentioned in Li et al. (2023a).
Evaluation on weaker LLMs. To explore the performance of the proposed self-improving framework on weaker LLMs, we conduct experiments on the Llama2 chat 13B model (Touvron et al., 2023), the results are shown in Table 4. Two-stage majority voting selection strategy and the nearest neighbor retrieval method are used in this experiment. With a much weaker ability in zero-shot scenarios, Llama2 13B model shows negative results under the self-improving framework. This indicates that the proposed framework is more suitable for models with a strong zero-shot capability. For the models with a relatively weaker zero-shot ability, improving the prompt designing might be a more effective strategy to boost performance.
Related Work
Information extraction with LLM. The research of information extraction (IE) with LLMs includes prompt designing (Wei et al., 2023b; Wang et al., 2023; Xie et al., 2023; Li et al., 2023b), task-specific LLMs instruction-tuning (Zhou et al., 2023; Sainz et al., 2023) and data augmentation (Zhang et al., 2023; Ma et al., 2023; Josifoski et al., 2023). Zhang et al. (2023) use LLM to annotate data, which is used to fine-tune a specific IE model, then the fine-tuned model is used to help select the data to be annotated in the next iteration. Unlike previous works, this work proposes a training-free self-improving framework to push the zero-shot boundary of LLM on NER. Different from Zhang et al. (2023), no seed labeled data, expert small model nor training resources are required in our framework. In addition, our work is orthogonal to previous prompt designing works. They explored various advanced prompt formats to boost performance and did not utilize an unlabeled corpus. Unlike them, this work improves zero-shot NER by using an unlabeled corpus without designing any complex prompt format.
Demonstrations in ICL. Some works explored factors that have impacts on ICL (Lyu et al., 2023; Min et al., 2022; Wei et al., 2023a). Lyu et al. (2023) investigate the impact of randomly assigning labels to demonstrations in ICL. However, this random labeling method is not suitable for tasks like NER, which requires label information on the token-level instead of sentence-level. Different from them, we first use LLM to make predictions on the unlabeled corpus, then select reliable self-annotated data as demonstrations.
The remarkable zero-shot and few-shot generalization of large language models (LLMs) (OpenAI, 2022; Touvron et al., 2023; Chowdhery et al., 2022) makes the vision of one model for all possible. Named entity recognition (NER) is a fundamental task in information extraction (IE). There have been many works exploring new possibilities of the NER task in the era of LLMs. Some works study advanced prompting methods to improve zero-shot prediction or few-shot in-context learning (ICL) for NER (Wang et al., 2023; Xie et al., 2023; Wei et al., 2023b). Some works train task-specific LLMs for NER (Sainz et al., 2023; Zhou et al., 2023). Meanwhile, there are studies use LLMs as data annotator or data generator to conduct data augmentation for small language models(Ma et al., 2023; Josifoski et al., 2023).
In this work, we explore the possibility of pushing the boundary of zero-shot NER with LLMs via self-improving. We focus on the strict zero-shot scenarios where no annotated data is available but only an unlabeled corpus is accessible, and no training resource or auxiliary models are available. We propose a self-improving framework which utilize an unlabeled corpus to stimulate the self-learning ability of LLMs via a training-free paradigm. The pipeline of our framework is as follows. (1) First, we use LLMs to predict named entities for the unlabeled corpus and obtain the self-annotated data. In this process, we apply self-consistency (SC) to obtain a SC score for each prediction as the measure for the quality of self-annotation. (2) Second, we explore diverse strategy to select reliable samples from the self-annotated data and obtain a reliable sample set. Then we investigate various demonstration retrieval methods to select demonstrations for the arrived test query. To make better tradeoff between the similarity, diversity and reliability of demonstrations simultaneously, we propose the retrieval strategy of diverse nearest with SC ranking. (3) Finally, we use the selected reliable self-annotated demonstrations to assist the inference of the test query via ICL.
In summary, the main contributions are:
Figure 1: The overview of the proposed self-improving framework for zero-shot NER with LLM. The pipeline mainly consists of three steps: (1) First, we let the LLM to make predictions on the unlabeled corpus via zero-shot prompting and obtain the self-annotated data. We obtain a consistency score for each prediction via SC. (2) Second, we perform reliable sample selection on the self-annotated dataset via various strategies. (3) Finally, for the arrived test query, we use the selected reliable self-annotated data as the pseudo demonstrations in ICL and conduct inference with pseudo demonstrations
IE with LLM. Some works designed different prompting methods to improve the performance of IE with LLMs. Wei et al. (2023b) propose a two-stage chatting paradigm for IE, which first recognize the types of elements then extract the mentions corresponding to each type. Wang et al. (2023) apply in-context learning (ICL) to NER by inserting special tokens into the demonstrations retrieved from the full gold training set. Xie et al. (2023) propose decomposed-QA and syntactic augmentation to boost zero-shot NER. Li et al. (2023) transfer IE into code generation task. Meanwhile, a few studies trained task-specific LLMs for IE with instruction-tuning (Zhou et al., 2023; Sainz et al., 2023). In addition, several works used LLMs to annotate data or generate synthetic data for small expert IE models (Zhang et al., 2023; Ma et al., 2023; Josifoski et al., 2023). Zhang et al. (2023) use LLM to annotate data, then fine-tune the small IE model on the data annotated by LLM, and use the fine-tuned model to help select the data to be annotated; this pipeline is conducted iteratively to form a bootstrapping process. Ma et al. (2023); Josifoski et al. (2023) generate synthetic data for IE via a structure-to-text paradigm. Different from previous work, we attempt to push the zero-shot boundary of LLM on NER through training-free self-improving. We use unlabeled corpus to stimulate the self-learning ability of LLM without any auxiliary models. Note that our work is orthogonal to the previous prompt designing works, and any advanced prompting method can be applied in the proposed framework to boost performance.
Some works explored factors that have effects on ICL (Lyu et al., 2023; Min et al., 2022; Wei et al., 2023a). Lyu et al. (2023) use random assigned labels for demonstrations in ICL. However, this random assigning is not suitable for tasks like NER, which requires label information on the fine-grained token-level instead of sentence-level. Different from them, we first use LLM to make predictions on the unlabeled corpus, then select reliable self-annotated data as pseudo
We propose a self-improving framework for zero-shot NER with LLMs under the strict zero-shot setting: No annotated data but only an unlabeled corpus is available; No auxiliary models is required; The whole process is training-free, which explore the self-improving ability of the model without costly training. This study is orthogonal to the previous prompt designing works, and any advanced prompting method can be applied in the proposed framework to boost performance.
Given an input sentence $x$, the NER task is to predict the structure output $y$ from $x$, which consists of a set of $(e, t)$ pairs. $e$ is an entity span, which is a sequence of tokens from $x$, and $t$ is the corresponding entity type, which belongs to a predefined entity type set.
We assume an unlabeled corpus $U$ is available. In this work, we use the training set without annotation ${x_i}_{i=1}^n$ as the unlabeled corpus. In this step, we use zero-shot prompting to annotate the unlabeled corpus, as shown in the upper right of Fig. 1.
\[P(y\\|T, x_i)\]where the predicted answer is obtained via
\[y_i = \arg\max_y P(y\\|T, x_i)\]$T$ is the task instruction of NER.
Also, we apply Self-Consistency (SC) (Wang et al., 2022) to obtain the annotation confidences. For each predicted entity $(e, t)$, its vote is taken to measure its confidence, which we call entity-level SC score, $c_{e_i}$. For each input sentence, we take the average SC score over all predicted entities as the SC score of this sentence, which we call sample-level SC score, $c_i$. The self-annotated data with entity-level and sample-level SC scores are shown in the right part of Fig. 1. For each annotated sample, we denote it as a quadruple $(x_i, y_i, {c_{e_j}}, c_i)$, where ${c_{e_j}}$ is the set of entity-level SC scores corresponding to the predicted entity set.
We intend to use the self-annotated data as pseudo demonstrations to assist the inference, and provide useful guidance for test queries. This requires the pseudo demonstrations to have high quality. We use SC score to measure the quality of pseudo demonstrations and assume that a higher SC score indicates higher quality.
This step consists of two parts: we first select high-quality samples from the self-annotated dataset, then retrieve pseudo demonstrations from the high-quality sample set for inference.
Sample selection strategies:
Sample retrieval strategies: When a test query $x_q$ arrives, we retrieve $k$ samples from the reliable sample set to assist the inference of the query. We explore the following strategies for retrieval:
Let $S = {(x_i, y_i)}_{i=1}^k$ denote the retrieved reliable pseudo demonstrations for the test query $x_q$. Finally, our framework conducts ICL by concatenating these $k$ input-pseudo label pairs as well as the input query sentence $x_q$. Then the prediction is obtained via
\[P(y\\|T, S, x_q)\] \[y_q = \arg\max_y P(y\\|T, S, x_q)\]Datasets. We evaluate the proposed framework on four commonly-used NER datasets, CoNLL03 (Sang and De Meulder, 2003), ACE05 (Walker et al., 2006), WikiGold (Balasuriya et al., 2009) and GENIA (Ohta et al., 2002).
Models. We use GPT-3.5 (gpt-3.5-turbo) as the LLM backbone for all experiments. We use text-embedding-ada-002 model to get sentence representations, which is a text embedding model from
OpenAI. We access to OpenAI models with the official API.2
Prompts. We focus on investigating the effectiveness of this training-free self-improving paradigm on NER. So we eliminate the impact of different prompting methods by using the most regular prompt format, which is as shown in Fig. 1. The complete prompts are shown in Appendix E.
Note that prompt designing is orthogonal to this work, and any advanced prompting method can be applied in our framework to boost performance.
Implementation details. For inference via ICL with pseudo demonstrations , we use k = 16 in all of our experiments. For self-consistency, we set the temperature to 0.7 and 0 for settings with and without SC, respectively. We conduct majority voting of 5 responses in our main experiments. For cost saving, we randomly sample 300 samples from the test set twice and report the means and standard deviations, except for WikiGold, of which the test set size is only 247. We randomly sample 500 samples from the original training set as the unlabeled corpus U. We explore the effect of increasing the size of U in Section 4.4, and also report the results with full gold training set for analysis.
The results of GPT-3.5 are obtained during October and November 2023 with official API.
The rest of this section tries to answer the following questions:
The main results of the proposed self-improving framework are shown in Table 1, from which we obtain the following observations. (1) In the part of No sample selection, we remove SC, and use all self-annotated answers for self-improving. The results shows improvements over No-demos, revealing that our framework are helpful even without any sample selection step. (2) In the next three parts, with reliable sample selection, the performance is further improved. There is no significant performance differences between different sample selection strategies. (3) Overall, the proposed retrieval strategy of diverse nearest with SC ranking shows consistent improvements under various settings and achieve the best results when combined with two-stage majority voting. This confirms that diverse nearest with SC ranking achieves a better trade-off between similarity, diversity and reliability of demonstrations. (4) Random retrieval lags behind nearest retrieval but not as much as in the gold label setting, likely because the diversity of demonstrations is more important in pseudo label setting than in the gold label setting. The pseudo label inevitably contains noise. Although nearest retrieval obtains the most similar demonstrations, but if they contains much noise, they could harm the performance due to the copy mechanism of ICL mentioned in (Lyu et al., 2023). However, not similar but diverse demonstrations could make the model learn the fine-grained task requirement while eliminating the copy mechanism.
Figure 2: Increasing the size of unlabeled dataset does not bring obvious performance gain in self-improving paradigm.
Figure 3: The pipeline of iterative self-improving.
We explore the effect of increasing the size of unlabeled corpus U. We expand the size of U by 10× and randomly sampled 5000 samples from the original training set. We refer to U of size 500 and 5000 as small and large unlabeled set, respectively. The results on CoNLL03 and WikiGold are shown in Fig. 2. Results on the rest of datasets are in Appendix C. Increasing the size of unlabeled corpus does not guarantee performance improvements under the self-improving paradigm. Although increasing the size of demonstration pool does improve the results under gold label setting, the performance gain is marginal, likely because that the small set already approximately cover the distribution of the dataset. This is consistent with the results on lowresource scenario in (Wang et al., 2023).
We explore the effect of iterative self-improving: we use the self-annotated data as sample pool, to guide the next iteration of self-annotating, forming a bootstrapping process. The iterative process is as shown in Fig. 3. We use two-stage majority voting and diverse nearest with SC ranking in this section for demonstration. The results on CoNLL03 and WikiGold are shown in Fig. 4. Results on the rest of datasets are in Appendix C. We experiment up to 8 iterations. The 0-th iterations indicate the No-demos setting. The impact of iterative self-improving varies on different datasets. The performance generally keeps climbing with the increase of iterations on CoNLL03. However, the performances keeps dropping or wandering around the original level on other datasets. This may due to the fact that error accumulation in the self-annotating step is difficult to be eliminated in this training-free paradigm.
Figure 4: Effect of increasing the iterations of self-improving varies on different datasets.
Figure 5: Kernal density estimation plots for SC scores. The SC scores reflect the reliability of samples to a large extent.
Table 2: The potential upper bound of SC (True-only) keeps only the true predictions in SC. TSMV refers to two-stage majority voting. We display the best results for each strategy. The setting of True-only performs on par with the Gold label setting, showing that there is still space to be improved for the reliable sample selection.
Potential upper bound of SC. We explore the potential upper bound of SC by keeping the true predictions only (referred as true-only). The results comparison are shown in Table 2. We display the results of nearest retrieval for the true-only setting, which shows the best result among all retrieval strategies. The complete results of this setting are shown in Appendix B. True-only method performs on par with the gold label setting, showing that there is still space to be improved for the reliable sample selection. Strategies other than SC threshold filtering can be explored for selecting reliable entities.
This work explored the possibilities of pushing the boundary of zero-shot NER with LLMs via training-free self-improving. We proposed a selfimproving framework, which utilize an unlabeled corpus to stimulate the self-learning ability of LLMs. This framework mainly consists of three steps. At step one, we use LLM to make predictions on the unlabeled corpus and obtain the selfannotated data. At step two, we select reliable demonstrations from the self-annotated set via various selection strategies and retrieval approaches. Finally at step three, we conduct inference on the test query via ICL with the selected pseudo demonstrations. This framework achieves significant improvement on zero-shot NER with LLM. In addition, through comprehensive experimental analysis, we find that the proposed framework might be difficult to further improved via iterative selfannotation or simply increasing the size of unlabeled corpus, but might still be boosted via more advanced strategies for reliable entity selection.
We acknowledge the following limitations of this study: (1) This work focus on exploring the zero-shot self-improving framework on NER task. The investigation of this paradigm on other IE tasks are not studied yet. (2) We applied the commonly-used self-consistency method to obtain the confidence score for measuring the quality of self-annotated data. There might be other diverse approaches to measure the quality of self-annotation. (3) The zero-shot performance still lag behind previous state-of-the-art in fully-supervised methods. We leave them to future work.