00:00:00

Share Your Feedback 🏝️

Eliciting Human Preferences

Eliciting Human Preferences

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Hyper Attention Next: Model | AnyMAL

Eliciting Human Preferences

  • Related Project: Private
  • Category: Paper Review
  • Date: 2023-10-22

Eliciting Human Preferences with Language Models

  • url: https://arxiv.org/abs/2310.11589
  • pdf: https://arxiv.org/pdf/2310.11589
  • abstract: Language models (LMs) can be directed to perform target tasks by using labeled examples or natural language prompts. But selecting examples or writing prompts for can be challenging–especially in tasks that involve unusual edge cases, demand precise articulation of nebulous preferences, or require an accurate mental model of LM behavior. We propose to use LMs themselves to guide the task specification process. In this paper, we introduce Generative Active Task Elicitation (GATE): a learning framework in which models elicit and infer intended behavior through free-form, language-based interaction with users. We study GATE in three domains: email validation, content recommendation, and moral reasoning. In preregistered experiments, we show that LMs prompted to perform GATE (e.g., by generating open-ended questions or synthesizing informative edge cases) elicit responses that are often more informative than user-written prompts or labels. Users report that interactive task elicitation requires less effort than prompting or example labeling and surfaces novel considerations not initially anticipated by users. Our findings suggest that LM-driven elicitation can be a powerful tool for aligning models to complex human preferences and values.

Contents

TL;DR


  • 사양 비용 및 human-evaluation 정렬 최적화를 위한 프레임워크
  • 작업 유도 프레임워크 내에서의 상호 작용적이고 유연한 학습 방법
  • 도메인별 사양 비용 및 정렬 측정을 통한 실험적 평가

1. 서론

문제 정의 및 목표

본 논문은 사양 비용 및 human-evaluation 정렬 최적화를 목표로 하는 사양 수집 및 학습 프레임워크를 제안한다. 이를 위해 다음과 같은 목적함수를 최적화한다.

\[\alpha \cdot \text{specification cost} + \beta \cdot \text{human–predictor alignment} \ldots (1)\]

사양 비용은 휴먼의 시간과 정신적 노력을 측정하고, human-evaluation 정렬은 모델 선택이 휴먼의 선택과 얼마나 일치하는지를 측정한다. $\alpha$와 $\beta$는 두 요소 간의 상쇄를 나타낸다.

공식을 통한 목표 설정

이를 형식화하기 위해, 선호도를 함수 $f$로 표현하는 휴먼 사용자 $H_f$를 정의한다. $H_f$와 상호 작용하여 작업 사양 $s$를 생성하는 유도 정책 $E$를 설계하고자 한다. 이 사양은 학습 알고리즘에 입력되어 모델 $\hat{f}(s)$를 생성한다. 그 다음, 사양 비용을 나타내는 스칼라 척도 $C(\cdot)$와 두 예측기 간의 정렬을 측정하는 척도 $A(\cdot, \cdot)$을 정의하여, 휴먼 사용자 집단에 대한 기대치 내에서 다음을 최소화하고자 한다.

\[\mathbb{E}_{H_f} \mathbb{E}_{s \sim E(H_f)} \left[ \alpha \cdot C(s) + \beta \cdot A(f, \hat{f}(s)) \right]\]

$C$는 사양 $s$를 작성하는 데 사용된 단어 수를 측정할 수 있으며, $A$는 일부 집단의 개별 예측 수준에서 모델-예측기 간의 일치를 측정할 수 있다.

\[A(f, \hat{f}) = \mathbb{E}_x \| f(x) - \hat{f}(x) \|\]


2. 작업 유도 학습

2.1 작업 유도 프레임워크

본 논문은 특정 작업을 수행하는 머신러닝 모델을 효율적으로 훈련하는 문제를 연구한다. 작업은 일반적으로 입력 $x$를 출력 $y$로 매핑하는 함수 $f : x \rightarrow y$로 정의된다. 예를 들어, 맞춤형 웹사이트 추천 시스템을 구축할 때, $x$는 웹사이트이고, $y$는 해당 웹사이트에 대한 사용자 선호도 점수이다.

2.2 기존 학습 패러다임

기존 학습 패러다임은 상호작용성과 유연성 측면에서 다르다. 예를 들어, Supervised learning은 비상호작용적이고 예제 기반인 반면, Active learning은 상호작용적이고 예제 기반이다. 반면, 프롬프트 기반 학습은 비상호작용적이면서 자유 형식이다.

Supervised learning

Supervised learning에서는 사용자 $H_f$가 생성한 레이블링된 (입력, 출력) 쌍의 집합을 사용하여 $\hat{f}(s)$를 학습한다. 이는 예제 기반 프로세스로, 모델이 추가 데이터를 레이블링하도록 상호작용적으로 쿼리하지 않는다.

Active learning

Active learning에서는 사용자 $H_f$가 제공한 레이블이 가장 유용한 예제를 선택하여 모델이 상호작용적으로 쿼리한다. 이 접근 방식은 불확실성을 해결하는 데 유리하다.

프롬프트 기반 학습(Prompting)

프롬프트 기반 학습에서는 사양을 자연어 설명으로 지정한다. 이는 더 유연한 방법으로, 사용자가 자연어 설명을 작성하도록 지시하고, 최종 예측기는 이를 바탕으로 모델을 학습한다.


3. 생성적 능동 작업 유도

3.1 GATE 방법

본 논문은 자유 형식 사양의 유연성과 Active learning의 상호작용적 이점을 결합한 생성적 능동 작업 유도(GATE) 방법을 제안한다. 이를 통해 언어 모델(LM)이 사용자 선호도를 유도하고 이해할 수 있는지 실험한다. LM은 사용자와의 상호작용 기록을 조건으로 하여 질문을 생성하고, 예측기는 입력 $x$와 완전한 유도 전사를 바탕으로 예측을 생성한다.

정보 수집 정책

  1. 생성적 Active learning: LM이 사용자에게 레이블링할 예제 입력을 생성.
  2. 예/아니오 질문 생성: LM이 이진 질문을 생성하여 추상적 선호도를 유도.
  3. 개방형 질문 생성: LM이 자유 형식의 질문을 생성하여 가장 광범위한 지식을 유도.


4. 실험 설정

4.1 도메인 및 데이터셋

  • 콘텐츠 추천: 온라인 기사 추천 도메인에서 사용자가 특정 기사를 읽고 싶어할지를 예측.
  • 도덕적 인퍼런스: 빵을 훔치는 것이 언제 적절한지를 예측하여 도덕적 선호도 평가.
  • 이메일 검증: 소프트웨어 엔지니어링 과제로 이메일 주소의 유효성을 평가.

4.2 휴먼 상호작용

영어 사용자를 대상으로 상호작용 실험을 실시하고, 각 도메인-방법 쌍에 대해 20-30명의 참가자를 모집했다.

4.3 모델링 세부사항

GPT-4 모델을 사용하여 사용자 선호도를 유도하고 예측을 생성. 도메인 설명과 상호작용 기록을 기반으로 GPT-4가 질문을 생성하고, 테스트 샘플을 기반으로 예측을 생성.

4.4 기준 방법

  • Supervised learning: 예제 풀에서 무작위로 질문을 제시하여 레이블링.
  • 풀 기반 Active learning: 다양한 예제를 클러스터링하여 레이블링.
  • 사용자 작성 프롬프트: 사용자가 작업에 대한 선호도를 작성.

4.5 평가 및 지표

모델이 사용자의 질문에 대한 예측 확률을 얼마나 정확히 예측하는지를 측정. 구체적으로 모델의 예측 확률을 시간에 따라 측정하여 평가.


5. 결과

5.1 샘플 전사

다양한 생성적 능동 작업 유도 방법과의 사용자 상호작용 샘플 전사를 제시.

5.2 분석

사용자 선호도 간의 변동성, 언어 모델 상호작용이 사용자 선호도에 미치는 영향, 언어 모델 질문의 종류 등을 추가 분석.


6 OTHER RELATED WORK

6.1 선호도 기술의 유도 (Eliciting Descriptions of Preferences)

사람들의 모호한 생각, 선호도, 목표에 대한 정보를 얻는 것은 여러 분야에서 중요한 과제입니다. 이 섹션에서는 다양한 분야에서의 선호도 기술 유도 방법에 대해 논의합니다.

6.1.1 심리학 및 인지 과학 (Psychology and Cognitive Science)

심리학 및 인지 과학에서는 프로토콜 분석이 이런 정보를 얻는 방법을 설명합니다. 사고 발화 프로토콜(thinking-aloud protocols)은 참가자가 문제를 해결하거나 특정 작업을 수행하면서 자신의 생각을 소리 내어 표현하게 함으로써 인지 과정을 이해하는 데 사용됩니다 (Ericsson & Simon, 1980; Ericsson, 2017).

6.1.2 소프트웨어 사용성 분석 (Software Usability Analysis)

소프트웨어 사용성 분석에서도 유사한 기술이 사용됩니다. 이런 기술은 기존 소프트웨어의 사용성 및 한계를 평가하고, 사용자가 자신의 요구를 완전히 이해하지 못하거나 명확하게 표현하지 못할 때 정보를 수집하는 방법을 연구합니다 (Henderson et al., 1995; Christel & Kang, 1992; Goguen & Linde, 1993; Coughlan & Macredie, 2002; Zowghi & Coulin, 2005; Pacheco et al., 2018).

6.1.3 설문 조사 및 질문지 설계 (Survey and Questionnaire Design)

설문 조사, 질문지, 포커스 그룹 설계에서도 사람들의 선호도를 유도하는 방법이 사용됩니다. 이런 도구들은 고품질의 언어 보고서를 얻는 데 사용되며, 이는 사람들의 생각과 선호도를 명확히 이해하는 데 중요한 역할을 합니다 (Malhotra, 2006; Lietz, 2010; Krosnick, 2018; Krueger & Casey, 2002).


6.2 선호도의 계산 모델링 및 쿼리 (Computational Modeling and Querying of Preferences)

이 섹션에서는 선호도를 계산적으로 설명하거나 쿼리하는 다양한 접근 방법을 다룹니다.

6.2.1 선호도 모델링 (Preference Modeling)

선호도 모델링 기법은 사람들이 드러낸 선호도(revealed preferences)와 진술한 선호도(stated preferences)를 연구합니다. 예를 들어, 선호도는 사람들이 실제로 선택한 행동을 통해 드러나거나, 조사에서 명시된 답변을 통해 나타날 수 있습니다 (Samuelson, 1948; Kroes & Sheldon, 1988). 또한, 숙고를 통해 정제된 선호도도 고려됩니다 (Gutmann & Thompson, 2004).

6.2.2 선호도 유도 방법 (Methods for Eliciting Preferences)

선호도를 유도하는 방법에는 여러가지가 있습니다. 컨조인트 분석(conjoint analysis)은 사람들의 선호도를 평가하는데 사용되며, 다기준 의사결정(multiple-criteria decision making), 멀티암드 밴딧(multi-armed bandits), 듀얼링 밴딧(dueling bandits), 베이지안 방법(Bayesian methods), 추천 시스템(recommender systems), 강건 최적화(robust optimization), 최적 실험 설계(optimal experiment design), (협력적) 역강화 학습(cooperative inverse reinforcement learning), 질문 생성(question generation), 생성적 모델링(generative modeling) 등이 있습니다 (Green & Srinivasan, 1978; Greco et al., 2016; Robbins, 1952; Yue et al., 2012; Chajewska et al., 2000; Aggarwal et al., 2016; McAuley, 2022; Vayanos et al., 2020; Emery & Nenarokomov, 1998; Ng et al., 2000; Hadfield-Menell et al., 2016; Mulla & Gharpure, 2023; Zhu & Bento, 2017).

6.2.3 Active learning (Active Learning)

Active learning은 모델이 학습할 유용한 데이터 포인트를 선택하는 방법을 중심으로 합니다. 전통적으로 풀 기반 방법(pool-based methods)에 집중하였으며, 이는 고정된 데이터 풀에서 레이블링할 포인트를 선택하는 방식입니다 (Lewis & Catlett, 1994; Settles & Craven, 2008; Settles, 2009; Houlsby et al., 2011). 최근 연구에서는 미리 훈련된 모델의 불확실성 점수가 사용자의 작업 선호도를 명확히 하는 데 유용함을 발견하였습니다 (Tamkin et al., 2022b). 본 연구는 이 접근을 확장하여 생성적 설정에서 사용자의 의도를 명확히 하기 위해 생성된 예제와 질문을 쿼리합니다.


6.3 작업 모호성과 불완전 명세 (Task Ambiguity and Underspecification)

작업이 명확하지 않거나 모호할 수 있다는 점을 탐구하는 연구가 증가하고 있습니다. 이 섹션에서는 작업 모호성 및 불완전 명세에 대한 연구를 다룹니다.

6.3.1 작업 모호성 (Task Ambiguity)

작업 모호성은 모델의 입력(e.g., 자연어 프롬프트 또는 제공된 예제)과 일치하는 여러 작업이 존재할 때 발생합니다 (Finn et al., 2018; Tamkin et al., 2022b). 이런 모호성은 스퓨리어스 상관관계(spurious correlations)로 인해 발생할 수 있으며, 이는 네트워크가 입력 데이터의 특징과 작업 레이블 사이의 원하지 않는 연관성을 학습하는 경우입니다 (Geirhos et al., 2020; Nagarajan et al., 2021; Sagawa et al., 2019; Srivastava et al., 2020; Sagawa et al., 2020).

6.3.2 불완전한 훈련 파이프라인 (Underspecified Training Pipelines)

불완전한 훈련 파이프라인은 배포 시 예측 불가능하고 원치 않는 행동을 초래할 수 있으며, 잠재적으로 위험한 실제 세계의 결과를 초래할 수 있습니다 (D’Amour et al., 2022). 최근 모델은 자연어 프롬프트와 같은 더 풍부한 사양을 수용할 수 있게 되면서, 작업 모호성은 작업의 불완전하거나 최적이 아닌 자연어 설명과 같은 다른 출처에서 발생할 수 있습니다 (Tamkin et al., 2022b).

6.3.3 언어 모델을 통한 모호성 해결 (Resolving Ambiguity with Language Models)

본 연구에서는 언어 모델이 이런 경우 사용자에게 유용한 질문을 함으로써 자신의 작업 모호성을 자주 해결할 수 있음을 발견하였습니다. 예를 들어, 모델이 사용자에게 특정 질문을 하여 사용자의 의도를 명확히 할 수 있습니다.


1 INTRODUCTION

Abstractly, we seek a framework for gathering and learning from specifications that optimizes an objective:

\[\alpha \cdot \text{specification cost} + \beta \cdot \text{human–predictor alignment} \ldots (1)\]

where specification cost measures human time and mental effort, human–predictor alignment measures the extent to which model choices agree with choices the human would have made, and $\alpha$ and $\beta$ tradeoff between the two. To formalize this, let $H_f$ denote a human user whose preferences are represented by a function $f$. We wish to design an elicitation policy $E$ that interacts with $H_f$ to produce a task specification $s$. This specification may then be input to a learning algorithm to produce a model $\hat{f}(s)$. Then, letting $C(\cdot)$ denote a scalar measure of specification cost, and $A(\cdot, \cdot)$ denote a measure of alignment between two predictors, we wish to minimize (in expectation over the population of human users):

Here, \(C\) might measure the number of words the user typed to produce the specification \(s\), while \(A\) might measure model–predictor agreement at the level of individual predictions from some population: \(A(f, \hat{f}) = \mathbb{E}_x[\|f(x) - \hat{f}(x)\|].\)

In general, appropriate definitions of $C$ and $A$ are domain-dependent; in this paper, our experiments compare the alignment of different predictors at a fixed cost. Evaluation of cost, alignment, and tradeoffs between them are discussed more in Section 5.

  1. While this paper focuses on language-based elicitation procedures, we note that generative active task elicitation is modality-agnostic and could be applied to other settings (e.g., speech-based or multimodal models).

2 LEARNING AS TASK ELICITATION

2.1 THE TASK ELICITATION FRAMEWORK

We study the problem of efficiently training a machine learning model to perform a task of interest. Throughout this paper, we use task to refer generically to any function \(f : x \rightarrow y\) that maps inputs \(x\) to outputs \(y\). For example, when building a personalized website recommendation system, \(x\) are websites and \(y\) are user preference scores for those websites. Different users may prefer different content, meaning each user’s individual preferences specify a distinct task: content recommendation for Pat and content recommendation for Avery are different tasks within the domain of content recommendation (Ziegler et al., 2020).

To build such a model, we must collect some task specification from a human user (e.g., revealing what websites they are interested in). As noted above, current learning approaches admit a wide variety of specification types, including collections of labeled examples, natural language instructions, or combinations of the two. What makes one type of specification preferable to another? Ideally, we would like specifications that are both:

  1. Easy for humans to create
  2. Informative to learners, enabling them to model human preferences accurately

Abstractly, we seek a framework for gathering and learning from specifications that optimizes an objective:

\[\alpha \cdot \text{specification cost} + \beta \cdot \text{human–predictor alignment}\]

where specification cost measures human time and mental effort, human–predictor alignment measures the extent to which model choices agree with choices the human would have made, and \(\alpha\) and \(\beta\) trade off between the two.

To formalize this, let \(H_f\) denote a human user whose preferences are represented by a function \(f\). We wish to design an elicitation policy \(E\) that interacts with \(H_f\) to produce a task specification \(s\). This specification may then be input to a learning algorithm to produce a model \(\hat{f}(s)\). Then, letting \(C(\cdot)\) denote a scalar measure of specification cost, and \(A(\cdot, \cdot)\) denote a measure of alignment between two predictors, we wish to minimize (in expectation over the population of human users):

\[\mathbb{E}_{H_f} \mathbb{E}_{s \sim E(H_f)} \left[ \alpha \cdot C(s) + \beta \cdot A(f, \hat{f}(s)) \right]\]

Here, \(C\) might measure the number of words the user typed to produce the specification \(s\), while \(A\) might measure model–predictor agreement at the level of individual predictions from some population:

\[A(f, \hat{f}) = \mathbb{E}_x \| f(x) - \hat{f}(x) \|\]

In general, appropriate definitions of \(C\) and \(A\) are domain-dependent; in this paper, our experiments compare the alignment of different predictors at a fixed cost. Evaluation of cost, alignment, and trade-offs between them are discussed more in Section 5. We study the problem of efficiently training a machine learning model to perform a task of interest. Throughout this paper, we use task to refer generically to any function $f : x \rightarrow y$ that maps inputs $x$ to outputs $y$. For example, when building a personalized website recommendation system, $x$ are websites and $y$ are user preference scores for those websites. Different users may prefer different content, meaning each user’s individual preferences specify a distinct task: content recommendation for Pat and content recommendation for Avery are different tasks within the domain of content recommendation (Ziegler et al., 2020).

To build such a model, we must collect some task specification from a human user (e.g., revealing what websites they are interested in). As noted above, current learning approaches admit a wide variety of specification types, including collections of labeled examples, natural language instructions, or combinations of the two. What makes one type of specification preferable to another? Ideally, we would like specifications that are both:

  1. Easy for humans to create
  2. Informative to learners, enabling them to model human preferences accurately

Abstractly, we seek a framework for gathering and learning from specifications that optimizes an objective:

\[\alpha \cdot \text{specification cost} + \beta \cdot \text{human–predictor alignment}\]

where specification cost measures human time and mental effort, human–predictor alignment measures the extent to which model choices agree with choices the human would have made, and $\alpha$ and $\beta$ trade off between the two.

To formalize this, let $H_f$ denote a human user whose preferences are represented by a function $f$. We wish to design an elicitation policy $E$ that interacts with $H_f$ to produce a task specification $s$. This specification may then be input to a learning algorithm to produce a model $\hat{f}(s)$. Then, letting $C(\cdot)$ denote a scalar measure of specification cost, and $A(\cdot, \cdot)$ denote a measure of alignment between two predictors, we wish to minimize (in expectation over the population of human users):

\[\mathbb{E}_{H_f} \mathbb{E}_{s \sim E(H_f)} \left[ \alpha \cdot C(s) + \beta \cdot A(f, \hat{f}(s)) \right]\]

Here, $C$ might measure the number of words the user typed to produce the specification $s$, while $A$ might measure model–predictor agreement at the level of individual predictions from some population:

\[A(f, \hat{f}) = \mathbb{E}_x \| f(x) - \hat{f}(x) \|\]

In general, appropriate definitions of $C$ and $A$ are domain-dependent; in this paper, our experiments compare the alignment of different predictors at a fixed cost. Evaluation of cost, alignment, and trade-offs between them are discussed more in Section 5.

2.2 EXISTING LEARNING PARADIGMS IN THE TASK ELICITATION FRAMEWORK

Several existing frameworks for learning and task specification can be described within the framework given above. Understood as task elicitation procedures, existing frameworks differ along two key axes (visualized in Figure 2): their level of interactivity and their level of flexibility. In interactive elicitation methods, queries can change depending on user responses (e.g., querying for the most useful information based on what is known thus far) while passive elicitation methods expect the user to provide specifications in a single shot. Example-based specification methods ask users to label a set of examples, while free-form elicitation approaches are less restrictive, allowing the user to provide a much wider range of inputs, including natural language instructions and explanations.

Figure 2: Axes of variation in task elicitation.

지도 학습: passive, example-based In the most common supervised learning setup, the elicitation policy E simply instructs the human user Hf to generate a collection of labeled (input, output) pairs, after which ˆf (s) is produced by fitting or fine-tuning a learned model using standard algorithms. This is an example-based process because the specification is provided via labeled examples and is passive, as the model does not interactively query the user to label additional data.

Active learning: interactive, example-based In active learning, the elicitation policy is interactive. Users first assemble a fixed pool of unlabeled inputs x. Next, E, selects from this pool an example whose label would be most informative. The user Hf provides a label for this example, then E selects the next-most-informative example, and so on (Cohn et al., 1994; Dagan & Engelson, 1995; Lewis & Gale, 1994; Settles, 2009). Finally, ˆf (s) is trained as in supervised methods. Optimal experiment design methods (Emery & Nenarokomov, 1998) may be viewed as generalizations of this paradigm in which inputs x are generated rather than selected. Interactive processes enable the model to query for examples that may resolve uncertainty or ambiguity in the task specification (Tamkin et al., 2022b).

Prompting: passive, free-form Modern pre-trained models allow for specifying tasks in more flexible ways than simply labeling examples. For example, models can be conditioned with a prompt describing the user’s intended task in natural language (Brown et al., 2020b), or even a mix of language and image inputs (Alayrac et al., 2022). As with supervised learning, the labeling policy E here is simply an instruction to write a natural language task description (s), but the final predictor ˆf (s) is produced by passing s to a pre-trained language model.

3 GENERATIVE ACTIVE TASK ELICITATION

All of the methods above have important drawbacks: the burden typically falls upon the user to ensure that prompts or example sets are truly comprehensive specifications of the task, as any lack of clarity in the prompt could lead to task ambiguity (Tamkin et al., 2022a), resulting in undesired behavior during deployment. Resolving task ambiguity by crafting better prompts is challenging and time-consuming due to the difficulties of articulating nebulous personal preferences and anticipating edge cases that will emerge during deployment time.

However, one quadrant of Fig. 2 is not occupied by any of the aforementioned approaches: there is currently no method that leverages both the flexibility of a free-form specification, while using interaction to resolve uncertainty. We explore whether it is possible to combine the flexibility and richness of prompting-based specifications with the advantages of interactive methods such as active learning, by having a model interactively query users for these rich specifications. We term this family of methods generative active task elicitation (GATE).

3.1 METHODS FOR GATE

The effectiveness of language models (LMs) for understanding and producing free-form text suggests that they may be capable of eliciting and understanding user preferences. In this paper, we thus experiment with a family of GATE methods in which LMs serve as the backbone for both the elicitation policy E and the predictor ˆf (s).3 See Figure 1 for examples. In particular, we implement the elicitation policy E by prompting an LM to ask the user questions while conditioning on the history of previous questions and answers. To make predictions ˆf (s), an LM is prompted to predict a label conditioned on an input x and a complete elicitation transcript s provided as input. We experiment with several different information gathering policies, realized by simply prompting an LM to ask different kinds of questions:

Generative active learning The LM generates example inputs for the user to label. This approach has the advantage of providing concrete scenarios to the user, including some they may not have considered otherwise. For example, for content recommendation, the LM might generate an article such as: Are you interested in the following article? The Art of Fusion Cuisine: Mixing Cultures and Flavors […] .

Generating yes-or-no questions We restrict the LM to generating binary yes-or-no questions. This approach enables the model to elicit more abstract preferences while still being easy for the user to answer. For example, the model might probe a user’s preferences by asking: Do you enjoy reading articles about health and wellness?

Generating open-ended questions The LM generates arbitrary questions requiring free-form natural language responses. This enables the LM to elicit the broadest and most abstract pieces of knowledge at the potential cost of being overly broad or challenging for the user to answer. For example, the LM might generate the question: What hobbies or activities do you enjoy in your free time […], and why do these hobbies or activities captivate you?

The user is not constrained in their response in any of the above settings; they are free to provide as much detail as they want. We present example elicitation transcripts for each policy in Figure 5.

4 EXPERIMENT SETUP

We consider tasks in three different domains to evaluate our generative active task elicitation methods. A common feature of these domains is that they do not feature a single correct behavior that could be learned during LM pre-training; instead, models must elicit an individual human’s preferences in order to make accurate predictions. We allow each human user to interact open-endedly with an elicitation policy E for five minutes. Next, humans and learned models ˆf (s) independently label a set of held-out examples. Finally, we measure agreement between humans and learned predictors. See Figure 5 for examples of environments and dialogues. 4

4.1 DOMAINS AND DATASETS

Content Recommendation We consider the domain of online article recommendations, where user preferences vary widely. Models are evaluated on their ability to predict whether a user would like to read a given held-out article. These test cases are taken from popular online newspaper and magazine articles collected by the authors. We provide a website name, article title, and a short description for each test case.

Moral Reasoning Moral preferences can be deeply personal and vary significantly across people and cultures. As a test-bed for eliciting moral values, we consider the question of when (if ever) it is ethical to steal a loaf of bread. During evaluation, models are presented with textual descriptions of scenarios and asked to predict whether users will judge it appropriate to steal a loaf of bread. These test cases are constructed manually by the authors.

  1. However, we emphasize that our method is not specific to language models or natural language and could potentially be applied to other settings such as images, speech, or multimodal models.
  1. The preregistration for our experiments and analyses can be found at: https://osf.io/5v6nd/.

Email Verification Last, we consider the problem of eliciting requirements for a software engineering task. Specification is especially challenging in software engineering due to the many edge cases developers need to anticipate and account for. In particular, we focus on specifying requirements for email address validation, where people have varied preferences over how long emails can be, how many subdomains they may possess, and which special characters are allowed, among other factors. Models are evaluated on their agreement with users about the validity of a set of held-out emails; this test set is again manually constructed by the authors.

4.2 HUMAN INTERACTION

Human participants in these experiments were recruited from English-speaking users of Prolific. For the email validation task, we additionally recruited participants from several computer science programs at US universities. We recruited 20–30 participants for each domain-method pair (6 elicitation methods across 3 domains), for a total of 388 participants. Participants were paid an average of $12/hr. Our experiments received IRB approval. The breakdown of the number of participants allocated to each scenario and method can be found in Appendix B.1. Details of the user interface used in experiments may be found in Appendix B.2.

4.3 MODELING DETAILS

We use the GPT-4 model (gpt-4-0613 snapshot) (OpenAI, 2023) to both elicit user preferences (the elicitation policy E) and make predictions based on the elicited preferences (the predictor ˆf (s)). To elicit user preferences, we prompt GPT-4 with a domain description and the current interaction history, and ask it to generate an informative but easy-to-answer edge case (for generative active learning) or question (for generative yes-or-no questions and generative open-ended questions). To make predictions, we prompt GPT-4 with the task specification s and a test sample x and ask it to generate a prediction for the test sample. The full text of the prompts can be found in Appendix A.

4.4 BASELINE METHODS

We compare GATE with several baseline approaches for specifying tasks. Here, the elicitation policy E is not parameterized by an LM, but constructed by the user and/or a pool of examples.

지도 학습 We consider supervised learning as a baseline, as described in Section 2.2. We randomly present participants with questions from a large pool of examples and ask them to annotate up to the time limit. We study this approach exclusively in the content recommendation domain because pools of examples are not readily available in the other two domains. We use the Microsoft News Dataset (Wu et al., 2020) as our pool for this domain, a dataset of 160k news articles with descriptions.

Pool-based active learning As a baseline active learning approach, we consider a pool-based active learning approach, as described in Section 2.2. For the elicitation policy, we use the diversitybased sampling approach of Margatina et al. (2023); we first cluster the examples using a SentenceBERT embedding model (Reimers & Gurevych, 2019) into 15 different clusters, then iteratively ask questions from each cluster in a round-robin fashion, up until the time limit.5 This baseline is intended to capture the difficulty of selecting informative examples from a pool of unlabeled examples relative to generating informative examples from scratch. As with supervised learning, we study this approach exclusively in the content recommendation domain.

  1. Margatina et al. (2023) explored several different popular active learning sampling approaches for incontext learning (including random, uncertainty, and diversity sampling) and found little difference in empirical performance between them. We also ran exploratory model-model experiments in our domains and found no significant difference between these three sampling strategies. See details in Appendix D.

User-written prompts As a baseline that does not use interactive elicitation, we ask participants to write a short paragraph describing their preferences for the task. We then use the text of this paragraph to prompt a model to make decisions. This baseline is intended to capture the difficulty of specifying preferences in writing, both in terms of the effort it takes to write the paragraph and the difficulty of writing a paragraph that fully specifies one’s preferences.

4.5 EVALUATION AND METRICS

We measure how well models can predict the probability that users will answer questions a certain way. Specifically, we prompt the model to output a real-valued probability of answering yes to the question, as opposed to a binary yes/no decision. To do so, we prompt the model with the interaction history s as a single test case, then ask the model to predict the probability that a user would answer “yes” to the test case. This probability is outputted in token space as a number between 0.0 and 1.0, similar to past work (Branwen, 2020; Lin et al., 2022).6 We also discuss and report a classification-based metric in Appendix C.1. Given these predicted probabilities, we compute:

Area under the p(correct)-time curve We define p(correct) as the probability the model assigns to the user-preferred answer (see Section 4.5). For example, if the model outputted 0.8 for a given question, then p(correct) would be 0.8 if the user’s answer were “yes” to the same question, and 0.2 if the user’s answer was “no”. We select this metric instead of accuracy because guessing the user’s preferences may not always be possible, and modeling this uncertainty is useful.

However, we do not just care about the total information elicited, but about how quickly good information is elicited. To do this, we compute the average change in p(correct) after every minute of human elicitation time (conditioning on the state of the transcript at that time). This produces a curve where the x-axis is time, and the y-axis is the average change in p(correct). The area beneath this curve is a second metric we consider. Note that the final data point of each p(correct) curve may not reach 5 minutes because we subtract out the latency from the language modeling API; to account for this, we extend the final accuracy to the 5-minute mark before computing the area.

Rating of perceived effort across elicitation policies metrics, we also ask users to rate how difficult they found the elicitation process to be.

In addition to these performance-based

Specifically, we asked users “How mentally demanding was writing your answer?” in the non-interactive-elicitation setting, and “How mentally demanding was interacting with the chatbot?” in all elicitation settings (which include all other settings from Section 2.2). The “mentally demanding” wording was taken from the NASA TLX (Hart & Staveland, 1988). The question was assessed via a Likert scale from 1 (Very Little) to 7 (Very High). We also consider several additional questions to assess other usability tradeoffs. See Appendix E for the full list.

5 RESULTS

Evaluation results are shown in Figures 3 and 4. Additional results can be found in Appendix C. These results show that GATE methods…

...are successfully able to elicit human preferences. Overall, GATE improves over no elicitation, where the model is prompted to make decisions before any user interaction. This is the case across all domains studied (a positive score in Figure 3), with significance at the 0.05 level for all but the email domain, where only generative active learning was significant.
...are comparable to or better than other elicitation methods. In the majority of settings (6/10 for absolute, 7/10 for AUC), GATE elicitation methods improve over user-written prompts. In particular, generative yes/no questions improve over user-written prompts in every setting studied (although we lack enough power to assess significance in the moral reasoning domain). Furthermore, in the content recommendation setting, GATE elicitation methods (particularly generative open-ended questions) significantly improve over supervised learning and pool-based active learning.
...are equally or less mentally demanding than user-written prompts. As shown in Figure 4 (left), users generally find interactive elicitation methods to be less mentally demanding, especially ones that involve labeling samples or answering yes/no questions, than non-interactive prompting.
  1. While there may be other ways one might make predictions with these models, we found them lacking for a variety of reasons. First, we conducted pilot experiments prompting the LM to predict binary yes/no decisions; however, we found this resulted in skewed predictions where the LM would predict one of ‘yes’ or ‘no’ for the entire test set, perhaps due to miscalibration of the model’s implicit decision threshold. Second, we found that LMs are generally less reliable when generating confidence values in log space. Finally, we cannot directly take the token probabilities from GPT-4 using the OpenAI API.

Figure 3: Across three domains, our LM-prompting implementations of GATE are generally able to elicit human preferences beyond baseline supervised learning, active learning, or human-written prompts. We measure the Area Under the “∆p(correct) vs. Interaction time” Curve, which gives us a time-normalized metric for how well (and how quickly) each elicitation method is at aligning with human preferences. While GATE methods generally outperform the baseline methods as well as no interaction (represented by a ∆p(correct) of 0), we are only able to establish statistical significance between GATE and baselines in the content recommendation and email verification domains.

5.1 SAMPLE TRANSCRIPTS

Sample transcripts of users interacting with the various generative active task elicitation methods can be found in Figure 5.

5.2 ANALYSIS

Here, we present some additional analyses to better characterize the experiments.

How much variation there is in people’s preferences? Elicitation is only helpful if there is variation in people’s preferences; otherwise, a model could simply attain maximum performance by relying on its prior and ignoring the elicited information. To quantify how much variation there is in people’s preferences, we compute the entropy in p(yes) for each question across participants. We find that many questions have high entropy while many others have little entropy, for an average entropy of 0.77 bits. Broadly, the results validate that our settings have significant variation in human preferences, enabling models to personalize themselves based on human preferences.

Does language model elicitation influence user preferences? Human preferences may shift when interacting with language models for a variety of reasons. For example, past work has studied auto-induced distributional shift, where machine learning models shift human behavior to be easier to predict (Krueger et al., 2020). To investigate whether this occurs in our experiments (or indeed if different elicitation methods induce different human preferences for any other reason), we compare the distribution of human labels on test samples from the three GATE methods with those from the user-written prompt experiments to see whether interacting with language models influences users’ subsequent judgments. As seen in Figure 4 (right), we see no such effect.

What kinds of questions did the language models ask? We show a few examples of the language model questions in Figure 5. As the figure shows, these questions are complex and subtle, often building on the previous questions, representing a broad-based knowledge of the domain as well as possible nuances therein.

Figure 4: Left: GATE methods are equally or less mentally demanding than other methods. We plot the perceived mental demand across methods and domains (higher = greater mental demand). Right: Language model elicitation does not shift human preferences. We plot the proportion of participants who answered “yes” to each test question, comparing no LM interaction (user-written prompts) to LM interaction (GATE) elicitation. The red line is the y = x curve, which serves as a guideline to see how well humans’ no-LM interaction preferences align with their preferences post-LM interaction (if they align perfectly, the points should fall along this curve). We see that the points generally hover around this curve.

Why does prompting make things worse in the emails domain? In the emails domain in Figure 3, we observe that user-written preferences slightly decrease performance relative to a noelicitation baseline. While it is possible this is an effect of noise, we also observe that some participants articulated preferences that were actually different from those they experienced when viewing email addresses. For example, one user wrote “an email address should finish with .com or co.uk” yet later decided that “user@domain.edu” was an acceptable email address. This indicates that users may not have a clear and comprehensive understanding of their own preferences, especially in more technical domains.

Can we automate evaluation? To probe whether evaluation could be automated, we conducted experiments where we simulated different human preferences using language models prompted with a diverse set of (automatically-generated) personas. These personas varied by domain, but generally contained information about a hypothetical person’s preferences within that the domain. For example, in the content recommendation domain, we generated brief biographical sketches of hypothetical people, including their hobbies, interests, and careers, and conditioned GPT-4 on these biographical sketches to generate answers to queries. We found that model could simulate humans well in the content recommendation and email validation domains, but not in the moral reasoning domain. This suggests that while such personas may be a useful guide in some cases, they are not yet sophisticated enough to stand in for real human participants. See Appendix D for more details.

6.1 ELICITING DESCRIPTIONS OF PREFERENCES

A fundamental challenge across many fields is how to obtain information about people’s nebulous thoughts, preferences, and goals. In psychology and cognitive science, protocol analysis describes methods for how to obtaining and analyze verbal reports from subjects about cognitive processes including via think-aloud protocols (Ericsson & Simon, 1980; Ericsson, 2017). In software usability analysis, similar techniques are used to assess the usability and limitations of existing software (Henderson et al., 1995), and for broader applications in the areas of survey, questionnaire, and focus group design (Malhotra, 2006; Lietz, 2010; Krosnick, 2018; Krueger & Casey, 2002). High-quality verbal reports can be challenging to obtain, however, and requirements elicitation studies methods for gathering information even when it is challenging for users to fully understand or anticipate their own needs or describe their preferences in clear, unambiguous language (Christel & Kang, 1992; Goguen & Linde, 1993; Coughlan & Macredie, 2002; Zowghi & Coulin, 2005; Pacheco et al., 2018). In our work, we explore whether language models could take the place of human researchers in surfacing these insights from people or even other language models.

Figure 5: Excerpts of real transcripts across the different domains and elicitation methods we investigate. The System messages are generated by the language model, while the User messages are produced by human participants. Overall, the model is able to generate diverse and contextually-appropriate questions in each setting. See Sections 3.1 and 4.1 for more details on the domains and methods respectively.

6.2 COMPUTATIONAL MODELING AND QUERYING OF PREFERENCES

Many works attempt to computationally describe or query human preferences. Preference modeling techniques study people’s revealed preferences (Samuelson, 1948), as well as their stated preferences (Kroes & Sheldon, 1988), including preferences refined through deliberation (Gutmann & Thompson, 2004). Methods for eliciting preferences span a wide variety of research areas including conjoint analysis (Green & Srinivasan, 1978), multiple-criteria decision making (Greco et al., 2016), multi-armed bandits (Robbins, 1952) and dueling bandits (Yue et al., 2012), Bayesian methods (Chajewska et al., 2000), recommender systems (Aggarwal et al., 2016; McAuley, 2022), robust optimization (Vayanos et al., 2020), optimal experiment design (Emery & Nenarokomov, 1998), (cooperative) inverse reinforcement learning (Ng et al., 2000; Hadfield-Menell et al., 2016), question generation (Mulla & Gharpure, 2023), and generative modeling (Zhu & Bento, 2017).

Perhaps most relevant to our work is active learning, a major subfield of machine learning that centers on how models can choose useful data points to learn from. Active learning has traditionally focused on pool-based methods, which choose points to label from a fixed reservoir (Lewis & Catlett, 1994; Settles & Craven, 2008; Settles, 2009; Houlsby et al., 2011). Recently, Tamkin et al. (2022b) found that the well-calibrated uncertainty scores of pretrained models can be used during active learning to clarify the user’s task preferences—for instance, by choosing examples that distinguish which of two correlated features are important for the task. We extend this line of investigation to the generative setting, clarifying user intent by querying a user with generated examples and questions.

6.3 TASK AMBIGUITY AND UNDERSPECIFICATION

A growing body of work explores how tasks in machine learning can be underspecified or ambiguous. In particular, task ambiguity (Finn et al., 2018; Tamkin et al., 2022b) arises when more than one task is consistent with the inputs to the model (e.g. the natural language prompt or provided examples). One stream of work here investigates spurious correlations (Geirhos et al., 2020), a form of task ambiguity where the network learns unwanted associations between features in the input data and the task label (Nagarajan et al., 2021; Sagawa et al., 2019; Srivastava et al., 2020; Sagawa et al., 2020). Such underspecified training pipelines can lead to unpredictable and undesired behavior during deployment and potentially dangerous real-world consequences (D’Amour et al., 2022). As recent models can accept richer specifications, such as natural language prompts, task ambiguity can arise from other sources, such as incomplete or suboptimal natural language descriptions of the task (Tamkin et al., 2022b). In this work, we find that language models can often resolve their own task ambiguity in these instances by asking informative questions of the user.

7 DISCUSSION AND CONCLUSION

We introduced the GATE framework to interactively elicit preferences from human users with freeform queries and answers. We presented initial evidence that language models can successfully implement GATE to elicit human preferences (sometimes) more accurately and with less effort than supervised learning, active learning, or prompting-based approaches.

There are many ways to expand on our implementation of GATE: Future work may explore more principled methods for elicitation besides simple prompting; for example, explicit notions of uncertainty or disagreement sampling could be used in conjunction with the free-form generation enabled by generative language models, taking inspiration from the active learning literature. Second, larger models may be more capable elicitors: future work can explore scaling laws for elicitation. Finally, many real-world tasks are more complex than those we study here; applications such as software design and legal and medical decision-making present a richer set of constraints and edge cases. These applications thus offer a rich space of possible extensions of GATE.

ETHICAL CONSIDERATIONS

Our work presents several potential ethical benefits and risks.

There are many potential benefits of machines that can better elicit and understand human preferences. For example, by making it easier for software designers to incorporate nuanced user preferences, GATE may empower people with rare preferences or preferences that have historically not been considered when building software systems. In addition, improving the effort-performance ratio, especially by requiring less user typing, may help make language models more accessible to users with less time, familiarity with language models, or physical ability to use such systems.

However, this direction carries risks as well. In particular, work on thin slicing (Ambady & Rosenthal, 1992) has demonstrated that small amounts of information about a user can sometimes be used to predict a broader range of personal characteristics, raising potential privacy considerations. The interactive nature of GATE also risks increasing automation bias (Goddard et al., 2012), where users place undue weight on a model’s predictions. However, further work is necessary to establish if or when these risks are more significant for GATE than for prompting-based approaches to steering language models.

REPRODUCIBILITY

We will open-source all code used in creating GATE methods, constructing the user interface, and conducting the results and analysis. We will also release the pre-registration for our experiments. All prompts we used for querying GPT-4 in the decision-making and elicitation phases, and all instructions we presented to the user, can be found in the Appendix. In all cases, we queried GPT-4 with temperature 0 for replicability of experiments.

We also note that the model we use is a closed-source model whose versions are periodically deprecated. This may hinder reproducibility, and we may explore open-source models in the future.

Previous: Hyper Attention Next: Model | AnyMAL

post contain ""

    No matching posts found containing ""