00:00:00

Share Your Feedback 🏝️

RLHF Paper

RLHF Paper

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Decontamination | Detecting Pretraining Data Next: PO | Contrastive Preference Optimization*

RLHF Paper

  • Related Project: Private
  • Category: Paper Review
  • Date:2023-05-26

Contents


Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism

  • arxiv: https://arxiv.org/abs/2305.18438
  • github: https://arxiv.org/pdf/2305.18438
  • abstract: In this paper, we study offline Reinforcement Learning with Human Feedback (RLHF) where we aim to learn the human’s underlying reward and the MDP’s optimal policy from a set of trajectories induced by human choices. RLHF is challenging for multiple reasons: large state space but limited human feedback, the bounded rationality of human decisions, and the off-policy distribution shift. In this paper, we focus on the Dynamic Discrete Choice (DDC) model for modeling and understanding human choices. DCC, rooted in econometrics and decision theory, is widely used to model a human decision-making process with forward-looking and bounded rationality. We propose a $\underline{D}$ynamic-$\underline{C}$hoice-$\underline{P}$essimistic-$\underline{P}$olicy-$\underline{O}$ptimization (DCPPO) method. The method involves a three-stage process:
    1. The first step is to estimate the human behavior policy and the state-action value function via maximum likelihood estimation (MLE);
    2. The second step recovers the human reward function via minimizing BetextGenerationLLMan mean squared error using the learned value functions;
    3. The third step is to plug in the learned reward and invoke pessimistic value iteration for finding a near-optimal policy.

    With only single-policy coverage (i.e., optimal policy) of the dataset, we prove that the suboptimality of DCPPO almost matches the classical pessimistic offline RL algorithm in terms of suboptimality’s dependency on distribution shift and dimension. To the best of our knowledge, this paper presents the first theoretical guarantees for off-policy offline RLHF with dynamic discrete choice model.


Training language models to follow instructions with human feedback

  • arxiv: https://arxiv.org/abs/2203.02155
  • github: https://arxiv.org/pdf/2203.02155
  • abstract: Making language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.
  • related url: https://github.com/opendilab/awesome-RLHF
Previous: Decontamination | Detecting Pretraining Data Next: PO | Contrastive Preference Optimization*

post contain ""

    No matching posts found containing ""