Reinforcement Learning with Human Feedback

Reinforcement Learning with Human Feedback

RLHF在ChatGPT上取得了巨大的成功。这种人工反馈与数据结合的方式必然是下一代人工智能的趋势。在此学习并加以记录。

====== 重要的网址 ======

  • 首先是一套大家都看的网址:[[https://huggingface.co/blog/rlhf]]
  • 然后是中文的介绍:[[https://zhuanlan.zhihu.com/p/592671478]]

====== 个人感想 ====== == 三个步骤 ==

  • Pretraining a language model (LM),
  • gathering data and training a reward model, and
  • fine-tuning the LM with reinforcement learning.

== 主要涉及的模型 ==

  • 普通列表项目LM:一个语言模型,找一个预训练的就好?
  • RM:一个增强模型
  • PMP:另一个用于preference预测的语言模型,应该是拿来生成reward的。
  • Elo评分系统:一个博弈论里的优劣势估计的系统。给一堆样本的对比,然后对样本做排序。

== RM的基本设计 ==

  • policy: 一个新的LM模型,用于增强前面说的LM的输出结果。输入prompts,输出一段文本。
  • action space: 有点大,跟输出句子的单词数量成指数级
  • observation space: 有点大,跟输入句子的单词数量成指数级
  • reward function: 由PMP模型的输出与其他的约束合成
  • update rule: PPO或者A2C,未确定
  • reward constraint: 这里面道道很多,有多种想法。其中最经典的是加一个输入与输出的KL散度,约束policy不要乱生成样本糊弄RL。

== 目前还需要探索的点 ==

  • LM模型的初始化
  • 为什么RM模型能够
  • reward的设计

====== 大家都推荐的十二篇必读文章 ======

  • Deep Reinforcement Learning from Human Preferences (Christiano et al. 2017): RLHF applied on preferences between Atari trajectories.
  • Fine-Tuning Language Models from Human Preferences (Zieglar et al. 2019): An early paper that studies the impact of reward learning on four specific tasks.
  • Learning to summarize with human feedback (Stiennon et al., 2020): RLHF applied to the task of summarizing text. Also, Recursively Summarizing Books with Human Feedback (OpenAI Alignment Team 2021), follow on work summarizing books.
  • WebGPT: Browser-assisted question-answering with human feedback (OpenAI, 2021): Using RLHF to train an agent to navigate the web.
  • InstructGPT: Training language models to follow instructions with human feedback (OpenAI Alignment Team 2022): RLHF applied to a general language model [Blog post on InstructGPT].
  • GopherCite: Teaching language models to support answers with verified quotes (Menick et al. 2022): Train a LM with RLHF to return answers with specific citations.
  • Sparrow: Improving alignment of dialogue agents via targeted human judgements (Glaese et al. 2022): Fine-tuning a dialogue agent with RLHF
  • ChatGPT: Optimizing Language Models for Dialogue (OpenAI 2022): Training a LM with RLHF for suitable use as an all-purpose chat bot.
  • Scaling Laws for Reward Model Overoptimization (Gao et al. 2022): studies the scaling properties of the learned preference model in RLHF.
  • Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Anthropic, 2022): A detailed documentation of training a LM assistant with RLHF.
  • Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned (Ganguli et al. 2022): A detailed documentation of efforts to “discover, measure, and attempt to reduce [language models] potentially harmful outputs.”
  • Dynamic Planning in Open-Ended Dialogue using Reinforcement Learning (Cohen at al. 2022): Using RL to enhance the conversational skill of an open-ended dialogue agent.