Reinforcement Learning with Human Feedback
RLHF在ChatGPT上取得了巨大的成功。这种人工反馈与数据结合的方式必然是下一代人工智能的趋势。在此学习并加以记录。
====== 重要的网址 ======
- 首先是一套大家都看的网址:[[https://huggingface.co/blog/rlhf]]
- 然后是中文的介绍:[[https://zhuanlan.zhihu.com/p/592671478]]
====== 个人感想 ====== == 三个步骤 ==
- Pretraining a language model (LM),
- gathering data and training a reward model, and
- fine-tuning the LM with reinforcement learning.
== 主要涉及的模型 ==
- 普通列表项目LM:一个语言模型,找一个预训练的就好?
- RM:一个增强模型
- PMP:另一个用于preference预测的语言模型,应该是拿来生成reward的。
- Elo评分系统:一个博弈论里的优劣势估计的系统。给一堆样本的对比,然后对样本做排序。
== RM的基本设计 ==
- policy: 一个新的LM模型,用于增强前面说的LM的输出结果。输入prompts,输出一段文本。
- action space: 有点大,跟输出句子的单词数量成指数级
- observation space: 有点大,跟输入句子的单词数量成指数级
- reward function: 由PMP模型的输出与其他的约束合成
- update rule: PPO或者A2C,未确定
- reward constraint: 这里面道道很多,有多种想法。其中最经典的是加一个输入与输出的KL散度,约束policy不要乱生成样本糊弄RL。
== 目前还需要探索的点 ==
- LM模型的初始化
- 为什么RM模型能够
- reward的设计
====== 大家都推荐的十二篇必读文章 ======
- Deep Reinforcement Learning from Human Preferences (Christiano et al. 2017): RLHF applied on preferences between Atari trajectories.
- Fine-Tuning Language Models from Human Preferences (Zieglar et al. 2019): An early paper that studies the impact of reward learning on four specific tasks.
- Learning to summarize with human feedback (Stiennon et al., 2020): RLHF applied to the task of summarizing text. Also, Recursively Summarizing Books with Human Feedback (OpenAI Alignment Team 2021), follow on work summarizing books.
- WebGPT: Browser-assisted question-answering with human feedback (OpenAI, 2021): Using RLHF to train an agent to navigate the web.
- InstructGPT: Training language models to follow instructions with human feedback (OpenAI Alignment Team 2022): RLHF applied to a general language model [Blog post on InstructGPT].
- GopherCite: Teaching language models to support answers with verified quotes (Menick et al. 2022): Train a LM with RLHF to return answers with specific citations.
- Sparrow: Improving alignment of dialogue agents via targeted human judgements (Glaese et al. 2022): Fine-tuning a dialogue agent with RLHF
- ChatGPT: Optimizing Language Models for Dialogue (OpenAI 2022): Training a LM with RLHF for suitable use as an all-purpose chat bot.
- Scaling Laws for Reward Model Overoptimization (Gao et al. 2022): studies the scaling properties of the learned preference model in RLHF.
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Anthropic, 2022): A detailed documentation of training a LM assistant with RLHF.
- Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned (Ganguli et al. 2022): A detailed documentation of efforts to “discover, measure, and attempt to reduce [language models] potentially harmful outputs.”
- Dynamic Planning in Open-Ended Dialogue using Reinforcement Learning (Cohen at al. 2022): Using RL to enhance the conversational skill of an open-ended dialogue agent.