Reinforcement Learning with Human Feedback

RLHF在ChatGPT上取得了巨大的成功。这种人工反馈与数据结合的方式必然是下一代人工智能的趋势。在此学习并加以记录。

====== 重要的网址 ======

首先是一套大家都看的网址：[[https://huggingface.co/blog/rlhf]]
然后是中文的介绍：[[https://zhuanlan.zhihu.com/p/592671478]]

====== 个人感想 ====== == 三个步骤 ==

Pretraining a language model (LM),
gathering data and training a reward model, and
fine-tuning the LM with reinforcement learning.

== 主要涉及的模型 ==

普通列表项目LM：一个语言模型，找一个预训练的就好？
RM：一个增强模型
PMP：另一个用于preference预测的语言模型，应该是拿来生成reward的。
Elo评分系统：一个博弈论里的优劣势估计的系统。给一堆样本的对比，然后对样本做排序。

== RM的基本设计 ==

policy: 一个新的LM模型，用于增强前面说的LM的输出结果。输入prompts，输出一段文本。
action space: 有点大，跟输出句子的单词数量成指数级
observation space: 有点大，跟输入句子的单词数量成指数级
reward function: 由PMP模型的输出与其他的约束合成
update rule: PPO或者A2C，未确定
reward constraint: 这里面道道很多，有多种想法。其中最经典的是加一个输入与输出的KL散度，约束policy不要乱生成样本糊弄RL。

== 目前还需要探索的点 ==

LM模型的初始化
为什么RM模型能够
reward的设计

====== 大家都推荐的十二篇必读文章 ======

Deep Reinforcement Learning from Human Preferences (Christiano et al. 2017): RLHF applied on preferences between Atari trajectories.
Fine-Tuning Language Models from Human Preferences (Zieglar et al. 2019): An early paper that studies the impact of reward learning on four specific tasks.
Learning to summarize with human feedback (Stiennon et al., 2020): RLHF applied to the task of summarizing text. Also, Recursively Summarizing Books with Human Feedback (OpenAI Alignment Team 2021), follow on work summarizing books.
WebGPT: Browser-assisted question-answering with human feedback (OpenAI, 2021): Using RLHF to train an agent to navigate the web.
InstructGPT: Training language models to follow instructions with human feedback (OpenAI Alignment Team 2022): RLHF applied to a general language model [Blog post on InstructGPT].
GopherCite: Teaching language models to support answers with verified quotes (Menick et al. 2022): Train a LM with RLHF to return answers with specific citations.
Sparrow: Improving alignment of dialogue agents via targeted human judgements (Glaese et al. 2022): Fine-tuning a dialogue agent with RLHF
ChatGPT: Optimizing Language Models for Dialogue (OpenAI 2022): Training a LM with RLHF for suitable use as an all-purpose chat bot.
Scaling Laws for Reward Model Overoptimization (Gao et al. 2022): studies the scaling properties of the learned preference model in RLHF.
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Anthropic, 2022): A detailed documentation of training a LM assistant with RLHF.
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned (Ganguli et al. 2022): A detailed documentation of efforts to “discover, measure, and attempt to reduce [language models] potentially harmful outputs.”
Dynamic Planning in Open-Ended Dialogue using Reinforcement Learning (Cohen at al. 2022): Using RL to enhance the conversational skill of an open-ended dialogue agent.