论文精读-(ICLR 2025 Poster) On a Connection Between Imitation Learning and RLHF

less than 1 minute read

Published:

On a Connection Between Imitation Learning and RLHF

Motivation: 大语言模型存在着虚假对齐的现象 OpenAI-news.jpg Detecting and reducing scheming in AI models. Alignment faking in large language models是一个非常值得关注的现象,大模型可能在训练过程当中伪装顺从,目的是在训练结束后可以继续保持“真实”的行为偏好。 使用Claude 3 Opus模型,当它被告知正在训练以回答所有查询(包括有害查询),并且训练仅使用免费用户的数据时,它对免费用户的有害查询的依从率(14%)远高于付费用户(几乎从不)。

Fake Alignment: Are LLMs Really Aligned Well?(NAACL 24 From AILab)

Imitation Learning