Reinforcement Learning from Human Feedback

_{scroll ↓ to Resources}

Note

RLHF is not the name of a particular algorithm. Rather, it’s a particular task formulation where the reward used in training (reward model) is an approximation of the “true” reward, which lives in the human minds (that is, real human preferences).
core idea is to generate several outputs (for each query) using a pre-trained model, ask human annotators to rank them
- then, depending on the approach
  - in PPO - train a special reward model, which is able to predict human ranking of unseen generated outputs. then train the LLM further by asking it to maximized the reward
  - in DPO - directly train the model using collected labeled data, without an explicit reward modeling step
Reinforcement Learning is a capricious tool, so there have been several attempts at getting rid of it for alignment training. The most popular one right now is Direct Preference Optimization.

the RLHF pipeline

usually consists of 3 phases:
- supervised finetuning (SFT): fine-tuning a pre-trained LM with supervised learning on high-quality data for the downstream task(s) of interest (dialogue, summarization, etc.), to obtain a model $π^{SFT}$ (in order to make it generate reasonable quality output for RLHF)
- preference sampling and reward modeling: create a model predicting the score, which COULD BE given by a human
  - see more in 4.1.2 Reward Modeling (RM)
  - this saves a lot of labeling effort, provides a metric to optimize
- reinforcement learning optimization: the learned reward function is used to provide feedback to the language mode

reinforcement learning finetuning phase

the optimization is formulated as

where dist is some kind of distance; Kullback-Leibler divergence between the predicted probability distributions is popular in this role. This way, we maximize the reward while keeping the distance low.

same formula written as in the DPO paper:

RLHF fine-tuning doesn’t require pairs (prompt, completion) for training. The dataset for RLHF consists of prompts only, and the completions are generated by the model itself as part of the trial-and-error.
While RLHF improves alignment with human preferences, it doesn’t directly optimize output correctness and plausibility. This means that alignment training can harm the LLM quality. So while a model that refuses to answer any question will never tell a user how to make a chemical weapon so it’s perfectly harmless – although still utterly useless for helpful purposes, too. To address this issue, we try to ensure that the RLHF-trained model doesn't diverge much from its SFT predecessor. This is often enforced by adding a regularization term to the optimized function.
- This way the output model also operates within the x-y space, where the reward model was trained and, therefore, where it is reasonably accurate.
- It also prevents model collapse to single high reward answers.

Resources

Links to this File

table file.inlinks, file.outlinks from [[]] and !outgoing([[]])  AND -"Changelog"

Notitia Restante 🌱

On this site

Reinforcement Learning from Human Feedback

Note

the RLHF pipeline

reinforcement learning finetuning phase

Resources

Links to this File

Graph View

On this page

Backlinks

Recent

Evaluating information retrieval

synthetic data generation for RAG evaluation

YOLO

ResNet

image retrieval