Reinforcement Learning from Human Feedback

scroll ↓ to Resources

Note

  • RLHF is not the name of a particular algorithm. Rather, it’s a particular task formulation where the reward used in training (reward model) is an approximation of the “true” reward, which lives in the human minds (that is, real human preferences).
  • core idea is to generate several outputs (for each query) using a pre-trained model, ask human annotators to rank them
    • then, depending on the approach
      • in PPO - train a special reward model, which is able to predict human ranking of unseen generated outputs. then train the LLM further by asking it to maximized the reward
      • in DPO - directly train the model using collected labeled data, without an explicit reward modeling step
  • Reinforcement Learning is a capricious tool, so there have been several attempts at getting rid of it for alignment training. The most popular one right now is Direct Preference Optimization.

the RLHF pipeline

  • usually consists of 3 phases:

    • supervised finetuning (SFT): fine-tuning a pre-trained LM with supervised learning on high-quality data for the downstream task(s) of interest (dialogue, summarization, etc.), to obtain a model (in order to make it generate reasonable quality output for RLHF)
    • preference sampling and reward modeling: create a model predicting the score, which COULD BE given by a human
    • reinforcement learning optimization: the learned reward function is used to provide feedback to the language mode

reinforcement learning finetuning phase

  • the optimization is formulated as


where dist is some kind of distance; Kullback-Leibler divergence between the predicted probability distributions is popular in this role. This way, we maximize the reward while keeping the distance low.

  • RLHF fine-tuning doesn’t require pairs (prompt, completion) for training. The dataset for RLHF consists of prompts only, and the completions are generated by the model itself as part of the trial-and-error.
  • While RLHF improves alignment with human preferences, it doesn’t directly optimize output correctness and plausibility. This means that alignment training can harm the LLM quality. So while a model that refuses to answer any question will never tell a user how to make a chemical weapon so it’s perfectly harmless – although still utterly useless for helpful purposes, too. To address this issue, we try to ensure that the RLHF-trained model doesn't diverge much from its SFT predecessor. This is often enforced by adding a regularization term to the optimized function.
    • This way the output model also operates within the x-y space, where the reward model was trained and, therefore, where it is reasonably accurate.
    • It also prevents model collapse to single high reward answers.

Resources


table file.inlinks, file.outlinks from [[]] and !outgoing([[]])  AND -"Changelog"