direct preference optimization

_{scroll ↓ to Resources}

Note

alternative to RLHF for model alignment
because the RLHF can be seen as unstable and complex procedure requiring fitting a reward model to collected relative human preferences data and then fine-tuning the large self-supervised LM using reinforcement learning to maximize the estimated reward without drifting too far from the original model

instead of creating a proxy reward model, generating new outputs, scoring them and, finally, maximizing the probability of generating high-scored outputs ==DPO allows fine-tuning a LM directly on human preference data using backpropagation==
- Uses some math to suggest an internal reward model that doesn’t take into account human preferences. It turns out to be very simple and it roughly says: “A continuation yy is better for a prompt xx if yy is more likely to be generated from xx by the LLM”.
- It then takes the (prompt, preferred_continuation, rejected_continuation) dataset and trains the LLM on it in a simple supervised way to ensure that, roughly speaking, the preferred continuation is more likely to be generated from the prompt than the rejected one*.

table file.inlinks, file.outlinks from [[]] and !outgoing([[]])  AND -"Changelog"