reward model

_{scroll ↓ to Resources}

Note

reward model is the result of training on top of the pre-trained LLM checkpoint
- for any pair (instruction, result) predict a number (Reward Score) denoting how good is the result for this instruction - assigns a numerical score to model outputs based on their alignment with human preferences and values. The higher the score, the higher is the estimated degree of alignment.
use human-annotated data with several ranked responses for each given instruction, for instance: $e d i t e d > c h ose n > re j ec t e d$
- in practice, there can be up to 9 ranked responses, some of which can be excluded if too hard to rank.
- $e d i t e d$ response can be optionally added by human annotators to improve one of the existing responses
having such a model, we can select the best result from several options or use it for RLHF

How to obtain a reward model

Architecture

alter the transformer architecture by substituting the fully-connected layer with a single neuron, which output will be a number from $- in f$ to $+ in f$ - our Reward Score (RS) $r$
this is done only for the last output token in sequence, where all output tokens are already known.

Preference Data

Human preference data: several result candidates for each input instruction, ranked, but their rank is not quantified exactly.
Additionally, labelers were allowed to edit one of the candidate results to make it the best among the three

Preference model

several models exist, the one explained here is the Bradley-Terry model
it claims: the probability that user prefers result y2 to result y1 is equal to $\frac{r _{2}}{( r _{2} - r _{1}}$ where $r_{i}$ is the reward and $r_{i} = exp^{r (y_{i})}$ where $r$ is the output of that single neuron
this $r_{i}$ definition ensures it takes a value from $0$ to $+ in f$
after math transformations we get that $P ro b (y 2 > y 1) = s i g m o i d (r (y 2) - r (y 1))$ which is from 0 to 1
given preference with ranked results (that is we know that $P ro b (y 1 > y 2) = 1$ ) and knowing from the the above that this probability is $σ (r (y_{1}) - r (y_{2}))$ we can maximize this value while training the model with single neuron
in practice we minimize the negative log of that. The optimum of this function is in 0 and it is reached only if $r (y_{1}) >> r (y_{2})$ . So we train the model to have the good response to have much higher score than the bad response, irrespectively of their absolute values.
Train a reward model using checkpoint of a pre-trained model and preference data

Resources

Reward Models - by Cameron R. Wolfe, Ph.D.

Links to this File

table file.inlinks, filter(file.outlinks, (x) => !contains(string(x), ".jpg") AND !contains(string(x), ".pdf") AND !contains(string(x), ".png")) as "Outlinks" from [[]] and !outgoing([[]])  AND -"Changelog"

Fluent Numbers 🌱

On this site

reward model

Note

How to obtain a reward model

Architecture

Preference Data

Preference model

Resources

Links to this File

Graph View

On this page

Backlinks

Recent

Normalized Discounted Cumulative Gain

log probs

synthetic data

chunking strategy

hard negative