LLM-as-a-judge

_{scroll ↓ to Resources}

Note

A reference-free metric that takes human out of the loop and increases model development iterations
Examples
- Pairwise comparison
  - Given the question and the two different responses, decide which response is better based on its relevance, healthfulness, and level of detail. If both responses are equally good, return tie, otherwise return response 1 or response 2.
- Evaluation by criteria (reference-free)
  - Evaluate the following response for conciseness. A concise response is clear and direct without unnecessary words. Return either concise or verbose.
- Evaluation by criteria (reference-based)
  - Given the conversation response below, assess if the user request was resolved. If the issue was addressed and the user confirmed it or showed satisfaction, return the label resolved. Otherwise, return not resolved.
To create a judge, start with a labeled dataset, design a clear evaluation prompt, and iteratively refine it to align the model’s outputs with your golden dataset (small, but human-labeled).
Build the judge using the same model you are using in your application. After you have the judge working, replace it with a smaller/cheaper model and iterate.
After building the judge, integrate it into your application and use it to evaluate a percentage of the outputs to detect drift and track any trends over time.

Advantages

High quality evaluations closely matching human judgment.
Simple to set up because they don’t need reference answers.
Flexible, it is possible to evaluate everything (if you know what exactly)
Scalable, it is possible to handle multiple evaluations very fast.
Easy to adjust as criteria change.
Domain experts can participate in prompt creation.

Disadvantages

probabilistic, so slightly different prompts can lead to different outputs.
May suffer from self-bias, first position bias, or verbosity bias.
require detailed definitions of what constitutes “good” or “bad” performance
may not detect novel problems, emerging problem patterns outside their evaluation criteria
Privacy risks if using third-party LLM.
Costs, more expensive than rule-based evaluations, very expensive at scale for complex tasks

Services

Resources

Links to this File

table file.inlinks, filter(file.outlinks, (x) => !contains(string(x), ".jpg") AND !contains(string(x), ".pdf") AND !contains(string(x), ".png")) as "Outlinks" from [[]] and !outgoing([[]])  AND -"Changelog"

Fluent Numbers 🌱

On this site

LLM-as-a-judge

Note

Advantages

Disadvantages

Services

Resources

Links to this File

Graph View

On this page

Backlinks

Recent

log probs

synthetic data

chunking strategy

hard negative

How to kindly request the best interview feedback