LLM-as-a-Judge: When to Trust It
LLM-as-a-judge is reliable for cheap, scaled, relative grading on tight rubrics. It breaks wherever its own biases contaminate the call. When to trust it.
LLM-as-a-judge is reliable for cheap, scaled, relative grading on a well-specified rubric. It is unreliable for absolute quality calls, novel cases, and anything where its own biases (position, verbosity, self-preference) contaminate the score. Use it to triage, not to gate. The judge sorts the pile; a human decides the cases that matter.
I reach for an LLM judge constantly, and I trust it about as far as I have measured it. That is the whole posture of this piece. An LLM judge is a fast, cheap instrument with a known error profile. Treat it like a smoke detector, not a fire marshal. It tells you where to look. It does not sign off on the building.
Key takeaways
- LLM-as-a-judge is reliable for relative grading on narrow, checkable rubrics, and unreliable for absolute quality, novel cases, and anything its own biases touch.
- Four named biases break it: position, verbosity, self-preference, and calibration drift. All four produce confident, well-formatted scores that hide on a dashboard.
- Never judge with the same model family you are grading; self-preference adds a uniform tilt nothing else surfaces. For high-stakes launches, run a three-judge ensemble.
- Validate against human labels with Cohen's kappa before you trust a verdict. On well-scoped tasks judges reach κ of 0.75 to 0.83; below ~0.6 the rubric is broken.
- The pattern that works in production is hybrid: the judge triages every output, humans gate the high-stakes tail, and corrections feed the rubric back.
How an LLM judge actually works
LLM-as-a-judge means using one language model to grade the output of another against a rubric you write. You hand the judge a prompt, the candidate response, and a scoring instruction. It returns a verdict: a score, a pass/fail, or a preference between two answers. That is the entire mechanism. The power is that it scales human-style judgment to thousands of cases per minute at a fraction of a human reviewer's cost.
There are two common protocols, and the choice matters more than people expect. Pointwise scoring asks the judge to rate one response on an absolute scale. Pairwise comparison asks it to pick the better of two. Pairwise tends to track human preference more faithfully because a relative call is easier than calibrating to an abstract scale. But pairwise has its own hazard: a 2025 study on feedback protocols found pairwise preferences flip in roughly 35 percent of cases versus 9 percent for absolute scores, and pairwise judges are more easily fooled by distractor features a generator learns to exploit. Neither protocol is free. You pick the failure mode you can tolerate.
Here is the shape of a judge prompt I would actually ship. Note that it is narrow on purpose.
The instruction does one thing. It scores grounding, not vibes. It names what to ignore. It forces a structured output you can parse and audit. And the harness randomizes order on every call, because the judge has a thumb on the scale you have not seen yet.
Notice what the prompt does not do. It does not ask "is this a good answer." Good is a word that means everything and measures nothing. The moment your rubric contains a judgment a smart human would hesitate on, the judge stops being an instrument and starts being an oracle, and oracles are exactly what you cannot audit. Every line in a judge prompt should map to something you could check by hand on a sample. If you cannot check it by hand, the judge cannot either; it will just hide the guess inside confident JSON.
Where an LLM judge is trustworthy
An LLM judge earns its keep in four conditions, and they share a theme: the rubric is mechanical and the answer is checkable.
- Relative, not absolute. "Is A more grounded than B?" beats "Rate this 1 to 10." Comparison anchors the judge; absolute scores drift across runs.
- Narrow, checkable rubrics. Format compliance, schema validity, "does the answer cite the retrieved passage," refusal detection. These have near-binary ground truth.
- High volume, low stakes per call. Regression-testing 5,000 responses after a prompt change. No single verdict gates a deploy; the aggregate trend does.
- Triage and routing. Flag the bottom decile for human review. Even a noisy judge that surfaces 80 percent of the bad cases saves the reviewer most of the pile.
That last use is where the economics land. A human reviewing every output is a cost that grows linearly with traffic and never stops. A judge that filters the queue down to the cases worth a human's time turns a linear cost into a fixed one. This is the same argument I make about why a human reviews it is not a plan: review has to scale with autonomy, and an unaided human does not.
The inverse of this list is just as useful. Do not trust a judge to certify medical, legal, or financial correctness, to rank creative quality, to make the final call on a safety refusal, or to score anything where the right answer is genuinely contested. Those are absolute-quality calls on high-stakes, often novel cases. They are exactly the conditions where the biases below do the most damage. In those domains a judge can still pre-sort the queue. It just cannot be the last signature.
The biases that break the judge
LLM judges fail in named, measurable ways. In 2025 and 2026, researchers documented these well enough that you have no excuse for being surprised. Reporting on bias benchmarks found frontier models exceeding 50 percent error rates on challenging bias tests. Here are the four that have cost me time.
- Position bias. The judge favors the answer in slot A (or slot B) regardless of quality. A study across 15 judges and ~150,000 instances found this is systematic, not random, and varies by judge and task. Mitigation: randomize order, and run both orders, then count a win only if it survives the swap.
- Verbosity bias. Longer answers score higher even when no more correct. The judge mistakes elaboration for quality. Mitigation: tell it to ignore length, and spot-check whether your scores correlate with token count.
- Self-preference bias. A judge rates outputs from its own model family higher. The self-preference bias work traces this to perplexity: the judge prefers text that is familiar to it. Applied 2026 reporting puts the tilt at a uniform 10 to 25 percent, and nothing else you do will surface it. The cardinal rule: never use the same family as judge and candidate. For high-stakes launches, run a three-judge ensemble across families and aggregate by majority vote.
- Calibration drift and novelty. On out-of-distribution cases the judge invents a standard. It has no anchor for an answer it has never seen scored, so it guesses with confidence.
The dangerous property is that all four produce confident, well-formatted verdicts. The judge does not flag its own uncertainty. A wrong score and a right score look identical on the dashboard. That is why an unvalidated judge is worse than no judge: it launders noise into a number people trust.
How to validate the judge against human labels
You do not get to trust the judge until you have measured it against humans on your task. A judge is one instrument inside a larger harness, and it earns trust the same way every other eval does: against ground truth. If you are building that harness from scratch, start with my complete guide to LLM evaluation and the broader question of which metrics actually matter and which ones lie. The validation protocol below is not complicated, and skipping it is the single most common mistake I see.
Build a calibration set of a few hundred cases sampled from real traffic. Have trusted humans label them with the same rubric the judge uses. Then run the judge on the same set and compute agreement. Not raw accuracy, but Cohen's kappa, which corrects for the agreement you would get by chance. Raw "85 percent agreement" can be near-random if one label dominates. Kappa tells you the truth.
What counts as good? Recent work gives useful anchors. The Judge's Verdict benchmark measures judge capability through human-agreement kappa. In practice, applied 2025 and 2026 studies report substantial agreement in the 0.75 to 0.83 kappa range on well-scoped tasks: smart-home agent grading at κ = 0.83, patch evaluation at κ = 0.75. Those are tasks with tight rubrics and checkable answers. The rough 2026 consensus is that κ above 0.6 is acceptable for production and κ above 0.8 is strong. Below ~0.6, the judge is too noisy to gate anything; use it only to triage. Treat these as illustrative targets, not promises: your number depends on your rubric and your task.
Two disciplines make this hold. First, re-validate whenever you change the judge prompt, the judge model, or the task. A rubric edit can quietly tank your kappa. Second, when the judge and humans disagree, read the cases. Disagreement is a rubric signal, the same way I treat inter-rater disagreement in a production eval harness: it usually means the rubric is ambiguous, not that the humans are wrong.
The hybrid pattern: judge triages, humans gate the tail
The pattern I trust in production is not "judge or human." It is the judge handling volume and humans owning the cases that matter. Concretely:
- The judge scores everything. Cheap, fast, on every output. It produces a score and a confidence proxy (margin between A and B, or self-reported certainty).
- Clear passes ship. High-confidence, high-score, on-distribution cases clear automatically. This is most of your traffic.
- Humans gate the tail. The bottom decile, the low-confidence cases, anything novel or high-stakes, and a random audit slice route to a human. The human's verdict is authoritative.
- The tail feeds the rubric. Human corrections on the routed cases become new calibration labels. The judge gets re-validated against them. The loop tightens.
This is where engineering meets the P&L. The judge converts review cost from linear-in-traffic to roughly fixed, and it does so without pretending the model is trustworthy on its own. You ship faster because most of the queue clears automatically, and you sleep at night because the expensive mistakes still hit a human before they hit a customer. Evaluation is the scarce skill in the judgment economy, and the judge is leverage on it, not a replacement for it. The full protocol, from sampling and blinding to kappa thresholds and the routing rules, is what I lay out in A Field Guide to Evals.
The honest trade-off: the hybrid only works if the routing is right. Set the threshold too loose and bad outputs ship under a green score. Set it too tight and every case routes to a human, which is the bottleneck you built the judge to avoid. The routing threshold is itself a thing you have to tune and monitor, and it drifts as traffic changes. There is no set-and-forget version of this.
Frequently asked questions
Is LLM-as-a-judge reliable?
It is reliable for relative grading on narrow, checkable rubrics, and unreliable for absolute quality calls or novel cases. Validate it against human labels using Cohen's kappa before you trust any verdict. On well-scoped tasks, judges reach κ in the 0.75 to 0.83 range; below ~0.6 they are too noisy to gate anything.
Can an LLM evaluate another LLM fairly?
Only with guardrails. Using an LLM to evaluate an LLM introduces self-preference bias: a judge rates its own model family higher because that text is familiar to it. Judge with a different model family than the one you are grading, randomize answer order to kill position bias, and tell the judge to ignore length.
What biases affect LLM grading?
Four named ones: position bias (favoring an answer by its slot), verbosity bias (rewarding length over correctness), self-preference bias (preferring its own family's text), and calibration drift on out-of-distribution cases. All four produce confident, well-formatted scores, so they hide on a dashboard unless you measure for them.
Should I use pairwise or pointwise scoring?
Pairwise comparison tracks human preference more faithfully and avoids scale drift, but flips more often (~35 percent of cases) and is more exploitable by distractor features. Pointwise absolute scoring is more robust to manipulation but drifts across runs. Pick the failure mode you can tolerate for your task, and validate either way.
If you are wiring an LLM judge into a real pipeline, the part that pays off is the instrumentation around it: the kappa checks, the routing thresholds, the audit slice, the drift alarms. That is the AI observability and monitoring work a Devlyn pod builds in from day one, so the judge stays honest as your traffic moves. Start by measuring your judge against humans. The number you get back will tell you whether you are triaging or gating.
