Evals that predict production, not vanity

Most eval suites measure the wrong thing and pass right up until launch. Here is the harness I actually trust before I ship.

Every quarter I talk to engineering teams that are genuinely proud of their eval suite. Accuracy above ninety percent, a green CI badge, clean dashboards. Then they ship and something breaks in a way no metric predicted. The model hallucinates a product SKU. It trips over an edge-case utterance that a real customer submitted in week two of the pilot. The team goes back, adds a test, and calls it a lesson learned.

That is not an eval problem. That is a sampling problem dressed up as a measurement problem, and the distinction matters enormously when you are operating at the speed Devlyn operates, AI-Native from the ground up, where the engineering team does not have the luxury of a separate QA org to catch what the model misses. Our senior engineers own production readiness end-to-end. That means the eval harness has to do work that would otherwise fall to a QA team that does not exist.

This essay is about the harness I actually trust, not the one that looks good in a demo, but the one that has earned the right to gate a production deploy.

The sampling problem no one talks about

Most eval suites are built bottom-up: a developer writes cases while they are building the feature, a PM adds a few edge cases during review, and the set accumulates. The result is a distribution that reflects what the team imagined users would do, not what users actually do. Those two distributions can diverge badly.

The fix is mechanical but requires discipline: sample your eval set from real production traffic, freeze it, version it, and never let it grow organically again. At Devlyn we run a weekly job that samples a stratified slice of production requests, stratified by intent cluster, by confidence score, and deliberately over-weighted toward sessions where the model's confidence was low or where a human reviewer flagged a correction. That slice gets frozen as a named artifact: eval-set-2026-w23-v1.jsonl. It does not change. If we want to add new cases, we cut a new version.

Freezing the set sounds obvious until you realize what it implies: your eval score on an older frozen set can only go down. There is no sneaking in new easy cases to bring the number back up. That is the point. You want a fixed ruler, not an elastic one.

A moving eval set is not a ruler. It is a rubber band. The number it reports is a fact about the test, not the model.

The versioned artifact also gives you something you rarely get in AI projects: a straight historical comparison. You can ask whether the model you are about to deploy is better or worse than the one you deployed six months ago, on exactly the same questions, scored by exactly the same rubric. That is a sentence most teams cannot say with confidence.

Blind your labels before you score anything

When human raters score model outputs, they need to be blind to which model version produced each response. This sounds like an academic concern until you watch an engineer give a subtle benefit-of-the-doubt to outputs from a model they helped tune. It happens. It is not dishonesty; it is just human cognition.

Our rating pipeline strips the model version, shuffles the output order, and inserts a consistent proportion of gold-standard human responses into the batch without labeling them as such. Raters do not know whether they are scoring a model output or a reference human response. This matters for two reasons.

First, it catches rater drift. If your raters start scoring the planted human responses below the threshold that would pass a model, your rubric has broken down, either the raters have gotten sloppy or the rubric is no longer calibrated to what good actually looks like. That is a signal to stop and recalibrate before you score anything else.

Second, it gives you a concrete ceiling. Human-to-human agreement on the same task, scored by your rubric, is the ceiling your model will ever reach. If inter-rater agreement among humans is eighty-two percent on your hardest intent cluster, a model that hits eighty-five percent on that cluster is probably gaming the rubric, not genuinely exceeding human ability. Worth investigating rather than celebrating.

The book A Field Guide to Evals covers label-blinding protocols in detail, including how to handle cases where the model output and the human reference are both correct but stylistically different, which is where most rubrics quietly collapse.

Inter-rater disagreement is a rubric signal, not noise

When two experienced human raters disagree on a case, the instinct is to average their scores or escalate to a tiebreaker. Both responses treat the disagreement as an inconvenience to resolve. I treat it as data.

A cluster of disagreement on a particular intent type means one of three things: the rubric is ambiguous for that intent, the correct answer is genuinely context-dependent in ways the rubric does not capture, or the task is hard enough that reasonable people disagree. All three are useful to know before you deploy a model on that task. None of them should be smoothed over.

We track inter-rater agreement by intent cluster and over time. When agreement on a cluster drops below a threshold, we use seventy-five percent as our floor, we pause scoring on that cluster and run a rubric review. Sometimes this takes an afternoon. Once it took two weeks because the disagreement exposed a genuine product ambiguity about what the correct behavior should be in a specific scenario. Finding that ambiguity before the model did was unambiguously worth it.

The Eval-Driven Development framework treats inter-rater disagreement as a first-class artifact, something to log, trend, and review at the same cadence as accuracy metrics. That posture has influenced how we structure our rubric review cycles.

Over-sample the adversarial tail relentlessly

The hardest cases in production are not randomly distributed. They cluster. Users who are frustrated tend to phrase requests in unusual ways. Edge cases in your ontology attract certain user populations. Holiday traffic patterns expose latency cliffs that normal sampling never sees. A uniform random sample will under-represent every one of these.

Our eval set construction deliberately over-samples from four buckets: cases where the model's confidence score was in the bottom quartile; cases where a human reviewer submitted a correction; cases that are syntactically adversarial (unusual punctuation, code-switching between languages, truncated input); and cases that previously caused a production incident, even if we fixed the root cause. The last bucket is the one teams most often skip because the incident feels resolved. It is not resolved until a future model version passes those exact cases on a held-out eval.

Over-sampling the tail is a tradeoff: your aggregate accuracy metric will look worse than if you used a uniform sample. That is a feature, not a bug. A metric that reflects your hardest real-world traffic is more honest than one that reflects your average traffic. Ship the model that passes the hard set, not the model that has the prettiest aggregate number.

The metric worth reporting to the business is not aggregate accuracy. It is model-versus-trusted-human disagreement on a frozen production-sampled set, tracked over time.

The one metric worth reporting to the business

Leadership wants a number. That is legitimate. The question is which number tells the truth.

Aggregate accuracy on your eval set is affected by your sampling strategy, your rubric, your rater pool, and the model, four variables at once. When the number moves, you often cannot say which variable moved it. That makes it a poor basis for a go/no-go decision.

The metric I report instead: model-vs.-trusted-human disagreement on the frozen production-sampled set, tracked over time. Specifically: for each case in the frozen set, a panel of calibrated senior human raters produces a reference answer under the blinded protocol. The model's output is scored against that reference by a second panel of raters. The disagreement rate, the fraction of cases where the model and the human panel diverged beyond a tolerance threshold, is the number that goes in the weekly report.

This metric has properties that make it worth tracking. It is anchored to a fixed distribution (the frozen set), so changes in the number reflect changes in the model, not changes in the test. It is anchored to human performance, so it has a meaningful zero and a meaningful ceiling. And it is directional: if it goes up, something got worse; if it goes down, something got better, and you can investigate exactly which cases changed.

When a team asks me how they know whether their model is ready to ship, the answer is this metric, below a threshold that the product team and engineering agree on before the eval runs, not after. Setting the threshold after you see the result is not evaluation; it is rationalization.

Running the harness: what it actually looks like

Here is an abbreviated output from our eval runner against a frozen set. The numbers are realistic but not from a specific live system.

# eval run against frozen set eval-set-2026-w23-v1.jsonl

python -m devlyn_eval.runner \

--suite eval-set-2026-w23-v1.jsonl \

--model prod-candidate-2026-06-14 \

--rater-pool senior-3

# results summary

cases evaluated 847

recall@1 0.871 # up 0.009 vs prior candidate

recall@3 0.934

p95 latency 1,840 ms # +120 ms vs baseline, flag for review

human disagree 6.2% # threshold 8.0%, PASS

adversarial tail 14.1% # threshold 18.0%, PASS

inter-rater agree 81.3% # floor 75.0%, PASS

verdict GATE CLEAR # deploy gated on p95 review

A few things worth noting in that output. The p95 latency flag does not block the deploy gate, but it surfaces for a mandatory human review before the deploy proceeds. We have shipped models with higher latency when the product team accepted the tradeoff explicitly; we have also pulled deploys at this stage when the latency increase turned out to trace back to an infrastructure change that nobody had caught. The flag earns its place by making the tradeoff visible rather than implicit.

The adversarial tail number is always higher than the aggregate disagreement rate. That is expected, those cases are harder. The question is whether it is improving over model iterations, and by how much. A model that improves aggregate accuracy while holding adversarial tail disagreement flat has not actually gotten better where it matters.

Senior engineers own this, not a tooling team

The failure mode I see most often in mid-sized teams is treating the eval harness as infrastructure, something a platform team owns, something that runs in CI and produces a number that engineers passively consume. That posture produces eval suites that measure the wrong things with great precision.

At Devlyn, the engineers who own a model's behavior in production own the eval suite for that behavior. They write the rubric. They sit in on rater calibration sessions. They review the disagreement reports. They decide when a rubric needs revision and they do the revision. This is not optional work that happens when there is time. It is the work. Shipping a model without understanding the eval suite that gated it is the same as shipping code without understanding the tests.

That stance does not scale if your engineers are using eval infrastructure that requires a PhD to modify. It scales when the infrastructure is legible enough that a senior engineer can trace any metric back to the rubric choices and sampling decisions that produced it. Legibility is an engineering requirement, not a nice-to-have.

The broader argument, that Human in the Loop Is Not a Plan, applies here directly: you cannot outsource production judgment to a human review queue and call it a quality system. The model either meets the bar on your frozen, adversarially-sampled, human-calibrated eval set, or it does not ship. Full stop.

What this does not solve

No eval harness predicts every production failure. Distribution shift will always eventually break a frozen set, the world changes, user behavior changes, and a set sampled in one quarter may not represent the traffic you see two quarters later. The answer is a regular cadence of set refreshes, versioned the same way code releases are versioned, with a deliberate overlap period where you run both the old set and the new set and compare results. Pairing that cadence with live production observability is how you notice the drift before the next refresh, rather than after.

Latent failures, cases where the model produces a confident, plausible, incorrect answer, are also harder to catch. A strong recall metric will not surface them if your reference answers are wrong. This is why calibrated human raters matter more than automated scoring at the margin: a rater who knows the domain will catch the plausible-but-wrong case that an LLM-as-judge might wave through.

And evals do not tell you whether you are building the right thing. A model that perfectly answers the questions users are asking can still be failing at the product goal if those questions are the wrong questions. That is a product problem, not an eval problem, but it is worth naming so that a green eval result does not produce false confidence about product-market fit.

What the harness described here does provide: a reliable, operator-grade gate between a model candidate and a production deploy. It will not catch everything. It will catch most of the things that kill launches, and it will catch them before your users do. At Devlyn, that is the bar we hold and we do not lower it to ship faster. The cost of a bad launch in an AI-Native product is not one incident, it is the erosion of the trust that makes the entire product possible.

Build the suite that predicts production. Freeze it. Trust the number it gives you, not the one you wished it gave you.