Name: A Field Guide to Evals
Availability: InStock

Why demos, benchmarks, and small handpicked test sets fail to predict production behavior.

The Failure Is Usually Not the Model

The first evaluation suite most teams build is a comfort object. It contains the examples that were already discussed in meetings, the prompts that made the demo look good, and the answers everyone agrees are obviously right. It is useful in the way a smoke test is useful. It tells the team the system can still breathe. It does not tell the team whether the product is ready for production.

Production is not a larger demo. Production adds user intent the team did not predict, source material that is stale or contradictory, policy questions that were not settled, angry customers, partial context, ambiguous tasks, retries, and operational pressure. A model that behaves well on a tidy evaluation file can still fail when the product has to decide whether to answer, ask for more context, escalate, refuse, cite a source, call a tool, or stop.

This is why evals fail in production: they are built around examples instead of decisions. A serious eval does not only ask, "did the answer look good?" It asks, "does this measurement tell us whether to ship, hold, roll back, narrow the rollout, change policy, or repair a system boundary?" If the eval cannot change a decision, it is not a production eval yet.

OpenAI's evals guidance describes evals as repeatable tests made from data and graders. That framing is necessary. It gives teams the mechanics: examples, expected behavior, grading, and comparison. The production question is what those mechanics are for. In a real product, the eval must become decision infrastructure. It must say what changed, where it changed, how much it matters, and who owns the next action.

The Demo Distribution Is Not the Production Distribution

The demo distribution is shaped by people who already know the product. They choose inputs that show the intended path. They usually avoid requests that are unclear, adversarial, incomplete, outside policy, or tied to a messy customer history. Even when they try to be honest, they still tend to select memorable examples rather than representative ones.

Production traffic is shaped by users who do not share the team's mental model. They ask questions in their own language. They mix intents. They include irrelevant context. They assume the system remembers something it does not remember. They paste documents that contain instructions, contradictions, and obsolete facts. They use the product while tired, frustrated, rushed, or under business pressure.

Hand drawn workflow showing demo examples moving through traffic, edge cases, incidents, and a release gate. — Production evals start where demo checks stop: real traffic, edge cases, and incidents have to flow back into the gate that decides whether a release is safe.

The gap between those distributions is where many eval suites become misleading. A team can improve its demo score while the real product becomes riskier. It can optimize for fluent answers while factual support declines. It can reduce refusal rate while increasing unsupported claims. It can improve average quality while damaging the small set of workflows that drive support tickets or revenue risk.

The lesson from CheckList is useful here. Behavioral testing should inspect capabilities, invariances, and directional expectations, not only aggregate performance. In product terms, this means the eval should test what the system must preserve when the wording changes, what should change when the facts change, and what must never happen even if the model sounds confident.

Benchmarks Are Not Release Gates

Benchmarks can be useful. They help compare general capabilities. They can show whether a model is broadly better at math, coding, retrieval, reasoning, safety, or instruction following. They can help teams choose a starting model or reject a model that is clearly unfit.

But a benchmark is not a release gate for your product. It does not know your users, your policies, your source material, your cost structure, your latency promise, your support burden, or your sales commitments. It cannot tell you whether the new version is safe for a regulated workflow, a high-value customer segment, or a support path where a wrong answer creates a refund, escalation, or legal exposure.

The HELM project is valuable because it evaluates across scenarios and metrics rather than pretending one number captures model quality. That is the right instinct for production systems. Quality is multidimensional. A product eval may need to measure factuality, citation quality, refusal behavior, policy compliance, tone, latency, cost, reviewer disagreement, and downstream task completion. A single score can be useful only after the team knows what it hides.

Release gates should be built from the product's own risk. If the system answers policy questions, the gate should include policy-sensitive examples. If the system uses retrieval, the gate should include source quality and answerability. If the system can call tools, the gate should include permission boundaries and auditability. If the system handles customer support, the gate should include durable resolution, not only first-response fluency.

The Product Promise Comes First

Before building an eval, write the product promise in plain language. The promise should say what work the system does, for whom, with what evidence, under which constraints, and what happens when the answer is not ready. Without that promise, the eval will drift toward whatever is easiest to score.

For example, "answer customer questions" is not a product promise. It is a category. A better promise is: "For billing support questions, answer only from the current billing policy and the customer's account state; cite the policy section used; do not invent account data; escalate when the policy and account state conflict." That promise gives the eval something to test.

The promise also reveals the cost of failure. A wrong answer in a brainstorming assistant may be a low-cost nuisance. A wrong answer in billing, healthcare, financial advice, compliance, security operations, or legal workflow can create expensive downstream work. The eval should not weigh those failures the same way.

The NIST AI Risk Management Framework is useful because it separates governance, mapping, measurement, and management. The production promise belongs in the mapping work. It identifies the context and consequence. The eval belongs in the measurement work. The release gate belongs in the management work. If those three are disconnected, the team may measure behavior without managing risk.

A Good Eval Predicts a Decision

The strongest test of an eval is not whether it has a high score. The strongest test is whether the team knows what action follows each result. If a score improves by two points, what changes? If a high-risk segment regresses by five points, who blocks the release? If reviewer disagreement rises, does the team repair the rubric, quarantine the examples, or change product policy? If retrieval evidence is missing, does the team fix the corpus or change the answer behavior?

An eval that predicts a decision has five properties:

It is tied to a product promise.
It contains examples that represent real work and real risk.
It separates criteria that fail for different reasons.
It reports results by segment, not only by average.
It names the release action attached to each threshold.

This is where many teams lose discipline. They build a scorecard but never define thresholds. They review failures but never assign owners. They compare models but never inspect which examples changed. They add more examples but never retire stale ones. The eval grows in size while shrinking in authority.

The goal is not to make the eval large. The goal is to make it consequential. A 60-case eval that blocks the right release is more valuable than a 5,000-case suite no one trusts when the decision is hard.

The Real Unit Is the Failure Class

Production failures arrive as individual examples, but they should be managed as classes. A single bad answer may be caused by missing retrieval evidence. Another may be caused by ambiguous policy. Another may be caused by a prompt that over-instructs certainty. Another may be caused by a grader that rewards confident language. Treating all of these as "wrong answer" hides the repair path.

Failure classes should be part of the eval schema. A useful review process might classify failures as:

Unsupported claim.
Wrong source.
Missing source.
Source conflict.
Policy ambiguity.
Unsafe tool action.
Refusal failure.
Over-refusal.
Format violation.
Latency or timeout failure.
Reviewer disagreement.
Product scope mismatch.

Each class points to a different owner. Unsupported claims may require grounding rules or retrieval changes. Policy ambiguity requires product or legal ownership. Unsafe tool action requires permission design. Format violations may require schema validation. Reviewer disagreement may require rubric repair. Scope mismatch may require the product to stop accepting certain tasks.

This is why the book Retrieval That Survives Contact matters beside eval work. If the failure class is evidence availability, the eval is only showing the symptom. The retrieval system needs repair. Likewise, Agents That Actually Work becomes relevant when the failure is not the answer but the action path.

Demo Eval Versus Production Eval

Decision table

Demo eval versus production eval

Area	Demo eval	Production eval
Example source	Handpicked prompts and known happy paths.	Sampled traffic, incidents, edge cases, policy cases, and synthetic pressure tests.
Success definition	Looks correct to the builder.	Meets a written product promise under defined constraints.
Scoring	Single aggregate pass rate.	Segmented by task, risk, source quality, user intent, and failure class.
Reviewer role	Confirms preferred examples.	Applies a rubric, records disagreement, and improves the label guide.
Decision	Supports a demo or model preference.	Supports ship, hold, rollback, narrow rollout, or repair.

Common Mistakes

The first mistake is measuring the easiest behavior instead of the most consequential behavior. Fluency, format, and tone are easier to score than correctness under messy source conditions. They still matter, but they should not crowd out evidence quality, refusal behavior, or policy compliance.

The second mistake is treating synthetic examples as a substitute for production sampling. Synthetic examples are useful when they target a known gap. They are dangerous when they replace the real distribution. A synthetic adversarial case can test a boundary. It cannot tell you how often real users approach that boundary.

The third mistake is accepting reviewer agreement as proof of correctness. Reviewers can agree because the rubric is clear. They can also agree because they share the same blind spot. Calibration should include disputed cases, hidden gold examples, and occasional audit by someone outside the immediate build team.

The fourth mistake is measuring only the candidate model. A release eval should compare candidate behavior against the current production version and, when relevant, the fallback path. The question is not "is this model good?" The question is "is this change better enough, in the right places, without creating unacceptable regressions?"

The fifth mistake is failing to retire examples. An eval suite accumulates obsolete policy, old product behavior, outdated source material, and examples that no longer change a decision. A stale eval can block useful work or approve dangerous work. Both are expensive.

Practical Exercise

Choose one model-backed workflow in your product or a product you know well. Write one page with the following fields:

Product promise: what the system is allowed to do and under which constraints.
User tasks: the five most common task types the system handles.
High-cost failures: the five failure modes that would create customer, support, compliance, or revenue risk.
Existing evidence: where real examples can be sampled from.
Release decision: the next decision the eval must support.
Segments: the minimum segments the score must report separately.
Gate: the result that would block release.

If any field is unclear, do not start by writing more prompts. Start by resolving the product promise. Most bad evals are downstream of an unclear promise.

Summary

Evals fail in production when they are designed as proof that a system works instead of a way to decide whether a system is ready. Demo examples, benchmark scores, and aggregate pass rates can all be useful, but none of them replace a product-specific evaluation system tied to real tasks, real risk, and real release decisions.

The production eval starts with the promise. It asks what the system is supposed to do, what evidence it must use, what behavior is unacceptable, and what decision the team must make. It then builds examples, rubrics, graders, segments, and gates around that decision.

The rest of this book builds that system. The next chapter starts with the most important material asset: the reference set. If the examples are weak, stale, or politically selected, every later score becomes suspect. If the reference set is strong, the team has a foundation for grading, release gates, operations, and business trust.

Key Takeaways

A production eval is decision infrastructure, not a demo checklist.
The product promise should be written before the metric is chosen.
Benchmarks help with model selection, but they do not replace product-specific release gates.
Average scores hide expensive failures unless results are segmented by task, risk, source quality, and failure class.
Failure classes matter because each class points to a different repair path.
The strongest eval is not the largest eval. It is the eval the team trusts when the release decision is hard.