Name: A Field Guide to Evals
Availability: InStock

How to construct a living evaluation set from real work, incidents, synthetic pressure tests, and policy-sensitive cases.

The Reference Set Is the Product Boundary

The reference set is where the evaluation system becomes honest or dishonest. A team can write a careful rubric, automate grading, and build a polished dashboard, but if the examples are selected badly the result will still be misleading. The score will look precise while measuring the wrong reality.

A reference set is not a random spreadsheet of prompts. It is the governed collection of cases the team trusts when it has to decide whether a release is ready. It carries the product promise, the user tasks, the costly failures, the source conditions, the policy boundaries, and the examples that expose whether the system is improving or only changing.

This makes the reference set a product boundary. It decides which parts of reality are allowed to influence release decisions. If the set contains only easy examples, easy examples will run the product. If it excludes incident cases, the same incident can return. If it ignores high-value customers, the average score can improve while the business risk gets worse. If it omits refusal, escalation, and uncertainty cases, the system can become more fluent and less safe at the same time.

OpenAI's evals guidance frames evals as repeatable tests built from data and graders. Chapter 1 argued that a production eval must predict a decision. This chapter starts with the data: what belongs in the set, how cases are shaped, how they are segmented, and how the set stays alive after launch.

Infographic map for Building the Reference Set — The figure turns Building the Reference Set into a working map: how to construct a living evaluation set from real work, incidents, synthetic pressure tests, and policy-sensitive cases.

Start With the Promise, Not the Examples

Most weak reference sets begin with available examples. Someone exports recent prompts, removes the awkward cases, edits a few expected answers, and calls the result a test set. Availability is not the same as relevance. The reference set should start with the product promise.

Write the promise in operational language:

What work is the system allowed to do?
Which evidence is it required to use?
Which actions are outside scope?
When should it ask for more context?
When should it escalate?
What failure would create customer, compliance, support, or revenue risk?

Only then choose examples. The promise tells the team what the set must prove. For a billing support workflow, the set should include current policy, account state, policy conflict, refund boundaries, and escalation cases. For a retrieval-heavy research assistant, the set should include answerable questions, unanswerable questions, stale documents, contradictory evidence, and citation quality. For an agentic workflow, the set should include tool permissions, partial completion, retries, user interruption, and audit records.

The NIST AI Risk Management Framework is useful because it separates mapping from measurement. The reference set is where mapping becomes measurable. Context, intended use, affected users, foreseeable misuse, and consequence have to become concrete cases before a score can mean anything.

The Four Sources of a Serious Set

A production reference set should be assembled from four sources: real traffic, incidents, policy-sensitive edge cases, and synthetic pressure tests. Each source contributes something different. None is enough by itself.

Real traffic shows what users actually ask for. It captures language drift, partial context, repeated confusion, and the everyday distribution. It is the only reliable source for frequency. Without real traffic, the team is usually testing its own imagination.

Incidents show what has already hurt the product. They are not representative by frequency, but they are representative by consequence. If an answer caused a support escalation, a refund, a legal review, a customer apology, a security review, or an executive call, it deserves a place in the regression set until the failure class is no longer plausible.

Policy-sensitive edge cases show where correctness depends on a rule, boundary, or exception. These cases are often underrepresented in traffic because users do not know which boundary matters. The team has to create them deliberately from policy, product constraints, contracts, and known limitations.

Synthetic pressure tests are generated or constructed examples designed to probe a known weakness. They can test prompt injection, ambiguous instructions, conflicting documents, malformed input, missing evidence, tool misuse, or high-stakes refusal. They are useful when they are targeted. They become dangerous when the team treats them as a substitute for production sampling.

The lesson from CheckList applies directly. A useful behavioral suite tests capabilities, invariances, and directional expectations. In product terms, this means the reference set should not only ask whether a specific answer is correct. It should test what must stay stable when wording changes, what must change when facts change, and what behavior is unacceptable regardless of wording.

Normalize Cases Before Scoring Them

Raw examples are not evaluation cases yet. A user message, document, transcript, ticket, or tool trace has to be normalized before it can support a release decision. Normalization is the work of turning messy evidence into a case that can be reviewed consistently.

A useful case record should contain at least these fields:

Case id.
Source: traffic, incident, policy edge case, or synthetic pressure test.
Task type.
User intent.
Input context.
Required evidence.
Expected behavior.
Allowed uncertainty.
Disallowed behavior.
Risk tier.
Failure class.
Segment labels.
Owner.
Last reviewed date.
Status: active, watchlist, retired, or quarantined.

The fields matter because the same output can be right or wrong for different reasons. A response that lacks a citation may be acceptable in a brainstorming workflow and unacceptable in a policy workflow. A refusal may be correct for a tool action but wrong for a simple explanation. A concise answer may be preferred in chat but incomplete in a regulated review.

Normalization also protects the team from grading drift. If reviewers are shown only an input and an output, they will import their own assumptions. If they are shown the task type, evidence requirement, risk tier, and disallowed behavior, they can apply a shared standard.

Datasheets for Datasets argues that datasets need documentation about motivation, collection, composition, preprocessing, intended use, distribution, maintenance, and limitations. A reference set needs the same discipline. It should be possible to ask where a case came from, why it is active, what decision it supports, and when it should be reviewed again.

Keep Three Sets, Not One

Teams often talk about "the eval set" as if one set can do every job. It usually cannot. A production evaluation program needs at least three related sets: a stable release set, an exploratory review queue, and incident regression cases.

The stable release set is the gate. It should change slowly. It contains the examples the team uses to compare the current production version, a candidate release, and any fallback path. If this set changes every week, release comparisons lose meaning. If it never changes, it becomes stale. The right balance is deliberate change with versioned notes.

The exploratory review queue is the learning surface. It contains recent traffic, new support cases, uncertain examples, reviewer disagreements, and suspected drift. This queue is where the team discovers new failure classes and decides what belongs in the release set later.

Incident regression cases are high-consequence examples tied to failures the team cannot afford to repeat. They should be marked with the incident, date, affected workflow, failure class, fix owner, and retirement condition. Some incident cases will become permanent. Others should retire after the product boundary changes.

These sets should not be mixed casually. If a team tunes prompts against the same cases used for the release gate, it can memorize the gate. If it treats a noisy exploratory queue as the release set, it can block releases for unclear reasons. If it hides incident cases inside the average score, it can pass a release that reopens a known wound.

The Google ML Test Score is a useful reminder that production readiness includes data tests, model tests, infrastructure tests, monitoring, and ownership. For eval work, the equivalent is not a single score. It is a disciplined test system with separate roles for release confidence, discovery, and regression protection.

Segment Before You Average

An average score is useful only after the team knows what it hides. The reference set should be segmented before the first score is reported.

Useful segments often include:

Task type.
User intent.
Customer tier.
Risk tier.
Source type.
Source quality.
Retrieval availability.
Tool action required.
Language or locale.
Policy area.
Failure class.

Segments are not vanity dimensions. They are decision dimensions. If the enterprise customer segment regresses, the decision may be different than if a low-risk exploratory task regresses. If answerable retrieval cases improve but unanswerable cases worsen, the system may be becoming overconfident. If policy cases improve but tool-action cases regress, the release decision depends on where the feature is used.

Segmentation also helps the team avoid false comfort. A 92 percent pass rate can hide a 60 percent pass rate on the highest-risk cases. A stable average can hide a major improvement in easy cases and a regression in hard cases. A strong grader score can hide reviewer disagreement in exactly the cases that need human judgment.

The reference set should make those differences visible from the start. Do not wait until the first incident to ask whether the failing cases were in a segment.

Use Synthetic Cases Carefully

Synthetic cases are valuable when they are attached to a hypothesis. They are not valuable because they increase the row count.

Use synthetic cases when the team needs to test a boundary that traffic has not sampled enough. For example:

A document contains instructions that conflict with the user's task.
A source says the policy changed, but the current policy page says otherwise.
A user asks for a tool action without required permission.
A question is unanswerable from the available corpus.
A prompt includes misleading context that should not be trusted.
A request is safe in one customer tier but unsafe in another.

Each synthetic case should say why it exists. It should be labeled as synthetic, tied to a failure class, and reviewed separately from production traffic. The team should not use synthetic pass rate as a proxy for production readiness unless it has evidence that the synthetic cases map to real risk.

Synthetic cases are especially useful for rare but expensive failures. They help the team test boundaries before a real user finds them. But they should remain subordinate to observed work. The reference set should not become a fantasy world where the team only tests the attacks, exceptions, and clever traps it already knows how to name.

Holdout Discipline Matters

The reference set will influence development. That is the point. But the release gate loses value if every developer sees every expected answer, tunes against every case, and optimizes until the system passes the known rows.

Use holdout discipline. Keep some cases hidden from prompt tuning and model selection. Keep reviewer calibration examples separate from release examples. Keep newly discovered incident cases out of tuning until the team has decided whether they represent a class or a one-off mistake.

This does not require theatrical secrecy. It requires clear purpose. Development examples are for improving behavior. Calibration examples are for aligning reviewers and graders. Release examples are for deciding whether the candidate is safe to ship. Incident examples are for preventing repeated harm. When those purposes blur, scores become easier to improve and harder to trust.

This is one reason Hidden Technical Debt in Machine Learning Systems remains relevant to model-backed products. Data dependencies, evaluation dependencies, configuration, and monitoring all create maintenance debt. A reference set is not outside that debt. It is part of the system and has to be managed accordingly.

Refresh Is a Governance Job

Reference sets decay. Products change. Policies change. Corpora change. Customer language changes. Support teams learn new patterns. Sales teams make new promises. Incidents reveal new risk. A set that was honest at launch can become misleading three months later.

Refresh should be explicit. Every active case should have a review date or a refresh rule. Some cases should be stable for long comparison. Some should be reviewed whenever policy changes. Some should be reviewed when the retrieval corpus changes. Some should be reviewed after an incident. Some should retire because the product no longer accepts that task.

Do not let old expected answers become hidden policy. If the policy changed, the expected answer should change or the case should retire. If the product boundary changed, the case should reflect the new boundary. If the retrieval corpus changed, the required evidence should be checked again. If a customer workflow is no longer supported, the case should not keep blocking releases unless the product still receives that traffic.

Refresh is not a data cleanup chore. It is governance. It decides what reality the eval system is allowed to represent.

Example Source Decision Table

Decision table

What each source proves

Source	Include when	What it proves	What it cannot prove	Refresh rule
Sampled traffic	The team needs production language, task frequency, and real user confusion.	Whether the system handles ordinary work and current demand.	Rare high-cost boundaries that users have not hit yet.	Weekly or biweekly sampling, with stratification by task and risk.
Incidents	A failure caused support, revenue, compliance, security, or trust impact.	Whether a known failure class has been repaired.	The frequency of the failure in normal traffic.	Review after fix, after release, and after the incident class is no longer plausible.
Policy edge cases	Correct behavior depends on a boundary, exception, contract, or regulated rule.	Whether the product promise is enforced under constraint.	How often users naturally approach the boundary.	Review when policy, contract, or product scope changes.
Synthetic pressure tests	A known weakness needs targeted stress before real users expose it.	Whether a specific boundary or failure class is controlled.	Representative production quality or traffic frequency.	Review after the weakness is fixed, then retain only if the risk remains material.

Common Mistakes

The first mistake is choosing examples that make the team feel competent. A reference set should make the product safer, not the team more comfortable. If the set never creates a hard decision, it is probably too easy.

The second mistake is mixing tuning and release examples. If the same rows are used to improve the system and approve the system, the gate becomes easier to pass over time. Keep development, calibration, release, and incident cases separate enough that the release score still has authority.

The third mistake is treating synthetic cases as representative. Synthetic cases can be excellent boundary tests, but they do not tell the team what users are actually doing. They should be labeled, segmented, and interpreted separately.

The fourth mistake is ignoring abstention and escalation. Many products fail not because the answer is bad, but because the system answered when it should have asked, refused, escalated, or stopped. The reference set should include cases where the right behavior is not an answer.

The fifth mistake is letting stale cases become hidden product policy. Old examples can encode old rules, old documents, old customer tiers, and old support behavior. A reference set needs maintenance because the product keeps moving.

Practical Exercise

Choose one production workflow and build a 40-case first reference set:

Write the product promise in five sentences.
Select 16 sampled traffic cases across the most common task types.
Add 8 incident or support escalation cases.
Add 8 policy-sensitive edge cases.
Add 8 synthetic pressure tests tied to named failure classes.
For every case, record source, task type, risk tier, expected behavior, required evidence, failure class, owner, and review date.
Split the 40 cases into release, exploratory, and incident-regression groups.
Define the minimum segments you will report before any average score.

After the exercise, ask one question: would this set change a release decision? If the answer is no, the set is not ready. Add the missing risk, remove the vanity cases, and write the decision threshold before scoring.

Summary

The reference set is the foundation of the eval system. It carries the product promise into measurable cases. It decides which parts of production reality influence release decisions. It protects the team from demo comfort, benchmark theater, stale policy, and average scores that hide expensive failures.

A serious set is built from real traffic, incidents, policy-sensitive edge cases, and targeted synthetic pressure tests. Raw examples become useful only after they are normalized, segmented, documented, and assigned an owner. The team should keep separate surfaces for release gating, exploration, and incident regression. It should preserve holdout discipline and refresh the set as the product, corpus, policy, and customer base change.

The next chapter turns the reference set into judgment. Once the examples are strong, the team still has to decide how they are graded, when reviewers disagree, and which parts of judgment can be automated without pretending the hard cases are simple.

Key Takeaways

The reference set is the governed collection of cases the team trusts for release decisions.
Start with the product promise before selecting examples.
Use four sources: sampled traffic, incidents, policy edge cases, and synthetic pressure tests.
Normalize each case with task, evidence, expected behavior, risk tier, failure class, owner, and review date.
Keep separate release, exploratory, and incident-regression sets.
Segment before averaging so high-risk regressions do not disappear inside a strong overall score.
Refresh is governance, not cleanup. A stale set can approve the wrong behavior or block the right change.