Name: A Field Guide to Evals
Availability: InStock

How to connect eval results to ship, hold, rollback, and narrow-rollout decisions.

The Meeting Where the Score Loses

Picture the release meeting. The candidate version is ready. Sales has a customer expecting the new capability this week. The eval ran overnight, and the overall pass rate moved from 91.2 to 92.4. Someone says "it is up, ship it." Someone else says "it felt worse on the demo I tried." The room now has a number nobody trusts and an anecdote nobody can refute, and the decision gets made by whoever talks last or outranks the others.

That meeting is the enemy of this entire book. The first chapter argued that an eval must predict a decision. This chapter is where that promise gets paid or broken. A regression gate is the mechanism that turns a noisy, multidimensional eval result into a defensible ship, hold, rollback, or narrow-rollout decision, agreed before the run so that urgency cannot rewrite the rules during the meeting.

The discipline here is borrowed openly from software testing. A unit test suite does not "trend upward." It is green or red, and red blocks the merge until someone decides otherwise on the record. Probabilistic systems cannot be that binary, because a single point estimate on a finite reference set is itself uncertain. But the cultural transfer is exactly right: the gate is a contract written before the result, not a negotiation held after it. The mistake teams make is treating a 1.2 point move as if it were a deterministic test going from red to green, when it might be statistical noise.

The motif of the book applies directly. Measure the failure you can afford to prevent, not the benchmark you can afford to brag about. A gate that only watches the headline average brags. A gate that watches the high-cost segments and the critical failure classes prevents.

Infographic map for Regression Gates and Release Decisions — The figure turns Regression Gates and Release Decisions into a working map: how to connect eval results to ship, hold, rollback, and narrow-rollout decisions.

A Score Move Is Not Automatically Real

The 1.2 point improvement in the opening scene might be a genuine gain, or it might be the kind of fluctuation you would see by rerunning the same model on the same set twice. Before a gate treats a movement as signal, the team has to ask whether the movement is larger than the noise.

On a reference set, each case is a sample. A pass rate of 92 percent on 200 cases is an estimate with a confidence interval, not a fact. The standard error of a proportion p on n cases is roughly the square root of p times one minus p, divided by n. For 92 percent on 200 cases, that is about 1.9 percentage points, so the 95 percent interval spans roughly 88 to 96 percent. A move from 91.2 to 92.4 sits comfortably inside that band. It is noise until proven otherwise.

Two practical consequences follow. First, report intervals, not just point estimates, on every segment in the release memo. A segment with 40 cases has a much wider interval than a segment with 400, and a "regression" in the small segment may be nothing. Second, prefer paired comparisons. Because the candidate and the production version are run on the same cases, you can compare them case by case rather than comparing two independent averages. Count how many cases the candidate fixed, how many it broke, and how many were unchanged. A McNemar-style view of fixed-versus-broken is far more informative than two pass rates, because it shows the trade the change actually made.

This is where the failure-first method from earlier chapters pays off. You do not care equally about all the cases. You care about the broken ones, especially the broken ones in high-cost segments. A change that fixes 15 low-risk cases and breaks 3 enterprise-policy cases has a higher average and a worse risk profile. The gate must be able to see that trade, which means it must read the paired, segmented result, not the global mean.

For probabilistic outputs and calibrated confidence scores, the proper measurement tool is a proper scoring rule. The Brier score, characterized in Gneiting and Raftery's Strictly Proper Scoring Rules, Prediction, and Estimation, rewards a model for being both accurate and honestly calibrated, and it cannot be gamed by always predicting the majority class. When your product exposes a confidence value, gate on calibration with a proper scoring rule, not on raw accuracy alone. A model that becomes more accurate but more overconfident has gotten worse for any workflow that routes on confidence.

The Gate Is a Set of Conditions, Not a Number

A single threshold on a single number is the most common and most dangerous gate design. It invites the exact averaging failure the grader chapter warned about: a candidate that improves the easy majority and quietly regresses the expensive minority sails through a global threshold.

A real gate is a set of conditions, each tied to a consequence the business actually cares about. Here is a gate specification expressed as the conditions, stated before the run, that decide the outcome.

# release-gate.yaml (agreed before the run, versioned with the rubric)
gate:
 baseline: prod-v4.2 # candidate is compared against this
 reference_set: release-set-v9
 min_cases_per_segment: 30 # below this, segment reports advisory only

 block_if:
 - critical_failure_rate > 0.0 # any fabricated data, unsafe action
 - segment.enterprise.pass_rate.delta < -0.03 # CI-significant
 - failure_class.unsupported_claim.delta > 0.01
 - incident_regression.any_reopened == true
 - latency_p95 > product_promise.latency_p95

 warn_if:
 - overall.pass_rate.delta < 0.0
 - any_segment.pass_rate.delta < -0.02
 - reviewer_disagreement.alpha < 0.60 on gating dimensions
 - cost_per_request.delta > 0.10

 ship_if:
 - no block_if conditions true
 - all warn_if conditions reviewed and signed off by named owner

Three design principles are doing the work here. First, critical failure classes are absolute blocks, not weighted inputs. Any reopened incident or any fabricated-data case blocks the release regardless of how good the average looks. These are the failures the company already paid for once. Second, the gate reads segments and failure classes, not just the global mean. An enterprise regression beyond the noise band blocks even if the overall number is up. Third, warnings are not silent. A warning does not block, but it cannot be ignored either. It must be reviewed and signed off by a named owner, which puts the soft judgment on the record instead of in the hallway.

The thresholds themselves are not universal constants. The first chapter insisted the gate be built from the product's own risk. A brainstorming assistant can tolerate a wider warn band than a billing system. A regulated workflow may set the unsupported-claim block at zero. Copy the structure, never the numbers.

Four Decisions, Not Two

The release meeting usually behaves as if there are two options: ship or do not ship. That binary is what makes the meeting tense, because "do not ship" reads as failure and the room is biased against it. The fix is to expand the option set. There are at least four decisions a gate can produce, and naming them defuses the false choice.

Decision	When it applies	What it requires	Who owns it
Ship	No block conditions, warnings signed off	Full release to all segments	Release owner
Narrow rollout	Gains are real but concentrated, or a segment is flat	Ship only to segments where evidence supports it, hold the rest	Product owner
Hold for repair	A block condition fired and the fix is in reach	Repair the failing class, rerun the gate, do not widen scope	Engineering owner
Rollback	A regression escaped to production and is causing harm	Revert to baseline, run incident eval, then diagnose	On-call plus release owner

Narrow rollout is the most underused and most valuable of these. Most eval regressions are not uniform. A change improves retrieval on long documents and slightly hurts short ones, or it helps consumer queries and is flat on enterprise. A binary gate forces you to either ship the regression or block the improvement. A narrow rollout lets the evidence shape the release: ship to the segments where the gate is clean, hold the segments where it is not, and let the held segments be the explicit backlog for the next cycle. This is the same "make the release shape match the evidence" principle the operating cadence chapter raised, made concrete.

Rollback deserves its own discipline because it is an emergency, not a planned decision, and emergencies are where teams improvise badly. The rollback path should be tested before you need it. You should know the revert command, the data-migration implications, the customer-communication trigger, and the time-to-restore before an incident, not during one. A gate that can block but cannot quickly roll back is only half a safety system.

Shadow and Canary: Gates That Run in Production

Offline gates run against the reference set before release. They are necessary and insufficient, because the reference set, however well maintained, always lags the live distribution. Two production-time gates close the gap.

A shadow eval runs the candidate against live traffic without showing its output to users. The candidate processes real requests in parallel with production, its outputs are logged and graded, but the user sees only the production answer. Shadow evals catch distribution shift the offline set missed: new query types, new document formats, new failure clusters. Because no user is exposed, the cost of a shadow failure is a log entry, not an incident. The catch is that you cannot grade outcome quality the user never received, so shadow is strongest for measurable properties like latency, refusal rate, retrieval coverage, and format validity, and weaker for "was the answer actually helpful."

A canary ships the candidate to a small, monitored slice of real traffic, often one to five percent, with the production version serving everyone else and a fast automatic rollback if guardrail metrics breach. Now you are measuring real user outcomes, but on a bounded blast radius. The canary gate watches online metrics the offline eval cannot see: escalation rate, thumbs-down rate, task abandonment, repeat-contact rate, and any business metric tied to the workflow. The principle of staged, observable rollout with automatic rollback is standard production-engineering practice, and the NIST AI Risk Management Framework reinforces it from the governance side: the Manage function is explicit that risk treatments must be monitored and that there must be a mechanism to respond when a deployed system behaves outside expected bounds. A canary with automatic rollback is that mechanism made operational.

The full release pipeline therefore has three gates in series, each cheaper to fail than the next:

Offline gate on the reference set. Cheapest, catches known failure classes and incident regressions before any user is touched.
Shadow gate on live traffic without user exposure. Catches distribution shift in measurable properties.
Canary gate on a small live slice with auto-rollback. Catches real-outcome regressions with a bounded blast radius.

A change that clears all three has earned a full rollout. A change that clears the offline gate but degrades in shadow has told you your reference set is stale, which is itself valuable signal that feeds the maintenance work in the operating cadence.

The Release Memo and the Decision Ledger

A gate produces a decision, and a decision needs a record. The operating cadence chapter introduced the decision ledger. The release memo is the document that feeds it, and it should be short enough that a busy leader reads all of it.

A release memo answers five questions and nothing else:

What changed? Model, prompt, retrieval config, tool permissions, source snapshot, judge and rubric versions.
What improved? Which segments and failure classes got better, with intervals, and whether the gain cleared the noise band.
What regressed? Same, for losses. Name the cases that broke, especially in high-cost segments.
What decision? Ship, narrow rollout, hold, or rollback, mapped explicitly to the gate conditions that fired.
What is unresolved? The warnings that were signed off, the segments held back, the quarantined cases, and the next review date.

The reason the memo and ledger matter is not bureaucracy. It is that release decisions get questioned weeks later, by a customer, by support, by a regulator, or by the team trying to understand why a regression slipped through. A decision recorded as "held because enterprise-policy pass rate dropped 4 points, confidence interval excluded zero, fix owned by retrieval team, recheck on the 20th" is defensible and reusable. A decision recorded as "looked good, shipped" is neither. As the operating cadence chapter argued, the negative and partial decisions are the most valuable records, because they show which risks repeat and which promises should stop appearing in sales language until the system can support them.

A Worked Gate Decision

Make this concrete with a single worked example, labeled as a hypothetical. A team is releasing a retrieval change for a support copilot. The offline gate reports:

Overall pass rate: 90.1 to 91.6, interval on the delta includes zero. Not significant on its own.
Consumer segment (320 cases): 89 to 93, delta interval excludes zero. Real gain.
Enterprise segment (45 cases): 94 to 90, delta interval excludes zero. Real regression.
Unsupported-claim failure class: rate rose from 0.4 percent to 0.9 percent. Breaches the block_if 0.01 threshold narrowly.
One incident-regression case (a refund-policy fabrication from a prior incident) reopened.

The headline number is up. A naive gate ships. The real gate blocks: the reopened incident case is an absolute block, the enterprise regression is real and beyond noise, and the unsupported-claim rate is near the line. The decision is hold for repair on enterprise and consider narrow rollout to consumer once the unsupported-claim rate and the reopened incident are resolved. The memo records the consumer gain as real and bankable, so the work is not wasted, it is sequenced. That is the difference between a gate that prevents the failure you can afford to prevent and a scoreboard that lets you brag your way into an incident.

Common Mistakes

The first mistake is gating on a single global number. It averages your expensive regressions into invisibility. Gate on segments and failure classes.

The second mistake is treating a small score move as real without checking the noise band. Report intervals and use paired, case-by-case comparison so you can see the actual trade the change made.

The third mistake is the false binary of ship-or-not. Name four decisions. Narrow rollout and hold are not failures, they are the evidence-shaped middle that keeps the company moving without shipping a regression.

The fourth mistake is having no tested rollback path. A gate that can block but cannot quickly revert is half a safety system. Test the revert before you need it.

The fifth mistake is relying only on offline gates. The reference set lags production. Use shadow and canary gates to catch the distribution shift and real-outcome regressions the offline set cannot see.

Practical Exercise

For one workflow you ship, write the gate before the next release.

Write the gate as block_if, warn_if, and ship_if conditions. Name at least two absolute blocks tied to critical failure classes or reopened incidents.
For your last release, recompute the headline pass rate with a confidence interval, then redo it as a paired fixed-versus-broken count by segment. Decide whether you would have made the same call.
Define your three-gate pipeline: what the offline gate checks, what the shadow gate can measure, and what the canary gate watches with auto-rollback.
Write your rollback runbook in five lines: revert command, data implications, customer-communication trigger, time-to-restore, and owner.
Draft the five-question release memo template and decide where the decision ledger lives.

If you cannot state the block conditions before the run, you do not have a gate. You have a meeting, and the meeting will be won by urgency.

Summary

A regression gate turns a noisy, multidimensional eval result into a defensible release decision agreed before the run. It treats critical failure classes and reopened incidents as absolute blocks, reads segments and failure classes rather than the global mean, and refuses to mistake a movement smaller than the noise band for a real gain. It expands the decision from a tense binary into four named options, ship, narrow rollout, hold, and rollback, so that evidence shapes the release. Offline, shadow, and canary gates run in series, each catching what the cheaper one cannot, with automatic rollback as the production safety mechanism. Every decision feeds a release memo and a decision ledger that make the call defensible long after the meeting ends.

The next chapter steps back to the rhythm that keeps all of this alive: the operating cadence that samples production, recalibrates reviewers, runs the gate, and feeds incidents back into the system week after week.

Key Takeaways

A gate is a contract written before the run, not a negotiation held after it.
A score move smaller than its confidence interval is noise. Report intervals and compare candidate to baseline case by case.
Gate on segments and failure classes, not the global average. Make critical failures and reopened incidents absolute blocks.
Name four decisions, not two. Narrow rollout and hold let evidence shape the release without shipping a regression.
Run offline, shadow, and canary gates in series, with automatic rollback as the production safety mechanism.
Gate calibrated confidence with a proper scoring rule like Brier, not raw accuracy, so overconfidence cannot pass as improvement.
Every release decision feeds a memo and a ledger so the call stays defensible weeks later.