
2026 / Free online book · Field Manuals
A Field Guide to Evals
Measuring what will actually break in production
Access
Free
Chapters
6
Read time
75 min
An eval that only passes is not telling you anything. This manual builds an evaluation harness from production traffic up: what to sample, how to label without lying to yourself, and how to read a number you can defend in a review. Code, not philosophy.
Most eval suites measure the wrong thing and pass right up until launch. The harness I trust before I ship.
This edition is free to read onsite. Each chapter has its own URL, so readers can bookmark, share, and return to the exact section they need.
Table of contents
01 Why Evals Fail in Production Why demos, benchmarks, and small handpicked test sets fail to predict production behavior. 10 min 02 Building the Reference Set How to construct a living evaluation set from real work, incidents, synthetic pressure tests, and policy-sensitive cases. 14 min 03 Graders, Rubrics, and Human Review How to score model behavior without pretending judgment is simpler than it is. 14 min 04 Regression Gates and Release Decisions How to connect eval results to ship, hold, rollback, and narrow-rollout decisions. 13 min 05 Operating Cadence A practical operating rhythm for keeping evaluation suites aligned with production traffic, release risk, and business decisions. 11 min 06 The First Ninety Days of Eval-Led Delivery A staged plan for moving from scattered examples to a maintained evaluation system. 13 min
