Eval-Driven Development: The Test Suite Leads

Eval-driven development is TDD for probabilistic systems: write the eval first, gate every deploy on a frozen eval set, and treat the suite as the spec.

Eval-driven development is TDD for probabilistic systems. You write the eval before the prompt or model change, let a frozen eval set gate every deploy, and treat the eval suite as the source of truth the model is optimized against. It is the discipline that replaces test-driven development when the unit under test is a distribution, not a deterministic function.

I started taking this seriously after watching a prompt change ship on a teammate's eyeball check. The output looked better on the three examples he tried. In production it was worse on a class of inputs none of us had thought to type. There was no test to catch it, because the old habit said tests are for code with one right answer. An LLM call does not have one right answer. It has a distribution of answers, and you can only reason about it by sampling and measuring. That is what eval-driven development is for.

TDD asserts one correct output. Eval-driven development measures a distribution of outputs against a threshold you set first.

Key takeaways

Eval-driven development is TDD for probabilistic systems: write the eval before the prompt, gate every deploy on a frozen eval set, and treat that set as the spec the model is optimized against.
The unit of testing changes from a binary assert on one output to an aggregate score on a sample of cases versus a threshold you set before the run.
One run is not a measurement. At 100 cases and 80% accuracy a 95% confidence interval is roughly plus or minus 8 points, so green today can be red tomorrow with no code change.
Keep the gate cheap by running it as a pyramid: deterministic checks on every commit, a classifier-backed sweep on the regression set, and an expensive LLM judge on a sampled subset.
The eval set is a revenue asset, not hygiene. It is the only thing that lets you ship a model upgrade and prove, with evidence, that this week is not worse than last quarter.

How eval-driven development differs from classic TDD

Classic TDD works because a deterministic function has a single correct output for a given input. You assert add(2, 2) == 4, the test is binary, and a passing test today passes forever unless the code changes. That contract is the whole reason TDD is trustworthy. None of it survives contact with a language model.

An LLM call is stochastic. The same input can yield different outputs across runs, models, temperatures, and prompt edits. There is no single string to assert against; there are thousands of acceptable ones and a long tail of bad ones. A binary assertion on one example proves nothing about the system, because one sample tells you almost nothing about a distribution. This is the gap eval-driven development exists to close (Braintrust, eval-driven development).

So the unit of testing changes. Instead of one input and one expected output, an eval scores many inputs against a rubric and reports an aggregate: accuracy, faithfulness, a calibrated judge score, whatever maps to real failure. Instead of pass/fail on a string, you set a threshold before the run and the gate compares the aggregate to it. A single passing case is an anecdote. A score on a frozen set of 300 cases is a measurement.

The table below is the translation I keep in my head when moving an engineer from TDD to evals.

Classic TDD	Eval-driven development
One input, one correct output	Many inputs, a distribution of acceptable outputs
Binary assert (==)	Aggregate score vs. a threshold set first
Deterministic; flake is a bug	Stochastic; variance is expected and measured
One run is enough	Sample size and confidence interval matter
Green forever unless code changes	Decays as the world and the model drift

The workflow: eval first, then the prompt

The order is the whole point. In eval-driven development the eval comes first, before the prompt, before the pipeline, before you pick a model. You write down what good looks like as a scored test, then you build the thing that passes it. Same loop as TDD, different unit. The manifesto version of this is blunt: if your evals do not run on every change, they do not exist (evaldriven.org).

Here is the loop I run.

Write the eval first. Define the failure mode in a rubric and a metric before you touch the prompt. "This summary must not invent a number that is not in the source" is a testable claim. "Make it better" is not.
Freeze a golden set. Pull real inputs from production, version the set as an artifact, and never grow it casually. The set is your ruler; a ruler you keep editing measures nothing.
Set the threshold before the run. Every threshold needs a justification you wrote down first. Picking the bar after you see the score is rationalization, not evaluation.
Make the change, then run the suite. Edit the prompt or swap the model, run the frozen set, and read the aggregate against the gate.
Gate the deploy in CI. The eval runs next to lint and type-check. A change that drops the score past the threshold is blocked from merge, automatically.

That last step is where eval-driven development stops being a notebook habit and becomes engineering. OpenAI's own regression workflow stores completions per prompt version and compares runs so a prompt edit that degrades quality is caught as a regression before it ships (OpenAI Cookbook, detecting prompt regressions). The eval suite becomes a merge-blocking gate, the same role your unit tests play for deterministic code.

If your evals do not run on every change, they do not exist. Evaluation belongs in CI, not in a notebook someone opens quarterly.

Make the gate cheap enough to run on every change

"Run it on every change" sounds expensive, and naively it is. An LLM-as-a-judge call runs roughly 5 to 50 cents per case, so a 300-case suite of judged evals on every commit gets costly fast and slow enough that people start skipping it. A gate everyone bypasses is not a gate. The fix is to stop treating every eval as equally expensive.

I run the suite as a pyramid, the same shape as the test pyramid you already trust.

Every commit: deterministic checks. Schema validation, regex and string assertions, "did it cite a real source ID," refusal detection. These cost nothing, run in milliseconds, and catch the dumb regressions immediately.
Every PR: a classifier-backed sweep. Cheap learned scorers on the full regression set run at a fraction of a cent each, so you get an aggregate quality delta on every merge without paying judge prices.
Nightly or pre-release: the LLM judge on a sample. Reserve the expensive judged eval for the high-stakes rubrics on a sampled subset, and run it on a schedule, not on every keystroke.

This is the difference between a gate that ships and a gate that becomes theater. Cheap, fast, and statistically significant is a pick-two on any single tier, so you spread the three goals across the tiers instead of demanding all three on every commit. The deterministic tier buys speed, the classifier tier buys breadth, the judge tier buys depth, and the deploy is gated on the combination.

The eval set is the source of truth, so write it like spec

Once evals gate every deploy, the eval set quietly becomes the real specification. The prompt is just the current implementation that happens to pass it. Swap the model, rewrite the prompt, change vendors: the eval set is what carries the definition of correct across all of it. This is the same move I argue for in treating the spec as the source, applied to probabilistic systems. The evals are the executable spec.

That reframing has teeth. It means the eval set deserves more review than the prompt, because the prompt is disposable and the eval set is the asset. It means a vague rubric is a vague spec, and the model will optimize toward whatever the rubric actually rewards, including the parts you wrote sloppily. Evals are engineered, not generated. Every metric maps to a failure mode, every threshold has a reason.

It also changes what "the model is optimized against" means. When you tune a prompt to lift the eval score, you are fitting to the eval set. If that set is a faithful sample of production, you are improving the product. If it is a handful of cases someone imagined, you are overfitting to fiction and the production gap will find you. The quality of your evals is the ceiling on the quality of your system. I build the full version of this argument in Eval-Driven Development, and the harness it sits on in how to build an LLM evaluation framework.

Where eval-driven development is genuinely hard

I will not pretend this is free. Eval-driven development is harder than TDD in ways that are structural, not just unfamiliar, and anyone selling it as painless has not run it under deadline.

The first cost is statistics. One run is not enough, and neither is a small set. At 100 cases and 80% accuracy, a 95% confidence interval is roughly plus or minus 8 points, so two prompts that score 79 and 83 are statistically indistinguishable. Detecting a real 2-point improvement with confidence can take thousands of cases per arm, not a handful. Outputs also vary run to run, so a green eval today can be red tomorrow with no code change, which feels exactly like a flaky test and is not a bug you can fix. You handle it by sampling: run enough cases, aggregate, and report a confidence interval, not a single number. Pinning temperature to zero helps reproducibility but narrows what you measure. Doing this right means treating eval results as estimates with error bars, which most engineers were never trained to do (Cameron Wolfe, applying statistics to LLM evals).

The second cost is the judge. To score open-ended output at volume you usually need an LLM grading the LLM, and that grader has its own measurable failure modes. Position bias hands the first option in a pairwise comparison a 10 to 15 point edge. Verbosity bias rewards the longer answer at matched quality. Self-preference inflates a model's score on its own family's output. The judge is itself a probabilistic system that needs evaluating, so eval-driven development can recurse on you. The defensible mitigation is to calibrate the judge against blinded human labels and, for launch decisions, run an ensemble of judges from different model families so family-specific biases partly cancel. I cover when to trust it in my full guide to LLM evaluation; the short version is only trust the judge where it agrees with people.

The third cost is the ledger. The eval set decays. The world shifts, the model provider updates the underlying weights, and your frozen ruler slowly stops measuring the present. A frozen set is honest but goes stale; refreshing it costs real labeling time and risks resetting your baseline. There is no version of this where you write the evals once and walk away. That is the trade-off, stated plainly: eval-driven development trades a one-time test-writing cost for an ongoing measurement discipline you fund forever.

Why this is a revenue decision, not a hygiene one

The gate is cheaper than the churn. In a deterministic product a bad deploy is one bug and one rollback. In an LLM product a bad deploy is a quiet quality regression that erodes the trust the product is sold on, and you often cannot see it until renewal conversations get strange. The frozen eval set is the only thing that lets you say, with evidence, that this week's model is not worse than last quarter's on the exact inputs your customers send.

That sentence is worth money. It is the difference between shipping model upgrades with confidence and freezing on an old version because nobody can prove the new one is safe. Eval-driven development is what makes a probabilistic feature governable enough to keep selling. When generation is cheap, the durable advantage is being able to tell good output from bad at scale, and the eval suite is how you operationalize that judgment. The same harness powers A Field Guide to Evals and the production gating in my essay on evals that predict production.

FAQ

What is eval-driven development?

Eval-driven development is a methodology where you write an evaluation before changing the prompt or model, gate every deploy on a frozen eval set, and treat that set as the source of truth the system is optimized against. It is TDD adapted for probabilistic systems: instead of asserting one correct output, you score many outputs against a threshold you set before the run.

How is eval-driven development different from test-driven development?

TDD asserts a single deterministic output and one passing run is enough. Eval-driven development measures a distribution, because an LLM gives different outputs across runs and prompts. You score a sample of cases against a rubric, compare the aggregate to a pre-set threshold, and account for variance with sample sizes and confidence intervals rather than a binary pass or fail.

When should I use evals as tests instead of unit tests?

Use unit tests for the deterministic code around the model: parsing, routing, formatting, tool plumbing. Use evals for anything where the model's output is the thing under test and there is no single correct string. Most real LLM features need both, with the evals running in CI as a merge-blocking gate alongside the unit tests.

Does eval-driven development slow teams down?

It adds an upfront cost: writing the eval, freezing a golden set, and funding ongoing measurement as the set drifts. It removes a larger cost: shipping a regression you cannot see until a customer does. For a probabilistic feature you intend to keep changing, the gate is cheaper than the silent quality decay it prevents.

Isn't running evals on every commit too expensive?

Only if you run the expensive evals on every commit. Run the suite as a pyramid: free deterministic checks on every commit, cheap classifier-backed scorers on the regression set per PR, and the costly LLM judge on a sampled subset nightly or before release. That keeps the per-commit cost near zero while still gating the deploy on a real quality signal.

If you are adopting eval-driven development, start small: write one eval for your highest-risk feature, freeze a golden set from real traffic, and wire it into CI as a gate before you expand. When you want that discipline built into a team that ships probabilistic features for real, that is the work my Devlyn pods do, with AI engineers who treat the eval suite as the spec. Write the eval first, let the test suite lead, and trust the number over the demo.