AN Alpesh Nakrani
BlogBooksPraiseAbout Work with me →
Book overview
Chapter 10 / Points of View

Proving Good Enough With Evals

You cannot defend choosing a small model without measuring it, and measuring it well is a discipline most teams skip.

Research spine: this chapter stays grounded in DistilBERT result and FrugalGPT work, then applies that evidence to the operating judgment in the book. The argument that ended every small-model proposal at one company I worked with was three words long: "but is it good enough?" And the reason the argument always won was that nobody could answer it. They had a frontier model in production, it seemed fine, and any proposal to replace it with something smaller ran into a wall of "we do not know if the small one is as good, so let us not risk it." That wall is built entirely out of the absence of measurement. A team that cannot measure quality will always default to the most capable model out of fear, because fear is the only information they have - and that fear is also what blinds them to the silent failure modes catalogued in When Small Models Are Wrong. Evaluation is what replaces the fear with a number, and a number is something you can actually make a decision with.

This chapter is about that discipline. It is the keystone of the whole book, because every framework so far, the Suitability Test, the Capability Waste Audit, the cascade, the quantization quality gate, depends on being able to answer "is it good enough" with evidence. Without evals, this book is a set of nice ideas you are too scared to act on. With evals, it is a decision procedure.

What "good enough" actually means

The first move is to reject the question as usually asked. "Is the small model as good as the frontier model" is the wrong question, because it sets the frontier model as the bar, and the frontier model is almost certainly better than you need. The right question is "is the small model good enough for this task at our standard," and that requires you to define the standard, which most teams never do.

Defining the standard means answering three things, in order:

What does correct mean for this task? For a classifier, it might be accuracy or F1 against labeled ground truth. For extraction, field-level exact match or a tolerance for numeric fields. For generation, it gets harder, and we will get to that. The point is that "correct" has to be operationalized into something you can compute, or you cannot measure anything.

What is the threshold? Not "as good as possible," but the actual bar below which the feature is unacceptable. This is a product decision, and it should account for the consequence of being wrong. A spam filter that is 95 percent accurate might be fine; a medical triage classifier at 95 percent might be a disaster. The threshold encodes how much error the use case tolerates, and it is the number against which "good enough" is judged.

What is the cost of the errors you will make? Not all errors are equal. A false positive and a false negative often have very different consequences, and the threshold should reflect the asymmetry. A model that is good enough on aggregate accuracy can be unacceptable if it makes the expensive kind of error too often. Measure the errors that matter, weighted by what they cost, not just the count.

Once you have answered these three, "is it good enough" becomes a computable question, and the small-model decision stops being a matter of nerve.

A neat demo set versus a varied eval set, both measured by one ruler against candidate models
The demo set flatters every model; the eval set is the long-tail ruler that reveals where models diverge.

Build the eval set first

The most important artifact in this chapter is the evaluation set, and the most common mistake is building it last or not at all. Build it first, before you choose a model, because it is the ruler against which every candidate is measured and the demo set is a biased ruler that flatters whatever you point it at.

A good eval set has properties the demo set lacks:

  • It is drawn from real production inputs, not hand-picked examples. The demo set contains the cases that work; the eval set must contain the cases that break, because those are where models differ.
  • It covers the long tail, deliberately including the rare, weird, adversarial, and ambiguous inputs, because that is where a small model and a frontier model diverge and where your threshold gets tested.
  • It has ground-truth labels you trust, which is real work: human labeling, expert review, or careful construction. The eval set is only as good as its labels, and a sloppily labeled eval set will lie to you confidently.
  • It is held out from any training or prompt-tuning, so you are measuring generalization, not memorization. A model tuned on its own eval set will look better than it is.
  • It is versioned and stable, so a number measured today is comparable to a number measured next month. An eval set that drifts silently makes your quality trend meaningless.

This is unglamorous work and it is the work that makes everything else possible. A team with a good eval set can evaluate any candidate model in an afternoon and make the small-model decision on evidence. A team without one will argue about it forever and default to the frontier every time. The eval set is the cheapest insurance you can buy against the frontier-model reflex.

Evaluating against the frontier as a baseline

For tasks where defining ground truth is hard, especially open-ended generation, there is a pragmatic shortcut that fits the small-model project precisely: use the frontier model's output as the reference and measure how often the small model agrees with it. This is the agreement rate from the Capability Waste Audit, used now as the primary eval metric.

The logic is direct. If you are currently happy with the frontier model's outputs in production, then "matches the frontier model" is a reasonable proxy for "good enough," and you can measure it without building expensive ground truth. Run both models on the eval set, compare outputs, and the agreement rate tells you what fraction of traffic the small model would handle identically to the model you already trust. The disagreements are then worth human inspection, because they are either cases where the small model is genuinely worse (escalate them) or cases where the small model is actually fine and the frontier was over-delivering (free wins).

This connects directly to the cascade. The agreement rate on the eval set predicts the tier-hit distribution in production: if the small model agrees with the frontier on 92 percent of eval cases, roughly 8 percent will need escalation, and you can size the frontier tier's cost accordingly before you ship. The eval set is not just a quality gate, it is the design input for the small-first cascade.

A few cautions with agreement-based evaluation. The frontier model is not infallible, so on cases where it is wrong, agreement with it is not the goal. For high-stakes tasks, supplement agreement with real ground truth on a subset. And LLM-as-judge approaches, where you use a model to score outputs, are useful but must be validated against human judgment, because a judge model has its own biases and a judge that agrees with itself proves nothing. Treat the judge as a candidate that itself needs evaluating, not as an oracle.

An eval harness that does real work

The eval harness is the code that runs candidates against the eval set and reports the numbers that decide the model. Here is the shape of one comparing a small model against a frontier baseline:

def evaluate(model, eval_set, metric, frontier_baseline=None):
 results = {"correct": 0, "errors_by_type": {}, "latency_ms": [], "cost": 0.0}
 for case in eval_set:
 t0 = now()
 output = model.predict(case.input)
 results["latency_ms"].append(now() - t0)
 results["cost"] += estimate_cost(model, case.input, output)

 if metric.is_correct(output, case.ground_truth):
 results["correct"] += 1
 else:
 err_type = metric.classify_error(output, case.ground_truth)
 results["errors_by_type"][err_type] = \
 results["errors_by_type"].get(err_type, 0) + 1

 if frontier_baseline:
 results.setdefault("agreement", 0)
 if metric.agrees(output, frontier_baseline[case.id]):
 results["agreement"] += 1

 results["accuracy"] = results["correct"] / len(eval_set)
 results["p99_latency"] = percentile(results["latency_ms"], 99)
 return results

This harness does real work because it reports more than accuracy. It reports the error types, so you can see whether the small model fails in cheap ways or expensive ways. It reports p99 latency, not the average, tying back to the latency chapter. It reports cost, so the quality number sits next to the price. And it reports agreement with the frontier baseline. A single accuracy number hides all of this. The model decision needs all of it, because a small model that is slightly less accurate but dramatically cheaper and faster, failing only in cheap ways, is often the right choice, and you can only see that when the harness reports the full picture rather than one number.

The decision the evals enable

With the eval set built and the harness reporting the full picture, the small-model decision becomes a table you can put in front of anyone, including the person who keeps asking "but is it good enough."

CandidateAccuracyAgreement w/ frontierp99 latencyCost/1k reqExpensive errors
Frontier (baseline)96%100%1400 ms$X0.4%
Fine-tuned small94%93%45 ms$X/200.5%
Small + frontier cascade96%99%90 ms (p99)$X/80.4%

Now the conversation is evidence. The cascade matches the frontier's accuracy and expensive-error rate at a fraction of the cost and latency. The standalone small model gives up two points of accuracy for a 20x cost reduction, which is the right call for some features and not others depending on the threshold and error costs you defined. Either way, "is it good enough" has an answer, and the answer is a number measured on your data at your standard. The fear is gone, replaced by a decision you can defend in any room.

What to do tomorrow

Pick the feature where someone keeps blocking a smaller-model proposal with "but is it good enough," and build the eval set that answers the question. A few hundred real, labeled, held-out cases including the hard tail. Run your current frontier model and a small candidate through a harness that reports accuracy, error types, p99 latency, cost, and frontier agreement. Put the table in front of the person who has been blocking. You will either prove the small model is good enough, in which case you ship it, or prove it is not, in which case you have a principled reason to keep the frontier and you have stopped guessing. Both outcomes are wins. Guessing was the only loss.

Key Takeaways

  • A team that cannot measure quality defaults to the most capable model out of fear. Evals replace fear with a number, which is the only thing you can actually make a decision with.
  • Reject "is the small model as good as the frontier." Ask "is it good enough at our standard," which requires defining what correct means, the threshold, and the cost of each error type.
  • Build the eval set first, from real production inputs including the long tail, with trusted ground-truth labels, held out and versioned. The demo set flatters every model; the eval set is where models diverge.
  • For hard-to-label tasks, use frontier agreement as a proxy metric: it measures what fraction of traffic the small model handles identically to the model you already trust, and it predicts the cascade's tier-hit distribution.
  • A good eval harness reports more than accuracy: error types, p99 latency, cost, and frontier agreement. The full picture often shows a small model or cascade that is the right choice in ways a single accuracy number hides.
Share