How to Build an LLM Evaluation Framework

A good LLM evaluation framework tests what will break in production: a golden set from real traffic, task metrics, blinded rubrics, and a drift cadence.

A good LLM evaluation framework tests what will actually break in production. That means four parts working together: a golden set sampled from real traffic, task-specific metrics instead of generic ones, blinded human rubrics, and a measurement cadence that catches drift before a customer does. Everything else is decoration.

I have watched teams ship LLM features behind an eval suite that reported ninety-percent accuracy and a green CI badge, then field a Sev-1 the same week. The suite was not wrong. It was measuring the wrong distribution. An LLM evaluation framework is not a dashboard you bolt on at the end. It is the harness that decides whether a model candidate earns a production deploy, and it has to be built with the same care as the feature it gates.

This is the build process I trust. It sits under my complete guide to LLM evaluation and extends the argument in my essay on evals that predict production, which covers the sampling problem in depth. Here I focus on assembly: what the framework must contain, and how to stand each piece up.

Key takeaways

An LLM evaluation framework is the gate between a model candidate and a deploy, not a dashboard you add at the end.
It has five parts: a frozen golden set from real traffic, task-specific metrics, an offline CI layer, an online layer, and a calibrated scoring engine behind a fixed gate.
Sample the golden set from production and oversample the hard tail; a set built from imagined cases passes right up until launch.
An automated judge handles volume only after it hits 85 to 90 percent agreement with a blinded human rubric; below that it is guessing.
Score components, not just the end-to-end answer. A strong model papers over retrieval collapse until the day it can't, and the gate is cheaper than the churn.

An eval framework is not a dashboard you bolt on at the end. It is the gate between a model candidate and a deploy.

What an LLM evaluation framework must contain

Most eval content lists metrics. That is the part that matters least. A framework is the system around the metrics, and a usable one has five components. Skip any of them and the number you report stops meaning anything.

A golden set sampled from real production traffic, frozen and versioned, never grown organically.
Task-specific metrics chosen for the failure modes that hurt this feature, not a generic accuracy score.
An offline layer that runs in CI and gates the deploy, plus an online layer that scores live traffic with the same metrics.
A scoring engine: an automated judge for volume, calibrated against a blinded human rubric for ground truth.
A cadence and a gate: a fixed threshold agreed before the run, and a refresh schedule that catches drift.

The pattern is now standard across teams that ship LLM features for real. The recommended production setup moves a single golden dataset through local development, a pre-merge gate, a deploy gate, and live monitoring, with the same evaluator at every stage so pre-launch and post-launch scores compare directly (Datadog, LLM evaluation framework best practices). One ruler, four checkpoints.

Building the golden set from real traffic

The golden set is the foundation, and it is where most LLM evaluation frameworks go wrong on day one. Teams build the set bottom-up: a developer writes cases while building the feature, a PM adds a few during review, and the set accumulates. The result reflects what the team imagined users would do, not what users do. Those two distributions diverge fast.

Sample the set from production instead. Pull a stratified slice of real requests, freeze it as a named artifact, and version it like code. A practical size is 200 to 500 cases that cover the feature's full operational envelope, each pairing an exact input with a reference output (Maxim, golden dataset guide). Build it from real failures, not synthetic happy-path examples.

Over-weight the hard tail on purpose. The cases that break production are not randomly distributed; they cluster. I deliberately oversample four buckets: bottom-quartile model confidence, cases a human reviewer corrected, syntactically adversarial input, and anything that previously caused an incident. That last bucket is the one teams skip because the incident feels resolved. It is not resolved until a future model passes those exact cases on a held-out set.

Freezing the set has a consequence worth stating plainly: your score on an older frozen set can only go down. There is no sneaking in easy cases to lift the number. That is the point. You want a fixed ruler, not a rubber band. The trade-off is honest: oversampling the tail makes your aggregate accuracy look worse than a uniform sample would. Ship the model that passes the hard set, not the one with the prettiest average.

Choosing task-specific metrics

The metrics are the part of the framework you tune to the feature, and a generic accuracy score is almost always the wrong choice. The question is not "is the output good." It is "what does broken look like for this task, and what number goes negative when it happens." Answer that first, then pick the metric.

Different tasks fail differently, so they need different instruments. A few that map cleanly to common LLM features:

Retrieval (RAG): recall and precision on the retrieved context, scored separately from answer quality. Faithfulness, whether the answer is grounded in what was retrieved, catches confident hallucination.
Classification or extraction: per-class precision and recall, because aggregate accuracy hides a class that fails completely while the average stays high.
Agents: per-step success and tool-call correctness, not just whether the final answer landed. A lucky end result over a broken trajectory is not success.
Open-ended generation: a rubric scored by a calibrated judge, plus a hard safety and refusal check that blocks regardless of quality.

Pick the two or three metrics that map to real failure, set a floor for each, and resist the urge to track twenty. A framework with too many metrics produces a dashboard nobody reads and no clear gate. The discipline is choosing the few numbers you will actually block a deploy on. I cover the full menu in the metrics that matter and the ones that lie. Teams running this at scale, like Booking.com's engineering org, report the same lesson: a few production-anchored metrics beat a long dashboard nobody gates on.

The offline and online layers

An LLM evaluation framework needs two layers, because a model can pass every offline check and still rot in production. Offline catches regressions before launch. Online catches drift after it.

The offline layer runs against the frozen golden set, in CI, on every model candidate. It produces a go/no-go number for the deploy gate. This is the layer that answers "is this candidate better or worse than what we shipped six months ago, on exactly the same questions." Most teams cannot say that sentence with confidence. The frozen set is what lets you.

The online layer scores live traffic with the same metrics, then watches for drift on top of those scores. Sample 5 to 10 percent of real requests, score them with your automated evaluator, and alert when the distribution shifts (OpenObserve, LLM monitoring best practices). Use statistical monitoring for input distribution and semantic drift detection for output meaning. You usually want both.

Here is why two layers and not one. End-to-end answer quality hides component failure. A RAG answer can read well while retrieval has quietly collapsed, because a capable model papers over thin context. If you only score the final answer, recall can fall for weeks before the answer quality metric notices. I have watched retrieval decay in month three while the surface metric stayed flat. Score the components, not just the output.

End-to-end answer quality hides retrieval failure. A strong model papers over thin context until the day it can't.

Automated judge versus human rubric

You cannot put a human on every output; that does not scale and turns the reviewer into a bottleneck, then a rubber stamp. You also cannot trust an automated judge you have not validated. The framework needs both, in the right roles: the automated judge handles volume, the human rubric defines ground truth and keeps the judge honest.

LLM-as-a-judge is the standard way to score at volume, but it is not free of failure modes. Research has documented position bias, where the judge favors a response based on where it sits in the prompt, and self-inconsistency across repeated runs (Judging the Judges, arXiv 2406.07791). Treat the judge as an instrument that needs calibration, not as a source of truth.

Calibration is the step that makes a judge usable. Validate it against a human-labeled slice of the golden set, and only trust it for the metrics where it hits 85 to 90 percent agreement with your raters. Below that, the judge is guessing and you do not know it. Re-run the calibration whenever you change the judge model or the rubric.

For the human rubric, blind it. Strip the model version, shuffle output order, and plant a known proportion of reference human responses into the batch unlabeled. If raters start scoring the planted human answers below the bar they apply to model output, the rubric has drifted and you stop before scoring anything else. I treat rater disagreement on a cluster as a rubric signal, not noise. The full blinding protocol and the cases where rubrics quietly collapse are in A Field Guide to Evals.

The honest trade-off: blinded human scoring is slow and expensive, and it does not scale to every release. That is exactly why "a human reviews it" is not a quality system on its own. The human calibrates the rubric and the judge; the judge does the volume. I make that argument in full in Eval-Driven Development, and the staffing version of it in why a human in the loop is not a plan.

The cadence and the gate

A framework without a gate is a report nobody acts on. The gate is a fixed threshold, agreed by product and engineering before the run, not after. Setting the threshold once you see the result is not evaluation; it is rationalization.

Here is an abbreviated output from an eval runner against a frozen set. The numbers are realistic, not from a specific live system.

# offline eval against frozen golden set

python -m eval.runner \

--suite golden-set-2026-w24-v2.jsonl \

--model prod-candidate-2026-06-15 \

--judge judge-calibrated-v4 \

--rater-pool senior-3

# results summary

cases evaluated 412

retrieval recall@5 0.78 # floor 0.80, FAIL

answer faithfulness 0.94 # threshold 0.90, pass

judge vs human agree 0.88 # floor 0.85, judge trusted

human disagree 5.9% # threshold 8.0%, pass

adversarial tail 16.7% # threshold 18.0%, pass

p95 latency 1,910 ms # +130 ms vs baseline, flag

verdict GATE BLOCKED # recall@5 below floor

Read that output and notice what the framework caught. The end-to-end answer faithfulness passed at 0.94. If that were the only metric, this candidate ships. But retrieval recall@5 fell below its floor, so the gate blocks the deploy. The strong model was masking a retrieval regression that a single answer-quality score would have hidden until production. That is the component-scoring discipline earning its keep.

Cadence is the second half. A frozen set decays as the world changes, so refresh it on a schedule, version each refresh, and run the old and new set in parallel for an overlap period to compare. The online layer's drift alerts tell you when to refresh early. This is where the framework connects to revenue: in an LLM product, a bad launch is not one incident, it is the erosion of the trust that makes the product sellable at all. The gate is cheaper than the churn. If you want that gate run as managed infrastructure, that is the core of Devlyn's AI observability and monitoring work.

FAQ

What should an LLM evaluation framework actually measure?

It should measure the failure modes that will break this specific feature in production, not a generic accuracy score. For RAG, score retrieval recall separately from answer quality. For agents, score each step and tool call, not just the final result. The anchor metric I report to leadership is model-versus-trusted-human disagreement on a frozen, production-sampled set, tracked over time, because it is anchored to a fixed distribution and to human performance.

How do I build a golden set for LLM evaluation?

Sample it from real production traffic rather than writing synthetic cases. Take a stratified slice of 200 to 500 real requests, oversample the hard tail (low confidence, human corrections, adversarial input, past incidents), pair each input with a reference output, then freeze and version the set. Never grow it organically. Cut a new version when you need new cases.

Is LLM-as-a-judge reliable enough to gate a deploy?

Only after you calibrate it. Validate the judge against a human-labeled slice and trust it only for metrics where it hits 85 to 90 percent agreement with your raters. Documented biases like position bias and self-inconsistency mean an uncalibrated judge can be confidently wrong. Use the judge for volume and a blinded human rubric for ground truth.

What is the difference between offline and online LLM evaluation?

Offline evaluation runs against a frozen golden set in CI and gates the deploy before launch. Online evaluation scores a sample of live traffic with the same metrics and watches for drift after launch. You need both: offline catches regressions, online catches the distribution shift that no frozen set predicts forever.

If you are standing up your first LLM evaluation framework, start with the golden set and the offline gate; those two pieces catch most of what kills launches. When you are ready to make evaluation the way your team works rather than a step at the end, Eval-Driven Development is the long-form version of this harness. Build the framework that predicts production, freeze it, and trust the number it gives you over the one you wished for.