How to Build a Golden Eval Set From Production

A golden dataset for LLM evaluation is a frozen, versioned slice of real traffic with trusted reference answers, over-weighted toward the adversarial tail.

A golden dataset for LLM evaluation is a frozen, versioned sample of real production traffic, paired with trusted reference answers, stratified by intent, and deliberately over-weighted toward the adversarial tail. You sample it from production, freeze it, version it like code, and never let it grow organically. That last rule is the one teams break, and it is why their eval numbers stop meaning anything.

I have watched a team build an eval set the wrong way and not notice for a quarter. A developer wrote a few cases while building the feature. A PM added some during review. The set grew with every sprint. By the time it had 600 cases, it measured what the team imagined users would do, not what users actually did. The two distributions had drifted far apart, and the green eval badge was lying.

This is the build process I trust for a golden eval set, step by step. It is the companion to my essay on evals that predict production, which makes the case for why sampling is the whole game, and it sits under my full guide to LLM evaluation. Here I focus on the mechanics: how to build an eval dataset that holds up as a fixed ruler.

Key takeaways

A golden dataset for LLM evaluation is a frozen, versioned slice of real production traffic with trusted reference answers, over-weighted toward the adversarial tail.
Sample from production, not from your imagination. Synthetic cases encode the team's assumptions, which is exactly what production breaks.
Stratify by intent, then size the set to 200 to 500 cases, and deliberately over-weight the hard buckets to roughly 20 to 25 percent.
Freeze the set and version it like code. Once frozen, your score can only go down, so a real regression shows up instead of hiding.
Refresh on a fixed cadence with a parallel overlap window, so you can tell genuine drift from a model regression.

A golden eval set is a fixed ruler, not a rubber band. Once you let it grow to chase a number, it measures nothing.

Sample your golden eval set from production, not your imagination

The first decision determines everything downstream: where the cases come from. A golden dataset for LLM evaluation has to be sampled from real production traffic. Synthetic cases written by the team encode the team's assumptions, and those assumptions are exactly what production breaks.

You cannot sample production before you have production, so the first version is honest synthetic. Treat it as a v0, not as ground truth. The moment you have request logs, harvest a stratified sample from them and replace the synthetic set (TrueFoundry, enterprise LLM benchmarking). The synthetic v0 gets you to launch. Real traffic keeps you there.

Sample across a window wide enough to catch real variation. Two weeks is a defensible minimum for most applications, because it captures the weekly cycle and a full deployment cycle of whatever calls your model (TrueFoundry). Monday traffic does not look like Saturday traffic. Sample one day and your set inherits one day's bias.

Stratify by intent before you size the set

A random sample of production traffic over-represents your easy, high-volume path and under-represents the cases that actually break. To build an eval dataset that predicts failure, stratify first, then sample within each stratum.

Start by bucketing real requests by intent: the distinct jobs users hire your feature to do. For a support assistant that might be returns, billing, account access, and product questions. Then size each bucket so the rare-but-costly intents are not drowned out. A practical target is 200 to 500 cases that cover the feature's full operational envelope (Maxim, golden dataset guide). Smaller than 200 and the per-stratum signal is noise.

Now over-weight the adversarial tail on purpose. The cases that break production are not spread evenly; they cluster. I deliberately oversample four buckets in the golden eval set:

Bottom-quartile model confidence on the original production response.
Cases a human reviewer corrected after the model answered.
Syntactically adversarial input: injection attempts, malformed requests, off-topic prompts.
Anything that previously caused an incident, no matter how resolved it feels.

That fourth bucket is the one teams skip. An incident feels closed once the hotfix ships. It is not closed until a future model candidate passes those exact cases on a held-out set. Otherwise you are one regression away from re-living it.

The split between the easy path and the tail is a design choice, so make it explicit. I aim for roughly 20 to 25 percent of the set in the adversarial buckets, with the remainder distributed across the real intent mix. That ratio is high enough to stress the model and low enough that the set still resembles production. There is no universal number here; the right ratio is the one that surfaces the failures your customers care about. Write the ratio down so the next person who refreshes the set does not quietly dilute it.

Random sampling and a curated golden set are not rivals; they answer different questions. Random sampling tells you how the model does on the average request. A curated, tail-weighted golden set tells you whether the model is safe to ship (random sampling vs golden dataset for regression tests). For a deploy gate, you want the curated set. For online monitoring, you sample live traffic. Both, in their place.

Attach trusted reference answers

An input without a trusted reference answer is not an eval case; it is a prompt. The reference, sometimes called ground truth or the target response, is the answer you have decided is correct for that input. Building these is the slow, expensive, unavoidable part of a golden eval set.

Have domain experts label the references, not whoever is free. The labeling step forces you to articulate precise quality criteria, which is most of the value (EvidentlyAI, LLM evaluation guide). If two experts disagree on the reference for a case, you have found an ambiguous spec, not a hard case. Resolve the spec before you label.

You can speed this up with LLM-assisted drafting, then human review, but the human owns the final reference. Measure inter-annotator agreement and treat low agreement as a signal that your rubric is underspecified (EvidentlyAI). This is the thesis in miniature: the machine drafts the work, the human evaluates and decides. For when an exact reference is impossible, score against a rubric instead, which I cover in my full guide to LLM evaluation.

Store the reference as data, not as a brittle exact-match string. Many production answers have several correct phrasings, so a strict string match will fail a good response and pass a lucky one. Capture the reference plus the grading rule alongside it: exact match for structured outputs, semantic similarity or a judge rubric for open-ended ones. The reference is the answer; the grading rule is how you decide the candidate met it. Conflating the two is how an eval set starts punishing correct answers and nobody trusts the number anymore.

Freeze the set and version it like code

Once the cases and references exist, freeze the whole thing as a named, immutable artifact. A frozen golden dataset is the only thing that lets you answer one question: is this model candidate better or worse than what we shipped six months ago, on exactly the same questions. Most teams cannot say that sentence with confidence. The frozen set is what earns it.

Version the set the way you version code. Each freeze gets a name, a date, and a commit. Here is the discipline I use, as a log against a frozen set.

# freeze the golden eval set as a versioned artifact

eval freeze \

--source prod-logs-2026-w23-to-w24 \

--strata intent,confidence,incident \

--size 412 \

--out golden-set-2026-w24-v2.jsonl

# a single frozen case, schema sketch

{

"id": "case-0417",

"intent": "billing",

"stratum": "incident-replay",

"input": "why was i charged twice in may",

"reference": "expert-labeled target answer",

"frozen_at": "2026-06-15"

}

# freezing means your score can only go down

set version golden-set-2026-w24-v2

cases 412 # immutable until next freeze

adversarial tail 22% # oversampled on purpose

Freezing has a consequence worth stating plainly: your score on an older frozen set can only go down. There is no quietly adding easy cases to lift the number. That is the point. When a model regresses, the frozen set tells you the truth instead of absorbing the regression into a moving baseline.

Set a refresh cadence so the ruler stays honest

A frozen set decays as the world changes. New intents appear, language shifts, the product adds features. So you refresh on a schedule, not on a whim, and you version each refresh. The set is frozen; the cadence is what keeps it current.

Cut a new version on a fixed interval, monthly or quarterly depending on traffic volatility, and feed real production failures back into each refresh. When a new failure mode shows up in production, it earns a place in the next freeze. Run the old version and the new version in parallel for an overlap period so you can compare scores across the boundary. Without the overlap, a refresh looks like a regression and you cannot tell drift from real change.

This is where the eval set connects to revenue. In an LLM product, a bad launch is rarely one incident. It is the slow erosion of the trust that makes the product sellable. A golden eval set that refreshes on cadence is the cheapest insurance against that erosion. The gate it enables costs less than the churn it prevents. If you want that gate run as managed infrastructure, that is the core of Devlyn's AI observability and monitoring work.

The honest trade-off

Over-weighting the adversarial tail makes your aggregate accuracy look worse than it would on a uniform sample. A golden eval set stacked with hard cases will report a lower number than a friendlier set, and someone will ask why your eval score dropped after you did it right.

That is the cost, and I pay it on purpose. A uniform sample flatters the model and hides the cases that generate support tickets and churn. An adversarial-weighted set tells you what breaks before a customer does. Ship the model that passes the hard set, not the one with the prettiest average. The aggregate number is for you; the tail is for the customer.

A uniform sample flatters the model. An adversarial set tells you what breaks before a customer finds it for you.

FAQ

What is a golden dataset for LLM evaluation?

A golden dataset for LLM evaluation is a frozen, versioned sample of real production traffic paired with trusted, expert-labeled reference answers. It is stratified by intent and over-weighted toward the adversarial tail so it measures what will break, not the easy path. You sample it from production, freeze it as an immutable artifact, and refresh it on a schedule rather than growing it case by case.

How big should an LLM test set be?

For most features, 200 to 500 cases that cover the full operational envelope is the practical range. Below 200 the per-stratum signal becomes noise, especially for rare intents. The number that matters is coverage of distinct intents and failure modes, not raw size. A focused 300-case set beats a sprawling 2,000-case set built from synthetic happy-path examples.

Why should I freeze the eval set instead of letting it grow?

Because a set that grows organically becomes a moving baseline, and a moving baseline cannot detect regression. When you freeze the set, your score on it can only go down, which means a real regression shows up as a real drop instead of being absorbed by new easy cases. Freezing turns the set into a fixed ruler. You add new cases through a versioned refresh, not by appending whenever you feel like it.

How often should I refresh a golden eval set?

Refresh on a fixed cadence, monthly or quarterly depending on how fast your traffic shifts, and cut a new version each time. Feed production failures and new intents into each refresh, and run the old and new versions in parallel for an overlap window so you can compare across the boundary. Refresh early when your online drift alerts fire.

If you are building your first golden eval set, start with two weeks of production logs, stratify by intent, and label 200 hard cases by hand. That alone catches most of what kills launches. When you are ready to make this the way your team works rather than a one-time chore, A Field Guide to Evals is the long-form version of this harness, and the sibling guides on building an evaluation framework and evaluating RAG show where the set plugs in. Sample from production, freeze it, version it, and trust the number it gives you over the one you wished for.