RAG Evaluation: Measuring Retrieval Before It Collapses
RAG evaluation works only when you score retrieval and generation separately on a frozen golden set. Here is how to catch recall decay before it ships.
RAG evaluation works when you measure retrieval and generation separately. Recall@k and context precision on a frozen golden set catch the silent recall decay that end-to-end answer-quality metrics hide. A single "is the answer good?" score blends two systems into one number, so when retrieval rots you see a small wobble in answer quality and miss the cause entirely. Build the golden set before you ship. After the corpus drifts, you no longer have a clean reference to measure against.
I have watched this exact failure at more than one company. The demo retrieves perfectly, answer quality scores 0.9, everyone ships. Three months later the support tickets climb and nobody can say why, because the only metric on the dashboard moved two points. This piece is the retrieval discipline I trust, and where it bridges the broader evals that predict production practice.
Key takeaways
If you read nothing else, these are the load-bearing claims:
- Evaluate retrieval and generation as two separate systems. Recall@k and context precision measure the retriever; faithfulness and answer relevance measure the generator.
- Recall@k is the early-warning metric. It decays first and silently, weeks before answer quality visibly drops.
- Build a frozen golden set before launch. Sample real queries, label the relevant chunks, version it, and never edit it to chase a number.
- Faithfulness catches the confident wrong answer. It is the fraction of answer claims supported by the retrieved context.
- Recall decay is a revenue leak. Every query that should resolve and does not becomes a ticket, a refund, or a churned account.
Why end-to-end RAG evaluation hides the failure
Most teams evaluate a RAG pipeline the way they evaluate a chatbot: ask a question, grade the final answer. That is end-to-end evaluation, and it is necessary but not sufficient. A RAG system has two stages that fail for different reasons. The retriever fetches chunks. The generator writes an answer from those chunks. Grade only the output and you cannot tell which stage broke.
The reason this matters is that the two stages decay on different clocks. Generation quality is mostly stable, because the model does not change unless you change it. Retrieval quality erodes constantly, because the corpus grows, the query distribution drifts, and the embedding space gets crowded. I covered the full arc of this in why most RAG pipelines fail in month three. The short version: recall is the metric that rots, and end-to-end scores are too far downstream to see it early.
Here is the mechanism. When recall drops from 0.9 to 0.75, a good generator papers over the gap. It still writes a fluent answer from the weaker context, and for a while that answer is right often enough that your answer-quality metric reads 0.88 instead of 0.90. Two points. Inside a noisy metric, two points is invisible. Meanwhile a quarter of your queries are now answered from incomplete context, and the worst of them are confidently wrong.
The retrieval metrics that matter: recall@k and context precision
Retrieval evaluation needs two metrics working together. Recall@k asks: of the chunks that should have been retrieved for this query, how many showed up in the top k? Context precision asks: of the chunks that were retrieved, how many are actually relevant? Recall is about misses. Precision is about noise. You need both, because a retriever can be tuned to win one at the cost of the other.
Recall@k is the one I watch first. Recall is what collapses silently, and it collapses before anything downstream reacts. If recall@5 was 0.91 at launch and reads 0.72 this week, retrieval is failing and the only question left is how many users have already felt it. Context precision matters because a retriever that floods the context window with marginally relevant chunks degrades generation and inflates cost, even when recall looks fine.
These map to the standard open-source definitions. RAGAS scores context precision and context recall as core retrieval metrics, and is explicit that retrieval and generation should be measured on separate axes (Ragas metrics docs). The point is not the tool. The point is the separation. Whatever you use, keep the retriever score and the generator score in different columns.
RAG evaluation metrics at a glance
Here is the panel I keep, split by the axis it measures. Retrieval metrics tell you whether the right context arrived. Generation metrics tell you what the model did with it. Read them in that order, because a generation score is only meaningful once you know retrieval was sound.
| Metric | Axis | What it catches | What it hides |
|---|---|---|---|
| recall@k | Retrieval | Missed chunks the answer needed | Nothing about answer wording; this is the early-warning metric |
| context precision | Retrieval | Noise and irrelevant chunks crowding the window | Misses, if recall is not watched alongside it |
| MRR | Retrieval | The right chunk ranked too low to use | Whether other relevant chunks were found at all |
| faithfulness / groundedness | Generation | Claims unsupported by the retrieved context | A grounded answer built on the wrong context |
| answer relevance | Generation | Answers that drift off the question | Whether the answer is actually correct |
The "what it hides" column is the one most write-ups skip. Every metric here can read green while another reads red, which is the whole reason you score them separately and gate on the weakest, not the average.
Build the golden set before you ship
A golden set is a frozen, labeled collection of queries paired with the chunks that should be retrieved to answer them. It is the reference that makes recall@k computable. Without it you have no ground truth, and "did we retrieve the right thing?" becomes an opinion instead of a number.
Build it before launch, for one blunt reason. After the corpus drifts you cannot reconstruct what "correct retrieval" looked like at launch. The labels you make in month three are contaminated by the system's current behavior, so you end up grading the retriever against itself. A golden set built on day one is a fixed ruler. Here is the build, four steps:
- Sample real queries. Pull 150 to 300 from production logs or beta traffic, weighted toward the queries that matter to revenue, not the easy ones.
- Label the relevant chunks. For each query, a human marks which corpus chunks genuinely answer it. This is the expensive step and the one you cannot skip.
- Freeze and version it. Save it as a named artifact,
golden-set-2026-w24-v1.jsonl, and treat edits as a new version, never an in-place fix. - Schedule a refresh. The set drifts from reality over time, so re-sample on a cadence and version each refresh like a code release.
The full discipline for sampling and label-blinding lives in my book A Field Guide to Evals, and the RAG-specific version of it in RAG That Survives Contact. The honest trade-off: labeling a golden set costs real human hours, and people resist spending them before a launch. They spend far more hours later, in support and incident review, when retrieval fails and there is no reference to debug against.
A recall@k eval log you can actually read
Here is what a retrieval evaluation run looks like against a frozen golden set. The retriever score and generation score sit in separate blocks, on purpose. The numbers are realistic, not from a specific live system.
Read that log the way the gate reads it. Answer relevance is 0.89, which on an end-to-end dashboard would clear the bar and ship. But recall@5 has fallen to 0.74 against a launch baseline of 0.91. The retriever is missing a quarter of the chunks it should find, and the generator is covering for it well enough that the output still scores fine. The separated panel catches what the blended score would have waved through.
Faithfulness and groundedness: grading the generator
Once retrieval is measured, grade generation on its own axis. Faithfulness, sometimes called groundedness, measures whether every claim in the answer can be inferred from the retrieved context. RAGAS defines it as the number of answer claims supported by the context divided by the total claims in the answer (Ragas faithfulness docs). A faithfulness of 0.88 means roughly one claim in eight is unsupported by what was retrieved. That is your hallucination rate, expressed as a number you can gate on.
Faithfulness and recall interact in a way worth naming. When recall drops, faithfulness often holds steady on its own terms, because the generator is being faithful to weak context. The answer is grounded in what it retrieved; it just retrieved the wrong things. This is exactly why you cannot collapse the two. A faithful answer built on a recall failure is a confident, well-sourced, wrong answer. Those are the ones that cost you a customer.
Pair faithfulness with answer relevance, which checks whether the answer addresses the question rather than drifting. Neither is reliable when scored by a weak judge model; faithfulness scoring in 2026 needs a strong reference model behind it to detect contradiction. I work through judge reliability in detail in when to trust LLM-as-a-judge, and the broader metric panel in the LLM evaluation metrics that matter.
Recall decay is a revenue leak
Here is the part most retrieval evaluation write-ups skip. Recall is not an engineering vanity number. Every query that should resolve and does not is a business event. In a support deployment it becomes a ticket a human now handles, which is cost. In self-serve it becomes a user who did not find the answer and churned, which is lost revenue. In sales enablement it becomes a rep quoting something the system never surfaced, which is risk.
So the metric to put in front of the business is not recall@5 by itself. It is the unresolved-query rate that recall decay drives, priced. When recall@5 falls from 0.91 to 0.74, model that against the fraction of queries now answered from incomplete context, then against what each unanswered query costs you downstream. That sentence, "retrieval decay is costing us X per month in support load," is what gets a retrieval fix prioritized. The recall number alone gets a shrug.
This is also where evaluation stops being a one-off script and becomes production instrumentation. Recall decays continuously, so you measure it continuously, the same way you measure latency or error rate. The loop that keeps the gate honest is simple: every low-scoring production query gets fed back into the golden set, so the set you measure against grows toward the queries that actually break. Standing up that kind of monitored, gated retrieval pipeline is squarely what Devlyn's RAG and knowledge integration builds, with the golden set and the eval gate wired in from day one rather than bolted on after the first incident.
Where RAG evaluation still falls short
Even a clean retrieval-plus-generation panel has a ceiling, and I would rather name it than oversell the method. A golden set drifts from reality as the corpus and the queries change, so a set you never refresh slowly stops measuring your live system. Labeling relevance is partly subjective, especially for queries with several defensible answers, so context precision carries some irreducible noise. And faithfulness scoring inherits every blind spot of the judge model grading it.
None of this argues for the fallback everyone reaches for, which is having a human read every RAG answer before it ships. That does not scale, and I make the full case in why a human in the loop is not a plan. The answer is a separated metric panel that earns the right to gate a deploy, plus a human who designs the golden set and audits the gate. The machine does the retrieval and the generation. The human evaluates both, separately, and the metrics are how that judgment scales past the demo.
Frequently asked questions
What is RAG evaluation? RAG evaluation is the practice of measuring a retrieval-augmented generation system on two separate axes: retrieval quality (recall@k, context precision) and generation quality (faithfulness, answer relevance). Scoring them separately is the whole point, because the retriever and the generator fail for different reasons and on different timelines.
What metrics should I use to evaluate a RAG pipeline? For retrieval, use recall@k and context precision, plus MRR if rank order matters. For generation, use faithfulness (groundedness) and answer relevance. Watch recall@k first, because it decays earliest and most silently, weeks before answer quality visibly drops.
How do I build a golden set for RAG? Sample 150 to 300 real queries, have a human label which corpus chunks should answer each one, freeze it as a versioned artifact, and refresh it on a schedule. Build it before you ship, because after the corpus drifts you can no longer reconstruct what correct retrieval looked like at launch.
Why does recall matter more than answer quality early on? Because a good generator masks weak retrieval. When recall falls, the model still writes a fluent answer from incomplete context, so answer-quality scores barely move while a growing share of answers are confidently wrong. Recall@k is the early-warning metric; end-to-end scores lag it by weeks.
If you want the full harness this plugs into, including label-blinding and gate design, my book A Field Guide to Evals walks through it end to end, and RAG That Survives Contact covers the retrieval-specific version. If you would rather have a team stand up a gated RAG pipeline with the golden set and eval harness built in from day one, that is exactly what Devlyn's RAG and knowledge integration is for. Measure retrieval separately. Catch the decay before a customer does.
