AN Alpesh Nakrani
BlogBooksPraiseAbout Work with me →
Book overview
Chapter 10 / Field Manuals

Evaluating Retrieval Apart from Answers

If you only measure the final answer, you cannot tell whether retrieval or generation broke, and you will keep fixing the wrong half.

Read this alongside the RAG That Survives book, the AI-Native thesis, and the full book library when you want the surrounding argument. A team I worked with had a dashboard they were proud of: a single number, "answer quality," scored by an LLM judge, tracked daily. When it dipped, they would convene, stare at the number, and argue about whether the model had regressed or the prompt needed work. The argument never resolved, because the number could not resolve it. A drop in answer quality can come from retrieval bringing the wrong context, from the model mishandling good context, or from the corpus itself being wrong. One number conflates all three, so every meeting was speculation. The dashboard measured the symptom and hid the cause.

The fix is the throughline of this entire book applied to measurement: retrieval and generation are separate problems, so you must evaluate them separately. You need to know, independently, whether the right material was retrieved and whether the model used it well. Only then can you tell which half to fix. This chapter builds that two-layer evaluation, the metrics that go with each layer, the observability to catch failures in production, and the incident discipline to learn from them. It is the eighth stage of the Living Index Lifecycle, observe, made concrete.

Key Takeaways

  • If you only measure the final answer, you cannot tell whether retrieval or generation broke, and you will keep fixing the wrong half.
  • Evaluating Retrieval Apart from Answers should be evaluated through concrete evidence, ownership, and failure modes before production behavior changes.
  • Read it with the adjacent Rag That Survives chapters to move from diagnosis to an implementation or release decision.

The two-layer evaluation

Split every evaluation into two questions, measured separately:

Layer one, retrieval quality: given a query, did the retrieval stack return the correct material in the candidate set and rank it well? This is measured against the corpus, independent of any generation. It answers "did we bring the right world into the prompt."

Layer two, answer quality: given the retrieved context, did the model produce a faithful, complete, well-attributed answer? This is measured given the context, independent of whether retrieval was good. It answers "did the model use what we gave it."

The combination is diagnostic. High retrieval, low answer: generation or prompt problem. Low retrieval, any answer: retrieval problem (a good answer here was luck or parametric memory, which is its own risk). High retrieval, high answer, still wrong: corpus problem, the source itself is wrong. This is the same three-outcome table from chapter two, now turned into a measurement strategy rather than a one-off diagnosis.

Two joined evaluation loops with a diagnosis grid in the middle
Measuring retrieval and answer quality separately turns a fuzzy quality drop into a clear diagnosis.

Retrieval metrics that mean something

Retrieval evaluation is a mature field with well-understood metrics, and you do not need to invent any. You need to use the right ones for the right question. All of them require the same thing: a labeled set of queries with their correct (relevant) chunks, the gold set. Building that set is most of the work, and I will get to it.

Recall@k: of the queries, what fraction have at least one correct chunk in the top k retrieved? This is the recall-stage metric from the hybrid chapter, and it is the first thing to protect, because if the correct chunk is not in the candidate set, nothing downstream can recover it. Track recall at the k you actually retrieve (say recall@100 for the candidate set) and at the k you pack (say recall@5 for the context).

MRR (Mean Reciprocal Rank): the average of 1/rank of the first correct chunk. It rewards getting the correct chunk high, which is the precision-stage concern that reranking addresses. A reranker that works should move MRR up even if recall@100 is unchanged, because it reorders the same candidates.

nDCG (normalized Discounted Cumulative Gain): the standard graded-relevance metric, which accounts for multiple relevant chunks at different relevance levels and discounts relevance found lower in the ranking. Use it when relevance is graded (some chunks are perfect, some partial) rather than binary. It is the most informative single ranking metric when you have graded labels.

Here is a compact implementation of the three, operating on a ranked list of retrieved chunk IDs against a set of relevant IDs:

import math

def recall_at_k(retrieved, relevant, k):
 top = retrieved[:k]
 return 1.0 if any(c in relevant for c in top) else 0.0 # per-query; average over the set

def reciprocal_rank(retrieved, relevant):
 for i, c in enumerate(retrieved):
 if c in relevant:
 return 1.0 / (i + 1)
 return 0.0

def ndcg_at_k(retrieved, relevance_grades, k):
 # relevance_grades: dict chunk_id -> graded relevance (e.g. 0,1,2,3)
 def dcg(items):
 return sum((2 ** relevance_grades.get(c, 0) - 1) / math.log2(i + 2)
 for i, c in enumerate(items))
 actual = dcg(retrieved[:k])
 ideal_order = sorted(relevance_grades, key=relevance_grades.get, reverse=True)
 ideal = dcg(ideal_order[:k])
 return actual / ideal if ideal > 0 else 0.0

These run over your whole gold set, and you average per-query results. The point of computing all three is that they answer different questions: recall@k says "is it findable," MRR says "is it near the top," nDCG says "is the whole ranking good." A change that improves one and not the others tells you exactly what you changed.

The evaluation dataset is the asset

The metrics are trivial; the gold dataset is the hard, valuable part, and it is what most teams lack. A retrieval eval set is a collection of queries, each labeled with the chunk(s) that correctly answer it. Here is a schema that has held up:

{
 "query_id": "q_0142",
 "query": "can I export my data on the free plan",
 "intent": "knowledge_lookup",
 "relevant_chunks": [
 { "chunk_id": "doc_export#c2", "grade": 3, "note": "states free plan export allowance" }
 ],
 "must_not_retrieve": ["doc_promo_2019#c3"],
 "source_of_query": "production_log",
 "created": "2026-05-02",
 "corpus_version": "2026-05-01"
}

Three fields earn their place beyond the obvious. must_not_retrieve lets you test negative cases: stale documents, wrong-tenant content, deprecated versions that should never surface, which is how you turn the permissions and freshness work into measurable assertions. source_of_query records where the query came from, because the best eval queries come from production logs (real user phrasing) rather than from an engineer imagining queries (which never look like real ones). corpus_version records what the corpus looked like when this was labeled, because relevance labels go stale as the corpus changes, which is the unique evaluation challenge of a moving corpus.

That last point deserves emphasis, since it is the part generic RAG advice misses. On a static corpus, you label once and reuse forever. On a corpus that won't sit still, your gold labels decay: the chunk that was correct for a query last month may have been deprecated, re-chunked, or superseded this month. So the eval set itself needs maintenance, tied to the lifecycle: when a document is re-chunked or retired, the labels pointing at its chunks need review. Treat the eval set as a living artifact with the same discovery-and-refresh discipline as the index, or your "evaluation" slowly measures against a corpus that no longer exists.

Build the set incrementally. Start with twenty to fifty real production queries, label them by hand, and grow from there, prioritizing queries that represent real failures and important intents. A few hundred well-chosen, well-maintained queries beat thousands of synthetic ones. Every production incident should add a query to the set, so the system is permanently protected against repeating that failure, which is the regression-test discipline applied to retrieval.

Answer-quality metrics, kept honest

For layer two, you measure whether the model used the retrieved context well, and the dimension that matters most for grounding is faithfulness: does the answer follow from the retrieved context, or did the model add claims from its own parametric memory? An answer that is fluent and correct-sounding but not grounded in the retrieved context is a hallucination risk even if it happens to be right, because next time the parametric guess will be wrong and you have no citation to catch it.

The RAGAS framework formalizes reference-free metrics for exactly this split: faithfulness (are the answer's claims supported by the retrieved context), answer relevance (does the answer address the question), and context relevance/precision (was the retrieved context actually on-topic). The useful insight from RAGAS is that you can assess these without gold answers by checking the answer against the retrieved context, which scales better than full human labeling. Use LLM-as-judge metrics like these as a fast signal, but anchor them periodically against human judgment on a sample, because LLM judges have their own biases and a judge metric that drifts from human assessment is worse than no metric, since it looks rigorous while misleading you.

The key discipline: measure context relevance separately from faithfulness. Context relevance is a retrieval metric in disguise (was the right context retrieved). Faithfulness is a generation metric (did the model stay grounded). Reporting them together recreates the single-number trap. Keep them apart and the diagnosis stays clean.

Observability: catching failures you did not anticipate

Offline evaluation on a gold set catches known failure patterns. Production observability catches the ones you did not anticipate, and a moving corpus generates new failure patterns constantly. The foundation is the retrieval trace from chapter two: for every production query, log the rewritten query, the candidates with scores and metadata, what was reranked, what was packed, and what was cited. That trace is what makes production debuggable.

On top of the traces, watch a small set of signals that correlate with retrieval failure:

SignalWhat it catches
Low top retrieval score / wide score gap to thresholdQuery had no good match in the corpus (coverage gap or out-of-scope)
"No relevant results" / fallback rate risingCoverage gap or a discovery/refresh failure removed content
High proportion of deprecated chunks retrievedFreshness or status-filtering problem
Citations pointing at old versionsVersioning or freshness drift
Retrieval latency spikeIndex health, often correlated with a refresh or compaction
Same query, different results over timeCorpus changed (expected) or index instability (not)
User negative feedback clustered by sourceA specific source degraded (parse, chunk, or staleness)

The most actionable of these is low-confidence retrieval. When the best candidate's score is far below your usual threshold, the corpus probably does not contain a good answer, and the right behavior is to say so rather than to return the closest weak match and let the model dress it up. A retrieval system that survives contact knows when it does not know, and that knowledge comes from watching the retrieval score, not from the model's confidence in its own prose. Surfacing "I do not have a confident source for this" beats fabricating a fluent answer from a weak match, every time.

The retrieval incident postmortem

When a retrieval failure reaches production, treat it like any other incident: a blameless postmortem that produces a fix and a regression test. Here is the postmortem template, structured around the Retrieval Failure Chain so the root cause lands on a specific link.

RETRIEVAL INCIDENT POSTMORTEM
- Summary: one sentence (what wrong answer, to whom, impact).
- Trace: paste the retrieval trace for the failing query.
- Failure chain link: which link broke?
 [ ] query interpretation [ ] permissions [ ] candidate retrieval
 [ ] reranking [ ] context packing [ ] generation [ ] citation
 [ ] corpus (source itself wrong/stale)
- Root cause: why that link broke (e.g. correct chunk at candidate
 rank 11, no reranker; or doc deprecated in source but not in index).
- Was it detectable? Which signal should have caught it earlier?
- Fix: the specific change (e.g. add reranker; fix discovery for source X).
- Regression test: the query added to the gold eval set, with relevant
 and must_not_retrieve chunks, so this exact failure is now measured.
- Lifecycle gap: which lifecycle stage (discover/validate/refresh/retire)
 let this through, and how is that stage hardened?

The two fields that turn an incident into durable improvement are the regression test and the lifecycle gap. The regression test ensures the specific failure can never silently return. The lifecycle gap forces the question of which stage of the operating loop let it through, so you harden the process and not just the single instance. Over time, the gold eval set becomes a museum of every way your retrieval has failed, and the lifecycle gets tighter at exactly the stages that have hurt you. That accumulation is what "survives contact" actually means in practice: not a system that never fails, but one that turns each failure into a permanent guardrail.

Practical exercise

Build the smallest useful eval set this week: thirty real queries pulled from production logs, each labeled with the correct chunk by hand, including at least five must_not_retrieve negative cases drawn from known stale or wrong-tenant documents. Compute recall@k at your candidate k and your packed k, and MRR, against your current stack. You now have a baseline. Then make one change you have been debating (add a reranker, add the sparse leg, fix a chunking boundary) and re-run the same thirty queries. If the metric you expected to move moved, you have evidence instead of opinion, and you have replaced the unresolvable dashboard meeting with a number that points at a cause.

Summary

A single answer-quality number hides whether retrieval, generation, or the corpus is at fault, so evaluate the two layers separately: retrieval quality against the corpus with recall@k, MRR, and nDCG, and answer quality given the context with faithfulness and context relevance kept apart. The hard, valuable asset is a gold eval set of real production queries labeled with correct (and must-not-retrieve) chunks, and on a moving corpus that set is itself a living artifact whose labels decay and must be refreshed. Production observability on the retrieval trace catches the failures you did not anticipate, and low retrieval confidence should trigger honesty rather than a dressed-up weak match. Every retrieval incident gets a blameless postmortem structured on the failure chain, producing a specific fix, a regression test added to the gold set, and a hardened lifecycle stage. Surviving contact is the accumulation of those guardrails over time.

Key Takeaways

  • Evaluate retrieval and generation separately, or you cannot tell which half broke and you will fix the wrong one.
  • Use recall@k (findable), MRR (near the top), and nDCG (whole ranking good); each answers a different question.
  • The gold eval set of real, labeled production queries is the valuable asset; the metrics are trivial by comparison.
  • On a moving corpus, eval labels decay; maintain the eval set as a living artifact tied to the lifecycle.
  • Keep context relevance (retrieval) and faithfulness (generation) as separate answer-quality metrics; anchor LLM judges against human judgment.
  • Log a retrieval trace for every production query and watch signals like low confidence, deprecated-chunk rate, and rising fallback rate.
  • Low retrieval confidence should trigger "I do not have a confident source" rather than a fluent answer from a weak match.
  • Every incident produces a blameless postmortem, a regression test added to the gold set, and a hardened lifecycle stage.
Share