Why most RAG pipelines fail in month three

The demo retrieves perfectly. Then the corpus grows, the queries drift, and recall quietly collapses. Here is the gap, and how I close it.

Let me tell you what month three looks like. You shipped a RAG pipeline in month one. The demo was clean, you typed in a question, the right chunk surfaced, the LLM answered coherently, your stakeholders nodded. Month two you connected it to the real data store and it still mostly worked. Month three something quietly broke. You can feel it in the support tickets. Users ask questions that should be answered. The system returns something plausible but wrong, or returns nothing useful at all. No alert fired. No error log. Recall just eroded, and nobody noticed until a customer noticed for you.

I have watched this happen at three different companies now. Different stacks, different domains, same arc. The demo retrieves perfectly. Then the corpus grows, the queries drift, and recall collapses silently. The pattern is consistent enough that I have stopped treating it as bad luck and started treating it as the default outcome of pipelines built without a retrieval discipline. This essay is about what creates that failure mode and what closes it.

RAG demos pass because they run against a frozen corpus and queries you already know. Production has neither.
Corpus drift is three problems, not one: coverage drift, staleness drift, and query drift. Each needs a different fix.
The failure stays invisible because teams monitor answer quality, not retrieval. A capable model papers over bad retrieval until recall has already collapsed.
Long context does not replace retrieval. It dilutes signal and shifts the burden to attention in a regime where models perform worse.
The closes are operational, not algorithmic: build a retrieval eval harness before you ship, sample production queries into it, audit chunking on new content types, run hybrid retrieval with a reranker, and schedule re-embedding.

The toy corpus problem

Every RAG pipeline I have seen starts its life against a frozen toy corpus. Maybe it is a hundred hand-picked documents. Maybe it is a year of historical support tickets that someone cleaned up and deduplicated. Either way, it has a property that production corpora almost never have: it is stable. Nothing is being added. Nothing is being modified. The distribution of topics does not shift week to week. You chunk it once, embed it once, index it once, and then you write your retrieval code against something that will never change underneath you.

The chunking strategy you picked, maybe 512-token windows with 10% overlap, performs fine on that corpus. The embedding model you chose generalizes well enough across the topics represented. You build your recall intuition on that foundation, which means you build it on something that will stop being true the moment you go to production.

At Devlyn, we work with optical retail networks. The product catalog is not static. Frames come in and out of stock. Vision plan coverage rules update when carriers renegotiate. New lens technology gets added to the lineup. Clinical protocols evolve. When we stood up our first RAG prototypes, we built them on catalog snapshots that were already thirty days stale. They worked beautifully on those snapshots. Then we pointed them at the live data pipeline and the problems started within weeks.

Corpus drift is not one problem, it is three

When I talk about corpus drift I mean three distinct phenomena that are easy to collapse into one word but require different responses.

The first is coverage drift: new content arrives that covers topics your index has never seen. Your embedding model may not have good representations for those topics. Your chunking may not handle the new document structure well. The new content exists in the index, but retrieval on queries about it underperforms because the model was never calibrated against it.

The second is staleness drift: old content becomes wrong. The document that used to correctly describe your return policy now describes a policy you changed six months ago. If you are not surfacing document age as a retrieval signal, you will happily return stale chunks with high cosine similarity to the query. The embedding does not know that the content is wrong, it only knows that it is semantically similar.

The third is query drift: your users stop asking what they were asking in month one. This one is subtle because nothing changes in your pipeline, but the match between your retrieval behavior and your actual workload degrades. If you tuned your chunk size and overlap against questions about product features, and your users start asking questions about installation, troubleshooting, and compatibility, your retrieval parameters may be wrong for the new query distribution even if the corpus itself has not changed.

Most teams discover all three of these together, which makes diagnosis confusing. The fix for coverage drift is different from the fix for query drift. You need to be able to separate them.

The demo worked because you optimized retrieval for a corpus that was never going to change, against queries you already knew. Production is neither of those things.

The silent collapse: no retrieval evals

Here is the mechanism by which all of this stays invisible for two months: there are no retrieval evals. I do not mean that teams are lazy. I mean that the evaluation infrastructure that would catch recall degradation is genuinely hard to build if you have not done it before, and most teams prioritize end-to-end answer quality instead because that is what users care about and what demos show.

The problem with evaluating only end-to-end answer quality is that a capable LLM can often paper over retrieval failures in the short term. If you retrieve three mediocre chunks, a good model will sometimes synthesize a plausible answer anyway. This creates a false signal: your answer quality metrics look acceptable, so you conclude retrieval is fine, so you do not build the retrieval-specific evals that would show you the truth. Meanwhile recall@5 is drifting from 0.82 in month one to 0.61 in month three and you have no chart that shows it.

# retrieval eval log, sampled weekly from production query logs# metric: recall@5 on labeled golden set (n=200 query/chunk pairs)

week recall@5 p50_latency_ms corpus_size 01 0.84 142 4_211 05 0.81 155 6_034 09 0.76 178 8_917 13 0.61 203 12_440 # ← stakeholder escalation

That table is roughly what we saw at Devlyn when we finally built the retrieval eval harness. Recall was already at 0.61 before the first user escalation reached leadership. We had been looking at generation quality and latency the whole time. Both looked acceptable. Recall was bleeding out quietly in the background.

The reason retrieval evals get skipped is that they require a golden set: a collection of query-and-relevant-chunk pairs that you can measure recall against. Building that golden set requires effort, human annotation, or a carefully reviewed LLM-assisted annotation pass, or both. Most teams defer it to "after we ship" and then never find the time. I used to defer it too. I stopped after the third time a pipeline degraded silently on me.

Why long context is not the answer

When teams discover retrieval degradation, the first instinct is usually to retrieve more chunks and pass more context. If recall@5 is bad, why not retrieve twenty chunks? If twenty chunks does not fit in the context window, why not use a model with a longer context window? The problem looks like a window size problem, so the solution looks like a bigger window.

I want to be careful here because long context has real uses. But it is not a substitute for retrieval quality, and treating it as one creates its own failure modes.

The core issue is that long context is not memory. A model with a 128k context window does not attend uniformly to all 128k tokens. Empirically, models perform worse on tasks that require locating relevant information in the middle of a long context than at the beginning or end. Retrieval solves the problem of getting the right information in front of the model. Padding the context with more chunks does not solve that problem, it dilutes the signal-to-noise ratio and shifts the burden to the model's attention mechanism in a regime where it performs less reliably.

There is also a cost and latency argument. Retrieval that works means you can use a tight context window with high-precision chunks. Retrieval that does not work means you are sending large contexts to a large model on every query, paying for tokens that are mostly noise. The economics of a production RAG system depend more heavily on retrieval precision than most teams realize when they are in demo mode.

The right mental model is that retrieval and context are not substitutes, they are complements. You want retrieval to do its job precisely so that context can be used efficiently. If retrieval is failing, adding context window is treating a symptom, not the disease. I cover this in more depth in "Long Context Is Not Memory" if you want the full breakdown of where long context helps versus where it is covering for something else.

Chunking that stops working

One of the more counterintuitive failure modes is that your chunking strategy can be correct for your initial corpus and wrong for the corpus you end up with six months later. Chunking is not a one-time architectural decision, it is a choice that is implicitly coupled to your document structure, your embedding model's context window, and the granularity at which your queries operate.

Consider a knowledge base that starts as a set of product FAQs: short, self-contained documents where a single question and answer fit cleanly into a 512-token chunk. That structure works well. Then the company adds technical documentation, multi-page installation guides, API references, troubleshooting trees. Those documents have a different structure. The meaningful unit of information is not a 512-token window; it might be a section, or a procedure, or a set of related steps. Fixed-size chunking fragments them at arbitrary boundaries, and the resulting chunks lose the context that makes them useful.

The honest version of this is that chunking is a domain-specific problem. There is no universal chunk size. The right answer depends on what you are chunking, how your users query it, and what your embedding model does with chunks of that size. The "Embeddings, Honestly" chapter on chunk size has the detail on how embedding quality degrades at the extremes, but the operator takeaway is simpler: you need to audit your chunking strategy whenever your corpus structure changes significantly, and you need retrieval evals to tell you when it has stopped working.

At Devlyn we ended up with a hybrid chunking approach: fixed-size for structured product data where sections are short and uniform, document-structure-aware chunking for clinical and regulatory content where sections carry their own semantic unit. We did not plan for that in month one. We discovered it in month three when retrieval on clinical queries degraded while retrieval on product queries stayed stable. The signal came from the evals.

Embeddings go stale and nobody schedules the update

Embedding models are not static in their relationship to your corpus. Two things change over time that affect this relationship: the embedding model itself may be updated or replaced, and your corpus vocabulary may drift away from the distribution on which the model was trained.

The second one is more common and more insidious. If you are operating in a domain that has specialized terminology, medical, legal, financial, technical, and new terminology enters your corpus that was not well-represented in the model's training data, retrieval on queries using that terminology will underperform. The model has never seen the term in a way that would give it a useful embedding. Queries using the term match poorly even to documents that are directly about it.

Most teams address this by fine-tuning their embedding model on domain-specific data at the outset, which is good practice. What fewer teams do is schedule any kind of re-embedding cadence. The embedding index is treated as a build artifact, you generate it once, you ship it, and you maintain it only when something obviously breaks. That cadence is too slow for a corpus that is growing and changing continuously.

What I have landed on: quarterly re-embedding of the full corpus when the corpus is under fifty thousand chunks, more frequent for subsets of the corpus that are changing rapidly. This is operationally annoying but the cost is predictable and the alternative, recall degradation that compounds over time, is much more expensive to deal with reactively. The "Retrieval That Survives Contact" chapter on embedding lifecycle has a more detailed treatment of how to decide on re-embedding cadence based on your corpus change velocity.

You will not catch recall degradation by monitoring answer quality. You will catch it by running retrieval evals on production-sampled queries, on a cadence, against a golden set you built when the system was working.

How I actually close the gap

None of what I am describing requires exotic technology. The closes are operational, not algorithmic. They require discipline and scheduling, not new models.

First, build the retrieval eval harness before you ship. Not after. A golden set of two hundred query-chunk pairs, annotated by domain experts or carefully reviewed after LLM-assisted generation, is enough to give you a meaningful recall@5 signal. Run it weekly. Put it in your dashboards next to latency and error rate. The moment recall starts drifting, you want to know it.

Second, sample your production query log continuously and add new queries to your eval set. The queries you had in month one are not the queries you will have in month six. Your eval set needs to stay representative of the actual query distribution. Set a threshold, say, once a quarter, add fifty production queries to the golden set with fresh annotations. This keeps the eval honest as query drift happens.

Third, audit your chunking whenever you add a new content type. Before new document categories go into your index, spend time understanding their structure. Do a manual retrieval evaluation: take twenty representative queries for the new content type, run them against your current chunking, look at what comes back. If the chunks look fragmented or decontextualized, your chunking strategy needs to change for that content type.

Fourth, implement hybrid retrieval with a reranker. Pure dense retrieval, cosine similarity against embeddings, is a good baseline but degrades more sharply as your corpus grows and as vocabulary shifts. Hybrid retrieval, sparse keyword matching like BM25 combined with dense retrieval, with a reranker to reconcile the two signal sets, is more robust to corpus growth and vocabulary drift. This is also where agentic retrieval starts to earn its keep, letting the system decide how to query rather than firing a single fixed lookup. The reranker can be a cross-encoder or a lighter learned model; either way it provides a second pass that is less sensitive to the specific embedding model's blind spots.

Fifth, schedule re-embedding and make it a first-class operational task. Put it on a calendar. Treat it like a database migration: planned, reviewed, tested against your eval set before it goes to production. An embedding update that drops recall is a regression. Treat it like one.

The through-line is that production RAG requires ongoing retrieval operations, not just retrieval architecture. The architecture decisions matter, chunking strategy, embedding model choice, hybrid versus dense retrieval, but they degrade over time without the operational layer to maintain them. At Devlyn, owning production readiness means owning the full lifecycle: the initial architecture and the maintenance cadence and the eval infrastructure that tells us when something is slipping before a customer tells us first.

The demo retrieves perfectly because it was designed against a corpus that was never going to change, with queries you already knew. That is not a problem with the demo, demos are supposed to show the happy path. The problem is treating the demo as evidence that the system is production-ready. It is evidence that the architecture can work. Whether it actually works in six months depends on what you build around it.

Most teams build nothing around it. That is why month three looks the way it does. The retrieval eval is not glamorous work. It does not ship features. It does not impress stakeholders. It is the discipline that keeps everything else from quietly falling apart.

Frequently asked questions

Why do RAG pipelines fail in production? They are usually built and tuned against a frozen toy corpus with a known set of queries. In production the corpus grows and changes, the query distribution drifts, and retrieval quality erodes. Because most teams monitor end-to-end answer quality rather than retrieval, the degradation stays invisible until a customer hits it.

How do you detect retrieval degradation before users do? Build a retrieval eval harness with a golden set of query-and-relevant-chunk pairs and track a metric like recall@5 on a cadence, next to latency and error rate. Answer-quality metrics will not catch it, because a capable model can paper over bad retrieval for a while.

Is a longer context window a substitute for good retrieval? No. Models attend less reliably to information in the middle of a long context, so padding the window with more chunks dilutes the signal and raises cost and latency. Retrieval and context are complements: precise retrieval lets you use a tight, high-precision context efficiently.