AN Alpesh Nakrani
BlogBooksPraiseAbout Work with me →
Book overview
Chapter 9 / Field Manuals

Freshness, Versioning, Deletion, and Reindexing

A corpus is a living system, and an index that does not discover, refresh, and retire documents on a schedule is drifting toward wrong by default.

Research spine: this chapter stays grounded in BM25 and Dense Passage Retrieval (DPR), then applies that evidence to the operating judgment in the book. Read this alongside the RAG That Survives book, the AI-Native thesis, and the full book library when you want the surrounding argument. Three weeks. That is roughly how long the assistant from this book's introduction took to go from magical to quietly wrong, and the cause was not a model regression or a code change. The cause was time. The product was renamed, a policy changed, permissions shifted, and the index kept serving the world as it had been on launch day. Nobody broke anything. The system did exactly what it did at launch, and that was the problem, because the corpus had moved and the index had not.

This chapter is about the operations that keep an index aligned with a corpus that won't sit still: discovering changes, refreshing content, handling versions, deleting documents (which is harder than it sounds), and re-embedding when the model falls behind. These are not features you add once. They are the metabolism of the system, and without them an index decays toward wrong at a rate set by how fast your corpus changes. A fast-moving corpus with a static index has a short shelf life measured in weeks, exactly as the introduction showed.

Key Takeaways

  • A corpus is a living system, and an index that does not discover, refresh, and retire documents on a schedule is drifting toward wrong by default.
  • Freshness, Versioning, Deletion, and Reindexing should be evaluated through concrete evidence, ownership, and failure modes before production behavior changes.
  • Read it with the adjacent Rag That Survives chapters to move from diagnosis to an implementation or release decision.

The Living Index Lifecycle

I organize all of this around one loop, the Living Index Lifecycle, which is the operating model for the whole book pulled into a single cycle:

  1. Discover changed, new, and deleted documents in the sources.
  2. Parse them into structured blocks (parsing chapter).
  3. Chunk them along structural seams (chunking chapter).
  4. Enrich with metadata from the source system and derivation (metadata chapter).
  5. Embed/index into dense and sparse indexes (retrieval chapter).
  6. Validate that the new content retrieves correctly before serving it.
  7. Serve the validated content to queries.
  8. Observe retrieval behavior in production (observability, next chapter).
  9. Refresh on a cadence matched to each source's change frequency.
  10. Retire documents that are deprecated, expired, or deleted.

A static-folder RAG system implements steps 2 through 5 once and stops. A retrieval system that survives contact runs all ten continuously. The chapters so far built the individual stages; this chapter builds the loop that connects them and keeps it turning.

Discovery: how the index finds out the world changed

The first stage is the one tutorials never mention, and it is the one that prevents the three-week failure. How does your index find out that a document changed? There are three mechanisms, in descending order of freshness and ascending order of effort to set up.

Event-driven is best where available: the source system emits an event when a document is created, updated, or deleted, and your pipeline reacts. A CMS webhook, a database change-data-capture stream, a message on a queue. This gives near-real-time freshness and is the right choice for fast-moving, high-stakes sources (pricing, policy, status). The cost is integration work and the need to handle the event reliably (at-least-once delivery, ordering, replay).

Polling with change detection is the pragmatic middle: on a schedule, list the source's documents and compare a change signal (a last_modified timestamp, a content hash, an ETag) against what you have indexed. Anything changed gets re-ingested, anything new gets added, anything missing gets retired. This is more robust than it sounds and works for sources without events. The cadence is the lever: poll a source about as often as it changes, which is exactly the change_frequency you tagged at inventory time.

Full re-crawl is the fallback: periodically re-ingest the entire source. Simple, expensive, and slow, so reserve it for small or rarely-changing sources, or as a periodic reconciliation pass to catch anything the incremental mechanisms missed. A monthly full reconcile alongside daily incrementals is a reasonable belt-and-suspenders pattern, because incremental discovery quietly misses things (a deletion that emitted no event, a timestamp that did not update) and the reconcile catches the drift.

The mistake is one global refresh schedule for everything. Your corpus is a mix of cadences: the regulatory filing from 2021 never changes, the status page changes by the minute. A single nightly job either wastes enormous compute re-embedding immutable documents or serves day-old pricing. Match discovery cadence to source cadence, per source.

Validation before serving: do not promote broken content

Stage six is the one teams skip and regret. When new or changed content flows through the pipeline, do not serve it immediately. Validate first, because the pipeline can fail silently: a parsing change garbled a source, a chunking edge case produced empty chunks, an embedding job partially failed. If you promote unvalidated content straight to serving, you discover the problem from user complaints.

A lightweight validation gate before promotion: run a held-out set of known queries against the new content in a staging index and confirm the expected chunks are retrievable with reasonable scores. Confirm chunk counts are sane (a source that had 3,500 chunks should not suddenly have 12 or 40,000). Confirm no chunks are empty or below a minimum length. Confirm metadata is populated. Only after the gate passes does the new content get promoted to the serving index. This is the same discipline as testing before deploying code, applied to data, and it catches the silent pipeline failures that are otherwise invisible until they reach users.

Versioning: serve the current world, keep the old one addressable

Versioning is what lets you change content without a gap where the document has no valid chunks and without losing the ability to answer about prior versions. The pattern that survives contact:

  • When a document changes, create the new version's chunks and index them alongside the old version's chunks.
  • Mark the old version's chunks status: deprecated (or superseded) via metadata.
  • Once the new version is validated and serving, the deprecated chunks are filtered out of normal retrieval by the status filter (permissions chapter), so current wins.
  • Retain deprecated chunks only as long as you need them: for "how did this work in v3" queries, for audit, or for rollback if the new version turns out to be wrong.

This avoids the gap. There is never a moment where the document exists but has no retrievable chunks, because the old chunks keep serving until the new ones are validated. It also makes rollback trivial: if the new version is bad, flip the status back. And it supports the multi-version reality of real products, where v3 and v4 both have live users and both sets of docs must coexist without poisoning each other, enforced by the product_version filter.

Deletion is harder than addition, and it is where leaks hide

Adding documents is easy. Deleting them correctly is where systems fail, and the failures are serious: a deleted document that still gets retrieved is a stale answer at best and, if it was deleted for legal or privacy reasons, a compliance violation at worst.

Deletion has to propagate through every place the document's content lives, and in a RAG system that is more places than you think:

  • the dense vector index (every chunk's vector)
  • the sparse index (every chunk's terms)
  • any metadata store
  • any cache of retrieval results or generated answers
  • any derived artifact (summaries, extracted entities, a knowledge graph)

Miss any one and the document is partially deleted, which can be worse than not deleted, because it produces inconsistent behavior that is hard to debug. The discipline is a deletion that is transactional or at least reconciled: delete from all stores, then verify the document is unretrievable by querying for its distinctive content and asserting zero results.

There are two flavors of deletion, and they need different handling. Soft deletion marks chunks status: retired and filters them out; it is reversible and fast but the content still physically exists in the index, which is unacceptable for true privacy deletions (a GDPR erasure request, for instance, requires the data actually gone, not merely hidden). Hard deletion physically removes the vectors and terms. For privacy and legal deletions, you need hard deletion and you need to verify it, including in backups and caches. Treat a "right to be forgotten" request as a hard-deletion runbook with verification, not a status flip.

There is also a subtle index-mechanics issue: some vector indexes do not immediately reclaim space or fully purge on delete; they tombstone and clean up later during compaction. If your index tombstones, a "deleted" vector may still be matchable until compaction runs. Know your index's deletion semantics, and for sensitive deletions, force or verify the compaction. This is exactly the kind of detail you confirm in the vector store's own documentation, since it varies by system.

Re-embedding: when the model, not the content, is stale

The freshness problem usually means content changed. There is a second, slower drift: the embedding model becomes stale relative to your corpus, even when no document changed. This happens for two reasons. First, vocabulary drift: your corpus accumulates new terms (product names, jargon, acronyms) that the frozen embedding model places poorly, steadily degrading dense recall, exactly as the hybrid chapter warned. Second, model upgrades: you adopt a better embedding model, and to use it you must re-embed the entire corpus, because vectors from different models are not comparable.

Re-embedding the whole corpus is a heavy operation and it needs the same care as a deletion: it is a full reprocessing that can fail partway and leave the index in a mixed state with some chunks in the old model's space and some in the new, which silently breaks retrieval because the two spaces are incompatible. The runbook is to embed into a new index, validate it against your eval set in parallel with the live index, compare retrieval quality, and cut over atomically once the new index is proven, keeping the old one for rollback. Never do a partial in-place re-embed across two model versions in a single live index.

The signal for when to re-embed is your retrieval evaluation trend (next chapter): if dense recall is slowly declining while content quality is steady, vocabulary drift against a frozen model is a likely cause, and re-embedding (or leaning harder on the sparse leg of your hybrid stack in the meantime) is the fix.

The index refresh runbook

Here is the artifact: a refresh runbook that operationalizes the lifecycle, parameterized per source.

StepActionFailure handling
1. DiscoverPull changed/new/deleted docs (event or poll per source cadence)Log discovery counts; alert if zero changes from a known-active source (discovery may be broken)
2. ProcessParse, chunk, enrich the changed docsOn parse/chunk error, quarantine the doc, do not promote, alert owner
3. StageIndex into staging, do not serve yetTrack staging chunk counts vs prior
4. ValidateRun held-out query set against staging; check counts, scores, empty chunks, metadataIf validation fails, abort promotion, keep serving prior version, alert
5. PromoteAtomically swap validated content into serving; mark superseded chunks deprecatedKeep prior version for rollback window
6. DeleteHard-delete retired/erased content from all stores; verify unretrievableIf verification finds content, escalate as a leak
7. ReconcilePeriodic full compare of source vs index; catch missed deletes/updatesRe-sync drift; log discrepancies
8. ObserveFeed retrieval metrics back to decide next refresh and re-embed needTrend recall; trigger re-embed on decline

The two steps most often skipped are 4 (validate) and 7 (reconcile), and they are the two that catch silent drift. Validation catches bad content before users do; reconciliation catches the deletions and updates that incremental discovery quietly missed. Skip them and your index slowly diverges from reality in ways no single query reveals.

Practical exercise

Pick your fastest-changing source and trace one real change end to end. Make a small edit to a document in the source system and time how long until that change is reflected in a retrieval result: discovery latency, processing, validation, promotion. Then delete a test document and verify it is truly unretrievable by querying its distinctive content across the dense index, the sparse index, and any cache. Most teams discover one of two things: the change takes far longer to propagate than they assumed, or the deletion did not fully propagate to one of the stores. Either finding is a freshness bug you can now fix before it becomes an incident.

Summary

An index decays toward wrong at the rate its corpus changes, and the introduction's three-week failure was pure freshness drift. The Living Index Lifecycle (discover, parse, chunk, enrich, embed/index, validate, serve, observe, refresh, retire) is the continuous loop that keeps the index aligned with the corpus. Discovery should be event-driven or polled per source cadence, never one global schedule. Validate new content before serving it, version documents so changes have no gap and the old world stays addressable, and treat deletion as a transactional, verified operation that propagates to every store, with hard deletion and verification for privacy and legal erasure. Re-embed when the frozen model drifts behind your vocabulary or when you upgrade models, always into a new validated index with atomic cutover. The two most-skipped steps, validation and reconciliation, are precisely the ones that catch silent drift.

Key Takeaways

  • An index with no refresh decays toward wrong at the speed of corpus change; freshness is metabolism, not a feature.
  • Run the full Living Index Lifecycle continuously, not just parse-chunk-embed once at launch.
  • Match discovery cadence to each source's change frequency; avoid one global refresh schedule.
  • Validate new content in staging against a known query set before promoting it to serving.
  • Version documents so changes have no gap and prior versions stay addressable for history, audit, and rollback.
  • Deletion must propagate to every store (dense, sparse, metadata, caches, derived artifacts) and be verified; use hard deletion for privacy/legal erasure and know your index's tombstone semantics.
  • Re-embed when the frozen model drifts behind your corpus vocabulary or you upgrade models; always into a new validated index with atomic cutover, never a partial in-place mix.
Share