Metadata as Operational Control
Metadata is the control plane of a living index, the difference between retrieving a vector and retrieving the right, current, authorized vector.
Research spine: this chapter stays grounded in BM25 and Dense Passage Retrieval (DPR), then applies that evidence to the operating judgment in the book. Read this alongside the RAG That Survives book, the AI-Native thesis, and the full book library when you want the surrounding argument. A retrieval system without metadata is a system that can only ask one question: what is closest in vector space? That question is necessary and wildly insufficient. Closest in vector space tells you nothing about whether the document is current, whether this user is allowed to see it, which tenant it belongs to, which product version it describes, or whether it was retired last week. A pure-similarity system retrieves the most semantically similar wrong answer with exactly the same enthusiasm as the right one.
Metadata is how you constrain "closest" into "closest among the documents that are current, authorized, in this tenant, for this product version, and not retired." It is the control plane of the entire system. Embeddings answer "what is this about." Metadata answers "should this be retrieved at all, for this query, for this user, right now." Most of the operational power of a retrieval system that survives contact lives in the metadata layer, and most of the catastrophic failures (leaks, staleness, cross-tenant bleed) are metadata failures, not embedding failures.
Key Takeaways
- Metadata is the control plane of a living index, the difference between retrieving a vector and retrieving the right, current, authorized vector.
- Metadata as Operational Control should be evaluated through concrete evidence, ownership, and failure modes before production behavior changes.
- Read it with the adjacent Rag That Survives chapters to move from diagnosis to an implementation or release decision.
The fields that earn their place
Metadata grows by accretion if you let it, so I keep a disciplined list where every field does operational work, meaning some part of the system filters, ranks, cites, or expires on it. If a field is never used in a filter, a ranking signal, a citation, or a lifecycle decision, it is documentation, not metadata, and it belongs elsewhere.
Here is the schema I start from. Each row notes what the field is for, because the purpose is the justification.
| Field | Type | Operational use |
|---|---|---|
source_id | string | Group documents; route ownership and refresh; filter by source |
owner | string | Route corrections; maintenance accountability (from inventory) |
visibility | enum / list | Permission filter before retrieval (public, tenant-internal, role-restricted) |
tenant_id | string | Hard isolation in multi-tenant systems |
roles_allowed | list | Fine-grained access control |
version | string | Distinguish current from superseded documents |
status | enum | current / deprecated / draft / retired; drives filtering and ranking |
effective_date | date | When the content became valid |
expires_at | date / null | When it should stop being retrieved |
last_updated | date | Freshness signal; drives refresh and recency ranking |
product | string / list | Filter to the product the query is about |
product_version | string | Distinguish v3 docs from v4 docs |
locale | string | Filter to the user's language/region |
doc_type | enum | policy / how-to / reference / changelog; routes ranking and prompt handling |
entities | list | Named entities for filtering and enrichment |
parse_confidence | float | From parsing; down-weight low-confidence sources |
heading_path | list | Location in document (from chunking); used in citation |
This is a starting schema, not a mandate; a product catalog needs sku and category, a legal corpus needs jurisdiction and clause_id. The discipline is constant: every field must be used by something.
Status is the field that prevents the launch-day-to-week-three failure
Recall the opening of this book: the assistant kept describing a renamed feature because the old article was still in the index. That is a status failure. The old article was never marked deprecated, so retrieval treated it as equal to the current one. The single most valuable metadata field for surviving contact is status, because it encodes the document's position in its own lifecycle and lets retrieval act on it.
A minimal, useful status enum:
current: the live, authoritative version. Retrievable and rankable normally.draft: not yet published. Excluded from retrieval entirely (a frequent contamination source).deprecated: superseded but kept for reference. Retrievable only when explicitly asked for historical context, otherwise filtered out or heavily down-ranked, and clearly labeled in citation.retired: removed from service. Never retrieved. Kept only for audit if rights require it.
With this field, the launch-day failure becomes a routine operation: when the product is renamed and the new article ships, the old one moves to deprecated, and the retriever stops surfacing it as current. No model change, no re-architecture. A metadata transition. The whole point of status is to make "the corpus moved" an event the index can respond to rather than a silent drift it cannot see.
Version and product_version prevent answering about the wrong world
The second classic failure is answering correctly about the wrong version. A user on product v3 asks how a feature works, and the retriever returns the v4 documentation because it is semantically identical and more recent. The answer is fluent, well-cited, and wrong for this user. Pure similarity cannot distinguish v3 from v4 docs because they are about the same thing; only metadata can.
product_version as a filter, driven by the user's known context (which version they are on), turns this into a constraint: retrieve only docs for the version this user runs. When the user's version is unknown, the system should surface the version explicitly in the answer ("for v4...") rather than silently picking one. Versioning metadata is also what lets you keep multiple live versions in one index without them poisoning each other, which is the normal situation for any product that supports more than one release at a time.
Metadata as a ranking signal, not just a filter
Filters are binary: a document is in or out. But metadata also carries graded signals that should influence ranking, and this is where a lot of subtle quality lives. A current document should generally rank above a deprecated one even when both are retrievable. A more recently last_updated document should get a recency boost in fast-moving domains. A high parse_confidence document should rank above a low-confidence OCR'd one when both match. A doc_type of policy should rank above a changelog for a policy question.
The mechanism is a post-retrieval scoring adjustment: retrieve candidates by similarity (and sparse match), then adjust scores using metadata before reranking. Here is the shape of a metadata-aware scoring step that sits between candidate retrieval and the reranker.
def apply_metadata_signals(candidates, query_ctx, now):
for c in candidates:
m = c["metadata"]
boost = 1.0
if m["status"] == "deprecated":
boost *= 0.3 # down-rank, do not exclude
if m["status"] in ("draft", "retired"):
c["excluded"] = True # hard filter, should already be gone
# Recency boost in fast-moving sources only.
if query_ctx.get("recency_sensitive"):
age_days = (now - m["last_updated"]).days
boost *= max(0.5, 1.0 - age_days / 365.0)
# Trust low-confidence OCR less.
boost *= 0.5 + 0.5 * m.get("parse_confidence", 1.0)
# Prefer the doc_type that matches intent.
if m["doc_type"] == query_ctx.get("preferred_doc_type"):
boost *= 1.2
c["adjusted_score"] = c["raw_score"] * boost
return [c for c in candidates if not c.get("excluded")]
This is deliberately simple and tunable. The point is not the exact multipliers; it is that metadata participates in ranking, not only filtering. The deprecated document does not vanish (sometimes the historical answer is what the user wants), but it has to be much better on similarity to outrank a current one. That is the right default: current wins ties, and stale has to earn its place.
Where metadata comes from, and why it decays
Metadata is only useful if it is correct, and metadata decays faster than content. A document marked current stays marked current long after it became stale, unless something updates it. There are three sources of metadata, in descending order of reliability.
First, the source system. The CMS, the document management system, the CRM, the code repository, all carry authoritative metadata: who owns it, when it was last updated, what its visibility is, what version it belongs to. Carry this across at ingestion. This is the most reliable source and the one teams most often drop on the floor by exporting only the document body. If you export the text and discard the system's permission and timestamp fields, you have manufactured a metadata decay problem.
Second, derived at ingestion. Some fields you compute: parse_confidence, heading_path, extracted entities, detected locale, inferred doc_type from structure. These are as good as your extraction and should be treated as such (an inferred doc_type is a guess, not a fact).
Third, manual. Some fields require human judgment: sensitivity classification, ownership assignment for orphaned sources, the decision to deprecate. Manual metadata is the least scalable and the most decay-prone, so minimize what depends on it and, where it is unavoidable, route it to the named owner from the inventory.
The operational rule that follows: metadata must be refreshed alongside content, not set once at indexing. When you re-ingest a changed document, you re-read its last_updated, version, and visibility from the source system. A retrieval system that survives contact treats metadata as a live projection of the source system's state, not a snapshot frozen at first index. We will wire this into the refresh runbook in the freshness chapter.
Metadata and the failure chain
Metadata touches almost every link of the Retrieval Failure Chain, which is why it is the control plane and not a side feature. At the permissions link, visibility, tenant_id, and roles_allowed are the filter that runs before retrieval. At candidate retrieval, product, product_version, locale, and status constrain the candidate set to the right world. At reranking, status, last_updated, parse_confidence, and doc_type adjust ordering. At citation, source_id, version, and heading_path produce a verifiable attribution. A query that fails at any of these links usually traces to a missing or stale metadata field, not to the embedding.
This is the reframe to carry forward: when retrieval brings the wrong world into the prompt, the wrong world was usually selectable on metadata you either did not capture or did not keep current. The embedding found a semantically reasonable document. Metadata was supposed to be the gate that said "but not this one, not now, not for them," and the gate was missing.
Practical exercise
Take ten chunks from your index at random and print their metadata. For each, ask: which field would let me filter this out if it were stale, wrong-tenant, wrong-version, or unauthorized? If a chunk has no status, no visibility, and no version, you have found a chunk you cannot control: it will be retrieved on similarity alone, forever, regardless of whether it should be. Then pick your single worst recent failure and identify which metadata field, had it existed and been current, would have prevented it. That field is your next schema addition.
Summary
Metadata is the control plane that turns "closest in vector space" into "closest among the current, authorized, correctly-versioned documents for this user." Every field must do operational work: filter, rank, cite, or expire. Status is the field that converts "the corpus moved" from silent drift into a routine deprecation event, and version metadata prevents confidently answering about the wrong release. Metadata serves as both a hard filter and a graded ranking signal, with current beating stale on ties. It comes from the source system (most reliable), derivation, and manual judgment (most decay-prone), and it must be refreshed alongside content rather than frozen at first index. Most catastrophic retrieval failures are metadata failures wearing an embedding's clothes.
Key Takeaways
- Embeddings answer "what is this about"; metadata answers "should this be retrieved, for this user, right now." It is the control plane.
- Every metadata field must do operational work (filter, rank, cite, or expire) or it is documentation, not metadata.
status(current/draft/deprecated/retired) turns "the corpus moved" into a routine operation instead of silent drift.- Version and product_version metadata stop the system from confidently answering about the wrong release.
- Use metadata as both a hard filter and a graded ranking signal; current should win ties, stale must earn its place.
- Carry metadata from the source system at ingestion; do not export only the body and discard permissions and timestamps.
- Refresh metadata alongside content; it is a live projection of source state, not a one-time snapshot.
