AN Alpesh Nakrani
BlogBooksPraiseAbout Work with me →
Book overview
Chapter 1 / Field Manuals

The CORPUS Inventory You Skipped

Before you embed a single document, you have to know what you actually have, who owns it, and whether you are even allowed to surface it.

Research spine: this chapter stays grounded in BM25 and Dense Passage Retrieval (DPR), then applies that evidence to the operating judgment in the book. Read this alongside the RAG That Survives book, the AI-Native thesis, and the full book library when you want the surrounding argument. Every team I have watched stand up RAG in a hurry does the same thing. Someone exports a knowledge base, drops it in a bucket, runs an embedding job overnight, and by morning there is a working prototype. The prototype is genuinely impressive. It is also built on a corpus nobody has actually looked at, and that gap is where the next six months of pain lives.

I want to slow down the part everyone skips. Before you chunk, before you embed, before you pick a vector store, you do an inventory. Not a vibes-based scan. A real accounting of what is in the corpus, who owns each part, whether you have the rights to retrieve it, whether you can even parse it, how often it changes, and whether it is findable. I call this the CORPUS Readiness Model, and the letters are not decoration: Coverage, Ownership, Rights, Parsing, Updates, Searchability. Each one is a question that, left unanswered, becomes an incident.

Key Takeaways

  • Before you embed a single document, you have to know what you actually have, who owns it, and whether you are even allowed to surface it.
  • The CORPUS Inventory You Skipped should be evaluated through concrete evidence, ownership, and failure modes before production behavior changes.
  • Read it with the adjacent Rag That Survives chapters to move from diagnosis to an implementation or release decision.

Why the inventory is not optional

A retrieval system is a function of its corpus far more than its model. Two teams using the identical model and identical vector database will get wildly different quality if one has a clean, owned, current, parseable corpus and the other has a junk drawer. The model cannot tell the difference between a current policy and a deprecated one. It cannot tell that two documents contradict each other. It cannot tell that one PDF is a scanned fax that parsed into mojibake. It will retrieve all of it with equal confidence.

The inventory is the moment you confront the actual material instead of the imagined material. In one engagement, a "knowledge base of about 8,000 articles" turned out, on inventory, to be 8,000 entries of which roughly 2,300 were exact or near duplicates from a migration, 900 were drafts that had never been published, 600 were in a language the embedding model handled poorly, and about 400 were internal-only documents that had been accidentally bulk-imported into the public space. The "8,000" was a fiction. The usable, current, public, parseable, owned corpus was closer to 3,500. Indexing the full 8,000 would not have been ambitious. It would have been negligent.

CORPUS Readiness dashboard with six gauges
Six readiness dimensions scored before any embedding runs: Coverage, Ownership, Rights, Parsing, Updates, Searchability.

C: Coverage

Coverage asks two questions that pull in opposite directions. First, does the corpus actually contain answers to the questions users will ask? Second, does it contain material it should not?

The first is a gap analysis. Pull a sample of real user questions, support tickets, search logs, sales objections, whatever your domain produces, and check by hand whether the corpus contains a correct answer to each. If users routinely ask about billing and the corpus has no billing documentation, no retrieval system can help them. This sounds obvious and is constantly missed, because teams index what they have rather than what users need. A retriever that returns the closest available document to a question it cannot actually answer is worse than one that says "I do not have that." It manufactures plausible wrongness.

The second is a contamination check. Test corpora, employee personal notes, half-finished migrations, and other teams' documents have a way of ending up in the same bucket. Coverage is not only "do we have enough," it is "do we have only what we intend."

O: Ownership

Every document in a retrieval corpus needs a human owner, and most documents do not have one. Ownership is the difference between a corpus that decays and one that stays current. When the product gets renamed, who updates the affected docs? When a policy changes, who retires the old version? If the answer is "nobody," that document is on a one-way trip to staleness, and your retriever will keep surfacing it long after it became wrong.

Ownership is also how you triage. When retrieval surfaces a stale answer, you need to route the correction to a person, not file it into a void. A corpus without owners cannot be maintained, only periodically re-dumped, and re-dumping carries forward every problem the inventory was supposed to catch.

Map ownership at the source level, not the document level, or you will never finish. Group documents into sources, each source has an owner, each owner has a maintenance expectation. We will use this same source-level grouping when we get to metadata and to permissions, because the owner of a source usually also defines who is allowed to read it.

R: Rights

Rights is the question that gets teams into legal trouble: are you actually allowed to retrieve and surface this content, to these users, in this way?

Three sub-questions matter. Is the content yours to use at all, or is it licensed, third-party, or under a contract that restricts redistribution? Some "knowledge bases" contain vendor documentation or partner materials that you are not permitted to expose through an AI surface. Second, are there regulatory constraints on the content? Health records, financial advice, and personal data carry rules about who can see what and where it can be processed. Third, and this is the one that becomes a security incident, does surfacing this content respect the permission model of the original system?

This last point deserves emphasis. When you extract documents from a system that had its own access controls, the embedding index does not inherit those controls unless you explicitly carry the permission metadata across. A document that was visible only to the finance team in the source system becomes visible to anyone who can query the index, unless you preserved and enforce the original permission. This is not a hypothetical risk; permission leakage through retrieval is a well-documented class of failure, and it is why permissions get their own chapter and why permission metadata gets captured at inventory time, not bolted on later.

P: Parsing

Parsing is the gap between the documents you wish you had and the documents you actually have. You imagine clean markdown. You have a mix of PDFs (some text, some scanned images), HTML with navigation chrome and cookie banners baked in, Word documents with tracked changes, spreadsheets where the real information lives in cell relationships, slide decks where the meaning is in the layout, and code files where structure is everything.

At inventory time you do not need to solve parsing. You need to honestly classify it. For each source, what format is it, and how cleanly does it parse? A scanned PDF that requires OCR is a different cost and risk than a native markdown file. A spreadsheet where a number means nothing without its row and column headers will produce useless chunks if you naively flatten it to text. The next chapter is entirely about parsing, because it is so consistently underestimated, but the inventory is where you flag which sources are going to be expensive or lossy to ingest.

A blunt rule from the field: if a document does not parse into clean, structured text, it does not belong in the index yet. A garbled chunk does not just fail to help; it actively competes for retrieval slots with chunks that would have helped.

U: Updates

Updates is the metabolism question. How often does each source change, and how will the index find out?

Sources fall on a spectrum. A regulatory filing from 2021 is effectively immutable; once indexed, it never changes. A pricing page might change weekly. A product changelog changes daily. An incident status page changes by the minute. A corpus is almost never uniform; it is a mix of fast-moving and slow-moving sources, and treating them all with one refresh schedule is how you end up either re-embedding immutable documents wastefully every night or serving week-old pricing because your refresh runs monthly.

At inventory time, tag each source with an expected change frequency. This single attribute drives your entire refresh strategy later, and we will build a full lifecycle around it in the chapter on freshness. The cost of getting it wrong is asymmetric: under-refreshing a fast source means serving stale answers; over-refreshing a slow source wastes compute. You want to refresh each source about as often as it changes, no more, no less.

S: Searchability

Searchability asks whether the content, once parsed and chunked, can actually be found by the queries users will write. This is where vocabulary mismatch lives. Your legal team writes "termination and pro-rata credit"; your users type "refund." Your engineers write "401 Unauthorized"; your users type "it says I'm not logged in." If the corpus only ever uses internal vocabulary and users only ever use natural language, dense retrieval will struggle and sparse retrieval will fail outright, because the literal tokens never match.

At inventory time, check whether each source uses language that resembles how users ask. Where it does not, you have options we will develop later: query rewriting to bridge user language to corpus language, metadata enrichment to attach synonyms and entities, or hybrid retrieval to catch both literal and semantic matches. But you cannot choose the fix until you have noticed the gap, and you notice it by sampling real queries against real documents.

The inventory template

Here is the artifact. One row per source, filled in before any embedding happens. I keep this in a spreadsheet during discovery and graduate the durable parts into the metadata schema later.

FieldWhat it capturesExample
source_idStable identifier for the source groupkb-support
descriptionHuman description"Public help-center articles"
doc_countRaw count, then de-duplicated count8,000 raw / 3,500 unique
ownerNamed human or team responsible"Support Content team"
coverage_notesWhich user questions it answers / gaps"No billing coverage"
rightsLicense, regulatory, redistribution status"Owned, public-OK"
visibilityWho may retrieve itpublic
formatsSource formats and parse difficulty"HTML clean; 12% PDF scanned"
change_frequencyHow often it changes"weekly"
vocabulary_gapUser-language vs corpus-language mismatch"high: legal phrasing"
readinessReady / needs-work / exclude"needs-work"

The readiness column is the decision. A source marked exclude does not get indexed, no matter how much someone wants the document count to look big. A source marked needs-work gets a remediation owner and a date. Only ready sources flow into ingestion. This is unglamorous and it is the single highest-return hour you will spend on the project.

Scoring readiness without overengineering it

You do not need a numeric maturity model. A simple three-state score per dimension is enough to drive decisions, and it forces honesty.

DimensionRedAmberGreen
CoverageMajor question gaps or contaminationSome gaps, knownAnswers real questions, clean
OwnershipNo ownerOwner exists, no maintenance SLAOwner with refresh commitment
RightsUnclear or restrictedOwned but sensitivity unreviewedCleared for these users
ParsingLossy / garbledParses with known artifactsClean structured text
UpdatesUnknown cadenceKnown but unmonitoredKnown and monitored
SearchabilitySevere vocabulary mismatchSome mismatchUser and corpus language align

Any red on Rights is a hard stop until resolved, because the downside is a leak, not a quality dip. Red on Parsing means the source waits. Red on the others means remediation with an owner and a date. The point is not the rubric; the point is that someone looked.

A short field story on doing this backwards

A company I advised had already launched before doing any inventory. Their RAG assistant for internal sales reps was surfacing deal terms from closed-lost opportunities as if they were standard pricing. On inventory, we found the CRM export had pulled in negotiation notes that reps had written for their own eyes, full of one-off discounts and "we caved on this to close it" commentary. Coverage was contaminated, Rights and visibility were never reviewed, and Ownership was nonexistent because the notes belonged to individual reps who never expected them in a shared system. The fix was not a model change. It was an inventory, a visibility tag, and the exclusion of an entire source. The assistant got dramatically more trustworthy by retrieving less.

That is the theme. A retrieval system that survives contact is built on a corpus someone has actually accounted for. The inventory is where you decide what is worthy of being retrieved, and that decision shapes every answer the system will ever give.

Practical exercise

Run a one-page inventory on your single largest source this week. Get the raw document count and the de-duplicated count; the gap alone is usually instructive. Name an owner. Mark the visibility. Sample twenty real user questions and check by hand how many the source can actually answer. Sample twenty documents and check how cleanly they parse. Assign red/amber/green on each CORPUS dimension. You will almost certainly find at least one source you should not be indexing as-is, and finding it now is cheaper than finding it in an incident.

Summary

Before embeddings, take inventory. The CORPUS Readiness Model gives you six dimensions that each become an incident if skipped: Coverage (does it answer real questions and only intended content), Ownership (a named human per source), Rights (legal, regulatory, and permission constraints), Parsing (clean structured text or it waits), Updates (change cadence per source drives refresh), and Searchability (does corpus language match user language). Score each source red/amber/green, exclude what is not ready, and remediate the rest with owners and dates. The model cannot tell current from stale or owned from leaked; the inventory is the only place that judgment gets made.

Key Takeaways

  • A retrieval system is a function of its corpus far more than its model; inventory before you embed.
  • Coverage cuts both ways: does the corpus answer real questions, and does it contain only what you intend?
  • Every source needs a named owner, or it decays and surfaces stale answers forever.
  • Rights includes carrying the source permission model across; the index does not inherit access controls automatically.
  • If a document does not parse into clean structured text, it does not belong in the index yet.
  • Tag each source with a change frequency; it drives the entire refresh strategy later.
  • Score each source red/amber/green per dimension; exclude the unready and remediate the rest with an owner and a date.
Share