Appendix A: Back Matter
Glossary, implementation checklist, and source register for the book.
Key Takeaways
- Back Matter consolidates the checklist, glossary, and source map readers need after the main argument.
- The useful artifact is not more prose; it is a way to turn back matter into an implementation review.
- Treat the appendix as the operating memory for the book: terms, gates, and references in one place.
Glossary
**Anchor (real-data anchor): ** A protected set of real examples held aside, never generated over and never leaked into evaluation, included in training as the real tail. The reality that keeps a synthetic dataset honest; CAREFUL's A. Re-injecting fresh real data is what arrests model collapse.
**CAREFUL: ** The book's framework: Clear purpose, Anchored reality, Recorded provenance, Evaluation isolation, Filtering gates, Use limits, Lifecycle monitoring. The question set a mature synthetic-data pipeline can answer for any dataset it ships.
**Coverage of real: ** The fraction of real examples that have a synthetic neighbor within a similarity radius. The decisive semantic-diversity metric: it answers whether synthetic data reached the real distribution, distinct from internal spread.
**Crossing: ** When data generated for a lenient use (e.g. demo) silently migrates into a stricter one (e.g. eval or training), usually via an untagged file and an over-broad glob. The most common structural error in synthetic-data work.
**Data card: ** Human-readable companion to a manifest, whose most valuable section is "NOT good for, " naming the tempting misuses (eval, label-gold) that facts alone do not prevent.
**Fabrication: ** Generating examples plausible in isolation but reflecting the generator's defaults rather than a measured distribution. Satisfies a prompt. Appropriate for edge cases and documentation; disqualifying when used as training/eval that faces reality. Contrast simulation.
**Fingerprints: ** The generator's systematic marks on its output: preferred style, length, register, flattened tails, confident errors, blind spots. Present on every generated example until measured and removed.
Generator overlap (contamination), An evaluation set generated by a model in the system's training lineage is contaminated by construction, sharing the generator's assumptions, even with zero string overlap. Detectable only via recorded generator provenance.
**Label accuracy (on reviewed sample): ** The rate at which a generated example's prompted/assumed label matches what a human reviewer judges from the text. Catches the assumed-label failure that a balanced count hides.
**Manifest: ** Machine-readable dataset-level record of composition, generation recipe, quality gates, and use policy. Earns its keep only when enforced at load time.
**Memorization (leak): ** A generator trained on sensitive data emitting it unprompted because the model memorized it. A privacy leak originating in the model's weights, not the prompt.
**Model collapse: ** Degradation from recursive training on a model's own unfiltered, unanchored output: tails vanish first, variance shrinks, the distribution converges toward the mean. Requires recursion, no anchor, and no filtering; avoided by removing any one.
**Near-duplicate: ** A generated example semantically equivalent but not byte-identical to another. Inflates apparent dataset size and causes train/eval leakage that exact-match checks miss.
**Prompt echo (leak): ** Real PII fed into a generation prompt as seed or context and reproduced (verbatim or paraphrased) in the output. The most common and most preventable synthetic-data privacy leak.
**Provenance: ** The recorded origin of a generated example: generator and version, prompt hash, seed, sampling params, post-processing, reviewer. Without it, a synthetic row cannot be excluded from eval, reproduced, deleted, or attributed.
**Realism AUC: ** The cross-validated AUC of a classifier trained to distinguish real from synthetic examples. ~0.5 means indistinguishable (good for training realism); ~1.0 means trivially fake. The right target depends on intended use, red-team data wants high separability.
**Reject set: ** Everything the gates threw out, retained with its reason. The most undervalued artifact in a pipeline: it diagnoses what the generator over-produces and surfaces alarms (e.g. leaked PII).
**Simulation: ** Generating examples plausibly drawn from a measured real distribution, including its mess and tail. Matches a distribution you have characterized. The mode required for training/eval data that faces reality. Contrast fabrication.
**Synthetic ratio (max): ** The recorded, enforced cap on how much of a training mix may be synthetic, set per use case. Diversity-preserving downsampling enforces it; the actual ratio is logged with the dataset id.
**Verifier: ** The mechanism that checks a generated example's correctness (execution for code, grounding for extraction, native speaker for low-resource, domain expert for cautious domains). Its cost and reliability set how heavily a task can safely lean on synthetic data.
Implementation Checklist
A team's synthetic-data practice is approaching production-ready when it can answer yes, with evidence to each item. Grouped by CAREFUL, which maps to the lifecycle.
C: Clear purpose (Movements I, VII)
- Every generation effort has a written purpose worksheet: the named gap, why real data won't suffice, the success criterion measured against real data.
- Each artifact is tagged at birth with its type and intended use; the type×use risk cell sets gate strength.
- Legal/policy review happens at the purpose stage, not as a final sign-off, including seed-data consent and vendor terms.
A: Anchored reality (Movements I, III, VI, VIII)
- A real anchor is held aside, flagged
never_train_onandeval_exclusion, and included in training as the real tail. - The real distribution is characterized (length, tone, channel, balance, difficulty) and generation simulates toward it rather than fabricating toward defaults.
- Diversity is measured against the real anchor on all five axes (lexical, semantic/coverage-of-real, label, source, difficulty), never asserted.
- For each task, the verifier is named and its cost/reliability assessed before deciding how heavily to lean synthetic.
R: Recorded provenance (Movement II)
- Producing an example and recording its lineage are the same operation; a synthetic row cannot be inserted without naming its generator and prompt (DB constraint).
- Prompt templates that produce kept data live in a versioned store, not a notebook; their hashes are in the manifest.
- Every dataset ships with a manifest and a data card whose "NOT good for" section names the tempting misuses.
- A dataset's generation recipe is versioned source: changes produce a reviewable diff and a re-versioned, reproducible dataset.
E: Evaluation isolation (Movement V)
- Eval data carries
never_train_on; the training loader aborts (not warns) if any eval/excluded row appears. - Train/eval leakage is scanned both exact and semantic (near-duplicate) before every training run.
- A generator-provenance check confirms no eval item came from the system's training lineage.
- Eval answers are anchored in humans, extractive spans, or executable verifiers, never generated gold; questions with no trustworthy gold are discarded.
- Synthetic evals run alongside a real/human-labeled eval and divergence between them is treated as an early warning.
F: Filtering gates (Movement IV)
- An ordered gate cascade runs: format, exact dedup, early PII/toxicity, near-duplicate, realism, label consistency, human sample.
- A real-vs-synthetic classifier reports AUC (interpreted by intended use) and its feature importances guide generation fixes.
- Multiple judges score survivors; unanimous → provisional keep, any disagreement → human; a calibration sample of auto-decided cases catches shared blind spots.
- Everything rejected is retained in an audit reject set with its reason; the reject taxonomy is reviewed.
- Rejecting confidently and frequently is normal; a low reject rate triggers suspicion, not celebration.
U: Use limits (Movements I, II, VII)
-
prohibited_useandeval_exclusionare enforced at load time; a job whose purpose is prohibited fails loudly. - Datasets seeded by one tenant's data carry a tenant scope forbidding use for models serving other tenants.
- Promotion from candidate pool to training runs through an approval workflow; quarantine is the default state and high-risk datasets require a second approver.
- A tamper-evident audit log can answer, for any model, which generated data it trained on, who approved it, and why.
L: Lifecycle monitoring (Movement VI, VIII)
- A synthetic-ratio cap is recorded per dataset and enforced at build time with diversity-preserving downsampling and a logged actual ratio.
- Production slices (rare classes, hard cases, under-covered channels, long inputs) are monitored against a protected real baseline; a slice degrading while aggregate holds triggers investigation.
- A production-vs-training coverage drift monitor watches for live traffic in under-covered regions.
- Datasets have an expiry; when the real distribution shifts, they are regenerated (with a diff) or retired, not silently reused.
- Recursive self-improvement loops re-inject fresh real data and filter every round, staying out of the collapse quadrant.
Research and Source Register
Sources grouped by chapter. A source appears under a chapter only if that chapter actually uses it to support a claim.
**Front matter / Introduction: ** synthetic; draws on the book's own argument and previews the shared spine. No external citations in the introduction prose.
Ch. 1, The Polished-Ticket Problem
- Self-Instruct
- Stanford Alpaca
- Textbooks Are All You Need
- Phi-3 Technical Report
- Model collapse (Nature)
Ch. 2, A Taxonomy of Generated Data
- Self-Instruct
- Textbooks Are All You Need
- Data contamination survey
- NIST AI RMF
- Datasheets for Datasets
Ch. 3, Generated Data Without Lineage Is Operational Debt
- NIST AI RMF
- OWASP LLM Top 10
- Data contamination survey
- Datasheets for Datasets
- Trustworthy evaluation / contamination
Ch. 4, Manifests, Data Cards, and the Diff
- Datasheets for Datasets
- Model Cards for Model Reporting
- NIST AI RMF
- Data contamination survey
- Self-Instruct
Ch. 5, Patterns That Earn Their Keep
Ch. 6, Diversity Is a Measurement, Not a Vibe
Ch. 7, The Gates After Generation
Ch. 8, Judges Disagree, and That Is the Useful Part
Ch. 9, Evaluation Is Where Synthetic Data Does the Most Damage
- Trustworthy evaluation / contamination
- Data contamination survey
- TruthfulQA
- RAGAS
- OpenAI: Evals guide
Ch. 10, Model Collapse, Honestly
Ch. 11, Mixture Ratios, Anchors, and Post-Training Monitoring
- Model collapse (Nature)
- The Curse of Recursion
- Textbooks Are All You Need
- NIST AI RMF
- Data contamination survey
Ch. 12, The Privacy Myth and What Generated Data Actually Leaks
- NIST AI RMF
- OWASP LLM Top 10: Sensitive Information Disclosure
- Extracting Training Data from Large Language Models
- The Secret Sharer
- Data contamination survey
Ch. 13, Poisoning, Approvals, and the Governance of Generation
- OWASP LLM Top 10: Data and Model Poisoning
- NIST AI RMF
- Poisoning Web-Scale Training Datasets Is Practical
- OpenAI: Safety best practices
- Data contamination survey
Ch. 14: Playbooks I: Classification, Extraction, Safety Red-Teaming
Ch. 15: Playbooks II: RAG Eval, Tone, Low-Resource, Code, Cautious Domains
Ch. 16, Operating Synthetic Data with CAREFUL
