Name: Synthetic Data, Carefully
Availability: InStock

Key Takeaways

Back Matter consolidates the checklist, glossary, and source map readers need after the main argument.

The useful artifact is not more prose; it is a way to turn back matter into an implementation review.

Treat the appendix as the operating memory for the book: terms, gates, and references in one place.

Glossary

**Anchor (real-data anchor): ** A protected set of real examples held aside, never generated over and never leaked into evaluation, included in training as the real tail. The reality that keeps a synthetic dataset honest; CAREFUL's A. Re-injecting fresh real data is what arrests model collapse.

**CAREFUL: ** The book's framework: Clear purpose, Anchored reality, Recorded provenance, Evaluation isolation, Filtering gates, Use limits, Lifecycle monitoring. The question set a mature synthetic-data pipeline can answer for any dataset it ships.

**Coverage of real: ** The fraction of real examples that have a synthetic neighbor within a similarity radius. The decisive semantic-diversity metric: it answers whether synthetic data reached the real distribution, distinct from internal spread.

**Crossing: ** When data generated for a lenient use (e.g. demo) silently migrates into a stricter one (e.g. eval or training), usually via an untagged file and an over-broad glob. The most common structural error in synthetic-data work.

**Data card: ** Human-readable companion to a manifest, whose most valuable section is "NOT good for, " naming the tempting misuses (eval, label-gold) that facts alone do not prevent.

**Fabrication: ** Generating examples plausible in isolation but reflecting the generator's defaults rather than a measured distribution. Satisfies a prompt. Appropriate for edge cases and documentation; disqualifying when used as training/eval that faces reality. Contrast simulation.

**Fingerprints: ** The generator's systematic marks on its output: preferred style, length, register, flattened tails, confident errors, blind spots. Present on every generated example until measured and removed.

Generator overlap (contamination), An evaluation set generated by a model in the system's training lineage is contaminated by construction, sharing the generator's assumptions, even with zero string overlap. Detectable only via recorded generator provenance.

**Label accuracy (on reviewed sample): ** The rate at which a generated example's prompted/assumed label matches what a human reviewer judges from the text. Catches the assumed-label failure that a balanced count hides.

**Manifest: ** Machine-readable dataset-level record of composition, generation recipe, quality gates, and use policy. Earns its keep only when enforced at load time.

**Memorization (leak): ** A generator trained on sensitive data emitting it unprompted because the model memorized it. A privacy leak originating in the model's weights, not the prompt.

**Model collapse: ** Degradation from recursive training on a model's own unfiltered, unanchored output: tails vanish first, variance shrinks, the distribution converges toward the mean. Requires recursion, no anchor, and no filtering; avoided by removing any one.

**Near-duplicate: ** A generated example semantically equivalent but not byte-identical to another. Inflates apparent dataset size and causes train/eval leakage that exact-match checks miss.

**Prompt echo (leak): ** Real PII fed into a generation prompt as seed or context and reproduced (verbatim or paraphrased) in the output. The most common and most preventable synthetic-data privacy leak.

**Provenance: ** The recorded origin of a generated example: generator and version, prompt hash, seed, sampling params, post-processing, reviewer. Without it, a synthetic row cannot be excluded from eval, reproduced, deleted, or attributed.

**Realism AUC: ** The cross-validated AUC of a classifier trained to distinguish real from synthetic examples. ~0.5 means indistinguishable (good for training realism); ~1.0 means trivially fake. The right target depends on intended use, red-team data wants high separability.

**Reject set: ** Everything the gates threw out, retained with its reason. The most undervalued artifact in a pipeline: it diagnoses what the generator over-produces and surfaces alarms (e.g. leaked PII).

**Simulation: ** Generating examples plausibly drawn from a measured real distribution, including its mess and tail. Matches a distribution you have characterized. The mode required for training/eval data that faces reality. Contrast fabrication.

**Synthetic ratio (max): ** The recorded, enforced cap on how much of a training mix may be synthetic, set per use case. Diversity-preserving downsampling enforces it; the actual ratio is logged with the dataset id.

**Verifier: ** The mechanism that checks a generated example's correctness (execution for code, grounding for extraction, native speaker for low-resource, domain expert for cautious domains). Its cost and reliability set how heavily a task can safely lean on synthetic data.

Implementation Checklist

A team's synthetic-data practice is approaching production-ready when it can answer yes, with evidence to each item. Grouped by CAREFUL, which maps to the lifecycle.

C: Clear purpose (Movements I, VII)

Every generation effort has a written purpose worksheet: the named gap, why real data won't suffice, the success criterion measured against real data.
Each artifact is tagged at birth with its type and intended use; the type×use risk cell sets gate strength.
Legal/policy review happens at the purpose stage, not as a final sign-off, including seed-data consent and vendor terms.

A: Anchored reality (Movements I, III, VI, VIII)

A real anchor is held aside, flagged never_train_on and eval_exclusion, and included in training as the real tail.
The real distribution is characterized (length, tone, channel, balance, difficulty) and generation simulates toward it rather than fabricating toward defaults.
Diversity is measured against the real anchor on all five axes (lexical, semantic/coverage-of-real, label, source, difficulty), never asserted.
For each task, the verifier is named and its cost/reliability assessed before deciding how heavily to lean synthetic.

R: Recorded provenance (Movement II)

Producing an example and recording its lineage are the same operation; a synthetic row cannot be inserted without naming its generator and prompt (DB constraint).
Prompt templates that produce kept data live in a versioned store, not a notebook; their hashes are in the manifest.
Every dataset ships with a manifest and a data card whose "NOT good for" section names the tempting misuses.
A dataset's generation recipe is versioned source: changes produce a reviewable diff and a re-versioned, reproducible dataset.

E: Evaluation isolation (Movement V)

Eval data carries never_train_on; the training loader aborts (not warns) if any eval/excluded row appears.
Train/eval leakage is scanned both exact and semantic (near-duplicate) before every training run.
A generator-provenance check confirms no eval item came from the system's training lineage.
Eval answers are anchored in humans, extractive spans, or executable verifiers, never generated gold; questions with no trustworthy gold are discarded.
Synthetic evals run alongside a real/human-labeled eval and divergence between them is treated as an early warning.

F: Filtering gates (Movement IV)

An ordered gate cascade runs: format, exact dedup, early PII/toxicity, near-duplicate, realism, label consistency, human sample.
A real-vs-synthetic classifier reports AUC (interpreted by intended use) and its feature importances guide generation fixes.
Multiple judges score survivors; unanimous → provisional keep, any disagreement → human; a calibration sample of auto-decided cases catches shared blind spots.
Everything rejected is retained in an audit reject set with its reason; the reject taxonomy is reviewed.
Rejecting confidently and frequently is normal; a low reject rate triggers suspicion, not celebration.

U: Use limits (Movements I, II, VII)

prohibited_use and eval_exclusion are enforced at load time; a job whose purpose is prohibited fails loudly.
Datasets seeded by one tenant's data carry a tenant scope forbidding use for models serving other tenants.
Promotion from candidate pool to training runs through an approval workflow; quarantine is the default state and high-risk datasets require a second approver.
A tamper-evident audit log can answer, for any model, which generated data it trained on, who approved it, and why.

L: Lifecycle monitoring (Movement VI, VIII)

A synthetic-ratio cap is recorded per dataset and enforced at build time with diversity-preserving downsampling and a logged actual ratio.
Production slices (rare classes, hard cases, under-covered channels, long inputs) are monitored against a protected real baseline; a slice degrading while aggregate holds triggers investigation.
A production-vs-training coverage drift monitor watches for live traffic in under-covered regions.
Datasets have an expiry; when the real distribution shifts, they are regenerated (with a diff) or retired, not silently reused.
Recursive self-improvement loops re-inject fresh real data and filter every round, staying out of the collapse quadrant.

Research and Source Register

Sources grouped by chapter. A source appears under a chapter only if that chapter actually uses it to support a claim.

**Front matter / Introduction: ** synthetic; draws on the book's own argument and previews the shared spine. No external citations in the introduction prose.

Ch. 1, The Polished-Ticket Problem

Ch. 2, A Taxonomy of Generated Data

Ch. 3, Generated Data Without Lineage Is Operational Debt

Ch. 4, Manifests, Data Cards, and the Diff

Ch. 5, Patterns That Earn Their Keep

Ch. 6, Diversity Is a Measurement, Not a Vibe

Ch. 7, The Gates After Generation

Ch. 8, Judges Disagree, and That Is the Useful Part

Ch. 9, Evaluation Is Where Synthetic Data Does the Most Damage

Ch. 10, Model Collapse, Honestly

Ch. 11, Mixture Ratios, Anchors, and Post-Training Monitoring

Ch. 12, The Privacy Myth and What Generated Data Actually Leaks

Ch. 13, Poisoning, Approvals, and the Governance of Generation

Ch. 14: Playbooks I: Classification, Extraction, Safety Red-Teaming

Ch. 15: Playbooks II: RAG Eval, Tone, Low-Resource, Code, Cautious Domains

Ch. 16, Operating Synthetic Data with CAREFUL

Appendix A: Back Matter