AN Alpesh Nakrani
BlogBooksPraiseAbout Work with me →
All books
Synthetic Data, Carefully cover
2026 / Free online book · Technical Deep Dives

Synthetic Data, Carefully

When Generated Data Helps, and When It Teaches Models Their Own Mistakes

Access
Free
Chapters
16
Read time
156 min

Synthetic data scales until the model starts learning its own mistakes. Where the technique helps and where it quietly rots.

This edition is free to read onsite. Each chapter has its own URL, so readers can bookmark, share, and return to the exact section they need.

Table of contents
FM Front Matter: Synthetic Data, Carefully When Generated Data Helps, and When It Teaches Models Their Own Mistakes 5 min INT Introduction: The Sentence That Should Make You Nervous There is a sentence that gets said in planning meetings, slack threads, and grant applications, and it should make you nervous every time you hear it: 7 min 01 The Polished-Ticket Problem > **Working claim: ** Generated data can fix a small, well-defined coverage gap, but it cannot stand in for the real distribution. 9 min 02 A Taxonomy of Generated Data: and Why We Reach for It > **Working claim: ** "Synthetic data" is not one thing. 9 min 03 Generated Data Without Lineage Is Operational Debt > **Working claim: ** A generated example with no recorded provenance is a liability disguised as an asset. 8 min 04 Manifests, Data Cards, and the Diff That Regenerates a Dataset This chapter turns manifests, data cards, and the diff that regenerates a dataset into a concrete operating problem for the synthetic data book. 8 min 05 Patterns That Earn Their Keep > **Working claim: ** Generation helps when it fills a *named, bounded* gap with examples a model can produce reliably and a human can verify cheaply. 9 min 06 Diversity Is a Measurement, Not a Vibe > **Working claim: ** "We generated diverse examples" is a claim, and like any claim it is either measured or imaginary. Volume is not coverage; ten thousand examples can occupy the same small region you already had. 8 min 07 The Gates After Generation > **Working claim: ** Synthetic data is only as good as the gates that run after generation. Generation is the cheap, optimistic part; filtering is the expensive, skeptical part, and it is where quality is actually decided. 8 min 08 Judges Disagree, and That Is the Useful Part > **Working claim: ** Using a model to judge generated data is convenient and necessary at scale, but a single LLM judge is a single fingerprinted opinion wearing the costume of measurement. 8 min 09 Evaluation Is Where Synthetic Data Does the Most Damage > **Working claim: ** Synthetic data is most dangerous when it becomes your ruler. A generated training example that is wrong costs you some signal; a generated evaluation example that is wrong destroys your ability to *detect* that anything is wrong. 9 min 10 Model Collapse, Honestly > **Working claim: ** Model collapse is real, mechanistically specific, and frequently misquoted. 8 min 11 Mixture Ratios, Anchors, and Post-Training Monitoring > **Working claim: ** The anchor that arrests collapse is not a one-time decision; it is a standing ratio you set per use case and a drift you watch for after the model ships. 8 min 12 The Privacy Myth and What Generated Data Actually Leaks > **Working claim: ** "Synthetic data is private" is the most over-trusted sentence in this field. 8 min 13 Poisoning, Approvals, and the Governance of Generation > **Working claim: ** A synthetic-data pipeline is an attack surface and a control surface at once. 7 min 14 Playbooks I: Classification, Extraction, and Safety Red-Teaming > **Working claim: ** The general principles become useful only when they collapse into specific recipes. 8 min 15 Playbooks II: RAG Evaluation, Fine-Tuning Tone, Low-Resource, Code, and Cautious Domains > **Working claim: ** The careful-use principles scale across very different tasks, but the safe synthetic ratio, the right verifier, and the human-review level swing enormously between them. 11 min 16 Operating Synthetic Data with CAREFUL > **Working claim: ** Everything in this book reduces to a single operating discipline: a synthetic dataset is a data *product* with a lifecycle, and a mature team can answer the CAREFUL questions for any dataset it ships, from purpose through retirement. 9 min A Appendix A: Back Matter Glossary, implementation checklist, and source register for the book. 9 min