Name: Synthetic Data, Carefully
Availability: InStock

> **Working claim: ** Generation helps when it fills a *named, bounded* gap with examples a model can produce reliably and a human can verify cheaply.

Key Takeaways

Patterns That Earn Their Keep treats synthetic data as a governed artifact, not a free replacement for reality.

The safe use depends on the cell: training, evaluation, red-team, demo, privacy, or lineage all need different gates.

Generated data earns its keep only when a real anchor, verifier, and owner survive after generation.

**Working claim: ** Generation helps when it fills a named, bounded gap with examples a model can produce reliably and a human can verify cheaply. It hurts when it is asked to manufacture the thing that needed verifying in the first place, most dangerously, synthetic rationales, which read as proof and are merely fluent.

The constructive turn

The first four chapters were mostly about what goes wrong. This chapter is the opposite: a catalog of generation patterns that genuinely work, with the conditions that make them work. The book is not anti-synthetic-data, and a manual that only catalogs failures teaches fear, not skill. The skill is knowing which generation tasks are safe and why, so you can reach for the safe ones confidently and treat the dangerous ones with the suspicion they earn.

A single principle sorts the patterns. **Generation is safe to the degree that verification is cheap and the gap is bounded. ** If you can cheaply check whether a generated example is correct, and the thing you are generating is a known, bounded variation rather than novel ground truth, generation is a force multiplier. If checking correctness is as hard as producing the example, or the gap is open-ended, generation is laundering uncertainty. Hold that principle and the catalog organizes itself.

Whiteboard-style technical sketch infographic for Patterns That Earn Their Keep. — The verification axis separates safer synthetic variants from dangerous generated rationales, where fluent prose can hide an unchecked answer.

The patterns, by how much verification they need

Pattern	What it generates	Verification cost	Safe because / dangerous when
Paraphrase expansion	Reworded versions of real examples	Cheap (meaning-preservation check)	Safe: anchored to a real example, label preserved. Dangerous when paraphrases drift in meaning and the label silently no longer applies.
Format variation	Same content, new surface form (JSON↔text, date formats, casing)	Cheap (deterministic)	Safe: makes parsers/classifiers robust. Rarely dangerous.
Minority-class augmentation	More examples of a rare class	Medium (a human can judge class membership)	Safe when shape is preserved; dangerous when it flattens the rare class into a central caricature (Ch. 1).
Edge-case generation	Deliberately unusual inputs	Medium	Safe: fabrication is appropriate here, you want the weird case. Dangerous when edges are weird in ways the system already handles.
Intent / coverage filling	Examples for under-covered intents or slots	Medium	Safe with a coverage map (Ch. 6); dangerous as undirected "more data."
Synthetic conversations	Multi-turn dialogues for tool-use/agent flows	Medium-High	Safe for structure and tool-call practice; dangerous because synthetic users are too cooperative and tool errors are under-represented.
Synthetic documents	Long-form docs for extraction/RAG tests	High	Safe for format coverage; dangerous when the answer is too findable and real document chaos is absent.
Teacher distillation	A stronger model labels/answers for a smaller one	High	Powerful (Alpaca) but inherits the teacher's errors wholesale; verify against a real anchor.
Synthetic rationales	Step-by-step reasoning/explanations	Very high	Most dangerous. Fluent reasoning reads as verified. Train on it only after the answer (not the prose) is independently checked.

Read top to bottom and the risk rises with verification cost, exactly as the principle predicts. The safe end is mechanical and anchored; the dangerous end asks the generator to supply the thing you cannot cheaply check.

A controlled-variant pipeline

The safe patterns share a shape: take a real seed, produce bounded variations, filter to a human-review queue, and keep what survives. Here is that pipeline as a generation→filter→dedupe→review flow. The point is not the API call; it is the gates between the steps.

from dataclasses import dataclass, field
from hashlib import sha256

@dataclass
class Candidate:
 text: str
 label: str
 seed_id: str
 template_id: str
 prompt_hash: str
 status: str = "candidate" # candidate -> kept | dropped | needs_review
 drop_reason: str | None = None

def generate_variants(seed, template, n, sampler) -> list[Candidate]:
 rendered = template.render(seed=seed)
 out = []
 for _ in range(n):
 text = sampler.generate(rendered) # the only model call
 out.append(Candidate(
 text=text, label=seed.label, seed_id=seed.id,
 template_id=template.id,
 prompt_hash=sha256(rendered.encode()).hexdigest(),))
 return out

def filter_pipeline(cands, *, real_corpus, near_dup, label_check) -> list[Candidate]:
 seen = set()
 for c in cands:
 # 1. exact dedup
 h = sha256(c.text.strip().lower().encode()).hexdigest()
 if h in seen:
 c.status, c.drop_reason = "dropped", "exact_duplicate"; continue
 seen.add(h)
 # 2. near-duplicate vs already-kept and vs real corpus (Ch.7)
 if near_dup.too_similar(c.text):
 c.status, c.drop_reason = "dropped", "near_duplicate"; continue
 # 3. label sanity: does the text still match the claimed label?
 if not label_check.consistent(c.text, c.label):
 c.status, c.drop_reason = "needs_review", "label_uncertain"; continue
 # 4. survivors still go to a human SAMPLE, not straight to train
 c.status = "needs_review" if near_dup.is_novel_region(c.text) else "kept"
 return cands

# Everything dropped is KEPT in an audit table, not deleted (Ch.7).

Three design choices matter more than the code. First, the label travels from the seed, but a label_check still verifies it, because a paraphrase can drift until the label no longer fits, and an assumed label is the Chapter 1 failure. Second, survivors in novel regions of the space go to a human, not straight to training; the generator is most likely to fabricate exactly where it has the least real grounding. Third, dropped candidates are retained for audit (the reject set of Chapter 7), so you can later ask what the generator over-produced and why.

The pattern that built famous models: and its actual recipe

The strongest evidence that generation works is the instruction-tuning and small-model-quality lineage. It is worth stating the recipe precisely, because the lesson is the opposite of "just generate."

Self-Instruct started from 175 human-written seed tasks and used a model to bootstrap new instructions and instances. The headline is "the model generated its own training data." The fine print is the filtering: new instructions were discarded if they were too similar to existing ones (a diversity gate), if the model could not produce a valid instance for them, or if they were duplicates or invalid. The generation was cheap; the selection was the contribution. Stanford Alpaca applied this to produce 52,000 demonstrations from a stronger teacher at low cost, explicitly a teacher-distillation pattern, and the authors were candid that the result inherited the teacher's limitations and was a research artifact, not a production-grade dataset.

Textbooks Are All You Need and the Phi-3 report pushed the quality-over-quantity thesis hardest: small models trained on filtered web data plus carefully generated "textbook-quality" synthetic data matched or beat much larger models. This is the most pro-synthetic result in the field, and its actual recipe is curation. The synthetic data was generated to a high pedagogical bar and combined with aggressive filtering of what to exclude. The win was not raw generation volume; it was treating data quality as the central engineering problem.

The contrast that completes the picture is InstructGPT, which anchored alignment in human demonstrations and human preference judgments. The human signal is the anchor (CAREFUL's A) that the synthetic methods lean on indirectly, Alpaca distilled from a model that was itself aligned with human feedback. None of these methods generated in a vacuum. Each one had reality somewhere in the loop: human seeds, human preferences, or a teacher that human feedback had shaped.

The dangerous pattern, in detail: synthetic rationales

The single most seductive and most dangerous pattern deserves its own treatment, because it is where careful teams still get burned. A synthetic rationale is a generated chain of reasoning, "let's solve this step by step", attached to an answer. Teams generate rationales to teach models to reason, and the practice has real value. The hazard is specific: **a rationale's fluency is uncorrelated with its correctness, but humans and downstream training treat fluency as evidence of correctness. **

A generator can produce a beautifully structured, confidently worded derivation that arrives at a wrong answer through a plausible-looking but invalid step. If you train on it, you are not teaching reasoning; you are teaching the appearance of reasoning, decoupled from getting the answer right. The model learns to emit confident derivations, including for cases where it is wrong, because that is the pattern it was rewarded on.

The defense is to never let the rationale be its own warrant. Concretely:

def accept_rationale(item) -> bool:
 # The ANSWER must be checkable independently of the rationale's prose.
 if item.task_type == "math":
 return verify_numeric(item.answer, item.gold) # exact check
 if item.task_type == "code":
 return run_unit_tests(item.code).all_pass # execution check
 if item.task_type == "extraction":
 return item.answer in item.source_document # grounding check
 # If the answer can't be checked independently, the rationale is unverifiable.
 # Do NOT train on it as if correct; route to human or discard.
 return False

The rule is simple to state and constantly violated: **train on a rationale only when the final answer is independently verifiable, and verify the answer, not the prose. ** For math, check the number. For code, run the tests. For extraction, confirm the span is in the source. When the answer cannot be checked independently of the reasoning that produced it, open-ended explanation, judgment, "why" questions: a synthetic rationale is exactly the artifact this book warns about: a confident-looking sample masquerading as truth. That is the type-7 "Forbidden as gold" cell of Chapter 2, made concrete.

A worked example: edge cases for a returns-policy classifier

To ground the safe end, take a real, small task. A retailer's classifier decides whether a return request is within_policy, outside_policy, or needs_human. The real data is dominated by clean cases; the misclassifications cluster on edge cases the logs barely contain. This is fabrication's home turf, because you want the unusual case and the label is checkable by a human against a written policy.

Useful generated edges, each anchored to a real policy clause:

"I bought this 31 days ago, your policy says 30, but it was a gift and I didn't open it until yesterday." → needs_human (policy is silent on gifts; this is the genuine ambiguity).
"Returning a final-sale clearance item because it's defective." → needs_human (final-sale vs. defective conflict, a real policy collision).
"I want to return it but I never received it." → outside_policy for returns, but actually a different intent (this tests whether the classifier confuses non-delivery with returns).

Each edge is generated to probe a named boundary in the policy, a human can verify the label against the written policy in seconds, and the cases are exactly the ones reality under-supplies. That is the recipe: bounded gap, cheap verification, anchored to a real artifact (the policy). The contrast with Chapter 1 is sharp, there, the generator was asked to manufacture the whole distribution; here, it is asked to manufacture named hard cases a human can check.

Chapter summary

Generation helps when verification is cheap and the gap is bounded, and hurts when it is asked to manufacture the very thing that needed verifying. Ordered by rising verification cost, the safe-to-dangerous spectrum runs from format variation and paraphrase (mechanical, anchored), through minority-class augmentation, edge cases, intent filling, synthetic conversations, and synthetic documents, to teacher distillation and, most dangerous, synthetic rationales. The safe patterns share a pipeline: real seed → bounded variants → exact and near-duplicate dedup → label verification → human review of novel-region survivors → keep, with everything dropped retained for audit. The famous successes (Self-Instruct, Alpaca, the Textbooks/Phi line) won not through raw generation but through aggressive filtering, curation, and a human or human-shaped anchor somewhere in the loop, in contrast to InstructGPT's direct human anchoring. Synthetic rationales are the seductive trap: fluency is uncorrelated with correctness but is treated as evidence of it, so train on a rationale only when the final answer is independently verifiable, check the number, run the tests, confirm the span, and verify the answer, not the prose. A returns-policy edge-case example shows the safe recipe concretely: generate named hard cases against a written policy a human can check in seconds.