AN Alpesh Nakrani
BlogBooksPraiseAbout Work with me →
Book overview
Chapter 2 / Technical Deep Dives

A Taxonomy of Generated Data: and Why We Reach for It

> **Working claim: ** "Synthetic data" is not one thing.

Key Takeaways

  • A Taxonomy of Generated Data: and Why We Reach for It treats synthetic data as a governed artifact, not a free replacement for reality.
  • The safe use depends on the cell: training, evaluation, red-team, demo, privacy, or lineage all need different gates.
  • Generated data earns its keep only when a real anchor, verifier, and owner survive after generation.

**Working claim: ** "Synthetic data" is not one thing. The risk of a generated example depends almost entirely on what it will be used for, and the same generated row can be harmless as a demo, useful as training augmentation, and disqualifying as an evaluation case. You cannot reason about synthetic data safely until you classify both what you generated and what you will do with it.

Why a taxonomy is not pedantry

The previous chapter's team made a mistake that a taxonomy would have prevented: they let one act, "generate tickets", produce data that they then used for two incompatible purposes, training and evaluation, without noticing the purposes had opposite requirements. This is the most common structural error in synthetic-data work, and it is not a knowledge gap. The engineers knew the difference between a training set and a test set. They simply did not have a habit of tagging each generated artifact with what it was and what it was for, so the categories blurred in a shared folder of .jsonl files.

A taxonomy gives you those tags. It is the difference between a warehouse with labeled shelves and a pile in the corner. Two questions organize everything in this book: what did you generate (the artifact type) and what will you do with it (the use). The artifact type tells you the likely fingerprints. The use tells you which fingerprints are disqualifying. Cross the two and you get a risk profile, and the risk profile tells you how hard your gates need to be.

Whiteboard-style technical sketch infographic for A Taxonomy of Generated Data, and Why We Reach for It.
The matrix shows how generated artifact type and intended use combine into risk, with warehouse shelves making silent crossings visible.

Axis one: what did you generate?

Generated data is not monolithic, and the type strongly predicts where it goes wrong.

Artifact typeWhat it isWhere the fingerprints hide
Synthetic examplesNew inputs in your domain (tickets, queries, images, rows)Style, length, vocabulary, class balance, flattened toward the generator's defaults
Synthetic labelsA model annotates real or synthetic inputsThe generator's systematic errors become "ground truth"; rare-but-correct labels get smoothed away
Synthetic edge casesDeliberately unusual inputs to stress a systemOften the only good use of fabrication, but easy to make unrealistic edge cases that test nothing real
Synthetic conversationsMulti-turn dialogues, often for tool-use or agent flowsTurn structure too clean; users too cooperative; tool errors under-represented
Synthetic documentsLong-form content for extraction or RAG testsFormat too regular; the answer is too findable; real document chaos is absent
Synthetic rationalesStep-by-step reasoning or explanationsThe reasoning can be fluent and wrong; plausibility is not verification, the most dangerous type
Synthetic evalsGenerated questions and/or expected answersEncodes the generator's assumptions about what is askable and answerable; contamination risk is highest

Two rows deserve a flag now and a chapter later. Synthetic rationales are the most dangerous artifact type because a generated chain of reasoning reads as if it were verified when it is merely fluent, a model can produce a confident, well-structured derivation of a wrong answer, and if you train on it, you are teaching the model to imitate confident wrongness. Self-Instruct and the instruction-tuning literature lean heavily on generated rationales precisely because they are useful, and the Self-Instruct filtering is in large part a defense against this. We return to rationales in Chapter 5. Synthetic evals carry the highest contamination risk, surveyed in depth in the data-contamination literature; Chapter 9 is devoted to them.

Axis two: what will you do with it?

The same generated example has a wildly different risk depending on its use. This is the axis teams skip, because the use often gets decided after generation, when the data is already sitting in a folder and the path of least resistance is "just use it everywhere."

UseTolerance for generator fingerprintsThe disqualifying defect
Training dataModerate, diversity matters; some style bleed is survivable if anchoredStyle monoculture; flattened tails; assumed labels
Evaluation dataNear zero, the eval must be a clean rulerAny leakage from the generator that the system was trained on; shared style
Safety / red-team dataGeneration is often the point, you want adversarial inputsCoverage gaps the generator can't imagine; sanitized attacks that don't probe real failures
Demo dataHigh, nobody trains or evaluates on itAccidentally leaking it into training or eval; using fake names that look real
Synthetic documentationHigh, it illustrates, it doesn't measureBeing mistaken for ground truth later; carrying into training

Read the table as a compatibility matrix. The most expensive errors are crossings: data generated for one use, repurposed for another with a stricter tolerance. Demo data that drifts into the training set. Training augmentation that gets sampled into an eval. Red-team prompts that leak into a benchmark and inflate safety scores. Each crossing is a quiet act, usually a copy-paste or a glob pattern that matched too much, and each one converts a harmless artifact into a contaminating one.

The risk matrix: type × use

Cross the two axes and you get an operational risk grid. The cell value is how hard your gates must be before the data is allowed into that use.

Type \ UseTrainingEvaluationRed-teamDemo
ExamplesMedium (dedup, diversity, label check)High (isolation proof required)Low-MedLow
LabelsHigh (verify against sample of truth)Forbidden as gold (use human labels)n/aLow
Edge casesMediumHigh (don't let edges leak to train)Low (this is their home)Low
ConversationsMedium-HighHighMediumLow
DocumentsMediumHighMediumLow
RationalesHigh (verify or don't train)Forbidden as goldMediumLow
Evalsn/aHighest (human acceptance gate)MediumLow

The two "Forbidden" cells are the load-bearing ones. A model's labels must never be your evaluation gold, because then you are scoring a model against another model's opinion and calling it accuracy. A model's rationales must never be your evaluation gold for the same reason, with the added hazard that fluent reasoning hides its wrongness. You can use generated labels and rationales in training if you verify them against a real anchor, which is the subject of Chapters 5 and 8. You cannot use them as the ruler you measure by.

This grid is also the answer to the question "how careful do I have to be?" The careful answer is: it depends on the cell. A synthetic demo conversation needs almost no gates. A synthetic evaluation set needs the strictest gates in the entire book. Treating all synthetic data with one policy, either "always fine" or "always dangerous", guarantees you over-spend on the harmless and under-protect against the catastrophic.

Why teams actually reach for synthetic data

The taxonomy explains what you have; this section explains why you wanted it, because the motivation predicts the trap. Teams reach for generation for a handful of recurring reasons, and each has a characteristic failure.

**Data scarcity. ** "We don't have enough examples." This is the support team's motivation and the most common one. The trap is mistaking volume for coverage, generating ten thousand examples that all sit in the same dense region you already had, while the tail you actually needed stays empty. Volume is cheap; coverage is the hard part, and Chapter 6 is about measuring it.

**Class imbalance. ** "We have plenty of data but the rare class is rare." Generation to balance minority classes is legitimate and often genuinely helps. The trap is that the rare class is rare for a reason, it lives in an unusual part of the distribution, and a generator asked to produce more of it tends to produce a tidy, central version of an inherently weird thing. You balance the count and flatten the shape.

**Privacy. ** "We can't use the real data, so we'll generate a synthetic stand-in." This is the most over-trusted motivation. Synthetic data is not automatically private, a generator trained or prompted on sensitive data can reproduce it, and "synthetic" is not a magic word that strips liability. Chapter 12 is the whole argument; for now, treat privacy as a reason to be more careful, not a reason to relax.

**Edge cases and testing. ** "We need inputs our logs never captured." This is generation's best use, because fabrication is appropriate here, you want the unusual case, and you are testing behavior, not measuring a distribution. The trap is fabricating edges that are weird in a way the system already handles, while missing the edges that actually break it.

**Cost and speed of labeling. ** "Human labeling is slow and expensive; let a model label." Legitimate as assistance, a model proposing labels for humans to confirm, and dangerous as replacement. The data-contamination survey and the broader evaluation literature keep finding the same thing: model-generated labels carry the model's systematic biases, and if those become your ground truth, you have laundered the model's errors into your metrics.

The honest version of all five motivations is the same sentence: *generation can fill a specific, named gap, and it cannot replace the reality it is standing in for. * The successes, Self-Instruct, the Textbooks/Phi line, all generated toward a named gap with heavy filtering. The failures generated toward "more data" and trained on the pile.

A first-contact decision tree

Before generation, run the artifact through this tree. It is the operational compression of this chapter.

1. What am I generating? (example / label / edge / convo / doc / rationale / eval)
2. What will it be USED for? (train / eval / red-team / demo)
3. Look up the risk cell in the type×use grid.
  - "Forbidden as gold"? -> STOP. Use humans/real data for this slot.
  - "Highest" / "High"? -> Proceed only with isolation proof + human gate.
  - "Medium"? -> Proceed with dedup, diversity, and label verification.
  - "Low"? -> Proceed; tag it so it never silently crosses into a stricter use.
4. Will it ever cross into a stricter use later? (almost always "yes" unless you prevent it)
  - Tag every example with source_type and intended_use NOW.
  - Set eval_exclusion=true on anything generated by a model the system trains on.

Step 4 is the one that prevents the crossings. The reason demo data ends up in training and red-team prompts end up in benchmarks is that nobody tagged them at birth, so months later a glob pattern swept them up. The NIST AI RMF frames this as a mapping and governance function, knowing what your data assets are and what they are approved for, and the Datasheets for Datasets work makes the same point from the documentation side: a dataset without a record of what it is and what it's for will be misused, not maliciously, just by default. The next two chapters turn that tag into a real provenance system.

Chapter summary

"Synthetic data" is not one substance. Two axes organize it: what you generated (examples, labels, edge cases, conversations, documents, rationales, evals) and what you will use it for (training, evaluation, red-team, demo). The artifact type predicts the fingerprints; the use determines which fingerprints are disqualifying. Crossing the two yields a risk matrix whose two non-negotiable cells are that model-generated labels and model-generated rationales must never serve as evaluation gold, using them as the ruler launders the model's errors into your metrics. The most expensive real-world errors are crossings: data generated for a lenient use (demo) that silently migrates into a strict one (training, eval), usually because nobody tagged it at birth. Teams reach for generation to fix scarcity, imbalance, privacy, testing, and labeling cost; each motivation has a signature trap, and the honest framing of all five is identical, generation fills a named gap and cannot replace the reality it stands in for. The defense is to tag every artifact with its type and intended use at creation, set eval-exclusion flags immediately, and let the risk cell dictate how hard the downstream gates must be.

Share