Name: Fine-Tune, or Don't
Availability: InStock

Key Takeaways

The glossary fixes the vocabulary for behavior, knowledge, adapters, splits, and release gates.

The checklist turns the book into an implementation sequence from diagnosis through operation.

The source register keeps external citations explicit and avoids invented authority.

Glossary

Adapter: A small set of trainable parameters (as in LoRA) added on top of a frozen base model. Detachable, composable, and the basis for instant rollback.

Baseline: The measured performance of the cheaper alternatives (zero-shot, prompted, few-shot, retrieval) on your task, established before training so the fine-tune has a meaningful number to beat.

Behavior: How a model responds: tone, format, task steps, tool discipline, domain phrasing. Lives in the weights; the right thing to fine-tune. Contrast knowledge.

Catastrophic forgetting: General capability lost as a side effect of fine-tuning, because the optimizer moves weights that encoded unrelated skills. Caught by regression evals.

Contamination: Overlap between training and test data (exact, near-duplicate, entity, temporal, or pretraining) that inflates the eval and makes a model look good that isn't.

Correction: A captured production failure plus its fix; the highest-value demonstration, feeding both the training set and the regression suite (the flywheel).

Data card / Datasheet: A document recording a dataset's sources, owners, licenses, PII/consent status, labeler-instruction version, measured quality, and splits. Part of lineage.

Demonstration: An input paired with its correct, complete output. The signal for supervised fine-tuning.

Distillation: Training a small student model to reproduce a large teacher model's behavior. Inherits the teacher's mistakes; evaluate against ground truth, not teacher agreement.

DPO (Direct Preference Optimization): Preference tuning without a separate reward model or RL loop; an SFT-like loss that raises the likelihood of preferred over dispreferred responses.

Drift: Quality degradation over time with no change to the weights, because the world the training data described moved on. Detected by monitoring; triggers retraining.

Generalist tax: The capability a frontier model offers but a narrow task never uses, billed on every call. The thing small-model specialization attacks.

Golden test set: A frozen, versioned, meticulously-correct set of test cases used to compare model versions on a stable ruler. Never trained on.

Knowledge: What a model treats as true: current facts, business state, policy. Lives in sources, not weights; retrieve or fetch it, never freeze it. Contrast behavior.

Lineage: The traceable chain from a deployed model back to the exact data, config, and run that produced it (dataset card + model card + run manifest). Required for reproduction, rollback, deletion, and audit.

LoRA (Low-Rank Adaptation): Fine-tuning that freezes the base and trains small low-rank adapters. Cheap, forgetting-bounded, perfectly reversible by detaching the adapter.

Model card: A document recording what a model is: base, method, eval results across all axes, intended use, limitations, and rollback target.

Preference: A judgment that one response is better than another for the same input. The signal for DPO/preference tuning, for properties with no single right answer.

QLoRA: LoRA on a 4-bit-quantized base, cutting memory enough to fine-tune large models on a single GPU with adapters kept at full precision.

Regression wall: The release gate that requires the new model to improve the target task while not regressing on protected axes (safety, format, key slices). A regression on any protected axis blocks the release.

Release gate: An automated set of preconditions (clean splits, beat baseline by a margin, regression wall, validated judge) a fine-tune must clear before shadow and canary rollout.

SFT (Supervised Fine-Tuning): The training objective of learning from labeled demonstrations. Orthogonal to the parameterization (full / LoRA / QLoRA).

Slot pattern: Writing training examples with placeholders for current facts so the model learns the phrasing (behavior) and the fact is filled by retrieval at inference time.

Stable repeated task: A high-volume task whose desired output is a structural shape (not a fact) that rarely changes. The canonical fine-tuning case.

Synthetic data: Examples generated by a model. Powerful for coverage and labeling cost; dangerous via inherited error, distribution narrowing, and recursive collapse. A candidate until verified.

TRAIN: The book's decision framework, and the fine-tuning companion to Human in the Loop Is Not a Plan and A Field Guide to Evals:Task stability, Required knowledge, Available examples, Impact of mistakes, Necessary evaluation. All five must pass before training.

Implementation Checklist

A team's fine-tuning decision and execution is sound when it can answer yes, with evidence to each of these. Grouped by movement.

Diagnosis (Movement I)

The symptom was named and sorted into the taxonomy (behavior vs. knowledge vs. state vs. permission) before "fine-tune" was written on a ticket.
No current fact, per-request state, permission, or fast-changing policy is a fine-tune target; each routes to retrieval, tools, or policy-in-code.
The five false diagnoses were ruled out: the distinguishing test was run for "doesn't know," "needs our data," "too verbose," "fails sometimes," and "must be cheaper."
"It fails sometimes" was converted into clustered, characterized failures before any training decision.
The lightest adequate tool on the ladder was tried and shown to fall short before climbing to fine-tuning.

The case for training (Movement II)

The target is a stable, repeated, pattern-shaped task, green on the task-stability scorecard, not red on stability or task-ness.
Style/phrasing/tool-discipline fine-tunes slot out facts and raise, not lower, the grounding bar for the knowledge twin.
Small-model specialization is justified by a break-even calculation including data curation and ongoing ops, not just GPU cost.

Data (Movement III)

The signal kind (demonstration / correction / preference) matches the problem (right answer vs. better answer).
Inter-annotator agreement was measured; contested items were reviewed; labeler instructions are versioned.
Coverage includes edge cases, negatives, and refusals, not just the easy middle.
The validator passes: no contradictions (release blocker), no invalid labels, no severe imbalance, negatives present.
Splits are by entity and temporal where relevant; the contamination report is clean across every split boundary; a frozen golden set exists.
Synthetic data is verified by a separate checker, diversity-capped, blended with real data, and provenance-tagged.

Methods (Movement IV)

The objective (SFT/DPO/distillation) and the parameterization (LoRA/QLoRA/full) were chosen as separate, justified decisions.
LoRA/QLoRA is the default; full fine-tuning is used only where an eval proves the adapter can't reach the bar.
The run pins the base, dataset hash, and seed; training stops on the validation metric, not the training loss.
The right component was chosen, retriever or router fine-tuned where the failure actually lives, not the generator by reflex.

Evaluation and operations (Movements V-VI)

A baseline was measured before training; the fine-tune must beat the best cheaper combination by a margin beyond eval noise.
The eval suite covers task, regression, format, safety/truthfulness, and slices; the regression wall blocks on any protected-axis regression.
Any LLM judge is validated against humans, uses a different family, and randomizes order.
The release gate is automated; passing it earns shadow + canary with auto-rollback, not a full launch.
Lineage exists (dataset card + model card + run manifest); the rollback target is a real, exact previous state.
Drift monitoring watches input, output, quality, and protected-slice drift; crossing a threshold triggers retraining via the correction flywheel.
Deletable PII and fast-changing policy were kept out of the weights; the dataset (not the trained model) is treated as the portable asset.
An incident runbook exists with an instant adapter-detach rollback as step two.

Research and Source Register

Sources grouped by chapter. A source appears under a chapter only if that chapter actually uses it to support a claim.

Front matter / Introduction: synthetic; draws on the book's own argument. No external citations.

Ch. 1, The Support Bot That Knew the Old Product

Ch. 2, What Fine-Tuning Actually Changes

Ch. 3, Five False Diagnoses

Ch. 4, The Customization Menu and a Decision Tree

Ch. 5, Format, Behavior, and the Shape of a Repeated Task

Ch. 6, House Style, Domain Phrasing, and Tool Discipline

Ch. 7, Specializing Small Models and Distilling Down

Ch. 8, Demonstrations, Corrections, and Preferences

Ch. 9, Labels, Disagreement, and Coverage

Ch. 10, Contamination, Leakage, and the Splits That Save You

Ch. 11: Synthetic Data: When It Helps, When It Poisons

Ch. 12, SFT, LoRA, and QLoRA in Practical Terms

Ch. 13, Preference Tuning, DPO, and Distillation

Ch. 14, What to Fine-Tune: Generators, Retrievers, and Routers

Ch. 15, Baselines, Regression Walls, and the Release Gate

Ch. 16, Versioning, Lineage, Drift, and Retirement

Ch. 17, Ten Playbooks for the Decision Meeting

Appendix A: Back Matter