Appendix A: Back Matter
Glossary, implementation checklist, and source register for the book.
Key Takeaways
- The glossary fixes the vocabulary for behavior, knowledge, adapters, splits, and release gates.
- The checklist turns the book into an implementation sequence from diagnosis through operation.
- The source register keeps external citations explicit and avoids invented authority.
Glossary
Adapter: A small set of trainable parameters (as in LoRA) added on top of a frozen base model. Detachable, composable, and the basis for instant rollback.
Baseline: The measured performance of the cheaper alternatives (zero-shot, prompted, few-shot, retrieval) on your task, established before training so the fine-tune has a meaningful number to beat.
Behavior: How a model responds: tone, format, task steps, tool discipline, domain phrasing. Lives in the weights; the right thing to fine-tune. Contrast knowledge.
Catastrophic forgetting: General capability lost as a side effect of fine-tuning, because the optimizer moves weights that encoded unrelated skills. Caught by regression evals.
Contamination: Overlap between training and test data (exact, near-duplicate, entity, temporal, or pretraining) that inflates the eval and makes a model look good that isn't.
Correction: A captured production failure plus its fix; the highest-value demonstration, feeding both the training set and the regression suite (the flywheel).
Data card / Datasheet: A document recording a dataset's sources, owners, licenses, PII/consent status, labeler-instruction version, measured quality, and splits. Part of lineage.
Demonstration: An input paired with its correct, complete output. The signal for supervised fine-tuning.
Distillation: Training a small student model to reproduce a large teacher model's behavior. Inherits the teacher's mistakes; evaluate against ground truth, not teacher agreement.
DPO (Direct Preference Optimization): Preference tuning without a separate reward model or RL loop; an SFT-like loss that raises the likelihood of preferred over dispreferred responses.
Drift: Quality degradation over time with no change to the weights, because the world the training data described moved on. Detected by monitoring; triggers retraining.
Generalist tax: The capability a frontier model offers but a narrow task never uses, billed on every call. The thing small-model specialization attacks.
Golden test set: A frozen, versioned, meticulously-correct set of test cases used to compare model versions on a stable ruler. Never trained on.
Knowledge: What a model treats as true: current facts, business state, policy. Lives in sources, not weights; retrieve or fetch it, never freeze it. Contrast behavior.
Lineage: The traceable chain from a deployed model back to the exact data, config, and run that produced it (dataset card + model card + run manifest). Required for reproduction, rollback, deletion, and audit.
LoRA (Low-Rank Adaptation): Fine-tuning that freezes the base and trains small low-rank adapters. Cheap, forgetting-bounded, perfectly reversible by detaching the adapter.
Model card: A document recording what a model is: base, method, eval results across all axes, intended use, limitations, and rollback target.
Preference: A judgment that one response is better than another for the same input. The signal for DPO/preference tuning, for properties with no single right answer.
QLoRA: LoRA on a 4-bit-quantized base, cutting memory enough to fine-tune large models on a single GPU with adapters kept at full precision.
Regression wall: The release gate that requires the new model to improve the target task while not regressing on protected axes (safety, format, key slices). A regression on any protected axis blocks the release.
Release gate: An automated set of preconditions (clean splits, beat baseline by a margin, regression wall, validated judge) a fine-tune must clear before shadow and canary rollout.
SFT (Supervised Fine-Tuning): The training objective of learning from labeled demonstrations. Orthogonal to the parameterization (full / LoRA / QLoRA).
Slot pattern: Writing training examples with placeholders for current facts so the model learns the phrasing (behavior) and the fact is filled by retrieval at inference time.
Stable repeated task: A high-volume task whose desired output is a structural shape (not a fact) that rarely changes. The canonical fine-tuning case.
Synthetic data: Examples generated by a model. Powerful for coverage and labeling cost; dangerous via inherited error, distribution narrowing, and recursive collapse. A candidate until verified.
TRAIN: The book's decision framework, and the fine-tuning companion to Human in the Loop Is Not a Plan and A Field Guide to Evals:Task stability, Required knowledge, Available examples, Impact of mistakes, Necessary evaluation. All five must pass before training.
Implementation Checklist
A team's fine-tuning decision and execution is sound when it can answer yes, with evidence to each of these. Grouped by movement.
Diagnosis (Movement I)
- The symptom was named and sorted into the taxonomy (behavior vs. knowledge vs. state vs. permission) before "fine-tune" was written on a ticket.
- No current fact, per-request state, permission, or fast-changing policy is a fine-tune target; each routes to retrieval, tools, or policy-in-code.
- The five false diagnoses were ruled out: the distinguishing test was run for "doesn't know," "needs our data," "too verbose," "fails sometimes," and "must be cheaper."
- "It fails sometimes" was converted into clustered, characterized failures before any training decision.
- The lightest adequate tool on the ladder was tried and shown to fall short before climbing to fine-tuning.
The case for training (Movement II)
- The target is a stable, repeated, pattern-shaped task, green on the task-stability scorecard, not red on stability or task-ness.
- Style/phrasing/tool-discipline fine-tunes slot out facts and raise, not lower, the grounding bar for the knowledge twin.
- Small-model specialization is justified by a break-even calculation including data curation and ongoing ops, not just GPU cost.
Data (Movement III)
- The signal kind (demonstration / correction / preference) matches the problem (right answer vs. better answer).
- Inter-annotator agreement was measured; contested items were reviewed; labeler instructions are versioned.
- Coverage includes edge cases, negatives, and refusals, not just the easy middle.
- The validator passes: no contradictions (release blocker), no invalid labels, no severe imbalance, negatives present.
- Splits are by entity and temporal where relevant; the contamination report is clean across every split boundary; a frozen golden set exists.
- Synthetic data is verified by a separate checker, diversity-capped, blended with real data, and provenance-tagged.
Methods (Movement IV)
- The objective (SFT/DPO/distillation) and the parameterization (LoRA/QLoRA/full) were chosen as separate, justified decisions.
- LoRA/QLoRA is the default; full fine-tuning is used only where an eval proves the adapter can't reach the bar.
- The run pins the base, dataset hash, and seed; training stops on the validation metric, not the training loss.
- The right component was chosen, retriever or router fine-tuned where the failure actually lives, not the generator by reflex.
Evaluation and operations (Movements V-VI)
- A baseline was measured before training; the fine-tune must beat the best cheaper combination by a margin beyond eval noise.
- The eval suite covers task, regression, format, safety/truthfulness, and slices; the regression wall blocks on any protected-axis regression.
- Any LLM judge is validated against humans, uses a different family, and randomizes order.
- The release gate is automated; passing it earns shadow + canary with auto-rollback, not a full launch.
- Lineage exists (dataset card + model card + run manifest); the rollback target is a real, exact previous state.
- Drift monitoring watches input, output, quality, and protected-slice drift; crossing a threshold triggers retraining via the correction flywheel.
- Deletable PII and fast-changing policy were kept out of the weights; the dataset (not the trained model) is treated as the portable asset.
- An incident runbook exists with an instant adapter-detach rollback as step two.
Research and Source Register
Sources grouped by chapter. A source appears under a chapter only if that chapter actually uses it to support a claim.
Front matter / Introduction: synthetic; draws on the book's own argument. No external citations.
Ch. 1, The Support Bot That Knew the Old Product
- OpenAI: Model optimization guide
- OpenAI: Supervised fine-tuning
- OpenAI: Fine-tuning best practices
- Retrieval-Augmented Generation (Lewis et al.)
- Training language models to follow instructions (InstructGPT)
Ch. 2, What Fine-Tuning Actually Changes
- OpenAI: Supervised fine-tuning
- Training language models to follow instructions (InstructGPT)
- LoRA: Low-Rank Adaptation of Large Language Models
- An Empirical Study of Catastrophic Forgetting in LLMs during Continual Fine-tuning
- Physics of Language Models: Knowledge Capacity Scaling Laws
Ch. 3, Five False Diagnoses
- Retrieval-Augmented Generation (Lewis et al.)
- OpenAI: Model optimization guide
- OpenAI: Fine-tuning best practices
- Lost in the Middle: How Language Models Use Long Contexts
- Chain-of-Thought Prompting Elicits Reasoning in LLMs
Ch. 4, The Customization Menu and a Decision Tree
- OpenAI: Model optimization guide
- Retrieval-Augmented Generation (Lewis et al.)
- Chain-of-Thought Prompting Elicits Reasoning in LLMs
- Toolformer: Language Models Can Teach Themselves to Use Tools
- LoRA: Low-Rank Adaptation of Large Language Models
Ch. 5, Format, Behavior, and the Shape of a Repeated Task
- OpenAI: Supervised fine-tuning
- OpenAI: Fine-tuning best practices
- Stanford Alpaca
- Self-Instruct
- Training language models to follow instructions (InstructGPT)
Ch. 6, House Style, Domain Phrasing, and Tool Discipline
- Training language models to follow instructions (InstructGPT)
- Toolformer: Language Models Can Teach Themselves to Use Tools
- OpenAI: Supervised fine-tuning
- Self-Instruct
- OpenAI: Function calling guide
Ch. 7, Specializing Small Models and Distilling Down
- Distilling the Knowledge in a Neural Network (Hinton et al.)
- Textbooks Are All You Need (phi-1)
- Phi-3 Technical Report
- QLoRA: Efficient Finetuning of Quantized LLMs
- OpenAI: Model optimization guide
Ch. 8, Demonstrations, Corrections, and Preferences
- Training language models to follow instructions (InstructGPT)
- Direct Preference Optimization
- OpenAI: Fine-tuning best practices
- Constitutional AI: Harmlessness from AI Feedback
- Self-Instruct
Ch. 9, Labels, Disagreement, and Coverage
- OpenAI: Fine-tuning best practices
- Training language models to follow instructions (InstructGPT)
- Data Quality for Machine Learning Tasks (KDD)
- Datasheets for Datasets
- Pervasive Label Errors in Test Sets (Northcutt et al.)
Ch. 10, Contamination, Leakage, and the Splits That Save You
- Generalization or Memorization: Data Contamination and Trustworthy Evaluation for LLMs
- OpenAI: Fine-tuning best practices
- QLoRA: Efficient Finetuning of Quantized LLMs
- Pervasive Label Errors in Test Sets (Northcutt et al.)
- Documenting Large Webtext Corpora: A Case Study on C4
Ch. 11: Synthetic Data: When It Helps, When It Poisons
- Self-Instruct
- Stanford Alpaca
- Textbooks Are All You Need (phi-1)
- AI models collapse when trained on recursively generated data (Nature)
- Constitutional AI: Harmlessness from AI Feedback
Ch. 12, SFT, LoRA, and QLoRA in Practical Terms
- LoRA: Low-Rank Adaptation of Large Language Models
- QLoRA: Efficient Finetuning of Quantized LLMs
- Hugging Face PEFT documentation
- OpenAI: Supervised fine-tuning
- An Empirical Study of Catastrophic Forgetting in LLMs during Continual Fine-tuning
Ch. 13, Preference Tuning, DPO, and Distillation
- Direct Preference Optimization
- Training language models to follow instructions (InstructGPT)
- Constitutional AI: Harmlessness from AI Feedback
- Distilling the Knowledge in a Neural Network (Hinton et al.)
- QLoRA: Efficient Finetuning of Quantized LLMs
Ch. 14, What to Fine-Tune: Generators, Retrievers, and Routers
- Dense Passage Retrieval for Open-Domain Question Answering
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
- Retrieval-Augmented Generation (Lewis et al.)
- LoRA: Low-Rank Adaptation of Large Language Models
- QLoRA: Efficient Finetuning of Quantized LLMs
Ch. 15, Baselines, Regression Walls, and the Release Gate
- OpenAI: Evals guide
- OpenAI Evals repository
- RAGAS: Automated Evaluation of Retrieval Augmented Generation
- TruthfulQA: Measuring How Models Mimic Human Falsehoods
- QLoRA: Efficient Finetuning of Quantized LLMs
Ch. 16, Versioning, Lineage, Drift, and Retirement
- NIST AI Risk Management Framework
- GDPR Article 17: Right to erasure
- Model Cards for Model Reporting
- Datasheets for Datasets
- OpenAI: Supervised fine-tuning
Ch. 17, Ten Playbooks for the Decision Meeting
- OpenAI: Fine-tuning best practices
- Retrieval-Augmented Generation (Lewis et al.)
- Direct Preference Optimization
- Toolformer: Language Models Can Teach Themselves to Use Tools
- NIST AI Risk Management Framework
