Name: Fine-Tune, or Don't
Availability: InStock

Key Takeaways

Fine-tuning is a behavior intervention, not a way to store current facts.

TRAIN gates every run through task stability, required knowledge, examples, impact, and evaluation.

The book treats fine-tuning as surgery: diagnose, prepare, monitor, and be ready to roll back.

Book promise

Fine-tuning changes model behavior. It does not replace retrieval, tools, policy, evaluation, or product judgment.

This is a decision-first field guide. It teaches builders when fine-tuning is the right intervention, when prompting, retrieval, structured outputs, tools, routing, or caching are better, and how to fine-tune without laundering bad data into model behavior. It is written for engineers who have already shipped something with a foundation model and have watched a team reach for fine-tuning as a reflex: the support bot that "needs to know the product," the extraction pipeline that "fails sometimes," the assistant that is "too verbose," the workload that "must be cheaper." Each of those sentences names a symptom. None of them names a diagnosis. This book is about the difference. It sits beside AI-Native means the machine does the job and evals that predict production, not vanity in the larger site argument.

This manuscript is not a short brief, not a topic outline, and not a marketing summary. It is designed for AI engineers, product engineers, MLOps engineers, ML engineers, backend engineers, and technical founders who need a production decision process for customizing models: one they can carry into an architecture review and defend with evidence.

The recurring motif

Fine-tuning is surgery, not seasoning.

Seasoning is what you reach for casually, taste, and adjust. Surgery is what you reach for after a diagnosis, with consent, preparation, a plan for what could go wrong, and monitoring during recovery. Fine-tuning is the second kind of intervention. It changes the body of the system, the weights themselves, and the change is not trivially reversible at inference time the way a prompt edit is. Treating training like seasoning is the single most expensive mistake in this field, and most of this book is an elaboration of why.

The enemy

The belief this book exists to correct:

"We have a model problem, so we should fine-tune."

Fine-tuning is genuinely powerful. It can stabilize output format, encode repeated task behavior, capture house style and domain phrasing, sharpen classification and extraction patterns, instill tool-use conventions, shorten prompts for high-volume workflows, and specialize a small model so it is cheaper and faster than a large one. It is usually the wrong way to add current facts, enforce permissions, track business state, follow rapidly changing policy, or hold private knowledge that must be deletable on request. Most teams that "need to fine-tune" need one of those other things instead, and the ones that genuinely need a fine-tune still need everything else around it.

Core thesis, stated plainly

Behavior lives in weights. Facts live in sources. State lives in databases. Permissions live in policy. Fine-tuning edits the first box. It does nothing for the other three, and it can quietly make them harder to operate. Every chapter returns to this split.

Primary research references

These anchor the book. Individual chapters use their own chapter-specific sources; this is the shared spine.

The TRAIN Decision Framework

One framework recurs through the book. Before any fine-tuning run, answer five questions. If any one fails, you are probably not ready to train, and often you have discovered that you should not train at all.

T: Task stability. Is the behavior stable enough to encode into weights? A target that changes weekly does not belong in a model that takes days to retrain and validate.
R: Required knowledge. Does the task need current facts (externalize them) or durable behavior (a candidate for training)? Confusing these is the most common false diagnosis in the book.
A: Available examples. Do you have enough clean, consistent demonstrations or preference pairs, not just enough rows, but enough trustworthy ones?
I: Impact of mistakes. What breaks if the model internalizes the wrong behavior? A bad prompt is fixed in a deploy; a bad fine-tune ships a confident error into every relevant request until you catch it.
N: Necessary evaluation. Can you prove the fine-tune beats the cheaper alternatives on the metrics that matter, before and after, including regressions?

TRAIN is used as a lens, not a template. It will not appear as a forced subsection in every chapter. It is the question set a mature team can answer before spending money and risk on a training run.

Movement I: The Diagnosis Before the Training Run

The Support Bot That Knew the Old Product
What Fine-Tuning Actually Changes
Five False Diagnoses
The Customization Menu and a Decision Tree

Movement II: The Problems Fine-Tuning Is Actually Good At

Format, Behavior, and the Shape of a Repeated Task
House Style, Domain Phrasing, and Tool Discipline
Specializing Small Models and Distilling Down

Movement III: The Data Is the Model Update

Demonstrations, Corrections, and Preferences
Labels, Disagreement, and Coverage
Contamination, Leakage, and the Splits That Save You
Synthetic Data: When It Helps, When It Poisons

Movement IV: Methods Without Mystery

SFT, LoRA, and QLoRA in Practical Terms
Preference Tuning, DPO, and Distillation
What to Fine-Tune: Generators, Retrievers, and Routers

Movement V: Evaluation Before, During, and After

Baselines, Regression Walls, and the Release Gate

Movement VI: Operating Fine-Tuned Models

Versioning, Lineage, Drift, and Retirement

Movement VII: Use Case Playbooks

Ten Playbooks for the Decision Meeting

Back matter

Glossary
Implementation Checklist
Research and Source Register

Front Matter: Fine-Tune, or Don't