Name: In Defense of Small Models
Availability: InStock

Why the largest model that wins the demo so often loses the production it was supposed to win, and what this book asks you to measure instead.

A team I worked with shipped a content moderation feature for a marketplace. The task was unglamorous: read a listing title and description, decide whether it violated one of nine policy categories, and return a label with a short reason. They prototyped it on a Friday with the largest frontier model their vendor offered, fed it twenty examples, and it was right on all twenty. The demo on Monday was a triumph. Someone said the words "this is basically solved" out loud in a room with the head of product in it, which is the moment you should start worrying.

Three weeks later the feature was in production handling about 1.4 million listings a day. The bill for that one feature was tracking toward a number that made the CFO ask, in a tone I have heard before, whether anyone had checked the unit economics before turning it on. Each call was costing real money, the p95 latency was making the listing-create flow feel sticky, and because the model was hosted by an external vendor, the legal team had questions about listing content leaving the building that nobody had answered. The feature worked. It was also, by any honest accounting, a bad engineering decision dressed up as a good one.

What went wrong is not exotic. They reached for the largest model that could solve the task because it solved the task in the demo, and the demo does not have a bill, a latency budget, a data residency requirement, or a 3 a.m. page. The demo is a controlled environment that rewards capability and punishes nothing. Production is the opposite. This book is about the gap between those two environments, and about the discipline of choosing models for the second one.

The enemy: the frontier-model reflex

I am going to name the thing this book is against, because naming it makes it easier to catch yourself doing it. The enemy is the frontier-model reflex: the habit of reaching for the largest available model by default, because it wins the demo, while ignoring cost, latency, privacy, reliability, and margin. The reflex is not stupid. Frontier models are genuinely astonishing, and reaching for the most capable tool is a reasonable instinct in most of engineering. The problem is that the instinct was trained in an environment where the most capable tool was also roughly free at the point of use. A frontier model call is not free. It costs money per token, it costs milliseconds you may not have, and it costs control you may legally need.

The reflex shows up in specific, recognizable ways. It shows up when a team uses a reasoning model to decide whether a string is a valid email. It shows up when a 70-billion-parameter model is asked to pick one of four routing categories a million times a day. It shows up when every feature in the product calls the same maximal model because standardizing on one API felt clean in the architecture review. In each case the team bought a capability they are barely using and is paying for in full, every single request.

A giant frontier-model hammer poised over a tiny production nail — The frontier-model reflex: the biggest tool aimed at the smallest job because it won the demo.

The thesis

Here is the claim the rest of the book defends.

The best model is not the largest model that can solve the task. It is the smallest system that solves the task reliably inside the product's constraints.

Read that twice, because the load-bearing words are easy to skim past. "Smallest system," not smallest model: sometimes the right answer is not a small model at all but a regular expression, a SQL query, a classical classifier, or a lookup table, and a model of any size is overkill. "Reliably," not "in the demo": the thing has to keep working across the long tail of real inputs, not just the curated twenty. "Inside the product's constraints": the constraints are the point. Capability that violates your latency budget, your cost model, your privacy posture, or your reliability target is not capability you can ship. It is a liability that happens to demo well.

The recurring motif of this book follows directly:

Capability is only valuable after it survives latency, cost, privacy, and reliability.

A model can be smarter than you need and still be the wrong choice, the same way a Formula 1 car is a worse choice than a Honda Civic for a school run. The Civic is not a compromise. It is the correct tool, chosen by someone who understood the actual job.

What this book is and is not

I want to be precise about the position I am taking, because "small models" attracts ideologues from two directions and I am not one of them.

This book is not a claim that small models beat frontier models everywhere. They do not. There are tasks where you need the frontier and trying to be clever will just cost you weeks and embarrassment. Hard open-ended reasoning, novel synthesis across long context, genuinely ambiguous instructions: reach for the big model and do not feel bad about it. The argument is about defaults and about the large class of high-volume production tasks where the frontier is overkill, not about a universal ranking.

This book is not a hardware optimization manual. We will talk about quantization, batching, and inference engines because they change the economics, but the goal is production decisions, not squeezing the last percent of GPU utilization.

This book is not a leaderboard book. I will cite benchmarks, and I will treat them with suspicion, because a benchmark number is a claim about a test set, not a promise about your inputs. Models like Microsoft's Phi-4, a 14-billion-parameter model that the Phi-4 technical report shows matching or beating much larger models on math and reasoning, are useful evidence that small can be strong. They are not a license to skip evaluating on your own data.

This book is not a fine-tuning tutorial and not an open-source manifesto. Fine-tuning is one tool among several. Open weights are sometimes the answer and sometimes a way to acquire an operational burden you did not budget for. The decision is economic and architectural, not tribal. There is a sibling chapter in another volume of this series that argues the pure cost-model case, the break-even arithmetic of small models as a financial instrument. I will not re-litigate that arithmetic here. This book is about the engineering: how to choose, adapt, evaluate, deploy, route to, and maintain smaller systems in production.

How to read this book

The book moves from diagnosis to construction to operation.

The first stretch is diagnostic. We look hard at the frontier-model reflex and at what "small" even means once you stop treating it as a synonym for "worse." We measure capability waste directly, because you cannot manage what you have not put a number on. And we learn to make the task itself smaller, which is often the highest-use move available and the one teams skip because it is unglamorous.

The middle stretch is constructive. We go through the actual toolkit: the right-sized intelligence ladder from rules up to human review, fine-tuning small models for domain behavior, distillation from a strong teacher, and the compression techniques that let a capable model run cheaply. These chapters are where the research lives, and I have tried to make the research shape the argument rather than decorate it.

The last stretch is operational. Latency as a product feature you design rather than tolerate. Privacy and local inference as constraints that sometimes decide the whole architecture. Routing, so the small system handles what it can and the frontier handles what it should. Evaluation, which is the discipline that lets you say "good enough" with a straight face. And the failure modes, because a book that only tells you when small models win is a sales brochure, not an engineering text.

Throughout, I will hand you artifacts you can use on Monday: the Small Model Suitability Test, the Capability Waste Audit, the Right-Sized Intelligence Ladder, the Margin-Quality Frontier, and the SMALL framework. These are not acronyms invented to look tidy. Each one is a question or a procedure that changes a decision. If a framework in this book does not change a decision, I have failed and you should ignore it.

The promise

By the end, you should be able to walk into the room where someone says "this is basically solved" after a clean demo, and ask the four questions that turn a demo into an engineering decision: what does this cost per request at our volume, what is the latency at p95, where does the data go, and what happens on the inputs we did not test. You should be able to look at an existing feature that calls a frontier model for everything and estimate how much capability you are paying for and not using. And you should be able to build a system that uses a small model for the common case and escalates to the frontier only when the request earns it.

The marketplace team I started with eventually rebuilt that moderation feature. A fine-tuned small classifier handled the clear cases, which were the overwhelming majority, in single-digit milliseconds inside their own infrastructure. A frontier model was reserved for the genuinely ambiguous listings that the classifier flagged as uncertain, which turned out to be a small minority of traffic. The bill dropped by more than an order of magnitude, the latency complaint disappeared, and the legal team got their answer because most content never left the building. Nobody calls that a downgrade. They call it the version that should have shipped in the first place.

That rebuild is the whole book in miniature. The frontier model was not wrong. Using it for everything was. Let me show you how to tell the difference before the CFO does.

Key Takeaways

The demo rewards capability and punishes nothing; production has a bill, a latency budget, a privacy posture, and a pager. Decisions made in the first environment routinely fail in the second.
The enemy is the frontier-model reflex: defaulting to the largest model because it wins the demo, while ignoring cost, latency, privacy, reliability, and margin.
The thesis: the best model is the smallest system that solves the task reliably inside the product's constraints. "System," not "model," and "reliably," not "in the demo."
Capability is only valuable after it survives latency, cost, privacy, and reliability. A model smarter than you need can still be the wrong choice.
This book is not anti-frontier, not a leaderboard book, not a fine-tuning tutorial, and not an open-weights manifesto. It is about choosing, adapting, evaluating, deploying, routing to, and maintaining smaller systems without ideology.