The CRO's case for shipping smaller models

Revenue rarely rewards the biggest model. It rewards the one you can afford to run, ship, and explain to a customer.

I have been in enough board rooms and enough inference billing conversations to tell you the uncomfortable truth about the AI hype cycle: the companies winning margin on AI-native products are not running frontier models on every request. They are running the smallest model that gets the job done acceptably, on a pipeline designed to escalate only when the small model cannot handle it. The frontier model gets the press release. The small model pays the rent.

This is not a contrarian take for its own sake. I have lived it at Devlyn, where we have built customer-facing AI into a retail experience that touches real people in stores, trying on eyewear, making a considered purchase. Every latency millisecond matters. Every token costs money. Every time a model says something wrong, a human employee has to fix it in front of a customer who just wanted help picking frames. The pressure to ship something correct, fast, and cheap is not theoretical. It is daily.

So when I see teams doing their AI strategy by ranking models on a benchmark leaderboard and picking the highest number, I know exactly where that road leads. It leads to a product that is technically impressive, operationally unsustainable, and commercially marginal. The leap from a research demo to a gross-margin-positive product almost always runs through a smaller model than you started with.

Key takeaway: The smallest model that clears the bar for your task usually wins on margin, latency, and control, not the highest score on a general benchmark.
Smallness is fit, not weakness. The right size covers the task distribution you actually ship against, fits your latency budget, runs in memory you control, and stays explainable.
Narrow the task before you grow the model. Decomposing a vague task into specific subtasks beats model-shopping and routinely cuts cost with no drop in user-facing quality.
Route small first, escalate rarely. A cascade runs every request through the small model and pays frontier prices only for the genuinely hard tail.
"Good enough" is an eval result, not a vibe. Defensible model selection rests on an eval suite that reports errors by failure mode and severity, plus the cost delta.

The frontier gets headlines; the small model gets the margin

Let me start with the arithmetic, because strategy without numbers is just opinion. As of mid-2025, a top-tier frontier model through a major inference API typically costs somewhere in the range of $10-$30 per million output tokens. A capable mid-size model, something in the 7B-20B parameter range, either hosted or self-deployed, costs an order of magnitude less. A fine-tuned, task-specific small model running on your own hardware or a dedicated endpoint can come in at 5-20x cheaper than that.

// Rough unit economics sketch, illustrative, not exact frontier_cost_per_call = $0.04 // ~2k tokens in+out at $15/M mid_model_cost_per_call = $0.004 // same volume, ~$2/M hosted small_model_cost_per_call= $0.0005 // fine-tuned, self-hosted endpoint

// At 500k calls/month: frontier_monthly = $20,000 mid_model_monthly = $2,000 small_model_monthly = $250

// Gross margin impact at $0.10 ARPU per call: frontier_gm = ($50k - $20k) / $50k = 60% small_model_gm = ($50k - $250) / $50k = 99.5%

That arithmetic is not subtle. You can quibble with specific numbers, but the order-of-magnitude differences are real and durable. The gap between a frontier model and a well-deployed small model on a narrow task is not the gap between good and mediocre. It is frequently the gap between a company that can raise a Series B and a company that cannot explain its unit economics to an investor.

This is what I mean when I say the frontier gets headlines and the small model gets the margin. GPT-4-class performance on a general benchmark does not tell you anything useful about whether a model can classify your customer's lens prescription query correctly 97% of the time at $0.0003 per call. A well-trained specialist beats a brilliant generalist on a narrow task, every time, once you factor in the full operational picture. The short version is: generality is expensive, and most production workloads do not need generality.

Smallness is not about parameter count, it is about fit

I want to be precise about what "small model" means, because teams often conflate it with "cheap model" or "dumb model" and then use that conflation to justify staying on the frontier. Smallness is a fitness concept, not a quality ceiling.

A model is the right size when:

It covers the task distribution you actually ship against, not some imagined worst case that occurs 0.1% of the time. If your use case is extracting structured data from customer intake forms, a 7B fine-tuned model will outperform a 70B general model because the fine-tuned model has seen thousands of your specific form variations and learned the extraction schema cold. The general model is trying to solve a harder problem than the one you have.

It fits within your latency budget. Users in a retail environment will tolerate roughly 1.5 to 2 seconds for an AI response before it starts feeling broken. A frontier model, even well-hosted, frequently cannot hit that wall on complex prompts, especially with long context. A small model running on an endpoint with sub-200ms time-to-first-token gives you room to build a real UX. Latency is a product quality metric, not just an engineering metric, and it has direct revenue consequences in conversion-rate-sensitive environments.

It fits in memory you can control. Self-hosted small models, quantized to 4-bit or 8-bit, can run on commodity GPU hardware you own or rent dedicated. That means no rate limits from a shared API, no surprise pricing changes, no service terms that change overnight. For a company building a product on top of AI inference, control over the stack is a competitive moat. Dependency on a single frontier API is a vendor risk you are carrying at high cost.

It is explainable. This one gets underweighted constantly. When a customer in a Devlyn store gets a recommendation they do not trust, the employee needs to be able to explain why the system said what it said. "Our AI analyzed your face shape, lighting conditions, and stated preferences and ranked these frames" is an explanation. "A 200-billion-parameter model predicted this token sequence" is not. Smaller, task-specific models with tighter prompt scaffolding tend to produce outputs that are easier to trace back to inputs. That traceability matters to customers who have been burned before, and in 2025, more of them have been burned than have not.

The leap from a research demo to a gross-margin-positive product almost always runs through a smaller model than you started with.

Task-narrowing beats model-shopping

The most common mistake I see teams make is treating model selection as the primary lever for quality improvement. They are unhappy with an output, so they swap to a bigger model. The bigger model costs more and is sometimes slower, but the output is marginally better. They declare victory. Then the next edge case surfaces and they shop for a bigger model again. Within six months they are on the largest available model, costs have doubled, and the quality problems are still there, because the problems were never about model capacity. They were about task definition.

Task-narrowing is the discipline of making the problem smaller and more specific before you make the model bigger. It is harder than model-shopping. It requires you to actually understand your task distribution, what inputs you will receive, what outputs you need, what failure modes you cannot tolerate. It requires labeling data and building evals and being honest with yourself about where the system is failing and why. But it consistently produces better results at lower cost than the next size up on the leaderboard.

The mechanics of task-narrowing look like this: you take a vague task, "help customers find the right eyewear", and decompose it into specific subtasks: classify intent, extract stated preferences, map to product attributes, rank candidates, generate explanation. Each subtask has a narrower input distribution and a clearer success criterion than the original. Each subtask can be modeled separately. And once you have separated the tasks, you discover that most of them are solved competently by models that are not frontier-class.

I have seen teams cut inference costs by 70% through task decomposition alone, with no change in user-facing quality. The frontier model gets retained for one or two subtasks where it genuinely earns its cost. Everything else runs on smaller, faster, cheaper models. In Defense of Small Models walks through this decomposition methodology in detail, it is one of the frameworks I return to whenever a team tells me they "need GPT-4" for something.

Cascade and routing: small first, escalate only when needed

Task decomposition leads naturally to the question of routing. Once you accept that different subtasks warrant different models, you need an architecture for deciding which model handles which request. This is model routing, and it is one of the highest-leverage infrastructure decisions a team can make.

The simplest version of routing is a cascade: run every request through the small model first, check the output against a quality or confidence criterion, and escalate to a larger model only when the small model fails the check. Most requests never escalate. You pay frontier prices only for the tail of genuinely hard cases.

The criterion for escalation can be as simple as a confidence score from the small model, a length or complexity check on the input, or a lightweight classifier trained to predict when the small model will struggle. The more sophisticated version involves multiple tiers, a small local model, a mid-size hosted model, a frontier model, with escalation logic tuned to your task distribution. I have covered the design space for this in Model Routing; the key insight is that the routing logic itself is usually cheap and fast, and even crude routing rules capture most of the savings.

At Devlyn, we do not route on a single binary. We route on a combination of factors: the estimated complexity of the customer query, the confidence score from the initial classification pass, and the business context of the interaction (a customer who has been in the store for forty minutes and is close to a decision gets a different resource allocation than a first-touch browser). The result is that the overwhelming majority of interactions never touch our most expensive models. The customers who get the frontier model experience are the ones where it genuinely changes the outcome, and they are a small fraction of total volume.

This is what "outcomes over velocity" means in practice at Devlyn. We are not racing to deploy the biggest model first. We are building infrastructure that allocates the right resource to the right moment. That takes longer to design than plugging in an API key, but it produces a product that can scale without the economics getting worse as volume grows.

Proving "good enough" is an engineering discipline

The phrase "good enough" sounds like a concession. In an engineering context, it is a specification. You cannot decide whether a smaller model is good enough unless you have defined what "good" means and built the machinery to measure it.

Evals are the mechanism. A real eval suite for a production AI feature looks like: a held-out dataset of representative inputs, labeled with the outputs you want, with failure modes categorized by type and severity. You run every candidate model against the suite. You look at the distribution of errors, not just the headline accuracy number, because a model that fails 5% of the time on your most critical edge cases is categorically worse than a model that fails 8% of the time on low-stakes requests, even if the latter has a lower overall score.

This is not glamorous work. It requires domain expertise to label data correctly, it requires consensus on what failure modes matter most, and it requires the discipline to maintain and expand the suite as the task distribution shifts. Most teams skip it, then are surprised when the model they chose in a two-hour vibe-check session behaves badly in production.

The business reason to do this work is that it is the only defensible basis for a model selection decision. When your CEO asks why you are running a small model on the customer recommendation flow and not GPT-5, the answer cannot be "it seemed fine in testing." It has to be "we ran our eval suite on both models, here are the results by failure mode, here is the cost delta, and here is the confidence interval on quality." That is a conversation that moves at board level. Gut feeling does not.

Quantization deserves a section in that analysis. A model quantized from full 16-bit precision to 4-bit takes up roughly one-fourth the memory and runs meaningfully faster. On most tasks, the quality degradation is small enough to fall within the noise of your eval suite. But "most tasks" is not "all tasks", quantization tends to hurt most on tasks requiring careful numerical reasoning, long-context faithfulness, and fine-grained instruction following. Know your task before you quantize. Run the evals both ways. The savings are often real; the quality drop is sometimes real too. The only way to know is to measure.

Privacy, local inference, and the customer who was burned

There is a dimension of small models that does not show up in benchmark comparisons but shows up everywhere in enterprise sales: data sovereignty. When you run a small model on your own infrastructure, on-premises, in your VPC, on a device, customer data never leaves your perimeter. It does not transit a third-party API. It does not appear in training pipelines. It does not create contractual ambiguity about data residency.

In healthcare, in financial services, in retail environments with loyalty programs and rich customer profiles, this matters enormously. I have been in sales conversations where a procurement team killed a deal not because the product was wrong but because the inference architecture required sending customer data to a hosted API that the legal team would not approve. A local small model closed the deal.

There is also a quieter version of this issue: customers who have been burned by AI before are increasingly skeptical of what happens to their data. A customer who asks "does this thing remember what I told it last time I was in the store?" deserves an honest, clear answer. That answer is easier to give if the inference stack is under your control and not a black box sitting in someone else's cloud.

The companies winning margin on AI-native products are running the smallest model that gets the job done, on a pipeline designed to escalate only when the small model cannot handle it.

Local inference also enables offline-capable products, a retail kiosk that works when the store's WiFi is flaky, a field service tool that runs in a warehouse without reliable connectivity. These are not edge cases. They are real operational constraints in a large fraction of physical-world AI deployments. A small model you can ship to a device solves them. A frontier API does not.

The CRO's summary: build for the margin, not the demo

I want to close with the revenue frame, because that is the frame that ultimately determines whether any of this gets funded and sustained.

Every AI feature has a cost structure and a value structure. The cost structure includes inference costs, maintenance overhead, human-in-the-loop costs for errors, and latency penalties on conversion. The value structure includes incremental revenue per interaction, churn reduction, customer satisfaction lift, and employee productivity. Your job as an operator is to maximize the spread between those two curves, not to maximize the capability of the model you are using.

The frontier model maximizes the capability numerator but taxes the cost denominator heavily. In a narrow task, that tax rarely pays off. The value does not scale with model size once you are above the capability threshold for the task. Capability above threshold is waste. And waste compounds at scale in ways that can turn a good unit-economics story into a bad one as volume grows.

The smaller model, properly evaluated, properly fine-tuned, properly routed, hits the capability threshold at a fraction of the cost. The margin it generates is real and defensible. The latency it delivers is a product quality improvement. The explainability it allows is a sales and trust asset. And the control it provides over the inference stack is a competitive moat that gets harder to replicate as you accumulate operational experience running it.

None of this means you never use a frontier model. It means you treat the frontier model like the expensive specialist it is: you engage it for the cases where nothing else will do, and you build a system that contains those cases rather than defaulting to them. That system, the routing, the evals, the fine-tuning, the task decomposition, is the real AI product. The model is an input to it.

The companies that internalize this early will have gross margins in five years that companies still defaulting to frontier-everything will envy. The frontier will keep getting better. It will also keep getting expensive relative to what you actually need. The discipline to know the difference, to measure it, route around it, and build margin from it, is the CRO's case for smaller models. And it is the case I will keep making until the field catches up.

Frequently asked questions

Are smaller language models just worse than large ones?

No. Smallness is a question of fit, not a quality ceiling. On a narrow task with a well-understood input distribution, a fine-tuned small model frequently outperforms a much larger general model, because it has been trained on the exact problem you have rather than a harder, broader one. The capability you need is whatever clears the bar for the task; everything above that threshold is cost without value.

When should I still reach for a frontier model?

Treat the frontier model like the expensive specialist it is: engage it for the genuinely hard cases where nothing smaller will do, and build a system that contains those cases rather than defaulting to them. In a routed pipeline, those cases are typically a small fraction of total volume.

What is model routing, or a cascade?

A cascade runs every request through the small model first, checks the output against a confidence or quality criterion, and escalates to a larger model only when the small model fails the check. Most requests never escalate, so you pay frontier prices only for the tail of hard cases. The routing logic itself is usually cheap and fast, and even crude rules capture most of the savings.

If you are working through model selection, routing, and evals for an AI-native product and want help building the system rather than just picking a model, the Devlyn team works on exactly this.