LLM Model Routing: Cheapest Model That Can Do the Job
LLM model routing sends each request to the cheapest model that can handle it, escalating only when needed. Here is how it cuts cost without cutting quality.
LLM model routing sends each request to the cheapest model that can handle it, escalating only when needed. Here is how it cuts cost without cutting quality.
LLM model routing is the practice of putting a decision layer in front of a pool of models so each request goes to the cheapest one that can answer it acceptably, and only the genuinely hard requests pay frontier prices. It cuts cost because most production traffic is easy, the price gap between a cheap model and a frontier model is roughly 100x per token in 2026, and you stop paying the top rate on the 80% of requests that never needed it.
I run revenue at Devlyn, and I have signed off on enough inference invoices to tell you where the money actually goes. It does not go to the hard problems. It goes to running a model that costs $25 per million output tokens on requests a model that costs $5 could have answered the same way. Routing is how you stop doing that. This piece is the routing chapter of the larger cost story; for the full picture, start with my guide to LLM inference cost.
Key takeaways
If you read nothing else, these are the load-bearing claims:
- Model routing pays frontier prices only for the hard tail. A decision layer sends easy requests to cheap models, so you stop overpaying on the bulk of traffic that never needed the expensive model.
- The savings are large and well documented. The RouteLLM work reported roughly 85% cost reduction on one benchmark while keeping 95% of GPT-4 quality, using the strong model on only 14% of queries.
- A cascade and an upfront router are different bets. A cascade tries the cheap model first and escalates on failure; an upfront router classifies the request and picks the model before generating. Each fails differently.
- Every router is an eval problem in disguise. You route on a quality or confidence signal, and if that signal is wrong, the product degrades quietly while the dashboard looks fine.
- The cheapest rule that captures most of the savings beats the elegant router you cannot debug. Crude routing logic is usually fast, cheap, and good enough.
What LLM model routing actually is
Picture your application as something that talks to a menu of models rather than one model. The menu spans a frontier model, a mid-tier model, a cheap fast model, and maybe a small self-hosted model you control. Routing is the policy that picks one of them per request. That is the whole idea, and it is worth understanding before you reach for any product that sells it.
The reason routing works is that production traffic is lopsided. A support assistant gets a hundred "where is my order" questions for every one that needs real reasoning. A coding tool gets a flood of small completions and a handful of genuine architecture questions. If you send all of it to a frontier model, you pay the top rate on every request, including the ones a far cheaper model would have handled identically.
The price gap is what makes this matter. As of 2026, the spread between the cheapest capable models and the most expensive frontier models runs to roughly 100x on input tokens. Anthropic's published rates put Claude Haiku 4.5 at $1 per million input tokens and Opus 4.8 at $5, with output tokens at $5 and $25 respectively (Anthropic pricing). Once a gap that large exists, sending an easy request to the expensive model is not a rounding error. It is the bill.
Routing is the cheap leg of the same strategy I make the case for in shipping smaller models. The small model handles the bulk; the router decides when the bulk is not enough. If you are early in the cost work, routing pairs naturally with prompt caching, which attacks the same bill from the input side. Want a team to build the routing layer rather than bolt one on later? That is what the Devlyn engineering team does.
The routing strategies that matter, and how each decides
There are four routing strategies you will actually see in production, and they differ mainly in how the decision gets made and when. Understanding the difference is the difference between a router you can reason about and one that surprises you in a board meeting.
Rules. The simplest router is a set of hand-written conditions: requests over a token threshold go to the big model, requests matching a known cheap pattern go to the small one. Rules cost essentially nothing to run, under a millisecond, and you can read them. They are crude, they miss nuance, and they are the right place to start because they capture a surprising amount of the savings before you have built anything fancy.
Cascade. A cascade runs the cheap model first, checks the output against a confidence or verification signal, and escalates to a bigger model only when the cheap one fails the check. Most requests never escalate, so you pay frontier prices only on the tail. The academic survey on dynamic routing and cascading frames this as the sequential-escalation pattern, distinct from upfront routing (arXiv survey). The catch: a cascade that escalates often pays for two calls on the same request.
Classifier or predictive router. Here a lightweight model looks at the incoming request and predicts which model should handle it, before any generation happens. The RouteLLM work trained such routers on preference data and reported that a matrix-factorization router hit 95% of GPT-4 quality while sending only 14% of queries to the strong model (LMSYS). This is the most powerful approach and the most work to maintain.
Semantic router. A semantic router embeds the request and matches it against clusters of known query types, routing by meaning rather than rules. It sits between hand rules and a trained classifier on both cost and capability. Embedding-based routing adds roughly 5ms; a heavier ML classifier adds 50 to 100ms, against typical LLM response times of 500 to 2000ms, so even the expensive routers are a single-digit percentage of the total call (DigitalApplied).
A routing-strategy table you can paste into a deck
Here is the same set of strategies side by side: how each one decides, the kind of savings it tends to deliver, and what it puts at risk. Sourced figures are marked; the rest are illustrative ranges from typical deployments.
| Strategy | How it decides | Typical savings | Main risk |
|---|---|---|---|
| Rules | Hand-written conditions on length, pattern, or metadata | 20-40% (illustrative) | Misses nuance; brittle as traffic shifts |
| Cascade | Cheap model first; escalate on confidence or verification failure | 40-85% (RouteLLM: ~85% on MT-bench) | Double-billing when escalation is frequent |
| Classifier / predictive | Trained model picks the target before generating | 45-85% (RouteLLM: 45% on MMLU) | Drift; needs labeled data and retraining |
| Semantic | Embed request, match to known query clusters | 30-60% (illustrative) | Mis-clusters novel or ambiguous requests |
The RouteLLM numbers are real and worth internalizing: roughly 85% cost reduction on MT-bench, 45% on MMLU, and 35% on GSM8K versus a frontier-only baseline, all while retaining about 95% of GPT-4 quality (LMSYS). The variation across benchmarks is the real lesson: routing savings depend entirely on how easy your actual traffic is, so your number is your own, not the paper's.
Build or buy your router
The market is full of managed LLM router products and gateways that promise to do this for you, and they are not wrong to. The question is whether the convenience is worth giving up the thing you most need, which is the ability to understand and debug why a request went where it went.
Build when your routing logic is simple enough to own. A rules layer or a cheap-first cascade with a confidence threshold is a few hundred lines of code and a config file. You can read it, test it, and change it without a vendor relationship. For most teams starting out, this is the right call, because the crude version captures most of the savings and teaches you what your traffic actually looks like.
Buy when routing has become its own discipline. If you are running a trained classifier that needs labeled data, retraining, and monitoring, a managed router that maintains the model and the eval loop can be cheaper than a half-time engineer doing it badly. The honest version of build-vs-buy is a cost-of-ownership question, not a feature comparison. Count the engineer hours, not the API line item.
The pattern I have watched succeed: start with a rule, graduate to a cascade, and only reach for a trained or managed router once you have proof the simpler version is leaving real money on the table. I walk through the full decision space in my book on model routing, including why the model's own confidence is a worse escalation signal than people expect.
The eval problem hiding inside every router
Here is the part most routing guides skip, and it is the part that decides whether routing helps or quietly hurts. Every router decides based on a signal: a confidence score, a classifier's prediction, a verification check. That signal is a claim about quality. If the claim is wrong, routing sends hard requests to the cheap model and the product gets worse while every cost dashboard turns green.
You cannot route on quality you cannot measure. Before a router is trustworthy, you need a frozen, production-sampled eval set that tells you, per model, how often each one is actually right on your traffic. That is the same machinery I describe in my guide to LLM evaluation, and it is not optional for routing. It is the foundation the router stands on.
The specific trap is the cheap path going untested. Teams evaluate the frontier model carefully, ship the router, and never run the same eval on the cheap model the router now sends most traffic to. The result is a system optimized for a cost number nobody connected back to a quality number. Pick the right harness with help from my rundown of LLM evaluation tools, then run every model in the menu against the same frozen set.
A useful reframe: routing is not a cost feature with an eval attached. It is an eval system that happens to save money. Get the eval right and the savings are safe. Get it wrong and you are gambling with the product's quality to shave a bill.
Where routing quietly goes wrong
Routing failures are rarely loud. The system keeps responding, the cost line keeps dropping, and the damage shows up in churn and support volume weeks later. These are the failure modes I watch for.
- The confidence signal lies. A small model is often most confident exactly when it is wrong, so escalating on low confidence can leave the worst answers unescalated. Test the signal, do not trust it.
- Double-billing on escalation. A cascade that escalates 40% of the time pays for two calls on those requests. If escalation is common, an upfront router that picks once can be cheaper than a cascade that picks twice.
- Drift. A classifier trained on last quarter's traffic routes this quarter's traffic worse every week. Without monitoring, you find out from the quality numbers, late.
- Routing overhead ignored or overstated. A trained classifier adds 50 to 100ms; on a p95-sensitive flow that can matter, and on a batch job it is noise. Know which one you are running before you optimize the wrong thing.
None of these are reasons not to route. They are reasons to instrument the routed pipeline so the failures are visible the day they start, not the month the revenue dips. Watching a routed system in production is squarely an AI observability and monitoring problem, not a one-time setup.
Two short stories, with numbers
The numbers below are illustrative, drawn from the shape of real deployments rather than any specific client system, and they are NDA-safe.
The cascade that paid for itself in a week. A team running a support assistant entirely on a frontier model was spending roughly $18,000 a month at about 600,000 calls. They added a cheap-first cascade: the small model answered, a confidence check decided whether to escalate. About 78% of requests resolved on the small model. The bill fell to near $5,000, and the held-out eval showed no measurable quality drop on the routed traffic. The router was a config file and a threshold.
The classifier that quietly broke. A different team shipped a trained classifier router, saw a 60% cost drop, and celebrated. Three weeks later, support tickets climbed. The classifier had been trained before a product launch changed the traffic mix, and it was now sending a new category of hard questions to the cheap model with high confidence. Nobody had re-run the eval after launch. The fix was not a better router. It was a frozen eval set that ran on every model weekly, which would have caught the drift in days instead of weeks.
The operator's frame
I will close where I started, on the revenue. Every AI feature has a cost curve and a value curve, and your job is to widen the gap between them. Routing widens it on the cost side without touching the value side, as long as the eval holds. That is rare and worth doing well.
The mistake is treating routing as a feature you turn on. It is infrastructure you operate. The router, the eval set behind it, and the monitoring on top of it are a system, and the system is what produces durable margin. Plugging in a managed router and walking away gets you the demo. Owning the eval loop gets you the margin that survives a traffic shift.
The teams that internalize this build a real cost advantage that compounds. Routing is one lever; it sits alongside smaller models, caching, and the rest of the toolkit in my guide to LLM inference cost. And routing inside an agent loop, where a single task fans out into many model calls, is where the savings get largest and the eval problem gets hardest, which I get into in my piece on the best AI agents.
Frequently asked questions
What is LLM model routing?
LLM model routing is a decision layer in front of a pool of models that sends each request to the cheapest model capable of answering it acceptably, escalating to a more expensive model only when needed. It cuts cost because most production traffic is easy and the price gap between cheap and frontier models is roughly 100x per token, so you stop overpaying on the bulk of requests.
How much does model routing actually save?
It depends on how easy your traffic is, but the documented range is large. The RouteLLM work reported about 85% cost reduction on one benchmark while keeping 95% of GPT-4 quality, using the strong model on only 14% of queries. Production deployments commonly land in the 40 to 85% range. Your number is your own, because it is set by your traffic mix, not the paper's.
What is the difference between a cascade and a router?
A cascade runs the cheap model first and escalates to a bigger model only when the cheap one fails a confidence or verification check, so it can pay for two calls on hard requests. An upfront router classifies the request and picks one model before generating, paying for a single call but needing a trained classifier and labeled data. Cascades are simpler to start with; upfront routers can be cheaper when escalation would otherwise be frequent.
What is the biggest risk with model routing?
Routing on a quality signal you have not measured. If the confidence score or classifier is wrong, the router sends hard requests to the cheap model and the product degrades while the cost dashboard looks healthy. The defense is a frozen, production-sampled eval set run against every model in the menu, plus monitoring to catch drift before the revenue numbers do.
If you want a team to build the routing layer, the eval set behind it, and the monitoring on top, that is exactly what Devlyn's engineering team works on. The cheapest model that can do the job is a great strategy. The eval that proves it can is what makes the strategy safe.
