
2025 / Free online book · Technical Deep Dives
Model Routing
Sending Each Request to the Cheapest Model That Can Still Answer Correctly
Access
Free
Chapters
19
Read time
190 min
One model for everything is the expensive default. Routing requests to the cheapest model that can actually answer them.
This edition is free to read onsite. Each chapter has its own URL, so readers can bookmark, share, and return to the exact section they need.
Table of contents
FM Front Matter: Model Routing Sending Each Request to the Cheapest Model That Can Still Answer Correctly 5 min INT Introduction: The Tower A company I will call the desk, the details are composited from several real support organizations, but the shape is exact, built a customer-support assistant in the first flush of capable chat models. The engineering was clean. 10 min 01 The Bill That Broke the One-Model Religion > **Working claim:** A single-model architecture is not a neutral default. It is a bet that your workload is uniform, that every request deserves the same model. 10 min 02 The Frontier, Not the Flagship > **Working claim:** "Which model is best?" is the wrong question because it presumes a single ranking on a single axis. 10 min 03 Static, Dynamic, Cascade, Fallback, Ensemble: Five Words That Are Not Synonyms > **Working claim:** "Routing" is used as one word for at least five different architectures, and they have different costs, different failure modes, and different reasons to exist. 11 min 04 Prompt Length Is a Liar This chapter turns prompt length is a liar into a concrete operating problem for the routing book. 8 min 05 Reading Risk Before Reading Difficulty > **Working claim:** Difficulty and risk are different axes, and the router must read *risk first*. 8 min 06 Confidence, Self-Assessment, and Why Models Lie About Both > **Working claim:** The most tempting escalation signal is the model's own confidence: "let the cheap model tell us when it's unsure." It is tempting because it is cheap and it is exactly what you wish existed. 9 min 07 Difficulty From History: Slices, Embeddings, and Learned Routers > **Working claim:** The strongest difficulty signal is not in the request; it is in your logs."Requests like this one", same task type, same domain, same shape, have a *measured* track record of how often each model got them right. 9 min 08 Rules, Intent, and Tiers > **Working claim:** Before you build a learned router, build a *legible* one. Static rules, intent classification, and explicit tiers get you most of the routing value with code a human can read, audit, and roll back. 8 min 09 The Cascade Ladder > **Working claim:** A cascade is the most powerful routing pattern and the easiest to build wrong. Its power is that it escalates on the strongest possible signal, the actual answer, judged by a verifier. 9 min 10 Failover, Local-vs-Cloud, and the Provider Mesh > **Working claim:** Everything so far assumed the model you chose is *available*. In production it sometimes is not, providers rate-limit, time out, return errors, and have outages. 8 min 11 Ensembles, Voting, and Rerank-and-Fuse > **Working claim:** Every pattern so far tried to spend *less*. An ensemble spends *more*, several models on the same request, every time, to buy an answer better than any single model can give. 9 min 12 The Confusion Matrix Has Four Boxes, Not Two > **Working claim:** A router can make every individual model look fine while the system fails, because the unit of evaluation is not the model output, it is the *routing decision*. 8 min 13 Regret, Oracles, and Cost-Weighted Quality > **Working claim:** The right way to grade a router is to compare it to the best decision it *could* have made. That comparison is *regret*, the quality (or cost) the router gave up by its choice versus an oracle's choice. 8 min 14 Shadow Routing and Online Evaluation > **Working claim:** Offline evaluation tells you how the router *did* on yesterday's data. 7 min 15 The Cost Waterfall > **Working claim:** "Cost per request" is not one number; it is a waterfall of components, and a router that models only the first model call is optimizing the smallest part of the bill. 8 min 16 The Latency Budget > **Working claim:** Latency is the currency a cost-obsessed router forgets, and it is the one users feel. 8 min 17 Locked Doors: Residency, Permissions, and Abuse > **Working claim:** A router is a thing that *sends data somewhere and triggers actions*, which makes it a security and compliance surface, not just a cost optimization. 8 min 18 The Control Room and the Fleet > **Working claim:** A routing policy is not shipped once; it is *operated*. Models join and leave the fleet, providers degrade, traffic drifts, and a config change can re-route a third of your traffic in one deploy. 8 min 19 Ten Playbooks > **Working claim:** Everything in this book converges on a single recurring decision shape, and the fastest way to internalize it is to apply it ten times to ten different systems. 12 min A Appendix A: Back Matter Glossary, implementation checklist, and source register for the book. 9 min
