AN Alpesh Nakrani
BlogBooksPraiseAbout Work with me →
Book overview
Front Matter / Technical Deep Dives

Front Matter: Model Routing

Sending Each Request to the Cheapest Model That Can Still Answer Correctly

Key Takeaways

  • Front Matter: Model Routing is a chapter about model routing and inference control planes, not a generic AI adoption note.
  • The operating rule is to send each request to the cheapest path that still meets quality, latency, residency, and risk requirements.
  • The failure mode to watch is polished output without evidence, owner, cost line, or rollback path.
  • The useful next step is an artifact a future teammate can replay without folklore.

Model routing works when each request goes to the cheapest path that still meets quality, latency, residency, and risk requirements.

Book promise

Not every request deserves your strongest model. Not every request can survive your cheapest one. A single-model architecture hides that tradeoff until the cloud bill, the user complaints, or the safety review exposes it.

This is a practical, systems-minded guide to designing, evaluating, and operating model routers, cascades, fallbacks, and ensembles, so that an AI product balances quality, latency, risk, and cost request by request instead of by edict. It is written for builders who have already shipped something with one model and then watched the spreadsheet, the flagship-only system whose gross margin evaporates, the cheap-only system whose escalations and wrong answers climb, the system that quietly sent a regulated document to a provider that should never have seen it.

This manuscript is not a price comparison, not an inference-optimization tutorial, and not a cost-cutting handbook. It is a guide to model-selection systems: how to classify request difficulty and risk, how to choose a routing pattern, how to evaluate the router as a system rather than the models in isolation, how to control cost and latency, how to protect sensitive data, and how to operate a fleet of models safely as providers deprecate, upgrade, and fail.

The recurring motif

The router is the air-traffic controller of an AI system.

A controller does not fly the planes. The controller decides which plane should fly, when to escalate, when to hold, when to divert, and how to keep the runway safe under load, weather, and emergency. A model router is the same kind of component: it does not generate the answer, it decides which generator should answer this request, under what budget, with what fallback, and with what evidence it should escalate. Confusing "which model should we use?" (a procurement question, asked once) with "which model should this request use?" (a runtime control question, asked millions of times) is the mistake this book exists to correct.

The enemy

The belief this book argues against:

"Pick the best model and send everything to it." (The one-model religion, flagship edition.)

"Pick the cheapest model and send everything to it." (The one-model religion, finance edition.)

Both are lazy designs that pretend a heterogeneous workload is homogeneous. Real systems have heterogeneous tasks, heterogeneous risks, heterogeneous latency constraints, and heterogeneous model capabilities. A safe autocomplete, a routine summarization, a math-heavy analysis, a high-risk legal answer, and a tool-using agent step do not all want the same model. Model routing turns that heterogeneity from a hidden liability into an explicit, measured, governed decision.

Core thesis

Model routing is not cost optimization alone. It is runtime decision-making under uncertainty.

The router never knows for certain whether a cheap model will get a request right; it must estimate difficulty and risk, act under a latency budget, and learn from what happened. Treating routing as a static cost-cutting rule ("if tokens > N, use the big model") ignores the uncertainty that makes it interesting and dangerous.

Primary research references

These anchor the book. Individual chapters use their own chapter-specific sources; this is the shared spine.

The ROUTE Framework

One framework recurs through the book. Whenever a request arrives and you must decide where it goes, ask five questions:

  • R: Risk. What happens if the answer is wrong? A wrong autocomplete is an annoyance; a wrong dosage, contract clause, or financial figure is a liability. Risk sets the floor on which models are even eligible.
  • O: Outcome metric. What quality signal actually matters for this request type? Exact match for arithmetic, faithfulness for summarization, pass-rate for code, human preference for chat. There is no single "quality."
  • U: Unit cost. What does each candidate path cost, not just the first call, but retries, second-model escalations, reranking, and the validators in between?
  • T: Time budget. How much latency can this request tolerate? An interactive autocomplete and an overnight batch enrichment live in different universes.
  • E: Escalation evidence. What concrete signal would justify moving to a stronger model or a human, and what signal would justify not escalating, so the router does not panic-escalate everything?

ROUTE is used as a design lens, not a forced subsection in every chapter. It is the question set a mature router can answer for any single request.

Table of contents

Movement I: One Model Is a Product Decision, Not a Law

  1. The Bill That Broke the One-Model Religion
  2. The Frontier, Not the Flagship
  3. Static, Dynamic, Cascade, Fallback, Ensemble: Five Words That Are Not Synonyms

Movement II: What Makes a Request Hard?

  1. Prompt Length Is a Liar
  2. Reading Risk Before Reading Difficulty
  3. Confidence, Self-Assessment, and Why Models Lie About Both
  4. Difficulty From History: Slices, Embeddings, and Learned Routers

Movement III: Routing Patterns

  1. Rules, Intent, and Tiers
  2. The Cascade Ladder
  3. Failover, Local-vs-Cloud, and the Provider Mesh
  4. Ensembles, Voting, and Rerank-and-Fuse

Movement IV: Evaluation: Judge the Router as a System

  1. The Confusion Matrix Has Four Boxes, Not Two
  2. Regret, Oracles, and Cost-Weighted Quality
  3. Shadow Routing and Online Evaluation

Movement V: Cost and Latency Engineering

  1. The Cost Waterfall
  2. The Latency Budget

Movement VI: Safety, Security, and Governance

  1. Locked Doors: Residency, Permissions, and Abuse

Movement VII: Operating the Router

  1. The Control Room and the Fleet

Movement VIII: Playbooks

  1. Ten Playbooks

Back matter

  • Glossary
  • Implementation Checklist
  • Research and Source Register

Internal map

For the larger argument, keep this chapter connected to Model Routing, The Economics of Inference, the smaller-model margin argument, and A Field Guide to Evals.

Share