Name: Model Routing
Availability: InStock

Key Takeaways

Appendix A: Back Matter is a chapter about model routing and inference control planes, not a generic AI adoption note.

The operating rule is to send each request to the cheapest path that still meets quality, latency, residency, and risk requirements.

The failure mode to watch is polished output without evidence, owner, cost line, or rollback path.

The useful next step is an artifact a future teammate can replay without folklore.

Model routing works when each request goes to the cheapest path that still meets quality, latency, residency, and risk requirements.

Glossary

Abstain class: A mandatory "unknown / out-of-scope" output for an intent classifier, routed conservatively (strong model, clarify, or human). An intent router without one mis-handles the requests it does not understand.

Cascade: A cost-ordered ladder of models where a request runs the cheapest, a verifier judges the answer, and the request climbs to a stronger model only on doubt. The escalation decision is made after seeing the actual answer. Contrast dynamic routing.

Cache-aware routing: Estimating a candidate model's cost and latency adjusted for whether that model's relevant prompt prefix is currently cached, since prompt caching is per-model and per-prefix. Routing to a nominally cheaper model with a cold cache can be dearer and slower.

Capability manifest: A per-model declaration of permitted tools, approved domains, side-effect rights, and required human approvals, enforced as part of routing eligibility so the cost optimizer cannot select an unapproved model.

Confusion matrix (routing): The four-box instrument grading routing decisions: cheap-and-right, cheap-and-wrong (false cheap), escalated-and-needed, escalated-and-wasted (false expensive). Its two errors are asymmetric and must never be summed.

Cost regret: Extra cost spent versus an oracle at the same quality level; the toll of false-expensive (over-escalation) errors.

Cost-weighted quality: A single scalar combining quality and cost via an explicit dollars-per-quality-point exchange rate, set per slice by risk. The optimization objective when one number is needed.

Denial-of-wallet: A cost-amplification attack where an adversary crafts traffic that forces every request down the most expensive lane (via escalation triggers or forced failover), an OWASP unbounded-consumption route. Defended by budget guards.

Difficulty: How likely a cheap model is to answer a request correctly. Independent of risk. Poorly predicted by prompt length; best predicted by historical per-slice performance.

Drift (route distribution): A shift in the fraction of traffic going to each lane without a policy change, signaling traffic, model, or price drift (or abuse). The router's heartbeat.

Dynamic routing: Picking one model up front from a learned prediction of difficulty or the likely winner (e. g., RouteLLM), committing without seeing any answer. One call; the strongest signal (the answer) is unavailable. Contrast cascade.

Effective context: The length over which a model reliably performs a given task above a quality threshold (RULER); far shorter than the advertised window and shorter still as task difficulty rises. Gates length-based eligibility.

Ensemble: Running multiple models on the same request every time and combining their outputs (vote, rerank, or fuse) to exceed the best single model (LLM-Blender). The lone pay-more pattern; justified only on high-stakes or offline-cacheable work.

Failover: Retrying on a different available path when a call fails for operational reasons (timeout, 429, 5xx, outage). Triggers on availability, fixes with an equivalent model, not a stronger one. Contrast cascade.

False cheap: Routing a request to a model that gets it wrong. A quality and, on high-risk slices, safety failure; invisible at routing time. The dangerous routing error.

False expensive: Escalating a request the cheap model would have handled. A cost failure; visible on the bill, not dangerous.

Fleet: The set of eligible models and providers assembled by procurement. Distinct from the routing of each request among them.

Four currencies: Cost, quality, latency, and risk. Cost and latency are resources spent; quality is what is bought; risk is a veto gating parts of the frontier.

Frontier (Pareto): The cost-vs-quality edge where no model is both cheaper and better than another. A router rides the frontier as a path of per-request choices; the system's aggregate point can sit above the line between any two single models.

Oracle: The unbuildable-but-measurable perfect router that always picks the cheapest model that will succeed on each request. The benchmark; regret is the gap to it.

Provider mesh: The fleet viewed through availability: each tier served by more than one provider so no provider is a single point of failure. Failover stays within governance constraints.

Quality bar (per-slice), The "correct enough" threshold a request type must clear, set by risk, strict graders and high thresholds where wrong answers cost more.

Quality regret: Quality lost versus an oracle at the same cost budget; the toll of false-cheap errors.

Risk: What a wrong answer costs. Read before difficulty, gates the eligible model set as a floor, and sets the per-slice quality bar and exchange rate. Independent of difficulty.

ROUTE: The book's framework:Risk, Outcome metric, Unit cost, Time budget, Escalation evidence. A design lens applied per request, not a chapter template.

Shadow routing: Running un-served models in the background on a sample of traffic, graded but not served, to generate the counterfactual labels (what the model you didn't serve would have done) that regret and the four-box matrix require. Off the critical path.

Slice: An equivalence class of "requests like this" (task type + domain + structure) whose model performance is similar enough to share routing. The unit of measured difficulty and per-slice tuning.

Static routing: Picking one model from fixed properties (task, tenant, length) before any call. Deterministic, auditable, free. The legible skeleton under a learned router.

Tier: An ordered capability level (small/mid/large, plus specialist lanes) used in policy so the policy is model-independent and survives provider churn via a tier→model mapping.

Verifier: An external check on a model's answer (run the tests, validate the schema, recompute, check grounding) used as a cascade's escalation trigger. Far more reliable than the model's self-reported confidence; is the cascade.

Implementation Checklist

A team's routing system is approaching production-ready when it can answer yes, with evidence to each item (see also Ten Playbooks for the concrete systems that exercise these checklist items). Grouped by movement.

Workload and frontier (Movements I-II)

The request workload has been measured as a distribution across difficulty and risk on real traffic, not assumed uniform.
The three-way baseline (all-small, all-flagship, routed) has been run on labeled traffic and the routed system sits above the line between the two single models.
The model frontier is plotted per slice from your own prices and quality; dominated models are flagged for retirement; specialists are on their own slice.
Risk is assessed first, as a floor on eligible tiers, with a classifier that rounds up on uncertainty; the trap quadrant (easy + high-risk) cannot route cheap.
Prompt length drives cost and eligibility only, never difficulty; an adversarial length-trap fixture (long-easy, short-hard) passes.
Escalation triggers on verifiers (and behavioral consistency), never bare self-reported confidence; the cheap model never grades itself.
A decision log records signals, policy version, decision, and outcome for every routed request; slice-performance is materialized from it.

Patterns (Movement III)

The five architectures are distinguished in code: static, dynamic, cascade, fallback, ensemble, with a selection policy and a separate resilience policy.
Fallback and cascade are not the same code path: failure → equivalent available model; insufficiency → stronger model.
The cascade's PASS threshold is risk-tuned per slice; ladder depth is justified by measured middle-slice data.
The provider mesh fails over within residency/governance constraints; circuit breakers and backoff-with-jitter prevent retry storms.
Local-vs-cloud routing reads live GPU utilization (local-first with cloud overflow); residency-pinned requests queue/shed rather than overflow.
Ensembles are used only on high-stakes or offline-cacheable slices; the combiner is evaluated against the best single member before trust.

Evaluation (Movement IV)

The unit of evaluation is the routing decision, graded with the four-box matrix; false-cheap and false-expensive are reported separately, never summed.
Escalation precision and recall are tuned per slice by risk; false-cheap on high-risk slices is the headline governance metric.
Regret (quality and cost) is computed against an oracle from a (request, model) eval dataset; cost-weighted quality uses an explicit per-slice exchange rate.
Shadow routing runs off the critical path on a sampled, stratified stream, populating the counterfactual labels.
Online evaluation recomputes the metrics per slice on a rolling window; a drift canary set distinguishes model drift from traffic drift.

Cost, latency, governance, operations (Movements V-VII)

Research and Source Register

Sources grouped by chapter. A source appears under a chapter only if that chapter actually uses it to support a claim.

Front matter & Introduction: synthetic; draws on the book's own argument and the shared spine. No standalone external citations beyond the spine list in the front matter.

Ch. 1, The Bill That Broke the One-Model Religion

Ch. 2, The Frontier, Not the Flagship

Ch. 3, Five Words That Are Not Synonyms

Ch. 4, Prompt Length Is a Liar

Ch. 5, Reading Risk Before Difficulty

Ch. 6, Confidence, Self-Assessment, and Why Models Lie About Both

Ch. 7, Difficulty From History

Ch. 8, Rules, Intent, and Tiers

Ch. 9, The Cascade Ladder

Ch. 10, Failover, Local-vs-Cloud, and the Provider Mesh

Ch. 11, Ensembles, Voting, and Rerank-and-Fuse

Ch. 12, The Confusion Matrix Has Four Boxes

Ch. 13, Regret, Oracles, and Cost-Weighted Quality

Ch. 14, Shadow Routing and Online Evaluation

Ch. 15, The Cost Waterfall

Ch. 16, The Latency Budget

Ch. 17, Locked Doors

Ch. 18, The Control Room and the Fleet

Ch. 19, Ten Playbooks

Internal map

For the larger argument, keep this chapter connected to Model Routing, The Economics of Inference, the smaller-model margin argument, and A Field Guide to Evals.