Name: Model Routing
Availability: InStock

> **Working claim:** Before you build a learned router, build a *legible* one. Static rules, intent classification, and explicit tiers get you most of the routing value with code a human can read, audit, and roll back.

Key Takeaways

Rules, Intent, and Tiers is a chapter about model routing and inference control planes, not a generic AI adoption note.

The operating rule is to send each request to the cheapest path that still meets quality, latency, residency, and risk requirements.

The failure mode to watch is polished output without evidence, owner, cost line, or rollback path.

The useful next step is an artifact a future teammate can replay without folklore.

Model routing works when each request goes to the cheapest path that still meets quality, latency, residency, and risk requirements.

Working claim: Before you build a learned router, build a legible one. Static rules, intent classification, and explicit tiers get you most of the routing value with code a human can read, audit, and roll back. They are not the primitive version you graduate from; they are the durable skeleton that the learned signals later refine. A routing policy you cannot read is a routing policy you cannot trust in an incident.

The case for legible routing

A learned router is impressive in a slide and frightening in an incident. When traffic shifts at 3 a. m. and the cost graph spikes, the on-call engineer needs to answer "why is everything routing to the flagship?" and a model that emits a number is a poor witness. A legible router, static rules, named intents, explicit tiers, answers that question by being read. The policy is a document; the decision is traceable to a clause; the rollback is a config revert. This is not a step you take because you cannot afford the learned router yet. It is the layer you keep underneath the learned router permanently, because legibility is an operational asset, not a developmental phase.

The three legible patterns build on each other. Static rules route on properties you can compute without any model.Intent routing adds a cheap classification step so the rules can key on what the request is asking for. Tier routing organizes the fleet into ordered capability levels so the rules and intents map to a tier rather than a specific model, which makes the fleet swappable underneath the policy.

Whiteboard-style technical sketch infographic for Rules, Intent, and Tiers. — The policy pipeline keeps hard guarantees, risk floors, intent, tier mapping, and learned difficulty in separate gates.

Static rules: the deterministic skeleton

Static rules are conditions over request properties known before any generation: tenant, task type (if declared), declared language, content flags, length, time-of-day, feature flag. They are deterministic, instant, and free, and they should handle every decision that can be made deterministically, because a deterministic decision is auditable, testable, and never surprises you.

The rule worth internalizing: rules express policy that must be guaranteed, not policy that should be optimized. "EU-tenant requests never leave EU-region providers" is a guarantee, it belongs in a hard rule, not a learned model, because you cannot afford a probabilistic component getting it wrong (this is the bridge to Chapter 17's governance)."Code requests go to the code specialist" is a guarantee about a known-good split."Requests over the model's effective context go to a long-context model" is a guarantee from Chapter 4's eligibility logic. Save the optimization, squeezing cost out of the uncertain middle, for the learned signals; keep the guarantees in rules.

Intent routing: route on what is being asked

Most requests do not arrive tagged with a task type; you have to infer it. Intent routing runs a cheap, fast classifier at the front of the pipeline to label the request: "billing question, " "code generation, " "summarization, " "out-of-scope", and routes on the label. The classifier is itself a small model or a fine-tuned encoder, and it is one of the few places a single cheap model call earns its keep before the main generation, because the intent label feeds risk assessment (Chapter 5), slice lookup (Chapter 7), and the rules.

The discipline that keeps intent routing safe: the classifier has an explicit "unknown / out-of-scope" class, and that class routes conservatively (to a stronger model, a clarification step, or a human), not to whatever the closest known intent happens to be. The dangerous failure of intent routing is forcing every request into a known intent and confidently mis-handling the ones that do not fit, the routing analog of a classifier with no abstain option. An intent router that cannot say "I don't know what this is" will route the requests it does not understand exactly as wrongly as the ones it does.

# Intent routing with a mandatory abstain class.
def classify_intent(request):
 pred = intent_model.predict_proba(request.text) # {intent: prob}
 top, p = max(pred.items(), key=lambda kv: kv[1])
 if p < 0.55: # not confident in ANY intent
 return "unknown" # abstain -> conservative route
 return top

def route_by_intent(request):
 intent = classify_intent(request)
 if intent == "unknown":
 return route_conservative(request) # strong model or clarify
 return INTENT_ROUTES[intent](request) # intent-specific routing

Tier routing: order the fleet

Tier routing is the organizing idea that makes the policy survive model changes. Instead of naming specific models in the policy, organize the fleet into ordered tiers, small, mid, large, plus orthogonal specialist lanes, and write the policy in terms of tiers. The mapping from tier to concrete model lives in one place (the fleet config), so when a provider deprecates a model or you add a new one, you update the mapping and the policy is untouched. This is the same indirection that lets infrastructure survive vendor churn: the policy depends on a role (mid tier), not an identity (a specific model name).

Tiers also make risk floors (Chapter 5) and escalation targets (Chapter 9) expressible cleanly: a risk floor is "minimum tier large, " an escalation is "next tier up, " a specialist route is "tier specialist: code." The whole routing vocabulary becomes tier-relative, which is exactly what you want when the concrete fleet is changing underneath you.

The routing policy as a versioned artifact

Put it together and the routing policy becomes a single, readable, versioned YAML document, the control-plane artifact from Chapter 1. This is the durable form of a legible router: a human can read it, diff it, review it in a pull request, canary it (Chapter 18), and roll it back. Here is a realistic shape.

# route-policy.yaml - versioned, reviewable, rollback-able routing policy.
version: "2025-06-12"
description: "Support assistant routing. Tiers map to models in fleet.yaml."

# Tier -> concrete model lives in fleet.yaml, NOT here. Policy is tier-relative.
defaults:
 tier: mid
 cascade: false

# Hard guarantees: deterministic, audited, never overridden by optimization.
guarantees:
 - name: eu_residency
 when: { tenant_region: "EU" }
 require: { provider_region: "EU" } # Chapter 17 governance
 - name: pii_floor
 when: { contains_pii: true }
 require: { min_tier: large, providers_allow: ["self-hosted", "enterprise-baa"] }

# Risk floors (Chapter 5): risk gates eligibility before difficulty.
risk_floors:
 low: { min_tier: small }
 medium: { min_tier: mid }
 high: { min_tier: large }
 critical: { route: human, model_may_draft: true }

# Intent-based routes.'unknown' is conservative on purpose.
intents:
 code_generation: { tier: specialist_code, cascade: true, verify: run_tests }
 summarization: { tier: small, cascade: true, verify: faithfulness_check }
 billing_dispute: { tier: mid, cascade: true, verify: consistency_3x }
 status_lookup: { tier: small, cascade: false }
 unknown: { tier: large, clarify_first: true } # abstain -> route up

# The uncertain middle is where learned difficulty (Ch. 7) refines the choice.
difficulty_model:
 enabled: true
 applies_to_intents: [billing_dispute, summarization]
 commit_cheap_below: 0.30
 commit_strong_above: 0.80
 # between thresholds: cascade (Ch. 3 combined architecture)

Read that policy top to bottom and you can predict where any request goes: guarantees first (residency, PII), then risk floors, then intent routes, with the learned difficulty model refining only the uncertain middle of the intents that opted into it. Nothing is hidden. An auditor can confirm the residency guarantee; an on-call engineer can see why billing disputes cascade; a reviewer can diff a proposed change. This legibility is what the NIST AI Risk Management Framework calls for under transparency and accountability, the ability to explain and govern an automated decision, and it is purchased almost entirely by writing the policy as data rather than as control flow.

The decision matrix for choosing a pattern

When does a slice need a static rule, an intent route, a tier, a cascade, or a learned model? Here is the selection matrix.

Situation	Pattern	Why
Decision must be guaranteed (residency, PII, safety floor)	Hard static rule	Cannot risk a probabilistic component; must be auditable
Clear, known task split (code vs. chat)	Static rule on task type / intent	Deterministic, free, obvious
Task type not declared by caller	Intent classifier with abstain	Need to infer what's being asked
Fleet changes often (models deprecated/added)	Tier indirection	Policy survives model churn
Easy majority, occasional hard ones	Cheap default + cascade on the slice	Pay for strong only when cheap fails (Ch. 9)
Uncertain middle, lots of labeled data	Learned difficulty/winner-predictor	Squeeze cost where rules can't tell (Ch. 7)
High-stakes, willing to pay	Ensemble	Buy quality above any single model (Ch. 11)

The matrix says something the hype misses: the learned router occupies one row. The legible patterns cover most of the table, and the RouteLLM and FrugalGPT machinery earns its place specifically in "the uncertain middle, lots of labeled data", not everywhere. A system that uses a learned router where a static rule would do has traded auditability for sophistication it did not need.

Where intent routing meets the data plane

A final operational note that connects to OpenAI's production best practices. The intent classifier and the difficulty model are additional model calls on the critical path, and they have their own failure modes: they can be slow, they can error, they can be rate-limited. Treat them with the same resilience discipline as the main generation, a timeout, a fallback (if the intent classifier fails, route to the conservative default rather than blocking the request), and monitoring of their own latency and error rate. A router that becomes unavailable because its classifier is down has converted a routing optimization into an outage. The legible patterns must themselves degrade gracefully: if the intelligence layer fails, fall back to the deterministic skeleton, route to the safe default tier, and keep answering.

Chapter summary

Before a learned router, build a legible one, and keep it underneath the learned router permanently, because legibility is an operational asset: in an incident the on-call engineer must answer "why did this route here?" by reading a policy, not interrogating a number. Three legible patterns build on each other, static rules on pre-generation properties, intent routing via a cheap classifier, and tier routing that organizes the fleet into ordered capability levels. The governing distinction is that rules express policy that must be guaranteed (residency, PII floors, known-good splits, deterministic and auditable) while learned signals handle policy that should be optimized (squeezing cost from the uncertain middle); never put a guarantee in a probabilistic component. Intent routing needs a mandatory abstain class, an intent router that cannot say "unknown" mis-handles the requests it does not understand, and it routes the unknown conservatively. Tier indirection (policy in terms of small/mid/large/specialist, with tier→model mapping isolated in fleet config) lets the policy survive provider churn and makes risk floors and escalations express cleanly. The whole thing becomes a single versioned, diffable, rollback-able YAML artifact, the control-plane document the NIST AI RMF's transparency goals call for, and the pattern-selection matrix shows the learned router occupies just one row (uncertain middle, lots of labeled data), where RouteLLM and FrugalGPT belong, not everywhere. Finally, the classifier and difficulty model are extra calls on the critical path: give them timeouts and fallbacks so that when the intelligence layer fails the router degrades to its deterministic skeleton and keeps answering rather than turning a routing optimization into an outage.

Internal map

For the larger argument, keep this chapter connected to Model Routing, The Economics of Inference, the smaller-model margin argument, and A Field Guide to Evals.