Name: Model Routing
Availability: InStock

This chapter turns prompt length is a liar into a concrete operating problem for the routing book.

Key Takeaways

Prompt Length Is a Liar is a chapter about model routing and inference control planes, not a generic AI adoption note.

The operating rule is to send each request to the cheapest path that still meets quality, latency, residency, and risk requirements.

The failure mode to watch is polished output without evidence, owner, cost line, or rollback path.

The useful next step is an artifact a future teammate can replay without folklore.

Model routing works when each request goes to the cheapest path that still meets quality, latency, residency, and risk requirements.

Working claim: The first routing signal every team reaches for is prompt length: "long prompts are hard, send them to the big model." It is the wrong signal for difficulty in both directions: it routes long-but-easy requests to expensive models for nothing, and it routes short-but-hard requests to cheap models that fail them. Length correlates with cost, which is real and useful, but it barely correlates with difficulty, which is what the router actually needs.

The seductive heuristic

Almost every routing system starts the same way. Someone writes if token_count(prompt) > 8000: model = "flagship" and ships it, because it is one line, it is intuitive, and it captures a real correlation: long prompts cost more, and "more expensive request" feels like it should mean "harder request." The heuristic survives because it is sometimes right and always cheap to compute, you can count tokens without calling any model. But "sometimes right and cheap" is exactly the profile of a signal that lulls a team into not measuring whether it is right on their traffic, and on most traffic it is wrong often enough to matter.

The error has a clean structure. Length is a good proxy for cost: more tokens, more money, by definition. It is a poor proxy for difficulty: whether a cheap model will get the answer right. Difficulty and length are different things that happen to be loosely correlated, and a router that uses length as a difficulty signal is using a cost variable to make a quality decision. The two failures fall out immediately.

Whiteboard-style technical sketch infographic for Prompt Length Is a Liar. — Prompt length can estimate cost and eligibility, but real difficulty comes from task, ambiguity, domain, tools, and history.

Failure one: long but easy

Plenty of long prompts are trivial. Consider these, all of which a small model handles perfectly despite being thousands of tokens long:

"Here is a 6,000-token customer email thread. Summarize the customer's main request in one sentence." A small model reads the thread and extracts the ask; length does not make extraction hard.
"Below is a 10, 000-token log file. List every line containing the word ERROR." This is mechanical pattern-matching; the cheapest model on earth does it.
"Here are 4,000 tokens of product reviews. What is the overall sentiment?" Aggregating sentiment over many short, redundant signals is easier with more data, not harder.

A length-threshold router sends all three to the flagship and pays the flagship tax for nothing. The requests are long because the input is long, and the task is easy. Routing on length conflates input size with task complexity, and over a workload with many long-but-easy requests, which is most document-processing workloads, it systematically over-spends. This is the false-expensive error from the confusion matrix in Chapter 12: routing to a big model when a cheap one would have done.

Failure two: short but hard

The mirror failure is worse because it is silent. Plenty of short prompts are brutal:

"Is 1,000, 033 prime?" Eleven tokens. A small model will answer confidently and often wrongly; the arithmetic is hard regardless of length.
"A train leaves Chicago at 2: 15pm traveling west at 60mph; another leaves Denver at 3: 40pm…" A few dozen tokens of multi-step reasoning that small models routinely botch.
"Given the clause 'notwithstanding the foregoing, ' does Section 4 override Section 2?" Short, but it requires careful legal reasoning a cheap model fakes fluently.
"Refactor this 15-line function to be thread-safe." Short input, genuinely hard correctness requirement.

A length-threshold router sends all of these to the cheap model because they are short, and the cheap model produces a confident wrong answer. This is the false-cheap error, and it is the dangerous one, because the failure is invisible at routing time, there is no exception, no timeout, just a wrong answer delivered smoothly. The router congratulates itself on the cheap path while quietly poisoning the hard slice.

RULER is the cleanest research demonstration that length and difficulty are orthogonal in exactly this way. It shows that a model's effective capability on a task collapses as the task gets harder at the same length, that a model handling simple retrieval over a long context can fail aggregation or multi-hop reasoning over the same context. Length held constant, difficulty varied, performance swung wildly. The corollary for routing is direct: you cannot read difficulty off length, because the same length spans the full range of difficulty.

Length is still a real signal: for cost and capacity

The fix is not to throw length away. Length is genuinely informative about two things the router must know, just not the thing it is usually used for.

First, cost. Length is cost (Chapter 15), and the router's cost estimate must use it. A long request is expensive on every model, which changes the cost-benefit of escalation: escalating a 50-token request to the flagship is cheap; escalating a 50, 000-token request is not, and the router should weigh that.

Second, capacity and capability-at-length. Some models cannot accept the request at all (it exceeds their window), and some models degrade at length even on easy tasks, the RULER effective-context result. So length gates eligibility: a request near a model's effective limit for its task type should not route to that model regardless of how easy the task is in the abstract, because the model's usable capability there is lower than its sticker window suggests.

So length informs the cost term and the eligibility gate, and it is nearly useless as the difficulty term. The mistake is using one number for all three.

# Length used correctly: for cost and eligibility, NOT for difficulty.

def eligible_models(request, fleet):
 n_tokens = count_tokens(request.prompt) # measured, not estimated
 task = request.task_type
 out = []
 for m in fleet:
 # Eligibility: can this model actually handle this length on this task?
 if n_tokens > m.effective_context_for(task): # from your RULER-style measurement
 continue # length gates eligibility
 out.append(m)
 return out

def estimate_cost(request, model):
 n_in = count_tokens(request.prompt)
 n_out = expected_output_tokens(request.task_type)
 return model.price_in * n_in + model.price_out * n_out # length drives cost

def estimate_difficulty(request):
 # Length is NOT here. Difficulty comes from task type, ambiguity,
 # reasoning depth, history - Chapters 5-7. Length is at most a weak feature.
 return difficulty_model.predict(features_without_length(request))

The separation is the whole lesson: eligible_models and estimate_cost use length; estimate_difficulty does not. A router that respects this separation stops over-spending on long-but-easy requests and stops under-serving short-but-hard ones, because the difficulty decision is no longer hostage to the input size.

What actually predicts difficulty

If not length, then what? The honest answer is that difficulty is multi-signal and the next three chapters are about it, but here is the preview, ranked roughly by how much signal each carries in practice.

Signal	What it captures	Strength	Cost to compute
Task type	Arithmetic, code, legal, summarization have different base difficulty	Strong	Free (classify intent)
Historical slice performance	What cheap models actually did on requests like this	Strongest	Needs logged outcomes (Ch. 7)
Reasoning depth required	Multi-step / multi-hop vs. single lookup	Strong	Hard to detect pre-generation
Ambiguity	Underspecified requests that need clarification	Medium	Detectable, noisy
Domain sensitivity	Specialized vocabulary, niche knowledge	Medium	Classifiable
Need for tools / retrieval	External computation or lookup required	Medium	Often detectable from task
Cheap-model self-assessment	The model's own confidence	Weak / unreliable	Costs a cheap call (Ch. 6)
Prompt length	Input size	Weak for difficulty	Free

Notice prompt length is at the bottom of the difficulty column despite being at the top of most teams' first router. Notice also that the strongest signal, historical slice performance, is not free; it requires that you logged what happened (Chapter 1's decision log, Chapter 7's slices). This is the recurring tension of routing signals: the cheap signals are weak, and the strong signals require infrastructure. The length heuristic is popular precisely because it is the only free and immediate signal, and teams reach for it before they have built the infrastructure for better ones.

A test that catches the length trap

Because the length trap is silent, you need a test that surfaces it. Build a small adversarial difficulty set: requests deliberately chosen to break the length-difficulty correlation, long-but-trivial and short-but-hard, and check that your difficulty estimator (and the routes it produces) handles them correctly. This is a direct application of the OpenAI evals guide discipline of building task-specific evals that probe the failure mode you care about, not just average performance.

# A fixture that specifically attacks the length-as-difficulty assumption.
length_trap_cases = [
 # (prompt, true_difficulty, note)
 (long_log_grep, "easy", "10k tokens, mechanical pattern match"),
 (long_thread_summary, "easy", "6k tokens, single-sentence extraction"),
 (prime_check_short, "hard", "11 tokens, real arithmetic"),
 (legal_clause_short, "hard", "short, needs careful reasoning"),
 (threadsafe_refactor, "hard", "short input, hard correctness"),]

def test_difficulty_estimator_ignores_length(estimator):
 errors = []
 for prompt, truth, note in length_trap_cases:
 pred = "hard" if estimator.predict(prompt) > 0.5 else "easy"
 if pred!= truth:
 errors.append((note, pred, truth))
 # A length-based estimator will get ALL of these backwards.
 assert not errors, f"length trap not handled: {errors}"

If your difficulty estimator is secretly just a length threshold, this test fails on every row, in both directions, which is exactly the diagnostic you want: it tells you your "difficulty" signal is a cost signal wearing a costume.

Chapter summary

Prompt length is the first routing signal every team reaches for and the wrong one for difficulty, because it conflates input size with task complexity: two loosely correlated but distinct things. The error is two-sided: long-but-easy requests (log greps, thread summaries, sentiment over many reviews) get routed to expensive models for nothing, the false-expensive error, and short-but-hard requests (prime checks, multi-step word problems, legal-clause reasoning, thread-safety refactors) get routed to cheap models that fail them silently, the dangerous false-cheap error. RULER is the clean demonstration that length and difficulty are orthogonal: at a fixed length, performance swings wildly as task difficulty changes, so you cannot read difficulty off length. The fix is not to discard length but to use it for what it actually predicts, cost (length is cost) and eligibility/capability-at-length (models degrade or refuse near their effective limit), while sourcing difficulty from the signals that carry it: task type, historical per-slice performance (strongest, but needs logging), reasoning depth, ambiguity, domain, and tool/retrieval needs, with length at the bottom of that list. Because the false-cheap error is invisible at routing time, build an adversarial length-trap fixture of long-but-easy and short-but-hard cases; a difficulty estimator that is secretly a length threshold fails every row, which is the diagnostic that tells you your difficulty signal is a cost signal in costume.

Internal map

For the larger argument, keep this chapter connected to Model Routing, The Economics of Inference, the smaller-model margin argument, and A Field Guide to Evals.