AN Alpesh Nakrani
BlogBooksPraiseAbout Work with me →
Back to the blog
Blog / May 15, 2026 · 12 min

How to Reduce LLM Inference Cost Without Wrecking Quality

Reduce LLM inference cost by right-sizing the model, caching what repeats, quantizing, trimming tokens, and batching. Here is the order to pull those levers, and what each one actually saves.

You reduce LLM inference cost by pulling five levers in roughly this order: route most traffic to the smallest model that clears the bar, cache the parts of your prompt that repeat, quantize the models you self-host, cut the tokens you send and generate, and batch the work that can wait. Underneath all of it sits one decision that sets the ceiling on the rest: whether you adapt the model through RAG or fine-tuning. Get the order right and most teams find their bill was 2 to 5 times bigger than the workload actually required.

I have sat in the billing review where the inference line went up and to the right faster than revenue, and I have watched a team panic-shop for a cheaper model when the real problem was that every request carried a 6,000-token system prompt nobody had read in months. The cost was never the model. It was the design around the model. This piece is the hub for that design. Each lever below gets a short version here and a full deep dive of its own, because the levers compound and the order you pull them in matters more than any single one.

The cost is rarely the model. It is the design around the model, and most of that design was set on a Tuesday by someone who has since moved teams.

Key takeaways

If you read nothing else, these are the load-bearing claims:

  • Output tokens are the expensive ones. On most frontier APIs in 2026, output is priced 5x to 6x higher than input, so the cheapest token is the one you never generate.
  • Routing is the lever with the most upside for most teams. Sending the easy 80% of requests to a small model and escalating only the hard tail routinely cuts spend more than any model swap.
  • Prompt caching is nearly free money on stable prefixes. Anthropic prices a cache read at 0.1x the base input rate, a 90% discount on the cached portion (Claude pricing).
  • Quantization buys roughly 2x to 4x memory savings on self-hosted models, but the quality drop is task-specific, so you measure before you ship it.
  • The metric that should govern all of this is cost per resolved task, not cost per token. A cheaper call that fails more often is not cheaper once a human cleans up the mess.

What actually drives your LLM inference cost

Before you optimize anything, you have to know what you are paying for. Four drivers explain almost every inference bill, and three of the four are things you control with code, not contracts.

Model tier. The headline number. A frontier model costs an order of magnitude more than a capable mid-size one. As of mid-2026, Anthropic prices Claude Opus 4.5 at $5 per million input tokens and $25 per million output, Sonnet at $3 and $15, and Haiku 4.5 at $1 and $5 (Claude pricing). OpenAI lists GPT-5.5 at $5 input and $30 output per million (OpenAI pricing). The spread between the cheapest and most expensive model in a single vendor's lineup is 5x or more, before you do anything clever.

Output tokens. Notice that in every pricing line above, output costs 5x to 6x what input does. That asymmetry should change how you write prompts. A verbose model that "thinks out loud" for 800 tokens to answer a yes-or-no question is burning the most expensive resource you buy. Constraining output length is one of the few changes that cuts cost and latency at the same time.

Context length. Every token you stuff into the prompt is a token you pay for on every single call. That 6,000-token system prompt I mentioned is not a one-time cost; it is a tax on every request for the life of the feature. Long context is sometimes worth it. It is rarely worth it by accident.

Multi-step agents. This is the driver that surprises people. A single agent task can fan out into ten, twenty, fifty model calls, each carrying the growing conversation as context. The per-call price looks cheap. The per-task price is what lands on your invoice, and for agentic workloads it can be 50x the cost of a single completion. If you are building agents, read how to cut the tokens an agent burns before you scale the loop.

If your inference bill is already a board-level line item and you would rather have engineers who have shipped this before than learn it on production traffic, that is the work the Devlyn team does. Now to the levers, in the order I would pull them.

Lever 1: Right-size the model, then route

The single biggest lever, and the one most teams reach for last, is using a smaller model. Not because small models are trendy, but because most production requests are easy, and you are paying frontier prices to answer easy questions. The discipline is to find the smallest model that clears the bar for your task, make it the default, and escalate to a bigger model only when the small one genuinely cannot cope.

That escalation pattern is called a cascade, or model routing. You run every request through a cheap model first, check the output against a confidence or quality signal, and pass it up the chain only on failure. Most requests never escalate, so you pay top-tier prices only for the hard tail. The routing logic itself is cheap, and even crude rules capture most of the savings. I covered the full design space in the guide to LLM model routing, and the case for defaulting small in the CRO's case for shipping smaller models.

The two trade-offs to name honestly: routing adds a moving part that can break, and a badly tuned router that escalates too often gives you the cost of the big model plus the latency of the small one. You tune it against real traffic, not a hunch. If you want the longer treatment, In Defense of Small Models and Model Routing are the two book chapters I point teams to most often.

Lever 2: Cache what repeats, by prefix and by meaning

A surprising fraction of what you send a model is identical from call to call: the system prompt, the tool definitions, the few-shot examples, the policy document. You are paying to reprocess all of it every single time, and you do not have to.

Prompt caching stores the stable prefix of your request so later calls read it from cache instead of recomputing it. The pricing is hard to argue with: Anthropic charges a cache read at 0.1x the base input rate, which is a 90% discount on the cached tokens, with a small write premium the first time (Claude pricing). For any feature with a large fixed preamble, this is close to free money, and I walk through the breakeven math in the prompt caching guide.

Semantic caching goes one step further. Instead of matching identical prefixes, it matches requests that mean the same thing, so a question phrased two different ways can return one cached answer without a model call at all. The savings can be large on high-repetition workloads like support and search, but the risk is real too: a too-loose similarity threshold serves a stale or wrong answer to a question that only looked similar. The discipline lives in the semantic caching guide, and the broader question of what belongs in context at all is the subject of the Context Windows chapter.

Lever 3: Compress the model with quantization

If you self-host, quantization is the lever that lowers the floor under everything else. It stores the model's weights at lower numerical precision, dropping from 16-bit to 8-bit or 4-bit, which shrinks memory footprint by roughly 2x to 4x and lets the same model run on cheaper hardware, often faster.

The honest caveat is the one teams skip: quality degradation from quantization is task-specific. On many tasks the drop falls inside the noise of your eval suite and you would never notice. On tasks that lean on careful numerical reasoning, long-context faithfulness, or fine-grained instruction following, the drop can be real and material. The rule is simple and non-negotiable: run your evals at full precision and at the quantized precision, compare by failure mode, and only then decide. The savings are frequently real. So is the quality drop. The only way to know which dominates for your task is to measure, and I cover how in the LLM quantization guide.

Lever 4: Cut the tokens you send and generate

This is the unglamorous lever that pays back fastest, because it costs you nothing in infrastructure. Most prompts carry tokens that earn nothing. Industry write-ups in 2026 estimate that the average API call wastes a large share of its input tokens on context the model does not need, and you pay for every wasted token twice, once in the bill and once in the latency (Morph, LLM inference optimization).

The work here is mundane and effective: prune the system prompt to what the model actually uses, drop few-shot examples once a fine-tune or a better instruction makes them redundant, summarize conversation history instead of replaying it verbatim, and cap output length to what the task needs. Each change is small. Together they routinely take a meaningful bite out of the bill with zero quality cost, because you are removing tokens that were never doing any work. The full method, including how to find the dead weight in an agent loop, is in the token optimization guide.

You pay for every wasted token twice: once in the bill, and once in the latency your user sits through.

Lever 5: Batch and schedule the work that can wait

Not every request needs an answer in 800 milliseconds. Overnight enrichment, bulk classification, evaluation runs, content generation pipelines: these are throughput problems, not latency problems, and they should be priced like it.

On hosted APIs, the Batch endpoint is the lever. Anthropic and OpenAI both offer roughly a 50% discount on input and output for asynchronous batch processing, the trade being that you accept a longer turnaround in exchange for half-price tokens (Claude pricing). If a workload does not need to be synchronous, leaving it on the real-time endpoint is leaving half the bill on the table.

If you self-host, the equivalent lever is continuous batching at the serving layer. Instead of waiting for a whole batch to finish before starting the next, the server injects new requests into the compute stream as slots free up. Anyscale's benchmarks reported up to 23x throughput over naive batching with vLLM and PagedAttention, and consistently better latency across percentiles (Anyscale). The headline 23x is a best case; the everyday gain on mixed traffic is more like 2x to 4x, which is still the difference between one GPU and three.

RAG vs fine-tuning: the cost decision behind adaptation

Underneath the five levers is a more fundamental choice that sets the cost ceiling for the whole system: how you adapt a general model to your specific knowledge and task. The two paths, retrieval-augmented generation and fine-tuning, have very different cost shapes, and picking the wrong one taxes every request forever.

RAG keeps the model general and feeds it relevant context at query time. Cheap to set up and easy to update, but it inflates every prompt with retrieved chunks, so it raises your per-call input cost permanently. Fine-tuning bakes the knowledge or behavior into the weights. It costs more upfront and is slower to update, but it can shrink your prompts dramatically and let a smaller model do a bigger model's job, which compounds with Lever 1. The right answer depends on how often your knowledge changes and how repetitive your task is, and I work through the decision in RAG vs fine-tuning, with the longer versions in RAG That Survives Contact and To Fine-Tune or Not. If you want a team to build the retrieval layer rather than learn it the hard way, Devlyn does RAG and knowledge integration as a core practice.

A decision table you can pull from

Here is the same set of levers in one place: when to reach for each, roughly what it saves, and how much work it is to ship. Treat the savings figures as directional, not contractual, because the real number depends entirely on your traffic shape.

LeverWhen to use itTypical savingsEffort
Route to smaller model (cascade)Most requests are easy; a hard tail needs the big modelLarge; often the biggest single winMedium
Prompt cachingLarge fixed prefix reused across calls (system prompt, docs, tools)Up to 90% on the cached portion (vendor-priced)Low
Semantic cachingHigh-repetition queries (support, search, FAQ)Large on repetitive traffic; watch the staleness riskMedium
Quantization (self-host)You run your own models and have GPU headroom to reclaim~2x to 4x memory; meaningful cost cut if quality holdsMedium
Token / prompt trimmingAlways; especially agent loops with growing contextSmall per change, compounds; near-zero quality costLow
Batch API (async)Work that tolerates delay (enrichment, evals, bulk jobs)~50% on input and output (vendor-priced)Low
Continuous batching (self-host)You serve your own models at meaningful concurrency2x to 4x throughput typical, up to ~23x best caseMedium
RAG vs fine-tuning choiceAdapting a general model to your knowledge or taskSets the ceiling for every other leverHigh

Three illustrative cost stories

These are composite scenarios, not specific clients, built from patterns I have seen repeat. The numbers are illustrative and rounded to make the mechanics legible, not pulled from a live system.

The support bot that forgot to cache. A team runs a support assistant on a frontier model, every request carrying a 5,000-token policy preamble. At a few hundred thousand calls a month, the preamble alone is most of the bill. They turn on prompt caching, the preamble becomes a cache read at 0.1x, and the input portion of the bill drops by most of its value overnight. No model change, no quality change, one config field. This is the lever I check first because it is the cheapest to try and the most often skipped.

The classifier on the wrong tier. An intake flow uses a top-tier model to classify incoming messages into one of eight categories, because that is the model someone wired in during the prototype and nobody revisited it. The task is easy; a small model handles it at near-identical accuracy after a light fine-tune. Moving classification to the small model and reserving the frontier model for the genuinely ambiguous cases cuts the per-task cost by an order of magnitude, and the latency improves as a bonus. The prototype default was quietly the most expensive decision in the system.

The agent that fanned out. An autonomous research agent makes, on average, thirty model calls per task, each replaying the full growing transcript as context. The per-call price looks trivial. The per-task price is anything but, and it scales with usage in the worst possible way. Summarizing the transcript between steps instead of replaying it verbatim, plus capping each step's output, takes a large bite out of per-task cost with no loss in the final answer. The fix was in the token budget, not the model.

The metric that should govern LLM inference cost: cost per resolved task

Every lever above optimizes cost. The thing you should actually be optimizing is cost per resolved task: total inference spend divided by the number of tasks the system fully handled without a human finishing the job. Cost per token is what the model charges you. Cost per resolved task is what the model costs you, and only one of those is on the P&L.

The trap is optimizing the wrong number. A cheaper model that resolves 70% of tickets is not cheaper than a pricier one that resolves 90%, once you price in the human cleaning up the other 30%. The token cost per call dropped; the cost per resolved task went up. I make the full case in the complete guide to LLM evaluation, and the tooling to measure it in the evaluation tools guide. Tracking this in production, rather than in a one-off spreadsheet, is squarely an AI observability and monitoring problem.

Cost per token is what the model charges you. Cost per resolved task is what the model costs you. Optimize the second one.

This is also where cost optimization stops being an engineering hobby and becomes a margin argument. When you can say "we resolve the same share of tasks for 40% less spend," you have turned an infrastructure project into a sentence a CFO will fund. That is the sentence most teams cannot say, because they measured tokens and never measured resolution.

Frequently asked questions

How do I reduce LLM inference cost the fastest? Start with the two cheapest levers to ship: turn on prompt caching for any large fixed prefix, and trim the dead tokens out of your prompts. Both are low-effort and low-risk. Then route the easy majority of requests to a smaller model, which is usually the single biggest win but takes more tuning. Batch anything that can tolerate a delay for the vendor's roughly 50% async discount.

Why are output tokens more expensive than input tokens? Generating tokens is sequential and compute-heavy: the model produces them one at a time, each conditioned on all the ones before. Processing input can be parallelized far more. That is why vendors price output 5x to 6x higher than input, and why capping output length is one of the few changes that cuts both cost and latency at once.

Does a smaller model always cost less to run? Per token, yes. Per resolved task, not always. If a small model fails often enough that a human has to finish the job, the cleanup cost can erase the token savings. That is why you measure cost per resolved task, not cost per token, and why a routed cascade beats a flat downgrade: it keeps the big model available for the hard cases that would otherwise fail.

Should I optimize cost before I have an eval suite? No. Every cost lever is a trade against quality, and without evals you cannot see the quality side of the trade. Quantize without evals and you ship a silent regression. Route without evals and you escalate the wrong cases. Build the eval harness first, then optimize against it so each change shows its true cost in both dollars and quality.

If you want the full operator's playbook on adapting and running models economically, the book chapters on small models, routing, and pricing intelligence go deeper than any single article can. And if you would rather have a team build the routing, caching, and observability into your stack from day one instead of retrofitting it after the bill scares someone, that is exactly what Devlyn's engineers do. Measure cost per resolved task. Pull the levers in order. Ignore the leaderboard.

Share
Next

Keep reading

View all blogs