LLM Token Optimization: Cut Token Cost, Keep Quality

LLM token optimization means cutting the tokens you send and generate, in that order of payoff. Start with output, because output is priced 5x to 6x higher than input.

LLM token optimization is the work of reducing token usage on every call so the bill tracks the workload instead of your prompt habits. You cut tokens in two places: the input you send and the output you generate. Start with output, because on most frontier APIs output is priced 5x to 6x higher than input, so the cheapest token you will ever buy is the one you never make the model write (Claude pricing). After that, trim the input nobody reads, compress what you must send, and stop replaying context the model already saw.

I have sat in the billing review where the inference line outran revenue, and the reflex in the room is always to shop for a cheaper model. The model is rarely the problem. The problem is a 6,000-token system prompt nobody has read in months, an agent replaying its full transcript on every step, and a model narrating its reasoning for 800 tokens to answer a yes-or-no question. This piece is the deep dive on cutting that waste, one lever in the larger guide to reducing LLM inference cost, and the lever that pays back fastest because it costs you nothing in infrastructure.

You pay for every wasted token twice: once on the invoice, and once in the latency your user sits through.

Key takeaways

If you read nothing else, these are the load-bearing claims:

Output is the expensive token. Vendors price output 5x to 6x higher than input because each output token is a fresh sequential forward pass, so capping output is the single biggest-payoff cut.
Input trimming is free money. Pruning dead system-prompt text, redundant few-shot examples, and over-retrieved context cuts the bill with near-zero quality cost, because those tokens were never doing work.
Prompt compression pays off at scale, not by default. Tools like LLMLingua hit 2x to 20x compression, but they add a step and a quality risk, so they earn their place only on large, repeated prompts.
Agents leak tokens per task, not per call. A loop that replays its growing transcript turns a cheap per-call price into an expensive per-task one; summarize between steps instead of replaying.
Measure cost per resolved task, not cost per token. A compression that quietly drops quality is not a saving once a human has to finish the job.

Where your tokens actually leak

Before you cut anything, find the leak. Token cost has two faucets, and most teams only watch one. The first is input: every token in the system prompt, the few-shot examples, the retrieved context, and the replayed conversation history, all of which you pay for on every single call. The second is output: every token the model generates, priced several times higher than input.

The asymmetry is the whole game. As of mid-2026, Anthropic prices Claude Opus 4.5 at $5 per million input tokens and $25 per million output, Sonnet at $3 and $15, and Haiku 4.5 at $1 and $5 (Claude pricing). That is a 5x ratio in every line. The reason is mechanical: input tokens get processed in one parallel pass, while each output token requires its own full forward pass through the model, generated one at a time (CodeAnt, token economics).

So the order of attack writes itself. Cut output first, because each token there is worth five of the input kind, then trim input, because there is usually a lot of it and most of it is dead weight. If you are building an AI feature and want a team to instrument this properly rather than guess, that is the kind of work the Devlyn engineering team does day to day. The rest of this piece walks each faucet in turn.

Trim the input: system prompts, few-shot, and retrieval

Input trimming is the unglamorous lever that pays back fastest, because you are removing tokens that earned nothing. Industry write-ups in 2026 estimate that naive full-context and naive RAG pipelines run 3x to 5x higher token cost than the task actually requires (Redis, token optimization). That gap is almost entirely dead weight you can delete without touching quality.

There are three usual suspects. The system prompt grows by accretion: every incident adds a rule, nobody ever removes one, and a year later you are shipping 6,000 tokens of instructions the model half-ignores. Read it line by line and cut anything the model already does without being told. The second suspect is few-shot examples that a better instruction or a fine-tune has made redundant.

The third is over-retrieval. RAG systems default to a top-k that is too generous, stuffing eight chunks into context when two would answer the question. Retrieve less, but retrieve better: a tighter top-k with a reranker beats a wide top-k every time, on both cost and accuracy (Redis). If your retrieval quality is the bottleneck, that is a RAG and knowledge integration problem worth fixing at the source.

One honest trade-off lives here. Few-shot examples and a fat system prompt are often load-bearing on day one, and cutting them blind will tank your accuracy. The discipline is to cut against an eval set, not against a hunch, so you can see the quality move as the tokens drop. Whether to remove few-shot for good usually comes down to the RAG versus fine-tuning decision, since a fine-tune bakes the examples into the weights and lets you delete them from the prompt entirely.

Control the output: the most expensive token is one you never generate

This is where the money is, and it is the lever most teams skip. Because output costs 5x to 6x input, shaving 200 tokens off a response is worth more than shaving 1,000 off the prompt. Yet teams pour effort into the prompt and let the model ramble.

The first move is the bluntest: set max_tokens. It is a hard ceiling, and it stops a model that wants to write an essay from doing it on your invoice. Pair it with a length instruction in the prompt, something as plain as "answer in two sentences," so the model targets brevity instead of getting truncated mid-thought.

The second move is structured output. When you ask for JSON against a schema, the model stops writing prose preambles and apologies and just fills the fields. That alone cuts output, and it makes the response parseable instead of something you have to scrape. There is a small token overhead to the schema itself, but on any volume it pays for itself many times over.

The third move is format choice, and it is underrated. JSON is verbose: all those quotes, braces, and repeated keys are tokens you pay for on every row. For tabular data, a compact format like TSV or TOON can cut the serialized token count by a meaningful margin against the same JSON payload, with no loss of information. The model reads it fine, and your wallet reads it better.

Teams pour effort into the prompt and let the model ramble. Output is where the 5x lives. Cap it first.

Prompt compression: LLMLingua, and when it earns its keep

Prompt compression is the technique people reach for first and need least. The idea is clever: a small model rewrites or prunes your prompt to keep the information the big model needs while dropping the filler. Microsoft's LLMLingua is the reference implementation, a coarse-to-fine method with a budget controller that reaches up to 20x compression with minimal performance loss on its benchmarks (Microsoft LLMLingua).

The follow-up, LLMLingua-2, is task-agnostic and tuned for speed, running 3x to 6x faster than the original at the 2x to 5x compression ratios you would use in production (Microsoft LLMLingua). For long-context RAG specifically, the LongLLMLingua variant reports improved retrieval quality using roughly a quarter of the tokens. The savings are real, and on a large fixed prompt repeated millions of times, they are significant.

Here is the trade-off nobody puts in the headline. Compression adds a step to your pipeline, which means latency and a second model to run and pay for. It also adds a quality risk, because the compressor can drop a token that turns out to matter. The math only works when the prompt is large, repeated, and stable enough that the per-call compression cost is dwarfed by the per-call savings.

For most teams, the cheaper move is to delete the dead tokens by hand first, then cache the stable prefix. Prompt caching reads a cached prefix at 0.1x the input rate, a 90% discount on the cached portion, with no quality risk at all (Claude pricing). Reach for LLMLingua-style compression after you have trimmed and cached and the prompt is still genuinely large.

Manage context over a session: summarize, do not replay

The worst token leaks I have seen do not live in a single call. They live in the loop. A multi-turn chat or an autonomous agent accumulates context, and the lazy implementation replays the entire growing transcript on every step. By turn twenty the model is rereading thousands of tokens it has already seen, and you are paying for every one of them, every step.

The fix is to summarize instead of replay. After each step, compress the transcript into a compact running summary plus the last turn or two verbatim, and pass that forward instead of the raw history. The model keeps the thread; you stop paying input tax on ancient context. This is the core idea behind context management, and it is the difference between an agent loop that scales and one whose per-task cost climbs with every turn.

This is also why the per-call price lies to you on agentic workloads. A single autonomous task can fan out into thirty or fifty model calls, each carrying the conversation, so the per-call number looks trivial. The per-task number is what lands on the invoice, and it can be fifty times a single completion if the context is not managed. Summarizing between steps and capping each step's output is usually where the agent budget gets saved.

If most of your traffic is easy and only the tail is hard, the biggest-payoff move sits alongside this: send the easy majority to a smaller model so you are not paying frontier output rates to generate boilerplate. I make that case in the argument for shipping smaller models, and the routing mechanics live in the guide to LLM model routing. Token cutting and right-sizing the model are the same campaign from two angles. For the deeper theory of what fits in a window and what to evict, my book on context windows walks through the eviction strategies in detail.

A comparison you can paste into a planning doc

Here is the same set of techniques in one table: what each one saves, what it risks, and how much work it is to ship. Token-saving ranges are directional and depend on how bloated your current setup is.

Technique	Token saving	Quality risk	Effort
Cap output (max_tokens + length instruction)	High, on the expensive 5x-6x output side	Low if the cap fits the task; truncation if too tight	Low
Structured output (schema / JSON)	Medium; kills prose preambles	Low; small schema overhead	Low
Compact format (TSV / TOON over JSON)	Medium on tabular payloads	Very low; same information	Low
Trim system prompt and few-shot	Medium; recurs on every call	Medium if cut blind; low if cut against evals	Low
Tighten retrieval (lower top-k + rerank)	Medium to high on RAG	Low; often improves accuracy	Medium
Summarize context (don't replay history)	High in long sessions and agent loops	Medium; summary can drop a detail	Medium
Prompt compression (LLMLingua)	High on large fixed prompts (2x-20x)	Medium; compressor can drop a key token	High

Read the table top to bottom as a rough order of operations. The low-effort, low-risk rows at the top are where you start. Compression sits at the bottom not because it saves the least, but because it costs the most to do right and you should exhaust the free wins first.

Two mini-cases, with the numbers

The numbers below are illustrative, drawn from the shape of real situations rather than any specific live system, so treat them as the order of magnitude, not a quote.

The chatbot that narrated. A support assistant ran on a frontier model with no output cap, and the model habitually opened every answer with a paragraph of throat-clearing before the actual reply. Average output ran near 600 tokens where 150 would do. Adding max_tokens, a "be direct, no preamble" instruction, and a JSON envelope cut average output by roughly two-thirds, with the input untouched. Because output is the 5x token, the bill fell far more than the token count alone suggested.

The agent that replayed itself. An internal research agent averaged around thirty model calls per task, each replaying the full growing transcript as input. The per-call cost looked like rounding error; the per-task cost was the largest single line in the AI budget. Swapping verbatim replay for a running summary plus the last two turns, and capping each step's output, took a large bite out of per-task cost with no change in the final answer quality on the eval set. The fix was in the token budget, not the model.

Measure token cost per task, not per call

Every technique above optimizes tokens. The number you should actually be optimizing is cost per resolved task: total inference spend divided by the tasks the system fully handled without a human stepping in. Cost per token is what the model charges you. Cost per resolved task is what the model costs you, and only one of those shows up on the P&L.

This matters most for compression and aggressive trimming, because both can quietly trade quality for tokens. A cheaper call that fails more often is not cheaper once a human cleans up the mess; the token cost per call dropped while the cost per resolved task went up. The only way to catch that is to track quality and resolution alongside spend, which I cover in the guide to LLM evaluation.

Cost per token is what the model charges you. Cost per resolved task is what the model costs you. Optimize the second one.

Doing this in production, not in a one-off spreadsheet, is an AI observability and monitoring problem. You want token usage, output length, and resolution rate on the same dashboard, so a token cut that hurts quality shows up as a falling resolution rate before a customer finds it for you. The unit economics of all this, why a token saved is a margin point earned, is the subject of my book on pricing intelligence.

Frequently asked questions

What is LLM token optimization? It is the practice of reducing the number of tokens you send to and generate from a language model on each call, so your inference bill tracks the actual work instead of accumulated prompt bloat. The main levers are capping and structuring output, trimming dead input, compressing large fixed prompts, and summarizing context in long sessions instead of replaying it.

How do I reduce token usage the fastest? Cap output with max_tokens and a length instruction, then ask for structured output, because output tokens are priced 5x to 6x higher than input and that is where the payoff is. Next, prune the dead text out of your system prompt and lower your retrieval top-k. Both are low-effort, low-risk, and recur on every call.

Does prompt compression like LLMLingua actually help? Yes, but selectively. LLMLingua reaches high compression ratios with small quality loss on benchmarks, which pays off on large, stable prompts repeated at scale. For most teams the cheaper first move is to delete dead tokens by hand and cache the stable prefix, then reach for compression only if the prompt is still large.

Why are output tokens more expensive than input tokens? Generating output is sequential: the model produces one token at a time, each conditioned on all the ones before, so each output token needs its own full forward pass. Input can be processed in a single parallel batch. That is why vendors price output 5x to 6x higher, and why capping output cuts both cost and latency at once.

If you want the full picture of where inference spend goes and the order to pull every lever, start with the guide to reducing LLM inference cost. And if you would rather have a team trim the tokens, route the traffic, and instrument cost per resolved task in your stack from day one, that is exactly what the Devlyn team builds. Cut the token you never needed. Measure the ones that are left.