Prompt Caching: What It Is and When It Saves Money

Prompt caching reuses the already-computed prefix of a prompt so repeated tokens get billed at a deep discount. Here is when it saves money, and when it does not.

Prompt caching is a way to reuse the part of a prompt the model has already processed, so the repeated tokens get billed at a steep discount instead of full price on every call. It saves money when a large, stable chunk of your prompt repeats across many requests inside the cache window: a long system prompt, a set of documents, your few-shot examples, or the running history of a conversation. If your prompts share almost nothing from one call to the next, it saves you nothing and adds a write fee. The whole game is prefix stability.

I have sat in the inference billing review where the line item kept climbing and nobody could explain why, until we looked at the actual requests and found a 7,000-token system prompt riding along on every single call. We were paying full input price to re-send the same instructions thousands of times an hour. Prompt caching fixed most of that bill in an afternoon. This piece is the deep dive on that lever from my guide to reducing LLM inference cost, where caching is one of five levers worth pulling.

Prompt caching does not make tokens cheaper. It makes you stop paying to re-read the same tokens you already paid to read a second ago.

Key takeaways

If you read nothing else, these are the load-bearing claims:

Prompt caching reuses an exact, contiguous prefix. It is not meaning-based; change one token early in the prompt and the cache for everything after it is gone.
Put the stable content first and the variable content last. Caching only works on the prefix, so a single moving token near the top wipes the whole cache for that request.
On hosted APIs you pay a write fee, then read at roughly 10% of input price. Anthropic charges 1.25x input to write a 5-minute cache and 0.1x to read it, so it pays off after one or two hits.
It saves nothing on cold, one-off prompts. The payoff is hit rate, and hit rate comes from request patterns that repeat a long prefix inside the TTL.
Prompt caching reuses exact text; semantic caching reuses meaning. They solve different problems and you often want both.

What prompt caching is, and what it is not

Prompt caching stores the model's intermediate computation for a chunk of your prompt and reuses it on the next request that starts with the exact same tokens. On a hosted API like Anthropic or OpenAI, you do not see the internals; you mark a prefix as cacheable, and a later request that matches that prefix reads it back at a deep discount instead of being reprocessed. The match has to be exact and contiguous from the start of the prompt. This is why people also call it prefix caching.

The word "prefix" is the whole story. Caching works left to right from the top of your prompt and stops at the first token that differs. If your system prompt and documents are identical but you slipped a timestamp or a request ID into the second line, the cache breaks right there and everything after it is recomputed at full price. The cache is not fuzzy and it is not smart. It is a literal string match on the front of your input.

That exactness is what separates prompt caching from semantic caching, which I cover in its own article. Semantic caching asks "have I seen a question that means roughly the same thing?" and returns a stored answer. Prompt caching asks "have I processed this exact text already?" and skips the recompute. One reuses meaning, the other reuses tokens. They are not competitors; in a real system you often run both, and confusing them leads to caching the wrong layer.

One more boundary worth drawing early: prompt caching is about the input you send, not the output the model generates. It lowers the cost and latency of feeding context in. It does not store or reuse completions. If you want to avoid regenerating an answer you have produced before, that is a response cache or semantic cache, a different tool in the same drawer.

How prefix caching works under the hood

When a transformer reads your prompt, it computes a set of internal tensors called the KV cache, one entry per token, that the model needs in order to attend to everything that came before. Normally that work is thrown away after the response. Prefix caching keeps it. The next request that shares the same opening tokens reuses those tensors instead of recomputing them, which is why the savings show up as both lower cost and faster time-to-first-token.

On self-hosted stacks you can see exactly how this is done. vLLM implements automatic prefix caching with a hash-based block scheme: it splits the KV cache into fixed blocks and hashes each block from its own tokens plus the hash of the prefix before it, so a block is only reused when the entire chain of tokens leading up to it matches. It caches full blocks only, and evicts on a least-recently-used policy when it needs space (vLLM docs). SGLang's RadixAttention does the same job with a radix tree of cached prefixes plus LRU eviction and cache-aware scheduling (SGLang paper).

The mechanics matter because they explain the failure modes. Block-level hashing is why a one-token change near the top of a long prompt is so expensive: it changes the hash of the first block, which changes every dependent block after it, so nothing downstream can be reused. LRU eviction is why a cache entry you expected to be warm can be cold under load: a burst of other traffic pushed your blocks out of memory. None of this is mysterious once you know the cache is a chain of hashed blocks, not a single blob.

# the cache breaks at the first token that differs

# BAD: volatile token sits in the prefix

system: "You are a support agent. Request id: req-8a3f. ...4000 more tokens..."

# every request has a new id, so the prefix never matches -> 0% hit rate

# GOOD: stable content first, volatile content last

system: "You are a support agent. ...4000 stable tokens..."

user: "Request id: req-8a3f. {the actual question}"

# the 4000-token prefix is identical every call -> high hit rate

If you are doing this work and want a team that has built caching into production pipelines rather than bolted it on after the bill arrived, the Devlyn engineering team works on exactly this.

Vendor support and pricing in 2026

Every major provider now supports some form of prompt caching, but the pricing models and TTLs differ enough that you cannot reason about one from another. Here is the verified state as of mid-2026, sourced to each vendor's own documentation. Treat the dollar figures as snapshots; the multipliers are the durable part.

Provider	Cache read discount	TTL	Notes
Anthropic (Claude API)	Read at 0.1x input (90% off); write costs 1.25x (5-min) or 2x (1-hour)	5 minutes (default) or 1 hour	Explicit cache breakpoints; min 1,024 tokens (Sonnet/Opus 4.5+), 4,096 for Haiku 4.5 (Anthropic pricing)
OpenAI (API)	50% off cached input on GPT-4o-era; up to ~90% cost reduction on cache hits for GPT-5 series	~5-10 min idle, up to 1h; Extended caching up to 24h (default 24h on gpt-5.5)	Automatic, no code change; activates at 1,024 tokens, grows in 128-token steps (OpenAI)
Google (Gemini / Vertex)	~75% reduced input price on cached content	Default 1 hour, configurable	Implicit caching automatic; explicit caching min ~32,768 tokens, model-dependent
Self-hosted (vLLM, SGLang)	Cache hit skips recompute entirely; you pay only the GPU you own	Until LRU eviction under memory pressure	Automatic prefix caching on by default in recent vLLM; full-block, hash-based reuse

Two things in that table trip people up. First, Anthropic charges you to write the cache, while OpenAI's automatic caching does not bill a separate write fee. That changes the break-even math: with Anthropic you are betting the prefix gets reused enough to earn back the 1.25x or 2x write premium, whereas with OpenAI the cache is free upside when it hits. Second, the OpenAI discount is not one number. The original GPT-4o announcement was a flat 50% on cached input; the current GPT-5 series quotes up to roughly 90% cost reduction on cache hits. Quote the model you are actually running, not the headline.

When prompt caching actually pays off

The payoff is entirely about hit rate, and hit rate is a property of your traffic, not of the feature being on. Caching a prefix that never repeats costs you the write fee and returns nothing. Caching a 5,000-token system prompt that rides on ten thousand requests an hour is close to free money. Before you turn anything on, ask one question: what large chunk of my prompt is identical across many calls inside the TTL window?

On Anthropic the break-even is concrete and small. A 5-minute cache write costs 1.25x the input price; a read costs 0.1x. So the cache pays for itself after a single hit on the 5-minute TTL, or after two hits on the 1-hour TTL whose write costs 2x (Anthropic pricing). For Claude Sonnet 4.6, that is a 5-minute write at $3.75 per million tokens versus reads at $0.30 per million against the $3.00 base. If your prefix gets read even twice before it expires, you are ahead.

Here is the shape of it in practice, with illustrative numbers rather than any specific client's bill. Take a support assistant with a 4,000-token stable system prompt, running on Sonnet 4.6 at 200,000 calls a month.

# illustrative, not a specific system

prefix_tokens 4000

calls_per_month 200000

# no cache: pay full input on the prefix every call

no_cache 4000 * 200000 * $3.00/M = $2,400/mo

# with cache: ~1 write per 5-min window, rest are reads

writes ~negligible share at $3.75/M

reads 4000 * ~200000 * $0.30/M = ~$240/mo

prefix_savings ~$2,160/mo # ~90% on the cached portion

That is the cached portion only; you still pay full price on the variable tail and the output. But the prefix is usually where the bloat lives, which is why this lever moves the bill. Agent loops are the strongest case I see in production: a multi-step agent re-sends its entire tool definitions, instructions, and growing scratchpad on every step, and almost all of it is an identical prefix step to step. Long-document Q&A, multi-tenant SaaS with a shared system prompt, and repo-wide code assistants all share that shape. Reported hit rates on workloads like these commonly land in the 60 to 85 percent range, which is the difference between a feature you can afford and one you cannot.

The pitfalls: cache misses, TTL, and a prefix that will not sit still

The number one reason teams turn on prompt caching and see no savings is a prefix that will not sit still. A timestamp in the system prompt, a per-request ID injected near the top, a user name interpolated before the instructions, a tools list that reorders itself: any of these changes the prefix and zeroes your hit rate while you keep paying write fees. The fix is mechanical. Move every stable token to the front and every variable token to the back, and verify with the usage fields that reads are actually happening.

TTL is the second trap. A 5-minute window is generous for a busy endpoint and useless for a sleepy one. If your traffic arrives in bursts with long gaps, your cache is cold by the time the next request shows up, and you paid the write for nothing. This is where the longer TTLs earn their premium: Anthropic's 1-hour cache costs 2x to write but survives the quiet stretches, and OpenAI's extended retention now defaults to 24 hours on gpt-5.5. Match the TTL to your real inter-request gap, not to a default.

The third trap is silent eviction. On self-hosted vLLM or SGLang, your warm prefix can be pushed out of GPU memory by other traffic under the LRU policy, so a cache you measured as warm at noon is cold during the afternoon peak. On hosted APIs the equivalent is that nothing guarantees your entry is still there inside the TTL; it is best effort. Instrument the cache-hit fields in the response and watch hit rate as a live metric, because a cache you assume is working and is not is worse than no cache: you are paying write fees for misses.

A cache you assume is working and is not is worse than no cache. You are paying the write fee on every miss and seeing none of the read discount.

There is also an isolation dimension that matters in multi-tenant products. You do not want tenant A's cached prefix served to tenant B, and providers handle this with workspace-level or organization-level isolation that you should confirm rather than assume. On self-hosted stacks, the hash includes salts precisely so that two tenants with identical text do not collide. Watching hit rate per tenant is the kind of thing that belongs in AI observability and monitoring, not a one-off spreadsheet, because it drifts as your prompts and traffic change.

Prompt caching vs semantic caching

This is the distinction I see conflated most, and getting it wrong wastes both effort and money. Prompt caching is exact: it reuses the model's computation for an identical prefix and is invisible to your answer quality, because the model still runs, it just skips re-reading what it already read. Semantic caching is approximate: it matches a new query to a stored one by meaning and returns the stored answer, skipping the model entirely.

The trade-offs are opposite. Prompt caching has essentially no quality risk and a modest, reliable saving on repeated context. Semantic caching can save far more, because it skips inference altogether on a hit, but it carries a real risk of returning a stale or subtly wrong answer when two questions mean almost but not quite the same thing. One is a cost optimization with no downside to reason about; the other is a cost optimization you have to evaluate like a feature.

In a mature pipeline you use both, at different layers. Semantic caching sits in front and answers the genuinely repeated questions without touching the model. Prompt caching sits behind it and cuts the cost of the requests that do reach the model by reusing their stable context. Pair both with the other levers in the pillar, token optimization to shrink what you cache and model routing to send each request to the cheapest model that clears the bar, and the cost curve bends well before you ever shop for a cheaper model.

Frequently asked questions

What is prompt caching in simple terms? It is reusing the part of a prompt the model already processed so you do not pay full price to feed the same tokens again. You mark a stable prefix as cacheable, and later requests that begin with the exact same tokens read it back at a deep discount instead of being reprocessed from scratch.

When does prompt caching save money? When a large, stable chunk of your prompt repeats across many requests inside the cache window: a long system prompt, shared documents, few-shot examples, or conversation history. It saves nothing on cold, one-off prompts, and on hosted APIs that charge a write fee it can cost slightly more if the prefix is never reused.

What is the difference between prompt caching and prefix caching? They are the same idea. "Prefix caching" is the name used on self-hosted stacks like vLLM and SGLang, where the cache works on the literal prefix of your tokens. "Prompt caching" is the product name the hosted vendors use for the same exact-prefix reuse.

Why is my prompt cache not getting any hits? Almost always because a variable token sits inside the prefix and breaks the exact match: a timestamp, a request ID, a user name, or a reordered tools list near the top of the prompt. Move all stable content to the front and all variable content to the end, then confirm reads are happening in the response usage fields.

If you want the full map of where caching fits among the other cost levers, that is my guide to reducing LLM inference cost, and the economics of caching context against long windows is the subject of my book The Context Window Problem. If you would rather have a team instrument hit rate and build caching into your stack from day one instead of after the bill, that is what Devlyn's observability work is for. Cache the tokens that repeat. Pay full price only for the ones that change.