Semantic Caching for LLMs: When It Saves Money

Semantic caching reuses a past LLM answer for a question that means the same thing, even when the words differ. Here is when it saves money, and how it differs from exact prompt caching.

Semantic caching reuses a stored LLM answer when a new question means roughly the same thing as one you have already answered, even if the wording is completely different. It saves money when your traffic is full of paraphrases of a small set of questions, because a cache hit returns in milliseconds at near-zero cost instead of paying for a fresh model call. The part most people get wrong is that it is not the same as exact prompt caching, which only fires when the prefix of the prompt is byte-for-byte identical: exact caching matches strings, while semantic caching matches meaning, and matching meaning is exactly where the money and the danger both live.

I have sat in the billing review where a support assistant was answering the same forty questions ten thousand times a day, each phrased a little differently, and we were paying full price for every single one because nothing matched exactly. Semantic caching is the lever for that shape of traffic. It is also the lever I have watched quietly serve a customer the wrong answer because someone set the similarity threshold by feel. This piece is the deep dive on that lever from my guide to reducing LLM inference cost, and it sits right next to the exact-match version in my piece on prompt caching, which the two are constantly confused with because they solve different problems.

Exact caching matches strings. Semantic caching matches meaning. The first is safe and narrow. The second is powerful and will lie to you if you let it.

Key takeaways

If you read nothing else, these are the load-bearing claims:

Semantic caching reuses answers for similar questions, not identical ones. It catches the paraphrases that exact prompt caching misses, which is where the volume hides in support and FAQ traffic.
The similarity threshold is a revenue dial, not a config value. Too loose and you serve confidently wrong answers; too tight and the hit rate collapses and you save nothing.
A false cache hit is worse than a cache miss. A miss costs you one model call. A false hit costs you a wrong answer to a real customer, and you will not see it in the cost dashboard.
It only pays off on repetitive, paraphrase-heavy traffic. On a workload where every prompt is genuinely novel, semantic caching adds latency and embedding cost and returns nothing.

What semantic caching actually is

Semantic caching is a layer that sits in front of your model and asks one question before every call: have I already answered something that means this? If the answer is yes with enough confidence, it returns the stored response and never touches the model. If the answer is no, it calls the model as usual and saves the new answer for next time.

The word doing all the work is "means." A traditional cache keyed on the raw text would treat "how do I reset my password" and "I forgot my password, how do I change it" as two completely different requests, because the strings differ. A semantic cache treats them as the same request, because they carry the same intent. That is the whole pitch: you stop paying to re-answer questions you have already answered in slightly different words.

If your business runs on retrieval, this should sound familiar, because it is the same embedding machinery you already use for search. The cache is just pointing that machinery at your own past answers instead of your document corpus. I walk through the similarity foundations underneath all of this in my book on embeddings in production, and the retrieval discipline it depends on in the RAG book.

How embedding-similarity caching works under the hood

The mechanics are simpler than the name suggests. When a query arrives, the cache turns it into a vector embedding, a list of numbers that captures its meaning. It then searches the vectors of past queries for the closest match, usually with an approximate nearest-neighbor index like HNSW so the lookup stays fast even with millions of entries.

The closeness of the match is scored with cosine similarity, a number between 0 and 1 where 1 means identical meaning. The cache compares that score against a threshold you set: if the best match clears the threshold, the cache returns the stored answer, and if nothing clears it, you get a miss and the request goes to the model. Each stored answer also carries a TTL so stale facts expire instead of haunting you forever.

That single threshold is the entire risk surface of the system, and it is worth understanding before you ship anything. Redis, whose semantic cache is one of the more documented production implementations, recommends thresholds somewhere between 0.7 and 0.95, and tells you to start conservative at 0.9 or higher (Redis). The reason for that caution is the next section, and it is the part of semantic caching that has cost teams more than it saved them.

If you are early in building an AI product and want this kind of infrastructure designed correctly the first time rather than discovered the hard way, the Devlyn team builds exactly this.

The tools: GPTCache, Redis LangCache, and gateway caches

You rarely build a semantic cache from scratch, and you should not. The open-source default is GPTCache, a Python library from Zilliz that handles the embed-search-store loop for you, with pluggable embedding models and a choice of vector backends like Milvus, Faiss, Redis, and Qdrant. It is the thing most people mean when they say "semantic cache" in a codebase.

On the managed side, Redis ships a semantic cache (marketed as LangCache) that gives you the same loop with native vector search, TTL expiration, and HNSW indexing built in. Beyond those, a growing number of LLM gateways now bake semantic caching in as a config flag, so you get it without writing cache code at all. The tradeoff there is the usual one: less control over the threshold and the eviction policy in exchange for not maintaining it yourself.

The tool matters less than the discipline around it. A gateway flag with a default threshold and no measurement is how teams end up with the failure mode I am about to describe. Pick whichever fits your stack, then spend your real effort on the threshold and the guardrails, not the library.

Tuning the similarity threshold and the false-hit danger

The threshold is a precision-recall dial, and you cannot win both ends. Lower it and you catch more paraphrases, so your hit rate climbs and your bill drops, but you also start matching questions that are merely adjacent rather than equivalent. Raise it and every hit is trustworthy, but you miss the loosely-worded paraphrases and the savings evaporate. There is no setting that is loose and safe at the same time.

Here is the failure that should scare you. A team I worked with set a semantic cache at 0.83 because that was the number in a tutorial, and the hit rate looked great. Then a customer asked whether a product was covered under warranty after twelve months, and the cache served them the confident answer to a question about coverage after twelve days, because the two queries sat 0.84 apart in embedding space. The numbers are illustrative, but the shape is real: a wrong answer, delivered fast, with full confidence, invisible in every cost metric because it counted as a successful cache hit.

A cache miss costs you one model call. A false cache hit costs you a wrong answer to a real customer, and it will never show up in the cost dashboard. That asymmetry is the whole game.

The discipline that prevents this is boring and it works. Collect a few hundred real queries, label which ones should and should not share an answer, and measure precision at each candidate threshold until you clear about 95% precision before you ship (Redis). Add a confidence buffer so you only serve a hit when it beats the threshold by a comfortable margin, and fence the cache with hard metadata boundaries so a hit can never cross tenant, locale, or product line no matter how similar the wording looks. Soft similarity inside hard walls is the pattern that survives contact with production.

A comparison of caching approaches

Here is the same decision laid out as a table: how each approach matches, what it risks, and the traffic it is built for.

Approach	How it matches	Main risk	Best for
Exact response cache	Byte-for-byte identical query string	Misses every paraphrase, so hit rate is low on free text	Fixed, repeated queries (status checks, canned commands)
Exact prompt caching	Identical prompt prefix within a cache window	Saves nothing if the prefix is not stable across calls	Long stable system prompts, shared documents, few-shot blocks
Semantic caching	Embedding similarity above a threshold	False hits: serving a wrong answer to a near-miss query	Paraphrase-heavy traffic (support, FAQ, knowledge bases)
Two-layer (exact then semantic)	Exact first, fall back to similarity	More moving parts to monitor and tune	High-volume assistants where both shapes of traffic appear

The two-layer pattern is what most serious deployments converge on. You check for an exact match first, which is free and carries zero false-positive risk, and only run the more expensive and riskier semantic match when the exact layer misses. It gives you the safety of strings on the common path and the reach of meaning on the tail.

When semantic caching pays off, and when it does not

The economics are entirely a function of hit rate, and hit rate is entirely a function of how repetitive your traffic is. The published benchmark for GPT Semantic Cache reports hit rates of roughly 62% to 69% on support and FAQ-style query sets, with positive-hit accuracy above 97% once the threshold is tuned (arXiv). Those are the conditions where the lever is worth pulling: a narrow set of questions asked many different ways.

Translate that into money and it is real. Redis sketches a workload spending $80,000 per quarter on the model, where a 30% to 40% semantic cache hit rate saves $24,000 to $32,000 per quarter (Redis). The latency win is just as concrete: a cached answer can come back in around 0.3 seconds against the 2.7 seconds of a live call, because you skipped the model entirely. Faster and cheaper at the same time is rare, and on the right traffic this delivers it.

Now the honest other side. If every request to your system is genuinely novel, a code-generation tool taking unique repos, an analysis agent reasoning over fresh data each time, your semantic cache hit rate will sit near zero. You will pay for an embedding call and a vector lookup on every request and get almost nothing back. On novel traffic, semantic caching is pure overhead, and the right answer is to spend your cost effort on routing to a smaller model or on how you adapt the model instead.

The deciding question is not "would caching help" in the abstract. It is "what fraction of my real traffic is a paraphrase of something I already answered." Pull a week of production queries, cluster them, and look: if a small number of clusters cover most of your volume, semantic caching is a strong lever, and if your queries are a long flat tail of unique requests, skip it and pull a different one.

Semantic caching vs prompt caching

This is the distinction that derails the most conversations, so let me make it sharp. Exact prompt caching reuses the already-computed prefix of a single prompt, so when a long system prompt or a block of documents repeats across calls, you stop paying full input price to re-send those identical tokens. It matches on bytes, it never serves a wrong answer, and I cover it fully in the prompt caching deep dive.

Semantic caching reuses an entire answer for a different but similar question, and it matches on meaning rather than bytes. That reach is why it can save more on the right traffic, and it is also why it can be wrong in a way prompt caching never can. They are not competitors. A mature inference stack often runs both: prompt caching to cut the cost of the repeated prefix on every call, and semantic caching to skip the call entirely when the question has already been answered.

The clean mental model is a layered defense against spend. Exact response cache catches the literal repeats for free, prompt caching discounts the stable prefix on the calls you do make, and semantic caching skips the call entirely when meaning repeats. Routing then sends whatever is left to the cheapest model that can handle it. None of these replaces the others, and the order you reach for them in is most of the design.

Frequently asked questions

What is semantic caching for LLMs? Semantic caching stores past LLM answers alongside vector embeddings of the questions that produced them. When a new query arrives, it embeds the query, finds the most similar past question by cosine similarity, and returns the stored answer if the match clears a confidence threshold, skipping the model call entirely. It reuses answers for questions that mean the same thing, not just questions worded the same way.

How is semantic caching different from prompt caching? Prompt caching reuses the identical computed prefix of a prompt and matches byte-for-byte, so it never serves a wrong answer and saves money when a large stable chunk repeats across calls. Semantic caching reuses a whole answer for a similar question and matches on meaning, so it reaches further but can serve a wrong answer if the threshold is too loose. Most strong stacks run both.

What similarity threshold should I use for a semantic cache? Start conservative, around 0.9 or higher, then tune against a labeled set of a few hundred real queries until you clear roughly 95% precision before shipping. Add a confidence buffer so you only serve hits that beat the threshold by a margin, and fence the cache with hard metadata boundaries like tenant and locale so a hit can never cross them.

When does semantic caching not save money? When your traffic is genuinely novel and rarely repeats in meaning, the hit rate sits near zero and you pay for an embedding call and a vector lookup on every request for almost no return. On that shape of workload, semantic caching is overhead, and routing or model adaptation is the better cost lever.

If you want this built into your stack with the threshold tuning, false-hit guardrails, and the monitoring to catch a bad hit before a customer does, that is squarely an AI observability and monitoring problem, and it is the work the Devlyn team does on retrieval and knowledge systems. Cache the meaning that repeats. Measure the precision before you trust it. And remember that the cheapest wrong answer still costs more than the model call you were trying to skip.