Semantic Caching for LLM Applications: Architecture and Trade-offs

Why LLM Costs Are a First-Order Problem

As organizations scale their LLM applications from proof-of-concept to production, API costs become a first-order constraint. A customer support system handling 10,000 queries per day at an average of 1,500 tokens per query costs roughly $15–45 per day with current frontier model pricing — $5,000–$16,000 per year for a single application. At enterprise scale, across dozens of AI applications, these costs compound rapidly.

The standard response is to use smaller, cheaper models for simpler queries. This is a valid optimization, but it requires careful query routing logic and introduces quality trade-offs. Semantic caching offers a complementary approach: instead of reducing the quality of the model, it reduces the number of times the model needs to be called at all.

The insight behind semantic caching is simple but powerful: in most LLM applications, a significant fraction of queries are semantically equivalent even if they are not lexically identical. "What is your return policy?" and "How do I return a product?" and "Can I get a refund?" are three different strings but one semantic intent. A traditional cache would miss all three matches. A semantic cache, using vector embeddings to measure semantic similarity, can recognize them as equivalent and serve the same cached response.

How Semantic Caching Works

A semantic cache sits between the application and the LLM API. When a new query arrives, the cache system embeds the query into a vector using an embedding model, then searches the cache index for any stored queries whose embeddings are within a similarity threshold of the new query. If a match is found above the threshold, the cached response is returned immediately — no LLM call required. If no match is found, the query is forwarded to the LLM, and both the query embedding and the response are stored in the cache for future use.

The similarity threshold is the critical tuning parameter. A threshold that is too high (requiring very close semantic matches) will have a low cache hit rate and provide little cost savings. A threshold that is too low will return cached responses for queries that are semantically different enough to warrant different answers, degrading response quality. In practice, thresholds between 0.92 and 0.97 cosine similarity work well for most applications, but the optimal value depends heavily on the domain and the tolerance for response variation.

The cache index is typically implemented using a vector database — Pinecone, Redis with vector search, or pgvector for teams already on PostgreSQL. The embedding model used for the cache should be the same model used for the application's retrieval pipeline to ensure consistent similarity scores. You can test semantic similarity scores directly using the Semantic Similarity Tool on this site.

Cache Hit Rates in Production

Real-world semantic cache hit rates vary significantly by application type. Customer support applications, where users ask a bounded set of common questions, typically achieve 40–65% cache hit rates after the cache warms up over the first few weeks of operation. Internal knowledge base assistants, where employees ask similar questions about company policies and procedures, achieve 30–50%. Open-ended creative or analytical applications, where every query is genuinely novel, achieve 5–15%.

The warmup period is important to understand. A semantic cache starts empty and builds its index as queries are processed. For the first few days or weeks of operation, hit rates will be low. As the cache accumulates a representative sample of the query distribution, hit rates stabilize. Organizations that pre-populate the cache with a library of anticipated queries and their responses can accelerate this warmup period significantly.

GPTCache, an open-source semantic caching library from Zilliz, reports average cache hit rates of 35–50% across its user base, with cost savings of 40–70% for applications with repetitive query patterns. LangChain's semantic cache integration provides similar capabilities with tighter integration into the LangChain ecosystem.

Staleness and Cache Invalidation

The hardest problem in semantic caching is cache invalidation — specifically, ensuring that cached responses remain accurate as the underlying data changes. For a customer support application, if the return policy changes, every cached response about returns must be invalidated. For a RAG application over enterprise documents, if a document is updated, any cached responses derived from that document must be refreshed.

The standard approach is TTL-based invalidation: cached entries expire after a fixed time period (24 hours, 7 days, 30 days depending on how frequently the underlying data changes). This is simple to implement but imprecise — it will serve stale responses until the TTL expires, and it will invalidate accurate responses unnecessarily.

More sophisticated approaches use source-aware invalidation: each cached entry is tagged with the source documents or data that contributed to the response. When a source document is updated, all cache entries tagged with that document are invalidated. This requires tighter integration between the cache and the document management system, but it provides much more precise invalidation. For applications where data freshness is critical — financial data, medical information, regulatory guidance — source-aware invalidation is the only acceptable approach. The semantic drift problem is related: as the meaning of terms in a domain evolves, cached responses that were accurate when stored may become misleading over time.

When Semantic Caching Is Inappropriate

Semantic caching is not appropriate for all LLM applications. Applications that require personalized responses — where the correct answer depends on the specific user's context, history, or permissions — cannot safely use a shared semantic cache. A query like "What are my pending tasks?" has a different correct answer for every user. Caching the response to this query and serving it to other users who ask semantically similar questions would be both incorrect and a potential privacy violation.

Applications that require real-time data — stock prices, live inventory levels, current weather — cannot cache responses that will be stale within minutes. The cache hit rate for these applications would be near zero even with a short TTL, making the overhead of the caching layer unjustified.

Applications where response variation is a feature, not a bug, should also avoid semantic caching. Creative writing assistants, brainstorming tools, and applications where users expect different responses to similar prompts will produce a degraded user experience if the cache returns identical responses to semantically similar queries. In these cases, the value of the LLM is precisely its ability to generate novel responses, and caching undermines that value.

Implementation Guide: Building a Semantic Cache

A production-ready semantic cache requires five components: an embedding model for query vectorization, a vector store for the cache index, a similarity threshold configuration, a TTL or invalidation policy, and a cache key strategy.

For the embedding model, use the same model as your application's retrieval pipeline. OpenAI's text-embedding-3-small is cost-effective for high-volume caching. For latency-sensitive applications, a locally-hosted model like BGE-small eliminates the round-trip to the embedding API.

For the vector store, pgvector is the lowest-friction option for teams already on PostgreSQL — it adds vector similarity search to an existing database with a single extension. Redis with the RediSearch module provides in-memory speed for latency-critical applications. Dedicated vector databases like Pinecone or Qdrant are appropriate for very high query volumes where the vector store itself becomes a bottleneck.

The cache key strategy determines what gets cached together. The simplest approach is to cache by query embedding alone. More sophisticated strategies include the user's role or permission level (to avoid serving responses across permission boundaries), the current date bucket (to avoid serving time-sensitive responses across date boundaries), and the application context (to avoid serving responses from one application context in another). Start simple and add complexity only when the simple approach produces incorrect behavior in production.

semantic caching LLM cost optimization vector cache semantic similarity cache reduce LLM API costs

Nick Eubanks

Entrepreneur, SEO Strategist & AI Infrastructure Builder

Nick Eubanks is a serial entrepreneur and digital strategist with nearly two decades of experience at the intersection of search, data, and emerging technology. He is the Global CMO of Digistore24, founder of IFTF Agency (acquired), and co-founder of the TTT SEO Community (acquired). A former Semrush team member and recognized authority in organic growth strategy, Nick has advised and built companies across SEO, content intelligence, and AI-driven marketing infrastructure. He is the founder of semantic.io — the definitive reference for the semantic AI era — and the Enterprise Risk Association at riskgovernance.com, where he publishes research on agentic AI governance for enterprise executives. Based in Miami, Nick writes at the frontier of semantic technology, AI architecture, and the infrastructure required to make enterprise AI actually work.

LinkedIn @nick_eubanks

SharePost on X Share on LinkedIn

Infrastructure