A Practical Semantic Cache in Python
Every team building search, assistants, recommendation layers, or document workflows eventually runs into the same quiet problem: the embedding bill keeps growing while latency gets worse.
The issue is rarely the vector model itself.
It is the repeated work around it.
The same customer question gets embedded thousands of times. The same product description is reprocessed after tiny formatting changes. The same support messages trigger identical retrieval paths over and over again.
A semantic system that recomputes everything on every request is not intelligent. It is expensive.
A better pattern is a semantic cache.
This article builds a simple version in Python.
The goal is straightforward:
- avoid recomputing embeddings for near-duplicate text
- reduce latency for repeated semantic requests
- keep retrieval quality high
- make the system easy to inspect and extend
This is not a toy optimization. It is one of the most practical ways to improve real-world AI pipelines.
⸻
Why semantic caching matters
Traditional caching works when inputs match exactly.
If a user asks:
- "How do I reset my password?"
- "How can I change my password?"
- "I forgot my password, what should I do?"
a normal cache sees three different strings.
A semantic cache sees one intent.
That difference matters in systems that rely on embeddings, vector search, reranking, or LLM context assembly.
A semantic cache can sit in front of:
- embedding generation
- retrieval pipelines
- FAQ systems
- agent tools
- support automation
- recommendation engines
The key idea is simple:
Instead of caching by exact text, cache by meaning.
⸻
What we are building
We will create a Python semantic cache that:
- converts text into embeddings
- stores prior queries and their vectors
- compares new queries to cached vectors
- returns a cached result when similarity is high enough
- computes a fresh result when similarity is too low
For the embedding model, we will use sentence-transformers because it is practical and widely used.
Install dependencies:
pip install sentence-transformers scikit-learn numpy⸻
The core design
Our cache needs four parts:
- a way to embed text
- a place to store vectors and results
- a similarity function
- a threshold for deciding cache hit vs miss
We will use cosine similarity.
If a new query is close enough to a stored query, we reuse the old result.
⸻
Step 1 — A minimal semantic cache
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
class SemanticCache:
def __init__(self, model_name="all-MiniLM-L6-v2", similarity_threshold=0.90):
self.model = SentenceTransformer(model_name)
self.similarity_threshold = similarity_threshold
self.entries = []
def embed(self, text):
return self.model.encode([text], normalize_embeddings=True)[0]
def lookup(self, query):
if not self.entries:
return None
query_vec = self.embed(query)
cached_vectors = np.array([entry["embedding"] for entry in self.entries])
similarities = cosine_similarity([query_vec], cached_vectors)[0]
best_idx = np.argmax(similarities)
best_score = similarities[best_idx]
if best_score >= self.similarity_threshold:
return {
"hit": True,
"score": float(best_score),
"matched_query": self.entries[best_idx]["query"],
"result": self.entries[best_idx]["result"]
}
return {
"hit": False,
"score": float(best_score)
}
def add(self, query, result):
embedding = self.embed(query)
self.entries.append({
"query": query,
"embedding": embedding,
"result": result
})This version keeps everything in memory.
That is enough to understand the mechanism before adding persistence or scaling.
⸻
Step 2 — Simulate a real use case
Let's imagine a support assistant that maps user questions to prepared answers.
cache = SemanticCache(similarity_threshold=0.88)
cache.add(
"How do I reset my password?",
"Go to Settings → Security → Reset Password.")
cache.add(
"How can I update my billing information?",
"Open Account → Billing → Payment Methods.")
queries = [
"I forgot my password, what should I do?",
"Where do I change my card details?",
"How do I delete my account?"]
for q in queries:
result = cache.lookup(q)
print(f"\nQuery: {q}")
print(result)A likely output would look like this:
Query: I forgot my password, what should I do?
{'hit': True, 'score': 0.92, 'matched_query': 'How do I reset my password?', 'result': 'Go to Settings → Security → Reset Password.'}
Query: Where do I change my card details?
{'hit': True, 'score': 0.89, 'matched_query': 'How can I update my billing information?', 'result': 'Open Account → Billing → Payment Methods.'}
Query: How do I delete my account?
{'hit': False, 'score': 0.54}This is the essential behavior.
Two semantically similar questions become cache hits.
A different question becomes a miss.
⸻
Step 3 — Add a fallback function
A cache is only useful if misses can be handled automatically.
In practice, a miss might trigger:
- an embedding + vector search pipeline
- an LLM call
- a database query
- a rules engine
- a recommendation model
Here is a simple wrapper that uses the cache first and computes only when needed.
def expensive_pipeline(query):
if "password" in query.lower():
return "Go to Settings → Security → Reset Password."
elif "billing" in query.lower() or "card" in query.lower():
return "Open Account → Billing → Payment Methods."
else:
return "Please contact support for this request."
def get_response(query, cache):
lookup_result = cache.lookup(query)
if lookup_result and lookup_result["hit"]:
return {
"source": "cache",
"score": lookup_result["score"],
"response": lookup_result["result"]
}
fresh_result = expensive_pipeline(query)
cache.add(query, fresh_result)
return {
"source": "fresh",
"score": None,
"response": fresh_result
}Now test it:
queries = [
"I forgot my password",
"How do I reset my password?",
"Need to update my card",
"Need to update my payment method",
"Delete my account"]
for q in queries:
print(q, "->", get_response(q, cache))This pattern is where semantic caching becomes operationally useful.
The first request pays the full cost.
Related requests become cheaper.
⸻
Step 4 — Avoid bad cache hits with text normalization
Semantic similarity is powerful, but raw text still contains noise.
Before embedding, normalize the text.
This helps with consistency and reduces accidental misses.
import re
def normalize_text(text):
text = text.lower().strip()
text = re.sub(r"\s+", " ", text)
return textUpdate the cache class:
class SemanticCache:
def __init__(self, model_name="all-MiniLM-L6-v2", similarity_threshold=0.90):
self.model = SentenceTransformer(model_name)
self.similarity_threshold = similarity_threshold
self.entries = []
def embed(self, text):
text = normalize_text(text)
return self.model.encode([text], normalize_embeddings=True)[0]
def lookup(self, query):
if not self.entries:
return None
query_vec = self.embed(query)
cached_vectors = np.array([entry["embedding"] for entry in self.entries])
similarities = cosine_similarity([query_vec], cached_vectors)[0]
best_idx = np.argmax(similarities)
best_score = similarities[best_idx]
if best_score >= self.similarity_threshold:
return {
"hit": True,
"score": float(best_score),
"matched_query": self.entries[best_idx]["query"],
"result": self.entries[best_idx]["result"]
}
return {
"hit": False,
"score": float(best_score)
}
def add(self, query, result):
embedding = self.embed(query)
self.entries.append({
"query": query,
"embedding": embedding,
"result": result
})Normalization will not solve every issue, but it improves stability.
⸻
Step 5 — Add metadata and expiration
A cache without lifecycle control eventually becomes a junk drawer.
Some entries get stale.
Some answers should expire.
Some domains need different thresholds.
We can add metadata like timestamp, usage count, and TTL.
import time
class SemanticCache:
def __init__(self, model_name="all-MiniLM-L6-v2", similarity_threshold=0.90, ttl_seconds=3600):
self.model = SentenceTransformer(model_name)
self.similarity_threshold = similarity_threshold
self.ttl_seconds = ttl_seconds
self.entries = []
def embed(self, text):
text = normalize_text(text)
return self.model.encode([text], normalize_embeddings=True)[0]
def _is_expired(self, entry):
return (time.time() - entry["timestamp"]) > self.ttl_seconds
def _prune_expired(self):
self.entries = [entry for entry in self.entries if not self._is_expired(entry)]
def lookup(self, query):
self._prune_expired()
if not self.entries:
return None
query_vec = self.embed(query)
cached_vectors = np.array([entry["embedding"] for entry in self.entries])
similarities = cosine_similarity([query_vec], cached_vectors)[0]
best_idx = np.argmax(similarities)
best_score = similarities[best_idx]
if best_score >= self.similarity_threshold:
self.entries[best_idx]["hits"] += 1
return {
"hit": True,
"score": float(best_score),
"matched_query": self.entries[best_idx]["query"],
"result": self.entries[best_idx]["result"]
}
return {
"hit": False,
"score": float(best_score)
}
def add(self, query, result):
embedding = self.embed(query)
self.entries.append({
"query": query,
"embedding": embedding,
"result": result,
"timestamp": time.time(),
"hits": 0
})This matters when your system answers questions about:
- prices
- inventory
- policy documents
- changing account states
- time-sensitive recommendations
Not every semantic hit should be reused forever.
⸻
Step 6 — Measure whether the cache is helping
A semantic cache should be judged with numbers, not optimism.
Track at least these metrics:
- hit rate
- average similarity score on hits
- latency saved
- incorrect hit rate
- miss distribution by query type
Here is a lightweight instrumented wrapper:
class CachedService:
def __init__(self, cache):
self.cache = cache
self.total_requests = 0
self.cache_hits = 0
self.cache_misses = 0
def expensive_pipeline(self, query):
if "password" in query.lower():
return "Go to Settings → Security → Reset Password."
elif "billing" in query.lower() or "card" in query.lower():
return "Open Account → Billing → Payment Methods."
else:
return "Please contact support for this request."
def handle(self, query):
self.total_requests += 1
result = self.cache.lookup(query)
if result and result["hit"]:
self.cache_hits += 1
return {
"source": "cache",
"response": result["result"],
"score": result["score"]
}
self.cache_misses += 1
response = self.expensive_pipeline(query)
self.cache.add(query, response)
return {
"source": "fresh",
"response": response,
"score": None
}
def stats(self):
hit_rate = self.cache_hits / self.total_requests if self.total_requests else 0
return {
"total_requests": self.total_requests,
"cache_hits": self.cache_hits,
"cache_misses": self.cache_misses,
"hit_rate": round(hit_rate, 3)
}Usage:
cache = SemanticCache(similarity_threshold=0.88, ttl_seconds=7200)
service = CachedService(cache)
test_queries = [
"How do I reset my password?",
"I forgot my password",
"Need help changing my password",
"Update billing info",
"Change my card details",
"Close my account"]
for q in test_queries:
print(service.handle(q))
print(service.stats())Once you can observe hit rate and false matches, you can tune the threshold with confidence.
⸻
Choosing the right similarity threshold
This is where many implementations go wrong.
A threshold that is too high gives you very few cache hits.
A threshold that is too low gives you wrong answers quickly.
Neither is good.
As a starting point:
- 0.95+ → very strict, low risk, fewer hits
- 0.88 to 0.94 → practical for many support and FAQ cases
- below 0.85 → risky unless your domain is tightly constrained
The right threshold depends on the cost of a wrong reuse.
For example:
- password reset instructions can tolerate moderate similarity
- medical guidance should be much stricter
- legal or financial outputs may need semantic caching disabled entirely unless validated
A useful approach is to label a few hundred query pairs and inspect where correct and incorrect matches cluster.
⸻
Common failure modes
A semantic cache is simple in concept, but there are several ways to misuse it.
1. Caching final answers when context changes
If the answer depends on live account state, inventory, or user permissions, a cached answer may become incorrect even if the query meaning matches.
In that case, cache intermediate retrieval results or tool plans instead of final user-facing text.
2. Ignoring tenant boundaries
In multi-customer systems, semantic similarity must never cross tenant boundaries.
A perfect semantic match from the wrong tenant is still a data leak.
3. Using one threshold for every intent
Password help, refund requests, and compliance questions may require different thresholds.
A single global threshold is convenient, but often too blunt.
4. Never reviewing false positives
The dangerous errors are not misses.
They are confident wrong hits.
Those need explicit monitoring.
⸻
A better production pattern
In real systems, the strongest design is often a layered cache:
- exact text cache first
- semantic cache second
- full retrieval/generation pipeline last
That gives you:
- fastest response for exact repeats
- strong savings for paraphrases
- full fallback when needed
The flow looks like this:
Query → exact cache → semantic cache → expensive pipeline → store result
Here is a compact implementation:
class LayeredCacheService:
def __init__(self, semantic_cache):
self.exact_cache = {}
self.semantic_cache = semantic_cache
def expensive_pipeline(self, query):
if "password" in query.lower():
return "Go to Settings → Security → Reset Password."
elif "billing" in query.lower() or "card" in query.lower():
return "Open Account → Billing → Payment Methods."
return "Please contact support for this request."
def handle(self, query):
normalized = normalize_text(query)
if normalized in self.exact_cache:
return {
"source": "exact_cache",
"response": self.exact_cache[normalized]
}
semantic_result = self.semantic_cache.lookup(query)
if semantic_result and semantic_result["hit"]:
self.exact_cache[normalized] = semantic_result["result"]
return {
"source": "semantic_cache",
"response": semantic_result["result"],
"score": semantic_result["score"]
}
fresh = self.expensive_pipeline(query)
self.exact_cache[normalized] = fresh
self.semantic_cache.add(query, fresh)
return {
"source": "fresh",
"response": fresh
}This layered structure is often enough to cut repeated semantic compute dramatically.
⸻
When this works best
Semantic caching delivers the most value when:
- users ask repeated questions in different wording
- embeddings are expensive or slow
- retrieval pipelines are stable
- many requests map to a small set of intents
- latency matters
Good examples include:
- customer support
- internal knowledge assistants
- FAQ bots
- product search reformulations
- document triage systems
It is less suitable when every answer depends heavily on fresh private state.
⸻
Where to take it next
The in-memory version is useful for learning, but production systems usually add:
- Redis or a vector database for persistence
- ANN search for larger cache sizes
- tenant-aware partitioning
- threshold tuning by intent type
- cache invalidation tied to source data updates
- offline evaluation against labeled query pairs
You can also cache more than answers.
Strong candidates include:
- retrieved document IDs
- tool selection decisions
- query rewrites
- classification labels
- reranking outputs
That is often safer than caching full generated text.
⸻
Final thought
A lot of AI infrastructure effort goes into bigger models, longer contexts, and more elaborate agent loops.
Meanwhile, many systems are still paying full price to solve the same semantic problem again and again.
A semantic cache is not flashy.
It does not need a new model release.
It does not depend on agent orchestration.
It simply removes waste from pipelines that already work.
And in production, removing waste is often the fastest path to a better system.
✓ If you build retrieval, support, or assistant workflows in Python, this is one of the highest-leverage optimizations you can add early.
