IA Infrastructure

A Practical Semantic Cache in Python

Every team building search, assistants, recommendation layers, or document workflows eventually runs into the same quiet problem: the embedding bill keeps growing while latency gets worse.

The issue is rarely the vector model itself.

It is the repeated work around it.

The same customer question gets embedded thousands of times. The same product description is reprocessed after tiny formatting changes. The same support messages trigger identical retrieval paths over and over again.

A semantic system that recomputes everything on every request is not intelligent. It is expensive.

A better pattern is a semantic cache.

This article builds a simple version in Python.

The goal is straightforward:

  • avoid recomputing embeddings for near-duplicate text
  • reduce latency for repeated semantic requests
  • keep retrieval quality high
  • make the system easy to inspect and extend

This is not a toy optimization. It is one of the most practical ways to improve real-world AI pipelines.

Why semantic caching matters

Traditional caching works when inputs match exactly.

If a user asks:

  • "How do I reset my password?"
  • "How can I change my password?"
  • "I forgot my password, what should I do?"

a normal cache sees three different strings.

A semantic cache sees one intent.

That difference matters in systems that rely on embeddings, vector search, reranking, or LLM context assembly.

A semantic cache can sit in front of:

  • embedding generation
  • retrieval pipelines
  • FAQ systems
  • agent tools
  • support automation
  • recommendation engines

The key idea is simple:

Instead of caching by exact text, cache by meaning.

What we are building

We will create a Python semantic cache that:

  • converts text into embeddings
  • stores prior queries and their vectors
  • compares new queries to cached vectors
  • returns a cached result when similarity is high enough
  • computes a fresh result when similarity is too low

For the embedding model, we will use sentence-transformers because it is practical and widely used.

Install dependencies:

pip install sentence-transformers scikit-learn numpy

The core design

Our cache needs four parts:

  • a way to embed text
  • a place to store vectors and results
  • a similarity function
  • a threshold for deciding cache hit vs miss

We will use cosine similarity.

If a new query is close enough to a stored query, we reuse the old result.

Step 1 — A minimal semantic cache

import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

class SemanticCache:
    def __init__(self, model_name="all-MiniLM-L6-v2", similarity_threshold=0.90):
        self.model = SentenceTransformer(model_name)
        self.similarity_threshold = similarity_threshold
        self.entries = []

    def embed(self, text):
        return self.model.encode([text], normalize_embeddings=True)[0]

    def lookup(self, query):
        if not self.entries:
            return None
        query_vec = self.embed(query)
        cached_vectors = np.array([entry["embedding"] for entry in self.entries])
        similarities = cosine_similarity([query_vec], cached_vectors)[0]
        best_idx = np.argmax(similarities)
        best_score = similarities[best_idx]
        if best_score >= self.similarity_threshold:
            return {
                "hit": True,
                "score": float(best_score),
                "matched_query": self.entries[best_idx]["query"],
                "result": self.entries[best_idx]["result"]
            }
        return {
            "hit": False,
            "score": float(best_score)
        }

    def add(self, query, result):
        embedding = self.embed(query)
        self.entries.append({
            "query": query,
            "embedding": embedding,
            "result": result
        })

This version keeps everything in memory.

That is enough to understand the mechanism before adding persistence or scaling.

Step 2 — Simulate a real use case

Let's imagine a support assistant that maps user questions to prepared answers.

cache = SemanticCache(similarity_threshold=0.88)
cache.add(
    "How do I reset my password?",
    "Go to Settings → Security → Reset Password.")
cache.add(
    "How can I update my billing information?",
    "Open Account → Billing → Payment Methods.")

queries = [
    "I forgot my password, what should I do?",
    "Where do I change my card details?",
    "How do I delete my account?"]

for q in queries:
    result = cache.lookup(q)
    print(f"\nQuery: {q}")
    print(result)

A likely output would look like this:

Query: I forgot my password, what should I do?
{'hit': True, 'score': 0.92, 'matched_query': 'How do I reset my password?', 'result': 'Go to Settings → Security → Reset Password.'}
Query: Where do I change my card details?
{'hit': True, 'score': 0.89, 'matched_query': 'How can I update my billing information?', 'result': 'Open Account → Billing → Payment Methods.'}
Query: How do I delete my account?
{'hit': False, 'score': 0.54}

This is the essential behavior.

Two semantically similar questions become cache hits.

A different question becomes a miss.

Step 3 — Add a fallback function

A cache is only useful if misses can be handled automatically.

In practice, a miss might trigger:

  • an embedding + vector search pipeline
  • an LLM call
  • a database query
  • a rules engine
  • a recommendation model

Here is a simple wrapper that uses the cache first and computes only when needed.

def expensive_pipeline(query):
    if "password" in query.lower():
        return "Go to Settings → Security → Reset Password."
    elif "billing" in query.lower() or "card" in query.lower():
        return "Open Account → Billing → Payment Methods."
    else:
        return "Please contact support for this request."

def get_response(query, cache):
    lookup_result = cache.lookup(query)
    if lookup_result and lookup_result["hit"]:
        return {
            "source": "cache",
            "score": lookup_result["score"],
            "response": lookup_result["result"]
        }
    fresh_result = expensive_pipeline(query)
    cache.add(query, fresh_result)
    return {
        "source": "fresh",
        "score": None,
        "response": fresh_result
    }

Now test it:

queries = [
    "I forgot my password",
    "How do I reset my password?",
    "Need to update my card",
    "Need to update my payment method",
    "Delete my account"]

for q in queries:
    print(q, "->", get_response(q, cache))

This pattern is where semantic caching becomes operationally useful.

The first request pays the full cost.

Related requests become cheaper.

Step 4 — Avoid bad cache hits with text normalization

Semantic similarity is powerful, but raw text still contains noise.

Before embedding, normalize the text.

This helps with consistency and reduces accidental misses.

import re

def normalize_text(text):
    text = text.lower().strip()
    text = re.sub(r"\s+", " ", text)
    return text

Update the cache class:

class SemanticCache:
    def __init__(self, model_name="all-MiniLM-L6-v2", similarity_threshold=0.90):
        self.model = SentenceTransformer(model_name)
        self.similarity_threshold = similarity_threshold
        self.entries = []

    def embed(self, text):
        text = normalize_text(text)
        return self.model.encode([text], normalize_embeddings=True)[0]

    def lookup(self, query):
        if not self.entries:
            return None
        query_vec = self.embed(query)
        cached_vectors = np.array([entry["embedding"] for entry in self.entries])
        similarities = cosine_similarity([query_vec], cached_vectors)[0]
        best_idx = np.argmax(similarities)
        best_score = similarities[best_idx]
        if best_score >= self.similarity_threshold:
            return {
                "hit": True,
                "score": float(best_score),
                "matched_query": self.entries[best_idx]["query"],
                "result": self.entries[best_idx]["result"]
            }
        return {
            "hit": False,
            "score": float(best_score)
        }

    def add(self, query, result):
        embedding = self.embed(query)
        self.entries.append({
            "query": query,
            "embedding": embedding,
            "result": result
        })

Normalization will not solve every issue, but it improves stability.

Step 5 — Add metadata and expiration

A cache without lifecycle control eventually becomes a junk drawer.

Some entries get stale.

Some answers should expire.

Some domains need different thresholds.

We can add metadata like timestamp, usage count, and TTL.

import time

class SemanticCache:
    def __init__(self, model_name="all-MiniLM-L6-v2", similarity_threshold=0.90, ttl_seconds=3600):
        self.model = SentenceTransformer(model_name)
        self.similarity_threshold = similarity_threshold
        self.ttl_seconds = ttl_seconds
        self.entries = []

    def embed(self, text):
        text = normalize_text(text)
        return self.model.encode([text], normalize_embeddings=True)[0]

    def _is_expired(self, entry):
        return (time.time() - entry["timestamp"]) > self.ttl_seconds

    def _prune_expired(self):
        self.entries = [entry for entry in self.entries if not self._is_expired(entry)]

    def lookup(self, query):
        self._prune_expired()
        if not self.entries:
            return None
        query_vec = self.embed(query)
        cached_vectors = np.array([entry["embedding"] for entry in self.entries])
        similarities = cosine_similarity([query_vec], cached_vectors)[0]
        best_idx = np.argmax(similarities)
        best_score = similarities[best_idx]
        if best_score >= self.similarity_threshold:
            self.entries[best_idx]["hits"] += 1
            return {
                "hit": True,
                "score": float(best_score),
                "matched_query": self.entries[best_idx]["query"],
                "result": self.entries[best_idx]["result"]
            }
        return {
            "hit": False,
            "score": float(best_score)
        }

    def add(self, query, result):
        embedding = self.embed(query)
        self.entries.append({
            "query": query,
            "embedding": embedding,
            "result": result,
            "timestamp": time.time(),
            "hits": 0
        })

This matters when your system answers questions about:

  • prices
  • inventory
  • policy documents
  • changing account states
  • time-sensitive recommendations

Not every semantic hit should be reused forever.

Step 6 — Measure whether the cache is helping

A semantic cache should be judged with numbers, not optimism.

Track at least these metrics:

  • hit rate
  • average similarity score on hits
  • latency saved
  • incorrect hit rate
  • miss distribution by query type

Here is a lightweight instrumented wrapper:

class CachedService:
    def __init__(self, cache):
        self.cache = cache
        self.total_requests = 0
        self.cache_hits = 0
        self.cache_misses = 0

    def expensive_pipeline(self, query):
        if "password" in query.lower():
            return "Go to Settings → Security → Reset Password."
        elif "billing" in query.lower() or "card" in query.lower():
            return "Open Account → Billing → Payment Methods."
        else:
            return "Please contact support for this request."

    def handle(self, query):
        self.total_requests += 1
        result = self.cache.lookup(query)
        if result and result["hit"]:
            self.cache_hits += 1
            return {
                "source": "cache",
                "response": result["result"],
                "score": result["score"]
            }
        self.cache_misses += 1
        response = self.expensive_pipeline(query)
        self.cache.add(query, response)
        return {
            "source": "fresh",
            "response": response,
            "score": None
        }

    def stats(self):
        hit_rate = self.cache_hits / self.total_requests if self.total_requests else 0
        return {
            "total_requests": self.total_requests,
            "cache_hits": self.cache_hits,
            "cache_misses": self.cache_misses,
            "hit_rate": round(hit_rate, 3)
        }

Usage:

cache = SemanticCache(similarity_threshold=0.88, ttl_seconds=7200)
service = CachedService(cache)

test_queries = [
    "How do I reset my password?",
    "I forgot my password",
    "Need help changing my password",
    "Update billing info",
    "Change my card details",
    "Close my account"]

for q in test_queries:
    print(service.handle(q))

print(service.stats())

Once you can observe hit rate and false matches, you can tune the threshold with confidence.

Choosing the right similarity threshold

This is where many implementations go wrong.

A threshold that is too high gives you very few cache hits.

A threshold that is too low gives you wrong answers quickly.

Neither is good.

As a starting point:

  • 0.95+ → very strict, low risk, fewer hits
  • 0.88 to 0.94 → practical for many support and FAQ cases
  • below 0.85 → risky unless your domain is tightly constrained

The right threshold depends on the cost of a wrong reuse.

For example:

  • password reset instructions can tolerate moderate similarity
  • medical guidance should be much stricter
  • legal or financial outputs may need semantic caching disabled entirely unless validated

A useful approach is to label a few hundred query pairs and inspect where correct and incorrect matches cluster.

Common failure modes

A semantic cache is simple in concept, but there are several ways to misuse it.

1. Caching final answers when context changes

If the answer depends on live account state, inventory, or user permissions, a cached answer may become incorrect even if the query meaning matches.

In that case, cache intermediate retrieval results or tool plans instead of final user-facing text.

2. Ignoring tenant boundaries

In multi-customer systems, semantic similarity must never cross tenant boundaries.

A perfect semantic match from the wrong tenant is still a data leak.

3. Using one threshold for every intent

Password help, refund requests, and compliance questions may require different thresholds.

A single global threshold is convenient, but often too blunt.

4. Never reviewing false positives

The dangerous errors are not misses.

They are confident wrong hits.

Those need explicit monitoring.

A better production pattern

In real systems, the strongest design is often a layered cache:

  • exact text cache first
  • semantic cache second
  • full retrieval/generation pipeline last

That gives you:

  • fastest response for exact repeats
  • strong savings for paraphrases
  • full fallback when needed

The flow looks like this:

Query → exact cache → semantic cache → expensive pipeline → store result

Here is a compact implementation:

class LayeredCacheService:
    def __init__(self, semantic_cache):
        self.exact_cache = {}
        self.semantic_cache = semantic_cache

    def expensive_pipeline(self, query):
        if "password" in query.lower():
            return "Go to Settings → Security → Reset Password."
        elif "billing" in query.lower() or "card" in query.lower():
            return "Open Account → Billing → Payment Methods."
        return "Please contact support for this request."

    def handle(self, query):
        normalized = normalize_text(query)
        if normalized in self.exact_cache:
            return {
                "source": "exact_cache",
                "response": self.exact_cache[normalized]
            }
        semantic_result = self.semantic_cache.lookup(query)
        if semantic_result and semantic_result["hit"]:
            self.exact_cache[normalized] = semantic_result["result"]
            return {
                "source": "semantic_cache",
                "response": semantic_result["result"],
                "score": semantic_result["score"]
            }
        fresh = self.expensive_pipeline(query)
        self.exact_cache[normalized] = fresh
        self.semantic_cache.add(query, fresh)
        return {
            "source": "fresh",
            "response": fresh
        }

This layered structure is often enough to cut repeated semantic compute dramatically.

When this works best

Semantic caching delivers the most value when:

  • users ask repeated questions in different wording
  • embeddings are expensive or slow
  • retrieval pipelines are stable
  • many requests map to a small set of intents
  • latency matters

Good examples include:

  • customer support
  • internal knowledge assistants
  • FAQ bots
  • product search reformulations
  • document triage systems

It is less suitable when every answer depends heavily on fresh private state.

Where to take it next

The in-memory version is useful for learning, but production systems usually add:

  • Redis or a vector database for persistence
  • ANN search for larger cache sizes
  • tenant-aware partitioning
  • threshold tuning by intent type
  • cache invalidation tied to source data updates
  • offline evaluation against labeled query pairs

You can also cache more than answers.

Strong candidates include:

  • retrieved document IDs
  • tool selection decisions
  • query rewrites
  • classification labels
  • reranking outputs

That is often safer than caching full generated text.

Final thought

A lot of AI infrastructure effort goes into bigger models, longer contexts, and more elaborate agent loops.

Meanwhile, many systems are still paying full price to solve the same semantic problem again and again.

A semantic cache is not flashy.

It does not need a new model release.

It does not depend on agent orchestration.

It simply removes waste from pipelines that already work.

And in production, removing waste is often the fastest path to a better system.

✓ If you build retrieval, support, or assistant workflows in Python, this is one of the highest-leverage optimizations you can add early.