AdvancedPython~20 min

Semantic Caching

The Problem

LLM API calls are expensive ($0.01-$0.10+ per call) and slow (1-10 seconds). Users often ask similar questions with different wording:

"What is Valkey?" → "Can you explain Valkey?" → "Tell me about Valkey"
All three should return the same cached answer

Exact-match caching (GET cache:{hash}) misses these. Semantic caching uses vector similarity to match by meaning.

Architecture

User: "Can you explain Valkey?"
        │
        ▼
   Embed query → [0.12, -0.45, ...]
        │
        ▼
   FT.SEARCH cache_idx KNN 1 → score 0.95
        │
        ├── score ≥ 0.90 → ✅ Cache HIT → return cached response
        └── score < 0.90 → ❌ Cache MISS → call LLM → cache result

Step 1: Create the Cache Index

from glide import (
    ft, VectorField, VectorAlgorithm, VectorFieldAttributesHnsw,
    VectorType, DistanceMetricType, NumericField,
    FtCreateOptions, DataType,
)

async def create_cache_index(client):
    existing = await ft.list(client)
    if b"cache_idx" in existing:
        return

    hnsw = VectorFieldAttributesHnsw(
        dimensions=1536,
        distance_metric=DistanceMetricType.COSINE,
        type=VectorType.FLOAT32,
    )
    await ft.create(
        client, "cache_idx",
        schema=[
            NumericField("$.created_at", alias="created_at"),
            VectorField("$.embedding", VectorAlgorithm.HNSW,
                        alias="embedding", attributes=hnsw),
        ],
        options=FtCreateOptions(data_type=DataType.JSON, prefixes=["llmcache:"]),
    )

Step 2: The Cache Lookup

import json, struct, hashlib, time
from glide import glide_json, FtSearchOptions

SIMILARITY_THRESHOLD = 0.90  # Tune this: higher = stricter matching

async def cache_lookup(client, query_embedding):
    """Check if a semantically similar query was already answered."""
    vec_bytes = struct.pack(f"<{len(query_embedding)}f", *query_embedding)

    result = await ft.search(
        client, "cache_idx",
        "(*)==>[KNN 1 @embedding $vec AS score]",
        options=FtSearchOptions(params={"vec": vec_bytes}),
    )

    if result and len(result) >= 2 and result[1]:
        for key, fields in result[1].items():
            score = 1.0 - float(fields[b"score"])
            if score >= SIMILARITY_THRESHOLD:
                doc = json.loads(fields[b"$"])
                return {"hit": True, "response": doc["response"], "score": score}

    return {"hit": False}

Step 3: Store a Cache Entry

async def cache_store(client, query, response, embedding, ttl=3600):
    """Cache an LLM response with its query embedding."""
    cache_id = hashlib.md5(query.encode()).hexdigest()[:12]
    key = f"llmcache:{cache_id}"

    doc = {
        "query": query,
        "response": response,
        "embedding": embedding,
        "created_at": time.time(),
    }
    await glide_json.set(client, key, "$", json.dumps(doc))
    await client.expire(key, ttl)  # Auto-expire stale cache

Step 4: The Complete Cache-Aware Chat Function

async def chat_with_cache(client, user_message):
    # 1. Embed the query
    embedding = get_embedding(user_message)

    # 2. Check semantic cache
    cached = await cache_lookup(client, embedding)
    if cached["hit"]:
        print(f"⚡ Cache HIT (similarity: {cached['score']:.3f})")
        return cached["response"]

    # 3. Cache miss - call the LLM
    print("🔄 Cache MISS - calling LLM...")
    response = call_llm(user_message)  # your LLM call here

    # 4. Store in cache for next time
    await cache_store(client, user_message, response, embedding)

    return response


# First call: cache miss, calls LLM (~2 seconds)
await chat_with_cache(client, "What is Valkey?")
# 🔄 Cache MISS - calling LLM...

# Second call: cache hit, instant (~3ms)
await chat_with_cache(client, "Can you explain Valkey?")
# ⚡ Cache HIT (similarity: 0.953)

Tuning the Similarity Threshold

Threshold	Behavior	Use case
`0.98`	Near-exact match only	Factual queries where precision matters
`0.92`	Same intent, different wording	General chatbot (recommended start)
`0.85`	Broadly similar topics	FAQ-style bots with limited topics

Start at 0.92 and adjust based on your false-positive rate.

Cost Impact

With a 40% cache hit rate (typical for customer support):

Metric	Without cache	With semantic cache
LLM calls / 1000 requests	1,000	600
Avg latency	~2,000ms	~1,200ms
Cost (at $0.01/call)	$10.00	$6.00
Cache lookup overhead	-	~3ms

Valkey Commands Reference

Operation	Command	Latency
Cache lookup	`FT.SEARCH cache_idx KNN 1`	~1-3ms
Cache store	`JSON.SET llmcache:{id} $ '{...}'`	~0.2ms
Set TTL	`EXPIRE llmcache:{id} 3600`	~0.1ms
Invalidate entry	`DEL llmcache:{id}`	~0.1ms
Flush all cache	`FT.DROPINDEX cache_idx`	~1ms

Next up: We've covered conversation history, session management, semantic memory, and caching. In the final cookbook, we'll add agent state - checkpointing multi-step reasoning and logging tool calls with Valkey Streams.

← Previous03 - Semantic Memory Next →05 - Agent State