AdvancedPython~25 min

Production Patterns

Similarity threshold tuning, cache hit rate monitoring, eviction strategies, TTL management, and cost tracking for production semantic caches.

Pattern 1: Similarity Threshold Tuning

The threshold controls the trade-off between hit rate and answer quality:

Threshold (COSINE)	Hit Rate	Quality Risk	Best For
`0.05` (very strict)	Low (~20%)	Very low	Medical, legal, financial
`0.15` (balanced)	Medium (~50%)	Low	General chatbots
`0.30` (relaxed)	High (~70%)	Medium	FAQ bots, support
`0.50` (very relaxed)	Very high (~85%)	High - stale answers	Not recommended

# Test different thresholds to find the right balance
def evaluate_threshold(test_pairs: list, threshold: float):
    """Evaluate cache quality at a given threshold."""
    hits = 0
    false_positives = 0

    for query, expected_similar in test_pairs:
        result = semantic_cache_lookup(query)
        if result["hit"] and result["score"] < threshold:
            hits += 1
            if not expected_similar:
                false_positives += 1

    hit_rate = hits / len(test_pairs)
    fp_rate = false_positives / max(1, hits)
    print(f"Threshold {threshold}: hit_rate={hit_rate:.1%}, false_positive_rate={fp_rate:.1%}")

Pattern 2: Cache Hit Rate Monitoring

import valkey

client = valkey.Valkey(host="localhost", port=6379)

def record_cache_event(event_type: str):
    """Track cache hits and misses using atomic counters."""
    client.incr(f"cache:metrics:{event_type}")

    # Also track hourly for time-series analysis
    from datetime import datetime
    hour_key = datetime.now().strftime("%Y%m%d%H")
    counter_key = f"cache:metrics:{event_type}:{hour_key}"
    client.incr(counter_key)
    client.expire(counter_key, 86400 * 7)  # Keep 7 days

def get_cache_stats() -> dict:
    """Get current cache performance metrics."""
    hits = int(client.get("cache:metrics:hit") or 0)
    misses = int(client.get("cache:metrics:miss") or 0)
    total = hits + misses
    hit_rate = hits / total if total > 0 else 0

    # Estimate cost savings (GPT-4: ~$0.03/1K tokens, avg 500 tokens/request)
    avg_cost_per_call = 0.015
    savings = hits * avg_cost_per_call

    return {
        "total_requests": total,
        "hits": hits,
        "misses": misses,
        "hit_rate": round(hit_rate, 3),
        "estimated_savings_usd": round(savings, 2),
    }

# Usage in ask_with_cache:
# if cache_hit: record_cache_event("hit")
# else: record_cache_event("miss")

Pattern 3: TTL Strategies

# Strategy 1: Fixed TTL - simple, predictable
client.expire(cache_key, 3600)  # 1 hour

# Strategy 2: Category-based TTL
TTL_MAP = {
    "factual": 86400,      # 24h - facts don't change fast
    "opinion": 3600,       # 1h - opinions evolve
    "real-time": 300,      # 5 min - stock prices, weather
    "conversation": 1800,  # 30 min - chat context
}

# Strategy 3: Sliding TTL - reset on each hit
def cache_hit_with_refresh(cache_key: str, ttl: int = 3600):
    """On cache hit, refresh the TTL to keep popular entries alive."""
    response = client.hget(cache_key, "response")
    client.expire(cache_key, ttl)  # Reset TTL
    return response

Pattern 4: Memory Management

# Set maxmemory policy for cache eviction
# In valkey.conf or via CONFIG SET:
# maxmemory 1gb
# maxmemory-policy allkeys-lru
#
# LRU = Least Recently Used - evicts least-accessed cache entries first
# This is ideal for semantic caching where popular queries should stay

# Check memory usage
info = client.info("memory")
used_mb = info["used_memory"] / (1024 * 1024)
print(f"Memory: {used_mb:.1f} MB")

# Estimate cache capacity
# Each entry: ~6KB (1536 dims * 4 bytes + prompt + response text)
# 1 GB ≈ ~170,000 cached entries

Pattern 5: Cache Invalidation

# Invalidate specific cached entries
def invalidate_by_topic(topic_keyword: str):
    """Remove cached entries matching a topic (e.g., after a data update)."""
    results = client.execute_command(
        "FT.SEARCH", "cache_idx",
        f"@prompt:{topic_keyword}",
        "NOCONTENT",  # Only return keys, not fields
    )

    if results[0] > 0:
        keys = results[1:]
        for key in keys:
            client.delete(key)
        print(f"Invalidated {len(keys)} cached entries for '{topic_keyword}'")

# Example: product info changed, invalidate related cache
invalidate_by_topic("pricing")

Production Checklist

Area	Recommendation
Threshold	Start at 0.15 (COSINE), tune with A/B testing
TTL	1h for general, 24h for facts, 5min for real-time
Monitoring	Track hit rate, latency, cost savings hourly
Memory	Set `maxmemory-policy allkeys-lru`
Invalidation	Use TEXT search to find and delete stale entries
Isolation	TAG filters for multi-tenant deployments
Embeddings	Use `text-embedding-3-small` (fast, cheap, 1536 dims)
Index	HNSW with COSINE for OpenAI/Bedrock embeddings

Reference: This pattern is described in the official AWS documentation for ElastiCache semantic caching use cases, and was featured in the AWS re:Invent session on semantic caching for multi-turn agents with ElastiCache.

← Previous02 - Multi-Turn Caching