IntermediatePython~20 min

LLM Caching

The Problem

LLM API calls are expensive ($0.003–$0.10+ per call) and slow (2–10 seconds). Users often send identical or near-identical prompts:

Exact-match caching eliminates redundant calls. (For meaning-based matching, see Guide 03.)

Upstream Contribution: The ValkeyCache integration was contributed to langchain-ai/langchain-aws in PR #717 by the Valkey team.

Step 1: Initialize ValkeyCache

from langgraph_checkpoint_aws import ValkeyCache
from valkey import Valkey

# Create Valkey client
valkey_client = Valkey.from_url(
    "valkey://localhost:6379",
    decode_responses=False,
)

# Initialize cache with 1-hour TTL
cache = ValkeyCache(
    client=valkey_client,
    prefix="llm_cache:",
    ttl=3600,
)

Step 2: Cache Key Generation

Generate deterministic keys from prompt + model + temperature so different model configs don't collide:

import hashlib

def cache_key(prompt: str, model: str = "claude-sonnet", temp: float = 0.7) -> tuple:
    content = f"{model}|temp={temp}|{prompt.strip()}"
    key = hashlib.sha256(content.encode()).hexdigest()[:16]
    return (("llm_responses",), key)

Step 3: Cached Inference Function

import time
from langchain_aws import ChatBedrockConverse
from langchain_core.messages import HumanMessage

model = ChatBedrockConverse(
    model="us.anthropic.claude-sonnet-4-20250514-v1:0",
    region_name="us-west-2",
)

async def cached_llm_call(prompt: str) -> dict:
    key = cache_key(prompt)

    # 1. Check cache
    start = time.time()
    cached = await cache.aget([key])
    cache_time = time.time() - start

    if key in cached:
        return {
            "response": cached[key]["response"],
            "cached": True,
            "latency_ms": cache_time * 1000,
        }

    # 2. Cache miss - call LLM
    start = time.time()
    response = await model.ainvoke([HumanMessage(content=prompt)])
    llm_time = time.time() - start

    # 3. Store in cache
    await cache.aset({key: ({"response": response.content}, 3600)})

    return {
        "response": response.content,
        "cached": False,
        "latency_ms": llm_time * 1000,
    }

Step 4: Benchmark - The Dramatic Difference

import asyncio

async def benchmark():
    prompt = "What is Amazon ElastiCache for Valkey?"

    # First call - cache miss
    r1 = await cached_llm_call(prompt)
    print(f"❌ Cache MISS: {r1['latency_ms']:.0f}ms")

    # Second call - cache hit
    r2 = await cached_llm_call(prompt)
    print(f"✅ Cache HIT:  {r2['latency_ms']:.1f}ms")

    speedup = r1["latency_ms"] / r2["latency_ms"]
    print(f"🚀 Speedup:   {speedup:.0f}x faster")

asyncio.run(benchmark())

# Output:
# ❌ Cache MISS: 4200ms
# ✅ Cache HIT:  1.2ms
# 🚀 Speedup:   3500x faster

Step 5: TTL Management

# Default TTL (set at cache creation)
cache = ValkeyCache(client=valkey_client, ttl=3600)  # 1 hour

# Custom TTL per entry
await cache.aset({key: (data, 300)})  # 5 minutes for volatile data

# Clear all cached entries
await cache.aclear()

Valkey Commands Fired:

# Cache lookup
GET llm_cache:a1b2c3d4e5f6g7h8

# Cache store
SET llm_cache:a1b2c3d4e5f6g7h8 '{"response":"..."}' EX 3600

# Cache clear
SCAN 0 MATCH llm_cache:* COUNT 100
DEL llm_cache:a1b2c3d4e5f6g7h8 ...

Next Steps

Exact-match caching is powerful but misses paraphrased queries. Next, we'll add semantic search to match by meaning.

Next: 03 Semantic Search with ValkeyStore →