AdvancedPython~25 min

Production Context Management

Short-term vs long-term memory, context pruning, multi-user isolation, monitoring, and the right Valkey data structure for each context type.

Short-term vs Long-term Memory

The memory layer is the foundation of context engineering at scale. Valkey supports both memory types through TTL management:

import valkey
import json
import time

client = valkey.Valkey(host="localhost", port=6379, decode_responses=True)

# Short-term memory: active session context (low TTL)
def store_short_term(session_id: str, key: str, value: str, ttl: int = 1800):
    """Store session-scoped context that expires after inactivity."""
    client.hset(f"session:{session_id}", key, value)
    client.expire(f"session:{session_id}", ttl)

# Long-term memory: persists across sessions (no TTL or high TTL)
def store_long_term(user_id: str, key: str, value: str):
    """Store cross-session user knowledge that persists indefinitely."""
    client.hset(f"memory:{user_id}", key, value)
    # No EXPIRE - this persists

# Example
store_short_term("sess_100", "current_topic", "billing")
store_short_term("sess_100", "escalation_level", "0")
store_long_term("alice", "communication_style", "prefers concise answers")
store_long_term("alice", "timezone", "America/Los_Angeles")

Memory Type	TTL	Valkey Key Pattern	Use Case
Short-term (session)	30 min	`session:{session_id}`	Current task state, tool outputs
Short-term (chat)	30 min	`chat:{session_id}`	Conversation messages
Long-term (user)	None	`memory:{user_id}`	Preferences, past interactions
Long-term (summary)	30 days	`summary:{user_id}:{date}`	Session summaries for future reference

Context Pruning

Not everything belongs in the context window. Prune aggressively:

def prune_old_messages(session_id: str, max_messages: int = 20):
    """Keep only the most recent messages."""
    client.ltrim(f"chat:{session_id}", -max_messages, -1)

def summarize_and_store(user_id: str, session_id: str, summary: str):
    """After a session ends, store a summary for long-term recall."""
    date = time.strftime("%Y-%m-%d")
    client.hset(f"summary:{user_id}:{date}", mapping={
        "session_id": session_id,
        "summary": summary,
        "timestamp": str(time.time()),
    })
    client.expire(f"summary:{user_id}:{date}", 86400 * 30)  # 30 days

# Prune chat to last 20 messages
prune_old_messages("sess_100")

# Store session summary for future context
summarize_and_store("alice", "sess_100", 
    "User asked about billing. Resolved a refund request for order ORD-12345.")

Key insight from Ankur Goyal (Braintrust): "Good context engineering caches well. Bad context engineering is both slow and expensive."

Multi-User Context Isolation

In production, you must isolate context between users:

def get_user_context(user_id: str, session_id: str) -> dict:
    """Get all context for a specific user, properly isolated."""
    return {
        "memory": client.hgetall(f"memory:{user_id}"),
        "session": client.hgetall(f"session:{session_id}"),
        "history": [
            json.loads(m) for m in 
            client.lrange(f"chat:{session_id}", -10, -1)
        ],
    }

# Alice's context is completely separate from Bob's
alice_ctx = get_user_context("alice", "sess_alice_001")
bob_ctx = get_user_context("bob", "sess_bob_001")

Monitoring Context Quality

Track how your context engineering system performs:

def record_context_metrics(session_id: str, metrics: dict):
    """Track context assembly metrics."""
    client.hset(f"metrics:context:{session_id}", mapping={
        "sources_used": str(metrics.get("sources_used", 0)),
        "total_tokens": str(metrics.get("total_tokens", 0)),
        "assembly_time_ms": str(metrics.get("assembly_time_ms", 0)),
        "pruned_messages": str(metrics.get("pruned_messages", 0)),
        "timestamp": str(time.time()),
    })
    client.expire(f"metrics:context:{session_id}", 86400 * 7)  # 7 days

# After assembling context
record_context_metrics("sess_100", {
    "sources_used": 4,
    "total_tokens": 2850,
    "assembly_time_ms": 3.2,
    "pruned_messages": 5,
})

# Aggregate metrics
def get_avg_assembly_time(pattern: str = "metrics:context:*") -> float:
    """Calculate average context assembly time."""
    keys = client.keys(pattern)
    times = []
    for key in keys[:100]:  # Sample last 100 sessions
        t = client.hget(key, "assembly_time_ms")
        if t:
            times.append(float(t))
    return sum(times) / len(times) if times else 0.0

Data Structure Reference

Context Type	Valkey Type	Key Pattern	TTL	Why This Structure
Agent config	Hash	`agent:config:{id}`	None	Structured key-value pairs
Chat history	List	`chat:{session}`	30 min	Ordered, appendable, trimmable
Session state	Hash	`session:{session}`	30 min	Fast field-level access
Tool outputs	Hash	`tool:{session}:step_{n}`	1 hour	Per-step structured data
User memory	Hash	`memory:{user_id}`	None	Persistent preferences
Session summaries	Hash	`summary:{user}:{date}`	30 days	Compressed long-term recall
KB embeddings	Hash + Vector	`kb:doc:{id}`	None	Semantic search via FT.SEARCH
Context metrics	Hash	`metrics:context:{session}`	7 days	Performance monitoring

Production Checklist

Area	Recommendation
Memory types	Use short-term (TTL) + long-term (persistent) memory
Pruning	LTRIM chat to last 20-50 messages per session
Isolation	Key patterns must include user_id or session_id
Token budget	Count tokens before each LLM call, trim from oldest
Monitoring	Track sources_used, total_tokens, assembly_time_ms
Summarization	Summarize sessions on close for long-term recall
Eviction	Set `maxmemory-policy allkeys-lru` for graceful degradation
Freshness	Use EXPIRE to auto-clean stale context

Reference: Based on best practices from the Redis context engineering blog, drawing on insights from Andrej Karpathy, Lance Martin (LangChain), and Salvatore Sanfilippo (Redis founder).

← Previous02 - Assembly Pipeline