AdvancedPython~25 min

Scaling for Production

← All RAG Cookbooks

Scaling for Production

Handle millions of vectors with clustering, replication, and memory optimization.

Memory Estimation

Plan your infrastructure based on vector count and dimensions:

Vectors Dimensions ~Memory (HNSW)
100K 1536 ~1 GB
1M 1536 ~10 GB
10M 1536 ~100 GB
100M 1536 ~1 TB (cluster)

Formula: vectors × dimensions × 4 bytes × 1.5 (HNSW overhead)

Cluster Mode

Distribute data across multiple nodes for horizontal scaling:

# Docker Compose for 3-node cluster
version: '3.8'
services:
  valkey-1:
    image: valkey/valkey-bundle:latest
    command: valkey-server --cluster-enabled yes --cluster-node-timeout 5000
    ports:
      - "7001:6379"

  valkey-2:
    image: valkey/valkey-bundle:latest
    command: valkey-server --cluster-enabled yes --cluster-node-timeout 5000
    ports:
      - "7002:6379"

  valkey-3:
    image: valkey/valkey-bundle:latest
    command: valkey-server --cluster-enabled yes --cluster-node-timeout 5000
    ports:
      - "7003:6379"
# Create cluster
valkey-cli --cluster create \
  127.0.0.1:7001 127.0.0.1:7002 127.0.0.1:7003 \
  --cluster-replicas 0

Replication for HA

# valkey.conf for replica
replicaof primary.host 6379
replica-read-only yes
# Connect to replicas for read scaling
primary = valkey.Valkey(host='primary', port=6379)
replica = valkey.Valkey(host='replica', port=6379)

# Write to primary, read from replicas
primary.hset(key, mapping=data)
results = replica.ft('idx').search(query)

Memory Optimization

# valkey.conf optimizations
maxmemory 8gb
maxmemory-policy volatile-lru

# Use FLOAT16 for 50% memory savings (slight accuracy loss)
FT.CREATE idx ON HASH PREFIX 1 "doc:"
  SCHEMA
    embedding VECTOR HNSW 6
      TYPE FLOAT16   # Instead of FLOAT32
      DIM 1536
      DISTANCE_METRIC COSINE

# Enable compression for hash values
hash-max-ziplist-entries 512
hash-max-ziplist-value 64

Connection Pooling

# Use connection pools in production
pool = valkey.ConnectionPool(
    host='localhost',
    port=6379,
    max_connections=50,
    decode_responses=False
)
client = valkey.Valkey(connection_pool=pool)

# Or use async with connection pool
pool = valkey.asyncio.ConnectionPool.from_url(
    "redis://localhost:6379",
    max_connections=50
)

AWS ElastiCache / MemoryDB

# Connect to ElastiCache cluster
client = valkey.Valkey(
    host='my-cluster.cache.amazonaws.com',
    port=6379,
    ssl=True,
    ssl_cert_reqs='required'
)

# For cluster mode
from redis.cluster import RedisCluster

cluster = RedisCluster(
    host='my-cluster.cache.amazonaws.com',
    port=6379,
    ssl=True
)

← Cache Invalidation Next: Monitoring →