HomeUse Cases → Rate Limiting

Rate Limiting
for LLM APIs

Token-aware, cost-based, and hierarchical rate limiting for production AI workloads. Protect your APIs, control spend, and keep your LLM integrations reliable.

LLM APIsCost ControlSliding WindowToken BucketMulti-tier Limits

Cookbooks

6 step-by-step guides from basic limiting to production-grade multi-tier systems

Live Demo

Test all 4 algorithms in real-time - fixed window, sliding window, token bucket, and leaky bucket

How Valkey Powers Rate Limiting

FIXED WINDOW
INCR rl:{user}:{window}
EXPIRE key {window_sec}

One key per window. Atomic increment + TTL. ~0.2ms.

SLIDING WINDOW
ZADD rl:{user} {ts} {id}
ZREMRANGEBYSCORE ...
ZCARD key

Sorted set per user. Prune old events. Exact count.

TOKEN BUCKET
HSET bucket tokens {n}
HSET bucket last_refill {ts}
EVALSHA lua_script

Lua script for atomic refill + consume. Burst-friendly.

TOKEN-AWARE
INCRBY rl:{user}:tokens {n}
INCR rl:{user}:requests
EXPIRE key {ttl}

Track both requests AND tokens. Ideal for LLM APIs.

Complete source code on GitHub

Full Python implementation with 5 algorithms, FastAPI integration, Redis/Valkey client, and Docker compose.