Token-aware, cost-based, and hierarchical rate limiting for production AI workloads. Protect your APIs, control spend, and keep your LLM integrations reliable.
6 step-by-step guides from basic limiting to production-grade multi-tier systems
Fixed-window rate limiting in under 5 minutes
Track LLM token consumption alongside request counts
Per-agent and per-conversation rate limits for AI workloads
User → Organization → Global limit tiers with atomic enforcement
Track and limit by dollar cost per model and per user
Distributed limiting, monitoring, failover, and observability
Test all 4 algorithms in real-time - fixed window, sliding window, token bucket, and leaky bucket
INCR rl:{user}:{window}
EXPIRE key {window_sec}
One key per window. Atomic increment + TTL. ~0.2ms.
ZADD rl:{user} {ts} {id}
ZREMRANGEBYSCORE ...
ZCARD key
Sorted set per user. Prune old events. Exact count.
HSET bucket tokens {n}
HSET bucket last_refill {ts}
EVALSHA lua_script
Lua script for atomic refill + consume. Burst-friendly.
INCRBY rl:{user}:tokens {n}
INCR rl:{user}:requests
EXPIRE key {ttl}
Track both requests AND tokens. Ideal for LLM APIs.
Full Python implementation with 5 algorithms, FastAPI integration, Redis/Valkey client, and Docker compose.