Rate Limiting
for LLM APIs

Token-aware, cost-based, and hierarchical rate limiting for production AI workloads. Protect your APIs, control spend, and keep your LLM integrations reliable.

LLM APIsCost ControlSliding WindowToken BucketMulti-tier Limits

Cookbooks

6 step-by-step guides from basic limiting to production-grade multi-tier systems

01

Getting Started

Fixed-window rate limiting in under 5 minutes

Beginner~5 min
02

Token-Aware Limiting

Track LLM token consumption alongside request counts

Intermediate~15 min
03

Agent Rate Limiting

Per-agent and per-conversation rate limits for AI workloads

Intermediate~20 min
04

Hierarchical Limits

User → Organization → Global limit tiers with atomic enforcement

Advanced~25 min
05

Cost-Based Limiting

Track and limit by dollar cost per model and per user

Advanced~25 min
06

Production Patterns

Distributed limiting, monitoring, failover, and observability

Advanced~30 min

Live Demo

Test all 4 algorithms in real-time - fixed window, sliding window, token bucket, and leaky bucket

How Valkey Powers Rate Limiting

FIXED WINDOW

INCR rl:{user}:{window}
EXPIRE key {window_sec}

One key per window. Atomic increment + TTL. ~0.2ms.

SLIDING WINDOW

ZADD rl:{user} {ts} {id}
ZREMRANGEBYSCORE ...
ZCARD key

Sorted set per user. Prune old events. Exact count.

TOKEN BUCKET

HSET bucket tokens {n}
HSET bucket last_refill {ts}
EVALSHA lua_script

Lua script for atomic refill + consume. Burst-friendly.

TOKEN-AWARE

INCRBY rl:{user}:tokens {n}
INCR rl:{user}:requests
EXPIRE key {ttl}

Track both requests AND tokens. Ideal for LLM APIs.

Complete source code on GitHub

Full Python implementation with 5 algorithms, FastAPI integration, Redis/Valkey client, and Docker compose.

View on GitHub → All 6 Cookbooks →

Rate Limitingfor LLM APIs

Cookbooks

Getting Started

Token-Aware Limiting

Agent Rate Limiting

Hierarchical Limits

Cost-Based Limiting

Production Patterns

Live Demo

How Valkey Powers Rate Limiting

Complete source code on GitHub

Rate Limiting
for LLM APIs