Skip to main content

Rate Limiting and Throttling

Control request flow to protect shared resources and prevent abuse

TL;DR

Rate limiting controls request flow proactively, preventing resource exhaustion before it happens. Token bucket algorithm refills tokens at a fixed rate; each request consumes a token. Requests rejected if no tokens available. Unlike load shedding (reactive), rate limiting is proactive. Throttling is client-side rate limiting (voluntary); rate limiting is server-side enforcement. Strategies include per-user quotas, per-IP limits, global thresholds, and adaptive limits based on system health. Essential for public APIs, multi-tenant systems, and preventing both intentional abuse and accidental DDoS.

Learning Objectives

  • Understand token bucket, sliding window, and leaky bucket algorithms
  • Design appropriate rate limits for different request types
  • Implement distributed rate limiting (across multiple servers)
  • Choose between per-user, per-IP, and global rate limits
  • Handle burst traffic while maintaining fairness
  • Communicate limits to clients via headers and status codes

Motivating Scenario

A public API for weather data limits users to 100 requests/minute. User A makes 50 requests/minute consistently. User B suddenly makes 200 requests in 30 seconds (burst). Without rate limiting, both succeed (total 250/min) and the backend saturates. With token bucket rate limiting (100 tokens/minute capacity), User A gets 50 requests, User B gets 50 requests (tokens depleted), then must wait. Burst traffic is handled fairly. After 30 seconds, User B has a few tokens available and can make a few more requests, but cannot exceed the sustained limit.

Core Concepts

Rate Limiting Algorithms

Token Bucket: Simple, handles bursts well. Best for: APIs with expected burstiness (image uploads, batch processing). Clients can make rapid requests if within capacity.

Sliding Window: Precise count-based limits. Best for: strict quotas (100 API calls per hour). More memory overhead.

Leaky Bucket: Smooths traffic perfectly. Best for: protecting backends from sudden spikes. Adds latency (requests wait in queue).

Practical Example

import time
from collections import defaultdict
from threading import Lock

class TokenBucketRateLimiter:
def __init__(self, capacity, refill_rate):
"""
capacity: max tokens in bucket
refill_rate: tokens per second
"""
self.capacity = capacity
self.refill_rate = refill_rate
self.tokens = capacity
self.last_refill = time.time()
self.lock = Lock()

def try_consume(self, tokens=1):
"""Try to consume tokens, return True if successful"""
with self.lock:
self._refill()

if self.tokens >= tokens:
self.tokens -= tokens
return True, "Request allowed"
else:
wait_time = (tokens - self.tokens) / self.refill_rate
return False, f"Rate limited. Retry after {wait_time:.1f}s"

def _refill(self):
"""Refill tokens based on elapsed time"""
now = time.time()
elapsed = now - self.last_refill
tokens_to_add = elapsed * self.refill_rate
self.tokens = min(self.capacity, self.tokens + tokens_to_add)
self.last_refill = now

def get_status(self):
with self.lock:
self._refill()
return {"tokens": self.tokens, "capacity": self.capacity}

class PerUserRateLimiter:
def __init__(self, default_capacity=100, default_rate=10):
self.limiters = defaultdict(
lambda: TokenBucketRateLimiter(default_capacity, default_rate)
)
self.lock = Lock()

def check_limit(self, user_id, tokens=1):
"""Check if user can make request"""
with self.lock:
limiter = self.limiters[user_id]
return limiter.try_consume(tokens)

def set_user_limit(self, user_id, capacity, rate):
"""Set custom limit for user (e.g., premium tier)"""
with self.lock:
self.limiters[user_id] = TokenBucketRateLimiter(capacity, rate)

class DistributedRateLimiter:
"""Rate limiter with IP-based limits"""
def __init__(self):
self.ip_limiters = defaultdict(
lambda: TokenBucketRateLimiter(capacity=100, refill_rate=10)
)
self.global_limiter = TokenBucketRateLimiter(capacity=10000, refill_rate=1000)

def check_limit(self, client_ip):
"""Check both IP and global limits"""
# Check global limit first (cheap check)
global_ok, global_msg = self.global_limiter.try_consume()
if not global_ok:
return False, "Service rate limit exceeded (429)"

# Check per-IP limit
ip_ok, ip_msg = self.ip_limiters[client_ip].try_consume()
if not ip_ok:
return False, f"IP rate limit exceeded: {ip_msg}"

return True, "Request allowed"

# Example usage
limiter = PerUserRateLimiter(capacity=100, default_rate=10)

# Simulate requests from different users
for i in range(150):
user = "user-1" if i < 50 else "user-2"
allowed, msg = limiter.check_limit(user)
if not allowed:
print(f"Request {i} ({user}): {msg}")

print("User-1 status:", limiter.limiters["user-1"].get_status())
print("User-2 status:", limiter.limiters["user-2"].get_status())

When to Use vs. When NOT to Use

Use Rate Limiting
  1. Public APIs with unknown clients
  2. Multi-tenant systems with quotas
Avoid Rate Limiting
  1. Protecting critical backend resources
  2. Preventing DDoS attacks (intentional or accidental)
  3. Fair resource sharing among users/services

Patterns and Pitfalls

Return X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset headers. Clients know exactly when they can retry. Better UX than guessing.
Free tier: 100 req/day. Pro tier: 10K req/day. Enterprise: unlimited. Different limits for different SLAs. Monetization and resource management aligned.
Rate limiter in one service. User makes requests across 5 services. Each sees 20 requests/minute, all pass limits. Total: 100 req/min, system overloaded. Solution: centralized rate limit state (Redis) or coordination.
Legitimate users hit limits during peak usage. Negative experience. Monitor actual usage; set limits at 2x average + headroom, not 1x.
Token bucket with capacity > refill rate. User has quota of 10 req/sec sustained, but can burst to 100 req/sec for 10 seconds. Handles spiky traffic while maintaining overall capacity.
Relying on clients to throttle. Malicious client ignores throttle. Always enforce server-side. Client-side is optimization only.

Design Review Checklist

  • Rate limits are specified per dimension (user, IP, global)
  • Limits are based on actual capacity testing, not guesses
  • Burst capacity (token bucket size) is defined per tier
  • Refill rate allows sustainable traffic within SLA
  • HTTP 429 Too Many Requests is returned with Retry-After header
  • X-RateLimit-* headers are included in all responses
  • Distributed rate limiting uses shared state (Redis, Memcached)
  • Rate limit thresholds have monitoring and alerting
  • Client libraries document rate limits in documentation
  • Fallback strategy exists if rate limiter fails (fail open vs. closed?)

Self-Check

  • Can you explain token bucket vs. sliding window algorithms?
  • How does your system handle burst traffic?
  • What happens if the rate limiter itself goes down?
  • How do you coordinate rate limits across multiple servers?
  • What limits do you return to clients, and how?

Next Steps

  1. Load Shedding: Read Load Shedding and Backpressure ↗️ for reactive overload handling
  2. Bulkhead Isolation: Learn Bulkhead Isolation ↗️ for resource-level isolation
  3. Circuit Breaker: Read Circuit Breaker ↗️ to handle dependency failures

References