Rate Limiting and Throttling
Control request flow to protect shared resources and prevent abuse
TL;DR
Rate limiting controls request flow proactively, preventing resource exhaustion before it happens. Token bucket algorithm refills tokens at a fixed rate; each request consumes a token. Requests rejected if no tokens available. Unlike load shedding (reactive), rate limiting is proactive. Throttling is client-side rate limiting (voluntary); rate limiting is server-side enforcement. Strategies include per-user quotas, per-IP limits, global thresholds, and adaptive limits based on system health. Essential for public APIs, multi-tenant systems, and preventing both intentional abuse and accidental DDoS.
Learning Objectives
- Understand token bucket, sliding window, and leaky bucket algorithms
- Design appropriate rate limits for different request types
- Implement distributed rate limiting (across multiple servers)
- Choose between per-user, per-IP, and global rate limits
- Handle burst traffic while maintaining fairness
- Communicate limits to clients via headers and status codes
Motivating Scenario
A public API for weather data limits users to 100 requests/minute. User A makes 50 requests/minute consistently. User B suddenly makes 200 requests in 30 seconds (burst). Without rate limiting, both succeed (total 250/min) and the backend saturates. With token bucket rate limiting (100 tokens/minute capacity), User A gets 50 requests, User B gets 50 requests (tokens depleted), then must wait. Burst traffic is handled fairly. After 30 seconds, User B has a few tokens available and can make a few more requests, but cannot exceed the sustained limit.
Core Concepts
Token Bucket: Simple, handles bursts well. Best for: APIs with expected burstiness (image uploads, batch processing). Clients can make rapid requests if within capacity.
Sliding Window: Precise count-based limits. Best for: strict quotas (100 API calls per hour). More memory overhead.
Leaky Bucket: Smooths traffic perfectly. Best for: protecting backends from sudden spikes. Adds latency (requests wait in queue).
Practical Example
- Python
- Go
- Node.js
import time
from collections import defaultdict
from threading import Lock
class TokenBucketRateLimiter:
def __init__(self, capacity, refill_rate):
"""
capacity: max tokens in bucket
refill_rate: tokens per second
"""
self.capacity = capacity
self.refill_rate = refill_rate
self.tokens = capacity
self.last_refill = time.time()
self.lock = Lock()
def try_consume(self, tokens=1):
"""Try to consume tokens, return True if successful"""
with self.lock:
self._refill()
if self.tokens >= tokens:
self.tokens -= tokens
return True, "Request allowed"
else:
wait_time = (tokens - self.tokens) / self.refill_rate
return False, f"Rate limited. Retry after {wait_time:.1f}s"
def _refill(self):
"""Refill tokens based on elapsed time"""
now = time.time()
elapsed = now - self.last_refill
tokens_to_add = elapsed * self.refill_rate
self.tokens = min(self.capacity, self.tokens + tokens_to_add)
self.last_refill = now
def get_status(self):
with self.lock:
self._refill()
return {"tokens": self.tokens, "capacity": self.capacity}
class PerUserRateLimiter:
def __init__(self, default_capacity=100, default_rate=10):
self.limiters = defaultdict(
lambda: TokenBucketRateLimiter(default_capacity, default_rate)
)
self.lock = Lock()
def check_limit(self, user_id, tokens=1):
"""Check if user can make request"""
with self.lock:
limiter = self.limiters[user_id]
return limiter.try_consume(tokens)
def set_user_limit(self, user_id, capacity, rate):
"""Set custom limit for user (e.g., premium tier)"""
with self.lock:
self.limiters[user_id] = TokenBucketRateLimiter(capacity, rate)
class DistributedRateLimiter:
"""Rate limiter with IP-based limits"""
def __init__(self):
self.ip_limiters = defaultdict(
lambda: TokenBucketRateLimiter(capacity=100, refill_rate=10)
)
self.global_limiter = TokenBucketRateLimiter(capacity=10000, refill_rate=1000)
def check_limit(self, client_ip):
"""Check both IP and global limits"""
# Check global limit first (cheap check)
global_ok, global_msg = self.global_limiter.try_consume()
if not global_ok:
return False, "Service rate limit exceeded (429)"
# Check per-IP limit
ip_ok, ip_msg = self.ip_limiters[client_ip].try_consume()
if not ip_ok:
return False, f"IP rate limit exceeded: {ip_msg}"
return True, "Request allowed"
# Example usage
limiter = PerUserRateLimiter(capacity=100, default_rate=10)
# Simulate requests from different users
for i in range(150):
user = "user-1" if i < 50 else "user-2"
allowed, msg = limiter.check_limit(user)
if not allowed:
print(f"Request {i} ({user}): {msg}")
print("User-1 status:", limiter.limiters["user-1"].get_status())
print("User-2 status:", limiter.limiters["user-2"].get_status())
package main
import (
"fmt"
"sync"
"time"
)
type TokenBucketRateLimiter struct {
capacity float64
refillRate float64
tokens float64
lastRefill time.Time
mu sync.Mutex
}
func NewTokenBucketRateLimiter(capacity float64, refillRate float64) *TokenBucketRateLimiter {
return &TokenBucketRateLimiter{
capacity: capacity,
refillRate: refillRate,
tokens: capacity,
lastRefill: time.Now(),
}
}
func (tbl *TokenBucketRateLimiter) refill() {
now := time.Now()
elapsed := now.Sub(tbl.lastRefill).Seconds()
tokensToAdd := elapsed * tbl.refillRate
tbl.tokens = math.Min(tbl.capacity, tbl.tokens+tokensToAdd)
tbl.lastRefill = now
}
func (tbl *TokenBucketRateLimiter) TryConsume(tokens float64) (bool, string) {
tbl.mu.Lock()
defer tbl.mu.Unlock()
tbl.refill()
if tbl.tokens >= tokens {
tbl.tokens -= tokens
return true, "Request allowed"
}
waitTime := (tokens - tbl.tokens) / tbl.refillRate
return false, fmt.Sprintf("Rate limited. Retry after %.1fs", waitTime)
}
type PerUserRateLimiter struct {
limiters map[string]*TokenBucketRateLimiter
mu sync.Mutex
}
func NewPerUserRateLimiter() *PerUserRateLimiter {
return &PerUserRateLimiter{
limiters: make(map[string]*TokenBucketRateLimiter),
}
}
func (purl *PerUserRateLimiter) CheckLimit(userID string) (bool, string) {
purl.mu.Lock()
limiter, exists := purl.limiters[userID]
if !exists {
limiter = NewTokenBucketRateLimiter(100, 10)
purl.limiters[userID] = limiter
}
purl.mu.Unlock()
return limiter.TryConsume(1)
}
func main() {
limiter := NewPerUserRateLimiter()
for i := 0; i < 150; i++ {
user := "user-1"
if i >= 50 {
user = "user-2"
}
allowed, msg := limiter.CheckLimit(user)
if !allowed {
fmt.Printf("Request %d (%s): %s\n", i, user, msg)
}
}
}
class TokenBucketRateLimiter {
constructor(capacity, refillRate) {
this.capacity = capacity;
this.refillRate = refillRate; // tokens per second
this.tokens = capacity;
this.lastRefill = Date.now();
}
refill() {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000; // seconds
const tokensToAdd = elapsed * this.refillRate;
this.tokens = Math.min(this.capacity, this.tokens + tokensToAdd);
this.lastRefill = now;
}
tryConsume(tokens = 1) {
this.refill();
if (this.tokens >= tokens) {
this.tokens -= tokens;
return [true, 'Request allowed'];
} else {
const waitTime = (tokens - this.tokens) / this.refillRate;
return [false, `Rate limited. Retry after ${waitTime.toFixed(1)}s`];
}
}
getStatus() {
this.refill();
return {
tokens: Math.floor(this.tokens),
capacity: this.capacity,
refillRate: this.refillRate
};
}
}
class PerUserRateLimiter {
constructor(defaultCapacity = 100, defaultRate = 10) {
this.limiters = new Map();
this.defaultCapacity = defaultCapacity;
this.defaultRate = defaultRate;
}
checkLimit(userId) {
if (!this.limiters.has(userId)) {
this.limiters.set(
userId,
new TokenBucketRateLimiter(this.defaultCapacity, this.defaultRate)
);
}
const limiter = this.limiters.get(userId);
return limiter.tryConsume(1);
}
setUserLimit(userId, capacity, rate) {
this.limiters.set(userId, new TokenBucketRateLimiter(capacity, rate));
}
getUserStatus(userId) {
if (!this.limiters.has(userId)) {
return null;
}
return this.limiters.get(userId).getStatus();
}
}
// Example usage
const limiter = new PerUserRateLimiter(100, 10);
// Simulate requests
for (let i = 0; i < 150; i++) {
const user = i < 50 ? 'user-1' : 'user-2';
const [allowed, msg] = limiter.checkLimit(user);
if (!allowed) {
console.log(`Request ${i} (${user}): ${msg}`);
}
}
console.log('User-1 status:', limiter.getUserStatus('user-1'));
console.log('User-2 status:', limiter.getUserStatus('user-2'));
When to Use vs. When NOT to Use
- Public APIs with unknown clients
- Multi-tenant systems with quotas
- Protecting critical backend resources
- Preventing DDoS attacks (intentional or accidental)
- Fair resource sharing among users/services
Patterns and Pitfalls
Design Review Checklist
- Rate limits are specified per dimension (user, IP, global)
- Limits are based on actual capacity testing, not guesses
- Burst capacity (token bucket size) is defined per tier
- Refill rate allows sustainable traffic within SLA
- HTTP 429 Too Many Requests is returned with Retry-After header
- X-RateLimit-* headers are included in all responses
- Distributed rate limiting uses shared state (Redis, Memcached)
- Rate limit thresholds have monitoring and alerting
- Client libraries document rate limits in documentation
- Fallback strategy exists if rate limiter fails (fail open vs. closed?)
Self-Check
- Can you explain token bucket vs. sliding window algorithms?
- How does your system handle burst traffic?
- What happens if the rate limiter itself goes down?
- How do you coordinate rate limits across multiple servers?
- What limits do you return to clients, and how?
Next Steps
- Load Shedding: Read Load Shedding and Backpressure ↗️ for reactive overload handling
- Bulkhead Isolation: Learn Bulkhead Isolation ↗️ for resource-level isolation
- Circuit Breaker: Read Circuit Breaker ↗️ to handle dependency failures
References
- Newman, S. (2015). Building Microservices. O'Reilly Media.
- Rafaels, A. (2012). Rate Limiting Strategies ↗️. Figma Engineering Blog.
- Kong Documentation. Rate Limiting Plugin ↗️