Skip to main content

Load Shedding and Backpressure

Reject requests strategically when overloaded, signal upstream to prevent cascades

TL;DR

When overloaded, queuing everything guarantees slow failure for all requests. Load shedding rejects low-priority requests fast, preserving capacity for critical operations. Better to fail some quickly than fail everyone slowly. Backpressure signals "I'm full" upstream, cascading load control up the stack. Strategies include priority-based rejection, service-tier rejection, and adaptive thresholds. Shed analytics and recommendations first; preserve payment and authentication traffic.

Learning Objectives

  • Understand why unbounded queues cause cascading failures
  • Design priority-based load shedding strategies
  • Implement backpressure mechanisms across service boundaries
  • Choose between aggressive vs. conservative shedding policies
  • Monitor and tune shedding thresholds for your SLA

Motivating Scenario

An e-commerce platform experiences a flash sale. Traffic spikes 10x. Without load shedding, the API gateway queues all 100,000 requests. Processing now takes 30 minutes per request. Payment processing, the most critical service, starves because threads are locked up waiting for slow requests. With load shedding, analytics requests are rejected immediately, keeping payment processing threads available for critical transactions. Customers experience 429 (Too Many Requests) on analytics, but purchases succeed within 2 seconds.

Core Concepts

Load Shedding vs. Unbounded Queuing

Load shedding operates at the ingress layer—API gateway, load balancer, or service entrypoint. When queue depth exceeds a threshold, new low-priority requests are rejected with a 429 status code. This prevents resource exhaustion and keeps latency predictable for high-priority traffic.

Backpressure extends this concept: when a service is overloaded, it signals upstream services (via 429 or queue-full responses) to reduce traffic. Upstream services then shed load before it reaches the overloaded service. This cascades control decisions up the call stack.

Practical Example

import time
from enum import Enum
from collections import deque

class RequestPriority(Enum):
CRITICAL = 1
NORMAL = 2
BACKGROUND = 3

class LoadShedder:
def __init__(self, max_queue_depth=1000, shed_threshold=0.8):
self.queue = deque()
self.max_queue_depth = max_queue_depth
self.shed_threshold = shed_threshold
self.dropped_requests = 0

def should_accept(self, priority):
queue_utilization = len(self.queue) / self.max_queue_depth

# Critical always accepted
if priority == RequestPriority.CRITICAL:
return True

# Normal accepted if queue < 70%
if priority == RequestPriority.NORMAL:
return queue_utilization < 0.7

# Background only if queue < 30%
if priority == RequestPriority.BACKGROUND:
return queue_utilization < 0.3

return False

def add_request(self, request_id, priority):
if not self.should_accept(priority):
self.dropped_requests += 1
return False, "Service overloaded (429)"

self.queue.append((request_id, priority, time.time()))
return True, "Queued"

def process(self):
if self.queue:
req_id, priority, arrival = self.queue.popleft()
latency = time.time() - arrival
return req_id, latency
return None, None

# Example usage
shed = LoadShedder()

# Simulate requests
for i in range(2000):
priority = RequestPriority.CRITICAL if i % 10 == 0 else (
RequestPriority.NORMAL if i % 3 == 0 else RequestPriority.BACKGROUND
)
accepted, msg = shed.add_request(f"req-{i}", priority)

# Process some requests
if i % 5 == 0:
shed.process()

print(f"Dropped: {shed.dropped_requests}, Queued: {len(shed.queue)}")

When to Use vs. When NOT to Use

Use Load Shedding
  1. Sustained traffic exceeds capacity
  2. You have tiered SLAs (critical vs. non-critical)
Avoid Load Shedding
  1. Rejecting is better than slow response
  2. Multi-tenant systems with quotas
  3. Flash sales and traffic spikes

Patterns and Pitfalls

Assign requests to tiers (critical, normal, background). Shed from bottom tier up. Critical requests have guaranteed capacity reserves. Example: payment APIs always have 20% reserved capacity.
Adjust shedding thresholds based on system load. At 50% CPU, shed background tasks. At 80% CPU, shed normal tasks. At 95% CPU, only accept critical. Prevents cascading failures.
If you shed traffic, clients retry immediately. All retries hit at once, causing a thundering herd. Add jitter to retry delays to spread load over time.
Too aggressive shedding = rejection of legitimate traffic. Too lenient = back to the queueing problem. Monitor actual queue depths and adjust thresholds monthly.
Return 503 Service Unavailable or 429 Too Many Requests with Retry-After header. Upstream services read this and shed load before it reaches you. Enables intelligent load distribution.
Dropping requests without feedback. Clients don't know to back off. Always signal rejections clearly. Log shed events for monitoring. Alert on shed rate > threshold.

Design Review Checklist

  • Request priorities are documented and enforced (critical, normal, background)
  • Queue depth thresholds are determined from capacity tests, not guessed
  • Shedding decision logic is tested under overload conditions
  • Clients handle 429 responses with exponential backoff and jitter
  • Rejected requests are logged with priority and reason
  • Monitoring alerts on shed rate (> 1% rejected is concerning)
  • Load shedding is placed at ingress (API gateway), not mid-stack
  • Backpressure signals propagate upstream (no silent drops)
  • SLA guarantees are honored for critical request tiers
  • Failover and multi-region strategies account for load shedding

Self-Check

  • Can you explain why an unbounded queue leads to cascading failure?
  • What are the three priority tiers in your system? How are they determined?
  • How does your system signal backpressure to upstream services?
  • What happens when a client receives a 429? Do they retry wisely?
  • How do you monitor whether shedding is necessary vs. optional?

Next Steps

  1. Health Probes: Read Health Probes ↗️ to pair with shedding for complete failure detection
  2. Circuit Breaker: Learn Circuit Breaker ↗️ to prevent cascades at the dependency level
  3. Rate Limiting: Read Rate Limiting and Throttling ↗️ for proactive traffic control

References