Resilience & Reliability Patterns
Transform inevitable failures into isolated incidents with patterns that keep systems running
Why Resilience Matters
Failures are inevitable. Network hiccups, service crashes, cascading timeouts—they happen. The difference between resilient systems and fragile ones isn't whether failures occur, but how systems respond. Graceful degradation beats catastrophic collapse.
Seven Core Patterns
1. Timeouts, Retries, Exponential Backoff, and Jitter
The foundation. Every remote call must have a timeout. Retries enable recovery from transient failures. Exponential backoff prevents overwhelming recovering systems. Jitter prevents thundering herd.
When to Use: Every network call
2. Circuit Breaker
Stop sending requests to failing services. Let them recover. Resume gradually.
When to Use: Services with high failure rate
3. Bulkhead Isolation
Separate resource pools for different services. One service's failure doesn't starve others.
When to Use: Multiple service dependencies
4. Rate Limiting and Throttling
Control request flow. Protect systems from being overwhelmed.
When to Use: Shared resources, preventing abuse
5. Load Shedding and Backpressure
When overloaded, reject low-priority requests. Better to fail some fast than fail all slowly.
When to Use: Under sustained overload
6. Health Probes
Detect failures quickly. Remove failed instances from service rotation.
When to Use: Distributed systems with multiple instances
7. Leader Election and Consensus
Coordinate across multiple nodes. Ensure only one node takes action.
When to Use: Distributed coordination needed
Learning Path
Total Time: 50 minutes
- Foundation (10 min): Timeouts, Retries, Backoff, Jitter
- Prevention (8 min): Circuit Breaker
- Isolation (8 min): Bulkhead Isolation
- Control (7 min): Rate Limiting & Throttling
- Shedding (7 min): Load Shedding & Backpressure
- Detection (6 min): Health Probes
- Coordination (6 min): Leader Election & Consensus
Resilience Principles
-
Fail Fast: Detect failures quickly (short timeouts). Slow failures are worse than fast ones.
-
Fail Independent: One service's failure shouldn't cascade. Use bulkheads and circuit breakers.
-
Degrade Gracefully: Reduce functionality, don't crash. Serve cached data, return defaults, go read-only.
-
Shed Load Intelligently: When overwhelmed, reject low-priority requests. Don't queue everything.
-
Recover Gradually: Don't send all traffic to recovering service. Use exponential backoff and gradual recovery.
-
Monitor Everything: You can't debug what you can't see. Measure latency, errors, timeouts.
Key Metrics
| Metric | Why It Matters | What to Watch |
|---|---|---|
| Timeout Rate | How often requests exceed timeout | High rate = timeout too short |
| Retry Rate | How often retries succeed | High rate = transient failures common |
| Circuit Breaker Trips | When services are failing | Indicates problem service |
| p99 Latency | Worst-case latency | Tail latencies reveal cascading issues |
| Error Rate | How many requests fail | Different for different services |
| Backpressure (Queue Size) | System approaching capacity | Growing queue = approaching failure |
Before You Move On
You should understand:
- Why every network call needs a timeout
- Why retries must be exponential, not linear
- The difference between permanent and transient failures
- How circuit breakers prevent cascades
- Why bulkheads isolate failures
- When to shed load vs queue requests
- The importance of health probes
Common Mistakes
- Timeouts Too Long: Cascades matter more than occasional extra latency
- Timeouts Too Short: High false-positive failure detection
- Retries Without Idempotency: Creates duplicate side effects
- No Exponential Backoff: Overwhelming recovering systems
- Circuit Breaker Thresholds Too High: Takes too long to fail open
- Health Checks Too Loose: Dead instances stay in service
📄️ Timeouts, Retries, Exponential Backoff, and Jitter
Master the foundation of resilience: timeouts prevent resource exhaustion, retries recover from transient failures, exponential backoff prevents overwhelming, jitter prevents thundering herd.
📄️ Circuit Breaker
Master the circuit breaker pattern: stop sending requests to failing services, prevent cascades, and recover gradually.
📄️ Bulkhead Isolation
Master bulkhead isolation: separate resource pools for different services to contain failures and prevent resource starvation.
📄️ Rate Limiting and Throttling
Control request flow to protect resources. Token bucket, sliding window, and adaptive rate limiting strategies.
📄️ Load Shedding and Backpressure
Reject low-priority requests when overloaded. Backpressure signals cascade up the stack to prevent cascading failures.
📄️ Health Probes
Detect failures quickly with liveness and readiness probes. Enable self-healing and automatic instance removal.
📄️ Leader Election and Consensus
Master leader election and consensus algorithms. Ensure only one node takes action and maintain consistency across replicas.
References
- Newman, S. (2015). "Building Microservices". O'Reilly Media.
- Nygard, M. J. (2007). "Release It!: Design and Deploy Production-Ready Software". Pragmatic Programmers.
- Kleppmann, M. (2017). "Designing Data-Intensive Applications". O'Reilly Media.
- Cockroft, A., & Burns, B. (2015). "Migrating to Cloud-Native Application Architectures". O'Reilly Media.