Skip to main content

Resilience & Reliability Patterns

Transform inevitable failures into isolated incidents with patterns that keep systems running

Why Resilience Matters

Failures are inevitable. Network hiccups, service crashes, cascading timeouts—they happen. The difference between resilient systems and fragile ones isn't whether failures occur, but how systems respond. Graceful degradation beats catastrophic collapse.

Resilience Pattern Layers

Seven Core Patterns

1. Timeouts, Retries, Exponential Backoff, and Jitter

The foundation. Every remote call must have a timeout. Retries enable recovery from transient failures. Exponential backoff prevents overwhelming recovering systems. Jitter prevents thundering herd.

When to Use: Every network call

2. Circuit Breaker

Stop sending requests to failing services. Let them recover. Resume gradually.

When to Use: Services with high failure rate

3. Bulkhead Isolation

Separate resource pools for different services. One service's failure doesn't starve others.

When to Use: Multiple service dependencies

4. Rate Limiting and Throttling

Control request flow. Protect systems from being overwhelmed.

When to Use: Shared resources, preventing abuse

5. Load Shedding and Backpressure

When overloaded, reject low-priority requests. Better to fail some fast than fail all slowly.

When to Use: Under sustained overload

6. Health Probes

Detect failures quickly. Remove failed instances from service rotation.

When to Use: Distributed systems with multiple instances

7. Leader Election and Consensus

Coordinate across multiple nodes. Ensure only one node takes action.

When to Use: Distributed coordination needed

Learning Path

Total Time: 50 minutes

  1. Foundation (10 min): Timeouts, Retries, Backoff, Jitter
  2. Prevention (8 min): Circuit Breaker
  3. Isolation (8 min): Bulkhead Isolation
  4. Control (7 min): Rate Limiting & Throttling
  5. Shedding (7 min): Load Shedding & Backpressure
  6. Detection (6 min): Health Probes
  7. Coordination (6 min): Leader Election & Consensus

Resilience Principles

  1. Fail Fast: Detect failures quickly (short timeouts). Slow failures are worse than fast ones.

  2. Fail Independent: One service's failure shouldn't cascade. Use bulkheads and circuit breakers.

  3. Degrade Gracefully: Reduce functionality, don't crash. Serve cached data, return defaults, go read-only.

  4. Shed Load Intelligently: When overwhelmed, reject low-priority requests. Don't queue everything.

  5. Recover Gradually: Don't send all traffic to recovering service. Use exponential backoff and gradual recovery.

  6. Monitor Everything: You can't debug what you can't see. Measure latency, errors, timeouts.

Key Metrics

MetricWhy It MattersWhat to Watch
Timeout RateHow often requests exceed timeoutHigh rate = timeout too short
Retry RateHow often retries succeedHigh rate = transient failures common
Circuit Breaker TripsWhen services are failingIndicates problem service
p99 LatencyWorst-case latencyTail latencies reveal cascading issues
Error RateHow many requests failDifferent for different services
Backpressure (Queue Size)System approaching capacityGrowing queue = approaching failure

Before You Move On

You should understand:

  • Why every network call needs a timeout
  • Why retries must be exponential, not linear
  • The difference between permanent and transient failures
  • How circuit breakers prevent cascades
  • Why bulkheads isolate failures
  • When to shed load vs queue requests
  • The importance of health probes

Common Mistakes

  1. Timeouts Too Long: Cascades matter more than occasional extra latency
  2. Timeouts Too Short: High false-positive failure detection
  3. Retries Without Idempotency: Creates duplicate side effects
  4. No Exponential Backoff: Overwhelming recovering systems
  5. Circuit Breaker Thresholds Too High: Takes too long to fail open
  6. Health Checks Too Loose: Dead instances stay in service

References

  • Newman, S. (2015). "Building Microservices". O'Reilly Media.
  • Nygard, M. J. (2007). "Release It!: Design and Deploy Production-Ready Software". Pragmatic Programmers.
  • Kleppmann, M. (2017). "Designing Data-Intensive Applications". O'Reilly Media.
  • Cockroft, A., & Burns, B. (2015). "Migrating to Cloud-Native Application Architectures". O'Reilly Media.