Skip to main content

Availability & Reliability

Availability (is the system running?) and reliability (does it work correctly?) are core concerns. This section covers redundancy patterns, automated failover, graceful degradation under failure, and disaster recovery planning.

What you'll find here

Key concepts

  • Availability: Uptime percentage (99.9%, 99.95%, 99.99%).
  • Reliability: System works correctly under normal conditions.
  • Fault Tolerance: Continues operating despite failures.
  • RTO (Recovery Time Objective): Target time to restore service (e.g., 15 minutes).
  • RPO (Recovery Point Objective): Maximum acceptable data loss (e.g., 1 hour of data).
  • MTBF (Mean Time Between Failures): Average time before next failure.
  • MTTR (Mean Time To Recovery): Average time to fix after failure.