Availability & Reliability
Availability (is the system running?) and reliability (does it work correctly?) are core concerns. This section covers redundancy patterns, automated failover, graceful degradation under failure, and disaster recovery planning.
What you'll find here
- Redundancy, Replication, and Failover — Multiple instances, active-passive/active-active failover.
- Graceful Degradation and Feature Kill Switches — Degrade features rather than fail completely.
- Disaster Recovery, RTO, RPO, and Backups — Plan for worst-case scenarios.
Key concepts
- Availability: Uptime percentage (99.9%, 99.95%, 99.99%).
- Reliability: System works correctly under normal conditions.
- Fault Tolerance: Continues operating despite failures.
- RTO (Recovery Time Objective): Target time to restore service (e.g., 15 minutes).
- RPO (Recovery Point Objective): Maximum acceptable data loss (e.g., 1 hour of data).
- MTBF (Mean Time Between Failures): Average time before next failure.
- MTTR (Mean Time To Recovery): Average time to fix after failure.
📄️ Graceful Degradation and Feature Kill Switches
Degrade features rather than fail completely; use kill switches to disable features during incidents.
📄️ Redundancy, Replication, and Failover
Build resilient systems through redundancy and automatic failover.
📄️ DR, RTO/RPO, Backups and Restores
Prepare for disasters with recovery planning, backups, and restore procedures.