Availability & Reliability

Availability (is the system running?) and reliability (does it work correctly?) are core concerns. This section covers redundancy patterns, automated failover, graceful degradation under failure, and disaster recovery planning.

What you'll find here

Redundancy, Replication, and Failover — Multiple instances, active-passive/active-active failover.
Graceful Degradation and Feature Kill Switches — Degrade features rather than fail completely.
Disaster Recovery, RTO, RPO, and Backups — Plan for worst-case scenarios.

Key concepts

Availability: Uptime percentage (99.9%, 99.95%, 99.99%).
Reliability: System works correctly under normal conditions.
Fault Tolerance: Continues operating despite failures.
RTO (Recovery Time Objective): Target time to restore service (e.g., 15 minutes).
RPO (Recovery Point Objective): Maximum acceptable data loss (e.g., 1 hour of data).
MTBF (Mean Time Between Failures): Average time before next failure.
MTTR (Mean Time To Recovery): Average time to fix after failure.

📄️ Graceful Degradation and Feature Kill Switches

Degrade features rather than fail completely; use kill switches to disable features during incidents.

📄️ Redundancy, Replication, and Failover

Build resilient systems through redundancy and automatic failover.

📄️ DR, RTO/RPO, Backups and Restores

Prepare for disasters with recovery planning, backups, and restore procedures.

What you'll find here​

Key concepts​

📄️ Graceful Degradation and Feature Kill Switches

📄️ Redundancy, Replication, and Failover

📄️ DR, RTO/RPO, Backups and Restores

What you'll find here

Key concepts