DR, RTO/RPO, Backups and Restores

Prepare for disasters with comprehensive recovery strategies.

TL;DR

DR (Disaster Recovery): Plan for worst case. RTO: Recovery Time Objective (how long to recover). RPO: Recovery Point Objective (how much data loss). Backup: Copy of data at point in time. Restore: Recover from backup. Backup strategies: full (everything), incremental (changes only), differential (all changes since last full). Frequency: hourly, daily, weekly depending on RPO. Store backups offsite (not same datacenter). Test restores monthly (if untested, backup is useless). 3-2-1 rule: 3 copies of data, 2 storage types, 1 offsite.

Learning Objectives

Define RTO and RPO targets
Design backup strategies
Implement incremental/differential backups
Store backups securely offsite
Test restore procedures
Automate backup/restore
Monitor backup health
Calculate backup cost

Motivating Scenario

Ransomware attack. All data encrypted. Check backup: last backup was 2 weeks old (data loss). Restore: 48 hours (RTO). Cost: 2 weeks data loss + 2 days downtime = millions. Better plan: daily backups + 4 hour restore + immutable backups = minimal loss.

Core Concepts

RTO vs. RPO Matrix

                 Low RPO (< 1 day)    High RPO (> 1 week)
Low RTO (< 1h)    $$$$$$$$$$         $$$$$$$
                  (Expensive)        (Moderate)

High RTO (> 1d)   $$                 $
                  (Cold standby)     (Tape backup)

Backup Strategy Options

Strategy	Frequency	RPO	RTO	Cost
Full Daily	1x/day	24h	4h	$$
Incremental Hourly	24x/day	1h	2h	$$$
Continuous Replication	Real-time	0	1min	$$$$
Cold Backup (Tape)	1x/week	7d	24h	$

3-2-1 Backup Rule

3 Copies:
  - Production database (copy 1)
  - Hot backup (copy 2)
  - Cold backup / archive (copy 3)

2 Storage Types:
  - Primary: Database
  - Secondary: S3 / Cloud Storage
  - Tertiary: Tape Archive

1 Offsite:
  - All backups in same region? Not good
  - 1 copy in different region (protected from regional disaster)

Implementation

PostgreSQL
MySQL
Backup Monitoring

#!/bin/bash
# PostgreSQL backup strategy

# Full backup daily at 2 AM
0 2 * * * /usr/local/bin/backup-full.sh

# Incremental backup (WAL) every hour
0 * * * * /usr/local/bin/backup-wal.sh

# Verification: restore test monthly
0 3 1 * * /usr/local/bin/test-restore.sh

# Upload to S3 (offsite)
30 2 * * * /usr/local/bin/upload-s3.sh

# cleanup.sh - Remove backups older than 30 days
find /backups/postgresql -name "*.backup" -mtime +30 -delete

# backup-full.sh - Full backup
BACKUP_DIR="/backups/postgresql"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="$BACKUP_DIR/full_backup_$TIMESTAMP.backup"

pg_basebackup -D $BACKUP_FILE -Ft -z
echo "Full backup completed: $BACKUP_FILE"

# backup-wal.sh - Continuous WAL archiving
pg_wal_dir="/var/lib/postgresql/15/main/pg_wal"
archive_dir="/backups/postgresql/wal_archive"

# Already configured in postgresql.conf:
# archive_mode = on
# archive_command = 'cp %p /archive/%f'

# test-restore.sh - Monthly restore test
TEST_DIR="/tmp/pg_restore_test"
rm -rf $TEST_DIR
mkdir -p $TEST_DIR

# Restore latest full backup
LATEST_BACKUP=$(ls -t /backups/postgresql/full_backup_*.backup | head -1)
pg_basebackup -D $TEST_DIR -c fast -R

# Start test instance
pg_ctl -D $TEST_DIR start
sleep 5

# Run queries to verify
psql -h localhost -U postgres -c "SELECT COUNT(*) FROM orders;"

# Check result
if [ $? -eq 0 ]; then
  echo "Restore test PASSED"
else
  echo "Restore test FAILED - ALERT!"
  exit 1
fi

# Cleanup
pg_ctl -D $TEST_DIR stop

# upload-s3.sh - Upload to S3 for offsite storage
BACKUP_FILE=$1
BUCKET="company-backups-us-west-2"

aws s3 cp $BACKUP_FILE s3://$BUCKET/postgresql/$(basename $BACKUP_FILE) --sse=AES256

# Enable versioning for protection
aws s3api put-bucket-versioning --bucket $BUCKET --versioning-configuration Status=Enabled

#!/bin/bash
# MySQL backup with xtrabackup

# Full backup daily
0 1 * * * /backup/full_backup.sh

# Incremental backup every 6 hours
0 0,6,12,18 * * * /backup/incremental_backup.sh

# Verify backup integrity daily
0 4 * * * /backup/verify_backup.sh

# full_backup.sh
BACKUP_DIR="/backups/mysql"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_PATH="$BACKUP_DIR/full_$TIMESTAMP"

mkdir -p $BACKUP_PATH

# Innobackupex (xtrabackup) - full backup
innobackupex \
  --user=backup_user \
  --password=$BACKUP_PASS \
  --compress \
  --compress-chunk-size=65536 \
  $BACKUP_PATH

# Prepare backup (ensure consistency)
innobackupex --decompress $BACKUP_PATH
innobackupex --apply-log $BACKUP_PATH

echo "Full backup: $BACKUP_PATH"

# incremental_backup.sh
BACKUP_DIR="/backups/mysql"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_PATH="$BACKUP_DIR/incremental_$TIMESTAMP"

# Get latest full backup
FULL_BACKUP=$(ls -dt $BACKUP_DIR/full_* | head -1)

mkdir -p $BACKUP_PATH

# Incremental based on LSN
innobackupex \
  --user=backup_user \
  --password=$BACKUP_PASS \
  --incremental $BACKUP_PATH \
  --incremental-basedir=$FULL_BACKUP \
  --compress

echo "Incremental backup: $BACKUP_PATH"

# verify_backup.sh - Restore to test instance
LATEST_FULL=$(ls -dt /backups/mysql/full_* | head -1)

# Prepare backup
innobackupex --apply-log --redo-only $LATEST_FULL

# Apply incremental backups
for INCR in $(ls -dt /backups/mysql/incremental_* | head -5); do
  innobackupex --apply-log --redo-only $LATEST_FULL --incremental-dir=$INCR
done

# Final apply-log
innobackupex --apply-log $LATEST_FULL

# Restore to test directory
mkdir -p /tmp/mysql_restore
innobackupex --copy-back $LATEST_FULL --target-dir=/tmp/mysql_restore

# Start MySQL test instance
mysql_install_db --datadir=/tmp/mysql_restore
mysqld --datadir=/tmp/mysql_restore &

# Test queries
sleep 5
mysql -u root -e "SELECT COUNT(*) FROM orders;" 2>/dev/null

if [ $? -eq 0 ]; then
  echo "Backup verification PASSED"
  pkill mysqld
else
  echo "Backup verification FAILED"
  exit 1
fi

import boto3
import json
from datetime import datetime, timedelta
from slack_sdk import WebClient

# Monitor backup health
def check_backup_status():
    s3 = boto3.client('s3')
    slack = WebClient(token="xoxb-...")
    
    bucket = 'company-backups'
    prefix = 'postgresql/'
    
    # List backups
    response = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
    
    if 'Contents' not in response:
        slack.chat_postMessage(
            channel='#ops',
            text='CRITICAL: No backups found!',
            blocks=[{
                'type': 'section',
                'text': {'type': 'mrkdwn', 'text': '🔴 *No backups found in S3*'}
            }]
        )
        return False
    
    # Check backup age
    latest_backup = max(response['Contents'], key=lambda x: x['LastModified'])
    backup_age = datetime.now(latest_backup['LastModified'].tzinfo) - latest_backup['LastModified']
    
    if backup_age > timedelta(hours=25):
        slack.chat_postMessage(
            channel='#ops',
            text=f'⚠️  Latest backup is {backup_age.total_seconds()/3600:.1f} hours old',
        )
        return False
    
    # Check backup size (sanity check)
    backup_size_gb = latest_backup['Size'] / (1024**3)
    
    if backup_size_gb < 1:  # Should be larger
        slack.chat_postMessage(
            channel='#ops',
            text=f'⚠️  Backup suspiciously small: {backup_size_gb:.1f}GB',
        )
        return False
    
    print(f"✓ Latest backup: {latest_backup['Key']}")
    print(f"  Age: {backup_age.total_seconds()/3600:.1f} hours")
    print(f"  Size: {backup_size_gb:.1f}GB")
    
    return True

def test_restore_monthly():
    """Monthly restore test"""
    import subprocess
    
    # Run restore test script
    result = subprocess.run(['/backup/test_restore.sh'], capture_output=True)
    
    if result.returncode != 0:
        slack.chat_postMessage(
            channel='#ops',
            text=f'🔴 *Monthly restore test FAILED*\n```{result.stderr.decode()}```'
        )
    else:
        slack.chat_postMessage(
            channel='#ops',
            text='✓ Monthly restore test PASSED'
        )

def backup_report():
    """Monthly backup report"""
    s3 = boto3.client('s3')
    
    # List all backups
    response = s3.list_objects_v2(Bucket='company-backups')
    
    total_size = sum(obj['Size'] for obj in response.get('Contents', []))
    backup_count = len(response.get('Contents', []))
    
    report = f"""
    Backup Report
    =============
    Total backups: {backup_count}
    Total size: {total_size / (1024**3):.1f}GB
    Cost estimate: ${(total_size / (1024**3)) * 0.023:.2f}/month (S3)
    """
    
    print(report)

if __name__ == '__main__':
    check_backup_status()
    test_restore_monthly()
    backup_report()

Real-World Scenarios

Scenario 1: Regional Disaster

Production:  us-west-2 (primary)
Backup:      us-east-1 (offsite)

Disaster: us-west-2 datacenter destroyed

Recovery:
1. Detect (5 min)
2. Restore from us-east-1 backup (4 hours)
3. Update DNS to us-east-1 (5 min)

RTO: ~4.5 hours
RPO: 1 hour (if hourly backups)

Scenario 2: Ransomware Attack

Attack detected: All production files encrypted

Recovery options:
1. Incremental backups from last 24 hours (infected)
2. Full backup from 2 days ago (known good)

Restore from 2 days ago: 4 hours
Data loss: 2 days

Prevention: Immutable backups (object lock)
Automatic snap with encryption key rotation

Scenario 3: Accidental Data Deletion

Customer: "Oops, deleted all orders!"
Alert: 30 sec (monitoring detects data loss)
Restore: Select backup from 1 hour ago
Result: 1 hour data loss, 5 min restore

Without backup: Unrecoverable

Common Mistakes

Mistake 1: Never Testing Restore

❌ WRONG: "Backup process works, I'm sure restores work too"
Result: Can't actually restore when needed

✓ CORRECT: Monthly restore drills
- Restore to test instance
- Run queries to verify
- Document any issues

Mistake 2: Backup in Same Datacenter

❌ WRONG: Backups on same storage as primary
Regional disaster = data + backups lost

✓ CORRECT: 3-2-1 rule
- Primary: local
- Secondary: different AZ
- Tertiary: different region

Mistake 3: Unencrypted Backups

❌ WRONG: Backups with plaintext data
Anyone with S3 access can read sensitive data

✓ CORRECT: Encryption at rest + in transit
- S3 encryption (AES-256)
- TLS for upload
- Access control (IAM roles)

Design Checklist

Next Steps

Define RTO/RPO targets
Design backup strategy
Implement automated backups
Setup offsite storage
Test restore monthly
Monitor backup health
Document runbooks

References

Disaster Recovery Patterns

Recovery Strategies by Scenario

Data Corruption:

Detection: 30 minutes (monitoring)
Recovery: 30 minutes (restore from clean backup)
RTO: 1 hour, RPO: varies

Hardware Failure:

Detection: Immediate (health check)
Recovery: 5 minutes (promote replica)
RTO: 5 minutes, RPO: 0 (with sync replication)

Regional Outage:

Detection: 1 minute (DNS health check)
Recovery: 5 minutes (failover to backup region)
RTO: 5 minutes, RPO: 1 hour

Human Error (DELETE * FROM orders):

Detection: Hours (if noticed at all)
Recovery: Manual restore from backup
RTO: 2-4 hours, RPO: 24 hours (point-in-time recovery)

Backup Versioning

Keep multiple backup versions:

Daily: 7 versions (1 week)
Weekly: 4 versions (1 month)
Monthly: 12 versions (1 year)
Yearly: Keep all

Example schedule:
Mon backup → keep daily
Tue backup → keep daily
...
Sun backup → keep as weekly

Month 1 → keep as monthly
Month 2 → keep as monthly
...

Testing Restore Procedures

Types of tests:

Backup integrity test: Verify backup file is not corrupted
Restore to test environment: Actually restore and verify data
Failover drill: Switch to backup system, run production load
Rollback test: Can we revert bad changes?

Frequency:

Automated daily: Backup integrity
Weekly: Restore to test environment
Monthly: Failover drill
Quarterly: Full DR exercise

Cold Standby vs. Hot Standby

Cold Standby:

Backup system powered off
Low cost, high RTO
Requires manual or automated provisioning
Good for: Non-critical services

Warm Standby:

Backup system running but not serving traffic
Medium cost, medium RTO
Automatic failover possible
Good for: Important services

Hot Standby (Active-Active):

Both systems serve traffic
High cost, low RTO
Requires conflict resolution
Good for: Critical services

Backup Encryption and Security

Encryption Best Practices

# Encrypt backup at rest
aws s3api put-object \
  --bucket company-backups \
  --key postgresql/backup.sql.gz \
  --sse ServerSideEncryption=AES256 \
  --body backup.sql.gz

# Or use AWS KMS
aws s3api put-object \
  --bucket company-backups \
  --key postgresql/backup.sql.gz \
  --sse aws:kms \
  --sse-kms-key-id arn:aws:kms:region:account:key/id \
  --body backup.sql.gz

# TLS for upload
curl --tlsv1.2 -u user:pass -T backup.sql.gz \
  https://backup-server.example.com/uploads/

Access Control

Principle of least privilege:

- Backup creator: Read/Write
- Restore operator: Read only (for backups > 30 days old)
- Auditor: List and metadata only

Off-site Storage

Protect against regional disasters:

Primary: us-west-2
Backup: us-east-1 (different region)
Archive: Cloud vault or tape (immutable)

Multi-cloud backup:

AWS S3 for primary
Google Cloud Storage for secondary
Azure Backup for tertiary

Recovery Metrics Dashboard

Track:

Last successful backup timestamp
Backup size and growth trend
Restore test success rate
Time to restore (measured weekly)
Data loss estimate (RPO)
Recovery cost estimate

Alerts:

Backup failed (24h no successful backup)
Backup size anomaly (2x increase)
Restore test failure
Backup older than RPO target

Conclusion

Disaster Recovery is critical for production systems:

RTO/RPO targets drive architecture
Backups are useless if untested
Multiple backup versions needed
Encrypt and store offsite

Implement:

Daily backups
Monthly restore tests
Offsite storage (different region)
Automated monitoring
Documented runbooks

Cost is small vs. downtime risk. Budget accordingly.

Incremental vs. Differential Backups

Full backup (weekly):

Monday: Full backup (all data) - 2 hours, 50GB

Incremental (daily, faster):

Tuesday: Incremental (only Tuesday changes) - 10 min, 5GB
Wednesday: Incremental (Wed changes) - 10 min, 6GB
Thursday: Incremental (Thu changes) - 10 min, 4GB

To restore Thursday: Full (Monday) + Incremental (Tue) + Incremental (Wed) + Incremental (Thu)
Restore time: 2.5 hours

Differential (daily, restore faster):

Tuesday: Differential (all changes since Mon) - 15 min, 8GB
Wednesday: Differential (all changes since Mon) - 20 min, 10GB
Thursday: Differential (all changes since Mon) - 18 min, 9GB

To restore Thursday: Full (Monday) + Differential (Thursday)
Restore time: 2.1 hours (faster than incremental)

Choice:

Incremental: Slower restore, faster backup, less storage
Differential: Faster restore, slower backup, more storage
Hybrid: Full weekly + Differential daily

Backup Validation Process

Monthly validation checklist:

☐ Backup file exists and is not corrupted
☐ Backup file is encrypted
☐ Backup is older than X days (detect stale backups)
☐ Restore to test environment succeeds
☐ Data integrity checks pass (checksums, record counts)
☐ Application can start with restored data
☐ Basic smoke tests pass
☐ Document any issues found

DR, RTO/RPO, Backups and Restores

TL;DR​

Learning Objectives​

Motivating Scenario​

Core Concepts​

RTO vs. RPO Matrix​

Backup Strategy Options​

3-2-1 Backup Rule​

Implementation​

Real-World Scenarios​

Scenario 1: Regional Disaster​

Scenario 2: Ransomware Attack​

Scenario 3: Accidental Data Deletion​

Common Mistakes​

Mistake 1: Never Testing Restore​

Mistake 2: Backup in Same Datacenter​

Mistake 3: Unencrypted Backups​

Design Checklist​

Next Steps​

References​

Disaster Recovery Patterns​

Recovery Strategies by Scenario​

Backup Versioning​

Testing Restore Procedures​

Cold Standby vs. Hot Standby​

Backup Encryption and Security​

Encryption Best Practices​

Access Control​

Off-site Storage​

Recovery Metrics Dashboard​

Conclusion​

Incremental vs. Differential Backups​

Backup Validation Process​