Skip to main content

Multi-Region Active-Active vs Active-Passive

TL;DR

Active-Passive: One region serves traffic; standby regions replicate data asynchronously. On primary failure, traffic fails over to standby (RTO ~5-15 minutes). Simpler consistency model; lower cost. Active-Active: All regions serve traffic simultaneously. Lower latency globally; requires eventual consistency. More complex: split-brain handling, conflict resolution, higher replication overhead. Choose active-passive for simplicity, cost, and strong consistency. Migrate to active-active only if regional latency is a business requirement and you can manage complexity.

Learning Objectives

  • Understand geographic failover architectures and their trade-offs
  • Design eventual consistency models for active-active deployments
  • Implement conflict resolution strategies for multi-region writes
  • Manage RPO/RTO targets across regional topologies
  • Route global traffic based on geography and health
  • Plan capacity and cost for each topology

Motivating Scenario

A global SaaS company runs in AWS us-east-1 (primary) with users in Europe (high latency). Users complain: "The app feels sluggish, takes 200ms to save data." Adding a European region (eu-west-1) reduces latency to 20ms. But now the company must choose:

  • Active-Passive: Europe reads from us-east-1 (still slow) or EU reads from local replica (async, eventual consistency). Simpler, cheaper, but latency problem remains for writes.
  • Active-Active: EU region accepts writes, replicates back to US. Writes are fast (20ms), but now data can diverge across regions. Conflict resolution complexity increases.

Core Concepts

Active-Passive (Simpler)

One active region, others on standby.

Architecture:

  • Primary region (us-east-1): serves all traffic, handles writes
  • Standby region (eu-west-1): replicas data asynchronously, read-only
  • On primary failure: traffic fails over to standby (data lag acceptable)

Characteristics:

  • RPO: 5-15 minutes (depend on replication lag)
  • RTO: 5-15 minutes (detect failure + promote standby)
  • Consistency: Strong (single writer, read-after-write consistent)
  • Cost: Lower (standby doesn't process traffic)
  • Complexity: Low (single source of truth)

Active-Active (Faster, More Complex)

All regions serve traffic, eventually consistent.

Architecture:

  • Region A (us-east-1): accepts writes, replicates to B
  • Region B (eu-west-1): accepts writes, replicates to A
  • Conflict resolution: last-write-wins, application-specific logic, or CRDTs

Characteristics:

  • RPO: Seconds (continuous replication)
  • RTO: Immediate (traffic continues in all regions)
  • Consistency: Eventual (data may diverge temporarily)
  • Cost: Higher (both regions fully operational)
  • Complexity: High (conflict resolution, split-brain scenarios)

Consistency Challenges

When two regions accept writes, conflicts occur:

Time    us-east-1                eu-west-1
----- ----------- -----------
T0 User A writes Name="Bob"
T1 User A writes Name="Charlie"
T2 Replicate: Name="Bob"
T3 Replicate: Name="Charlie"
T4 Conflict! Bob vs Charlie?

Resolution strategies:

  1. Last-Write-Wins: Keep whichever update arrived last (simple, loses data)
  2. Vector Clocks: Track causality, detect true conflicts (complex)
  3. CRDTs: Commutative data structures that converge (e.g., sets)
  4. Application Logic: Business rules (e.g., "newer timestamp wins")
  5. Manual Resolution: Ask user to resolve (bad UX)

Traffic Routing

Global Load Balancer: Routes users to nearest/healthiest region

  • GeoDNS: DNS returns different IPs based on client location
  • Anycast: Same IP from multiple locations, BGP routes to closest
  • Application-layer routing: API gateway in each region

Practical Example

# Active-Passive: us-east-1 (primary) + eu-west-1 (standby)

# Primary Region (us-east-1): Multi-AZ PostgreSQL cluster
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: postgres-us-primary
namespace: data-layer
spec:
instances: 3
postgresql:
parameters:
wal_level: replica
max_wal_senders: 10
---
# Standby Region (eu-west-1): Read-only replica of primary
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: postgres-eu-standby
namespace: data-layer
spec:
instances: 1
primaryUpdateStrategy: unsupervised
postgresql:
parameters:
hot_standby: on
# This cluster replicates from us-east-1
# primary_conninfo: "host=postgres-us-primary.us-east-1 port=5432"
backup:
barmanObjectStore:
destinationPath: s3://backups/postgres-eu
s3Credentials:
accessKeyId:
name: aws-credentials
key: access_key
secretAccessKey:
name: aws-credentials
key: secret_key
retention:
recoverWindowDays: 30
---
# Application in primary region (accepts writes)
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
namespace: default
spec:
replicas: 3
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/region
operator: In
values:
- us-east-1
containers:
- name: app
image: myapp:1.0.0
env:
- name: DATABASE_HOST
value: "postgres-us-primary.data-layer"
- name: DATABASE_WRITE_HOST
value: "postgres-us-primary.data-layer"
- name: DATABASE_READ_HOST
value: "postgres-us-primary.data-layer" # Primary for strong consistency
---
# Application in standby region (read-only)
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-eu-read-only
namespace: default
spec:
replicas: 2
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/region
operator: In
values:
- eu-west-1
containers:
- name: app
image: myapp:1.0.0
env:
- name: DATABASE_HOST
value: "postgres-eu-standby.data-layer"
- name: DATABASE_WRITE_HOST
value: "postgres-us-primary.data-layer" # Writes still go to primary
- name: DATABASE_READ_HOST
value: "postgres-eu-standby.data-layer" # Reads from local replica
- name: READ_ONLY_MODE
value: "true"
---
# Global Load Balancer: Route to primary (or standby if primary down)
apiVersion: v1
kind: Service
metadata:
name: myapp-global
spec:
type: LoadBalancer
selector:
app: myapp
ports:
- protocol: TCP
port: 443
targetPort: 8080
# In reality, use AWS Route53/CloudFront + GeoDNS
# CloudFront: cache at edge, route to nearest region
# Route53: health check primary, failover to standby on failure

Traffic Flow (Active-Passive):

  1. US user → Route53 (us-east-1 healthy) → myapp in us-east-1
  2. EU user → Route53 (us-east-1 healthy) → myapp in us-east-1 (cross-region, higher latency)
  3. US primary fails → Route53 (health check fails) → failover to eu-west-1
  4. All users → myapp in eu-west-1 (temporary high latency for US users)

RPO/RTO:

  • RPO: 5 minutes (replication lag)
  • RTO: 2-3 minutes (health check + failover)

When to Use / When NOT to Use

Active-Passive vs Active-Active: Trade-Offs
Best Practices
  1. Choose ACTIVE-PASSIVE When:: Users are primarily in one region (US-based SaaS). Strong consistency is critical (financial data). Cost and operational simplicity matter. RTO of 10-15 minutes is acceptable.
  2. Choose ACTIVE-ACTIVE When:: Global user base and latency matters (every ms counts). You can manage eventual consistency (chat, notifications, less critical data). You have ops expertise for conflict resolution. Regional autonomy/sovereignty required.
  3. DO: Route by Geography (GeoDNS): US users → us-east-1, EU users → eu-west-1. Each region optimized for local latency.
  4. DO: Detect and Handle Conflicts: Last-write-wins with timestamps. Or application-specific conflict resolution. Or CRDT data structures.
  5. DO: Monitor Replication Lag: Active-active: track lag between regions (100-500ms). Alert if lag exceeds SLA (e.g., >1s).
  6. DO: Test Cross-Region Failover: Quarterly: simulate region failure. Verify failover time, data integrity, user impact.
Anti-Patterns
  1. Choose ACTIVE-PASSIVE When:: Users are spread globally and latency is critical. You have resources for active-active complexity. You need zero-downtime failover.
  2. Choose ACTIVE-ACTIVE When:: Data must be strongly consistent (banking). Conflicts are unacceptable (inventory counts). Cost is constrained. Team is small.
  3. DO: Route by Geography (GeoDNS): Random routing. Or always routing to single region. Users suffer cross-region latency.
  4. DO: Detect and Handle Conflicts: Ignore conflicts. Duplicate/corrupt data spreads silently.
  5. DO: Monitor Replication Lag: Assume replication is instant. Users read stale data without warning.
  6. DO: Test Cross-Region Failover: Assume failover will work. Discover bugs during real outage.

Patterns & Pitfalls

All infrastructure in us-east-1. Regional outage = total downtime. EU users suffer 100ms+ latency. No geographic redundancy.
Setup replication, never test failover. During real failure, promotion fails (wrong DNS records, stale config). Downtime extends.
Two regions accept writes, but no strategy for conflicts. Data corruption, lost updates, customer confusion.
Primary in us-east-1, warm standby in eu-west-1. Replication lag acceptable (5-15 min RPO). On failure, promote and update DNS. Simple, proven.
Use conflict-free replicated data types (sets, maps, counters). Writes in any region auto-converge. No manual conflict resolution needed.
Start: single region + backup (active-passive). Phase 1: add standby region. Phase 2: local caching (CDN). Phase 3: active-active for subset of data.
Active-active, EU and US write same record. Lag of 2 seconds causes user to see stale data. App doesn't warn user.
Writes go to primary (strong consistency). Reads can use local replica (lower latency, eventual consistency). Best of both worlds.

Design Review Checklist

  • Is single-region SPOF eliminated (backup region exists)?
  • Is the choice of active-passive vs active-active justified by business needs?
  • Are RTO/RPO targets defined and achievable with chosen topology?
  • Is replication automated and monitored (lag tracking, alerts)?
  • If active-active: is conflict resolution strategy defined and tested?
  • Is geographic routing (GeoDNS) configured and latency-optimized?
  • Are region failures tested quarterly (failover drills)?
  • Is data consistency verified post-failover?
  • Are costs justified by resilience benefits (active-active is expensive)?
  • Is asynchronous replication lag transparent to users?
  • Are critical/consistent data replicated synchronously?
  • Is cross-region bandwidth cost understood and budgeted?
  • Can users continue operating in degraded mode (1 region down)?
  • Are split-brain scenarios detected and resolved automatically?
  • Is operational runbook for regional failover documented and practiced?

Self-Check

  1. Right now, what happens if your primary region fails? Do you have a runbook?
  2. What's your RTO? If "we don't know," that's a problem. Ask business.
  3. Are your users global? If latency matters to them, active-active may be justified.
  4. Can you tolerate eventual consistency? If not, active-passive only.
  5. Have you tested failover? If not, do it this month in staging.

Next Steps

  1. Define RTO/RPO — Interview business, document targets
  2. Start active-passive — Primary + warm standby. Simpler, proven
  3. Add geographic routing — GeoDNS to local regions
  4. Test failover quarterly — Automate DR drills
  5. If needed, migrate to active-active — Add conflict resolution, CRDTs
  6. Monitor replication lag — Alert if lag exceeds SLA
  7. Document runbooks — Who decides to failover? What's the procedure?

References

  1. AWS: Multi-Region Active-Active Architectures ↗️
  2. PostgreSQL: Warm Standby Servers ↗️
  3. CRDT: Conflict-free Replicated Data Types ↗️
  4. Amazon Dynamo: Eventual Consistency Paper ↗️
  5. Google Cloud: Global Load Balancing ↗️