Skip to main content

Single Region, Multi-Zone Deployments

Achieve high availability and resilience by distributing workloads across failure-isolated availability zones within a single geographic region.

TL;DR

Single-region multi-zone (SRMZ) deployments distribute application components across 2-3 availability zones within one AWS/Azure/GCP region. This provides fault tolerance against zone failures while maintaining low latency (sub-millisecond), synchronous consistency, and simplified operational complexity compared to multi-region deployments. Use SRMZ when you need 99.9%-99.95% availability without cross-region complexity.

Learning Objectives

By the end of this article, you'll understand:

  • How availability zones isolate infrastructure failures
  • Zone-aware load balancing and traffic distribution strategies
  • Data replication and consistency models across zones
  • Network latency implications and optimization
  • When SRMZ is sufficient versus requiring multi-region approaches

Motivating Scenario

Your SaaS platform runs on three servers in a single AWS availability zone. During a zone outage affecting 45 minutes, your service becomes completely unavailable, impacting 500+ customers. Post-incident analysis reveals adding two more zones would eliminate this single point of failure with minimal latency penalty. The question: how do you design a topology that survives zone failures while keeping operations manageable?

Core Concepts

Availability Zones as Failure Domains

Availability zones are isolated infrastructure segments with independent power, cooling, and networking. A zone failure (network partition, hardware failure, natural disaster) doesn't affect peer zones. Cloud providers typically offer 3-4 zones per region, with deployment across 2-3 providing optimal cost/resilience tradeoff.

Key characteristics:

  • Independent failure domains
  • Sub-millisecond inter-zone latency (typically 1-5ms)
  • No data transfer charges between zones
  • Automatic isolation on failure

Zone-Aware Load Balancing

Regional load balancers distribute traffic with zone awareness, maintaining customer affinity when possible while rebalancing on zone failure. Modern implementations use:

Health Check Models:

  • Per-zone health checks detect zone degradation
  • Graceful connection draining removes failed instances without dropping requests
  • Active health checks (every 5-30 seconds) detect failures within SLA bounds

Distribution Strategies:

  • Round-robin with zone awareness
  • Least connections (sticky sessions)
  • Weighted routing for A/B testing across zones

Data Replication Strategies

Single-region deployments enable synchronous replication with acceptable latency. Common patterns:

Write-Through Consistency:

  • Primary zone writes synchronously to 2+ replica zones
  • 99.99% durability (no data loss)
  • Sub-100ms replication latency
  • Higher write latency (network round-trips)
  • Typical for financial/critical data

Read Replicas with Eventual Consistency:

  • Primary zone writes, asynchronous replication
  • Lower write latency, temporary inconsistency
  • Sufficient for analytics, caching, non-critical reads
  • Replica lag: 10-100ms typical

Practical Example

from flask import Flask, request, jsonify
import os
from datetime import datetime

app = Flask(__name__)

ZONE = os.getenv('AVAILABILITY_ZONE', 'us-east-1a')

# Database replication tracking
class ZoneAwareDB:
def __init__(self, zone):
self.zone = zone
self.replicas = {
'us-east-1a': 'db-master.us-east-1a.rds.amazonaws.com',
'us-east-1b': 'db-replica.us-east-1b.rds.amazonaws.com',
'us-east-1c': 'db-replica.us-east-1c.rds.amazonaws.com'
}

def write(self, data):
"""Write with zone awareness - sync to replicas"""
timestamp = datetime.utcnow().isoformat()
# Write to master in current zone
master_addr = self.replicas[self.zone]

try:
# Simulate sync replication to peer zones
other_zones = [z for z in self.replicas
if z != self.zone]

# Would execute synchronous replication
print(f"Writing to {master_addr}")
print(f"Replicating to {other_zones}")

return {
'zone': self.zone,
'timestamp': timestamp,
'replicated': True
}
except Exception as e:
return {'error': str(e), 'replicated': False}

def read_local(self):
"""Read from local zone replica"""
return {
'zone': self.zone,
'consistency': 'eventual',
'latency_ms': '<1'
}

db = ZoneAwareDB(ZONE)

@app.route('/health', methods=['GET'])
def health():
"""Zone-aware health check"""
return jsonify({
'status': 'healthy',
'zone': ZONE,
'timestamp': datetime.utcnow().isoformat()
})

@app.route('/write', methods=['POST'])
def write_data():
"""Replicate writes across zones"""
data = request.get_json()
result = db.write(data)
return jsonify(result), 201 if result.get('replicated') else 500

@app.route('/read', methods=['GET'])
def read_data():
"""Read from local zone"""
result = db.read_local()
return jsonify(result)

@app.route('/failover-status', methods=['GET'])
def failover_status():
"""Report replication lag and failover readiness"""
return jsonify({
'zone': ZONE,
'replicas_healthy': 2,
'replication_lag_ms': 5,
'failover_ready': True
})

if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)

When to Use / When Not to Use

Use Single-Region Multi-Zone When:
  1. You need 99.9%-99.95% availability (44.6-8.76 hours/year downtime)
  2. Latency-sensitive applications (sub-5ms acceptable)
  3. Data consistency must be strongly guaranteed
  4. Operational simplicity is important (single region management)
  5. Cost-conscious with acceptable availability targets
  6. Compliance allows single-region deployment
Use Multi-Region When:
  1. You need 99.99%+ availability (disaster recovery SLA)
  2. Geographic data residency required across regions
  3. Serving global users with local latency requirements
  4. Regulatory requirements force geographic distribution
  5. Can tolerate eventual consistency and cross-region replication
  6. Have budget for multi-region infrastructure

Patterns & Pitfalls

Design Review Checklist

  • Application instances distributed across 3 zones (or 2 if cost-critical)
  • Load balancer routes traffic with zone awareness and health checks
  • Database replication strategy documented (sync vs eventual consistency)
  • Failover tested - application recovers when single zone fails
  • No single points of failure outside zones (NAT, load balancer, DNS)
  • Data loss acceptable in replication lag understood and tested
  • Connection pooling handles zone-to-zone communication
  • Monitoring tracks replication lag and zone health
  • Disaster recovery plan documents manual intervention steps
  • Cost analysis confirmed multi-zone cost-benefit ratio

Self-Check

Ask yourself:

  • Can my application tolerate temporary data loss during zone failover?
  • Is my database replication strategy documented and tested?
  • Do I have visibility into which zone each customer is using?
  • What's the maximum acceptable replication lag for my use case?
  • Am I monitoring zone-specific error rates separately?

One Key Takeaway

info

Single-region multi-zone deployments provide high availability (99.9%-99.95%) with minimal latency penalty by distributing workloads across independent failure domains. The key is ensuring zero single points of failure outside zones and testing failure scenarios before they occur in production.

Next Steps

  1. Inventory your infrastructure - Identify components still in single zones
  2. Plan replication strategy - Decide between sync (data-critical) vs eventual (performance-critical)
  3. Test zone failures - Use chaos engineering to simulate zone outages
  4. Document runbooks - Prepare manual intervention steps for degraded scenarios
  5. Monitor by zone - Add zone-aware monitoring and alerting
  6. Evaluate multi-region - Assess whether future global expansion requires geographic distribution

References