Blue-Green, Rolling, and Canary Deployments

Roll out changes safely with zero-downtime deployment strategies; detect problems before all users are affected.

TL;DR

Blue-Green: Two identical production environments. Run on blue. Deploy to green. Test green. Switch traffic instantly. Rollback: flip back. Zero downtime, instant rollback. Cost: doubles infrastructure.

Rolling: Replace instances gradually. 1 → 10 → 50 → 100. Longer deployment, less resources, harder rollback.

Canary: Route 5% to new, monitor, increase to 100%. Catches regressions early. Production validation. Must handle version differences.

Choose by risk: Database migration → blue-green (easy rollback). API change → canary (monitor users). Feature flag → instant rollback (disable).

Learning Objectives

Understand tradeoffs of each deployment strategy
Implement zero-downtime deployments safely
Use canary deployments to catch regressions early
Design quick rollback procedures
Monitor and automate deployment traffic shifting
Choose strategy based on risk and constraints

Motivating Scenario

Your payment service handles 50,000 transactions/minute. Your latest deploy has a subtle bug that causes 0.1% of transactions to fail silently (no error, just lost).

With big-bang deployment: Deploy at 2 AM. At 2:15 AM, bug is live on all 100% of traffic. 83 lost transactions before you notice. Takes 45 minutes to diagnose and rollback. Customer impact: 83 failed payments, angry emails, trust damage.

With canary deployment: Deploy to 5% at 2 AM. Monitor error rate for 5 minutes. Error rate spikes 0.1% → 0.3%. Automatic rollback triggers at 2:07 AM. Only 4 transactions affected. No customer impact. Team debugs in staging.

With blue-green: Deploy to green at 2 AM. Run smoke tests. Switch traffic at 2:10 AM. Error spike noticed at 2:15 AM. Flip traffic back to blue at 2:16 AM. Total downtime: 1 minute. Fewer lost transactions.

Core Concepts

Deployment Strategy Comparison

graph LR subgraph Strategy["Deployment Strategies"] BG["Blue-Green 2 Full Envs Instant Switch"] Rolling["Rolling Gradual Replace Slow Switch"] Canary["Canary % Traffic Auto-rollback"] end subgraph Cost BGCost["High Cost 2x Infrastructure"] RollingCost["Low Cost Single Env"] CanaryCost["Medium Cost Extra Monitoring"] end subgraph Speed BGSpeed["Fast Deployment 10-30 min"] RollingSpeed["Slow Deployment 30-60 min"] CanarySpeed["Medium Deployment 15-45 min"] end subgraph Rollback BGRollback["Instant Rollback <1 min"] RollingRollback["Slow Rollback 30+ min"] CanaryRollback["Automatic Rollback 5-10 min"] end BG --> BGCost BG --> BGSpeed BG --> BGRollback Rolling --> RollingCost Rolling --> RollingSpeed Rolling --> RollingRollback Canary --> CanaryCost Canary --> CanarySpeed Canary --> CanaryRollback

Blue-Green Deployment Flow

Phase 1: Preparation

Blue environment: current production (running v1.0)
Green environment: idle, with same infrastructure
Load balancer points to blue

Phase 2: Deploy to Green

Deploy v1.1 to green
Run smoke tests against green
Green is not serving traffic yet
Blue continues serving 100% traffic

Phase 3: Switch Traffic

Load balancer flips: blue → green
All traffic now on v1.1
Instant, atomic switch
Zero downtime

Phase 4: Monitor Green

Watch error rate, latency, alerts
If issues detected: flip back to blue (instant rollback)
If healthy: keep green as new production

Phase 5: Cleanup

Old blue (v1.0) idles until next deploy
Next deploy: switch green → blue

Rolling Deployment Flow

Phase 1: Initial State

10 instances running v1.0
Load balancer spreads traffic to all 10

Phase 2: Rolling Update

Drain traffic from instance #1
Deploy v1.1 to #1
Health check #1
Drain #2, deploy v1.1, health check
Continue: #3, #4, ... #10
Total time: 30-60 minutes (depends on health check time)

Phase 3: During Rollout

Old and new versions both running
Clients may hit both v1.0 and v1.1
Must handle backward compatibility
If issue detected: manually stop rollout, manual rollback

Canary Deployment Flow

Phase 1: Deploy to Small Subset

Deploy v1.1 to 5% of instances (e.g., 1 out of 20)
Load balancer: 95% → v1.0, 5% → v1.1
Run for 5-10 minutes

Phase 2: Monitor Metrics

Error rate: must stay under 0.1% increase
Latency p99: must stay under +10%
Custom metrics: any regressions?
If metrics bad: automatic rollback

Phase 3: Increase Traffic

5% → 25% → 50% → 100%
Pause 5 minutes between steps
Monitor metrics at each step
If rollback triggered: revert to v1.0

Phase 4: Full Rollout

All instances running v1.1
Canary strategy complete

Practical Examples

Blue-Green on Kubernetes
Canary with Flagger + Istio
Rolling Deployment with Terraform
Traffic Shifting Strategy

# blue-green-deployment.yaml - Kubernetes Blue-Green

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service-blue
  labels:
    app: payment-service
    variant: blue
spec:
  replicas: 3
  selector:
    matchLabels:
      app: payment-service
      variant: blue
  template:
    metadata:
      labels:
        app: payment-service
        variant: blue
    spec:
      containers:
      - name: payment-service
        image: payment-service:1.0.0  # Current version
        ports:
        - containerPort: 8080
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service-green
  labels:
    app: payment-service
    variant: green
spec:
  replicas: 0  # Starts at 0, scaled up during deploy
  selector:
    matchLabels:
      app: payment-service
      variant: green
  template:
    metadata:
      labels:
        app: payment-service
        variant: green
    spec:
      containers:
      - name: payment-service
        image: payment-service:1.1.0  # New version
        ports:
        - containerPort: 8080
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5

---
apiVersion: v1
kind: Service
metadata:
  name: payment-service
spec:
  selector:
    app: payment-service
    variant: blue  # Initially points to blue
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080

---
# Deployment script to perform blue-green switch
# deploy-blue-green.sh

#!/bin/bash
set -euo pipefail

NEW_VERSION=$1
NAMESPACE=${2:-production}

echo "Starting Blue-Green deployment of $NEW_VERSION..."

# Step 1: Scale up GREEN
echo "Scaling up green (new version)..."
kubectl scale deployment payment-service-green \
  --replicas=3 -n $NAMESPACE

# Step 2: Wait for green to be ready
echo "Waiting for green to be ready..."
kubectl wait --for=condition=available \
  --timeout=5m \
  deployment/payment-service-green \
  -n $NAMESPACE

# Step 3: Run smoke tests against green
echo "Running smoke tests on green..."
GREEN_IP=$(kubectl get service payment-service-green \
  -n $NAMESPACE -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

for i in {1..10}; do
  HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
    http://$GREEN_IP:80/health)
  if [ "$HTTP_CODE" = "200" ]; then
    echo "Health check passed"
    break
  fi
  if [ $i -eq 10 ]; then
    echo "Health checks failed, aborting deployment"
    kubectl scale deployment payment-service-green \
      --replicas=0 -n $NAMESPACE
    exit 1
  fi
  sleep 10
done

# Step 4: Switch traffic to GREEN
echo "Switching traffic from BLUE to GREEN..."
kubectl patch service payment-service \
  -n $NAMESPACE \
  -p '{"spec":{"selector":{"variant":"green"}}}'

echo "Deployment complete! Traffic now on GREEN."
echo "Monitor metrics. To rollback: kubectl patch service payment-service -p '{\"spec\":{\"selector\":{\"variant\":\"blue\"}}}'"

# Step 5: Clean up BLUE after verification (5 mins later)
sleep 300
echo "Scaling down old blue deployment..."
kubectl scale deployment payment-service-blue \
  --replicas=0 -n $NAMESPACE

echo "Blue-Green deployment finished successfully"

# canary-deployment.yaml - Flagger automated canary

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: payment-service
  namespace: production
spec:
  # Target the deployment
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service

  # Service configuration
  service:
    port: 8080
    targetPort: 8080

  # Progressive traffic shifting
  analysis:
    # Analysis interval
    interval: 1m

    # Max number of failed checks before rollback
    threshold: 5

    # Metrics for validation
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99  # Must be >= 99% success
      interval: 1m

    - name: request-duration
      thresholdRange:
        max: 500  # p99 latency must be < 500ms
      interval: 1m

    - name: error-rate
      thresholdRange:
        max: 0.1  # Error rate must stay < 0.1%
      interval: 1m

  # Traffic shifting stages
  skipAnalysis: false
  progressDeadlineSeconds: 300

  # Gradual traffic shift: 5% → 25% → 50% → 100%
  maxWeight: 100
  stepWeight: 25
  stepWeightPromotion: 50  # Automatically promote if metrics good

  # Webhook notifications
  skipAnalysis: false
  webhooks:
  - name: acceptance-test
    url: http://flagger-loadtester:80/
    timeout: 30s
    metadata:
      type: smoke
      cmd: "curl -s http://payment-service:8080/health"

  - name: load-test
    url: http://flagger-loadtester:80/
    timeout: 5s
    metadata:
      type: bash
      cmd: "for i in {1..100}; do curl -s http://payment-service:8080/api/charges; done"

---
# Prometheus alerts for canary
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: error-rate
  namespace: production
spec:
  provider:
    name: prometheus
    address: http://prometheus:9090
  query: |
    sum(rate(http_request_total{job="{{ .TargetName }}",status=~"5.."}[5m]))
    /
    sum(rate(http_request_total{job="{{ .TargetName }}"}[5m]))

# rolling-deployment.tf - Terraform rolling strategy

resource "aws_autoscaling_group" "payment_service" {
  name                = "payment-service-asg"
  min_size            = 3
  max_size            = 10
  desired_capacity    = 5
  health_check_type   = "ELB"
  health_check_grace_period = 300

  # Rolling update policy (gradual replacement)
  instance_refresh {
    strategy = "Rolling"
    preferences {
      min_healthy_percentage = 90  # Keep 90% healthy during update
      instance_warmup        = 120  # Seconds to consider instance healthy
      checkpoint_percentage  = 50   # Checkpoint at 50% completion
      checkpoint_delay       = 300  # 5 min pause between checkpoints
      skip_matching          = true # Skip matching instances
    }
  }

  launch_template {
    id      = aws_launch_template.payment_service.id
    version = "$Latest"
  }

  tag {
    key                 = "Name"
    value               = "payment-service"
    propagate_at_launch = true
  }
}

# Trigger rolling update by changing launch template
resource "aws_launch_template" "payment_service" {
  name_prefix   = "payment-service-"
  image_id      = var.ami_id  # New AMI triggers rolling update
  instance_type = "t3.large"

  iam_instance_profile {
    name = aws_iam_instance_profile.payment_service.name
  }

  tag_specifications {
    resource_type = "instance"
    tags = {
      Name = "payment-service"
    }
  }
}

# Health check for quick detection
resource "aws_lb_target_group" "payment_service" {
  name        = "payment-service-tg"
  port        = 8080
  protocol    = "HTTP"
  vpc_id      = var.vpc_id

  health_check {
    healthy_threshold   = 2
    unhealthy_threshold = 2
    timeout             = 5
    interval            = 10  # Quick health checks
    path                = "/health"
    matcher             = "200-299"
  }
}

# Monitor rolling update progress
data "aws_autoscaling_group" "payment_service" {
  name = aws_autoscaling_group.payment_service.name
}

output "rolling_update_status" {
  value = {
    desired_capacity    = data.aws_autoscaling_group.payment_service.desired_capacity
    healthy_instances   = length([for i in data.aws_autoscaling_group.payment_service.instances : i if contains(data.aws_autoscaling_group.payment_service.load_balancers, "payment-service")])
    in_service_instances = data.aws_autoscaling_group.payment_service.in_service_capacity
  }
}

# istio-virtual-service.yaml - Fine-grained traffic control

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: payment-service
  namespace: production
spec:
  hosts:
  - payment-service
  http:
  # Canary: route 10% to v1.1, 90% to v1.0
  - match:
    - sourceLabels:
        canary: "true"  # Special header marks canary traffic
    route:
    - destination:
        host: payment-service
        port:
          number: 8080
        subset: v1-1
      weight: 100

  # Normal traffic: 90% v1.0, 10% v1.1 (canary percent)
  - route:
    - destination:
        host: payment-service
        port:
          number: 8080
        subset: v1-0
      weight: 90
    - destination:
        host: payment-service
        port:
          number: 8080
        subset: v1-1
      weight: 10

---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payment-service
  namespace: production
spec:
  host: payment-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http1MaxPendingRequests: 50
        h2UpgradePolicy: UPGRADE
    outlierDetection:
      consecutive5xxErrors: 5  # Eject after 5 errors
      interval: 30s
      baseEjectionTime: 30s

  subsets:
  - name: v1-0
    labels:
      version: v1.0.0
  - name: v1-1
    labels:
      version: v1.1.0

When to Use Each Strategy

Deployment Strategy Selection

Blue-Green (Instant Rollback)

Database migrations (hard to rollback)
Breaking API changes (need instant rollback)
Large data format changes
High-risk changes (error budget low)
Infrastructure doubles available

Canary (Gradual + Auto-rollback)

New features (monitor user impact)
Subtle bugs likely (need monitoring)
API improvements (backward compat)
Performance changes (validate latency)
Normal changes (medium risk)

Rolling: Low-risk changes, resource-constrained, deployment time is acceptable.

Feature Flags: Changes that are reversible in code (disable flag = instant rollback).

Patterns and Pitfalls

Pattern: Combine Strategies

Use blue-green for risky changes, canary for normal changes, rolling for low-risk. Tailor strategy to risk level. Error budget low? → Blue-green. Error budget high? → Canary.

Pattern: Automated Rollback Triggers

Canary should auto-rollback if: error rate increases, latency spikes, or custom metric fails. Don't wait for human to notice. Automatic = faster MTTR.

Pitfall: Version Incompatibility During Rolling

During rolling deploy, old and new versions both serve traffic. If v1.0 can't read v1.1 data formats, customers experience errors. Solution: backward compatibility, feature flags, or use blue-green.

Pitfall: Canary Metrics Too Strict

Canary rollsback at first blip (normal variation). Metrics 99% success rate = maybe 99.9% is threshold. Solution: set realistic thresholds, include context (time of day, load).

Pitfall: Forgetting Connections During Switch

Blue-green switches traffic. Old connections linger, causing errors. Solution: drain connections before switch, or use connection-aware load balancing.

Design Review Checklist

Self-Check

How long does your current deployment take?
How long would a rollback take?
Have you experienced a deployment that couldn't be rolled back quickly?
Do you monitor error rate and latency during deployments?
Can a junior engineer run a deployment without senior help?

Next Steps

Week 1: Document current deployment process and rollback procedure
Week 2: Identify which changes are risky (need blue-green) vs normal (can use canary)
Week 3: Implement blue-green for your highest-risk service
Week 4: Automate canary deployment with metric-based rollback
Ongoing: Practice rollbacks quarterly; improve deployment speed

References

Humble, J., & Farley, D. (2010). Continuous Delivery. Addison-Wesley.
Newman, S. (2015). Building Microservices. O'Reilly Media.
Flagger Documentation. flagger.app ↗️
Istio Traffic Management. istio.io/traffic-management ↗️

Blue-Green, Rolling, and Canary Deployments

TL;DR​

Learning Objectives​

Motivating Scenario​

Core Concepts​

Deployment Strategy Comparison​

Blue-Green Deployment Flow​

Rolling Deployment Flow​

Canary Deployment Flow​

Practical Examples​

When to Use Each Strategy​

Patterns and Pitfalls​

Design Review Checklist​

Self-Check​

Next Steps​

References​