Gamedays and Chaos Engineering

# gameday-setup.sh - Configure controlled failure injection

#!/bin/bash
set -euo pipefail

# Install Chaos toolkit
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-latest.yaml

# Define gameday scenario: Kill payment service
cat > gameday-scenario.yaml <<EOF
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: gameday-payment-kill-pod
  namespace: chaos-testing
spec:
  # Only run during scheduled gamedays
  engineState: "active"
  appinfo:
    appns: "production"
    applabel: "app=payment-service"
    appkind: "deployment"

  experiments:
  - name: pod-kill
    spec:
      components:
        env:
          # Kill 1 pod for 5 minutes
          - name: TOTAL_CHAOS_DURATION
            value: "300"
          - name: PODS_AFFECTED_PERC
            value: "50"  # 50% of replicas
          - name: FORCE
            value: "false"  # Graceful shutdown first

      nodeSelector:
        chaos-enabled: "true"  # Only on test nodes

# Apply chaos scenario
kubectl apply -f gameday-scenario.yaml

# Monitor system response
echo "Gameday started: Payment service degradation"
kubectl logs -f -n production -l app=payment-service --tail=50

# In separate terminal, query metrics
watch 'curl -s http://prometheus:9090/api/v1/query?query=payment_error_rate | jq'

# After 5 minutes, chaos automatically stops
# Verify recovery
kubectl wait --for=condition=ready pod -l app=payment-service -n production --timeout=2m

# chaos_engine.py - Continuous chaos testing

import random
import time
import logging
from dataclasses import dataclass
from typing import List, Callable
import requests

logger = logging.getLogger(__name__)

@dataclass
class ChaosExperiment:
    name: str
    description: str
    inject: Callable[[], None]  # How to cause failure
    verify: Callable[[], bool]  # How to verify recovery
    rollback: Callable[[], None]  # How to fix it
    enabled_in_staging: bool = True
    enabled_in_production: bool = False
    duration_seconds: int = 60

class ChaosEngine:
    def __init__(self, environment: str):
        self.environment = environment
        self.experiments: List[ChaosExperiment] = []

    def register_experiment(self, exp: ChaosExperiment):
        self.experiments.append(exp)

    def run_scheduled(self):
        """Run chaos tests on schedule (e.g., nightly in staging)"""
        for exp in self.experiments:
            if not exp.enabled_in_staging and self.environment == "staging":
                continue

            logger.info(f"Starting chaos experiment: {exp.name}")

            try:
                # Inject failure
                exp.inject()
                logger.info(f"Injected failure for {exp.duration_seconds}s")

                # Monitor system
                time.sleep(exp.duration_seconds)

                # Verify graceful degradation (not cascade failure)
                if not exp.verify():
                    logger.error(f"System did not recover: {exp.name}")
                    self.alert_team(exp.name)

                # Clean up
                exp.rollback()
                logger.info(f"Rolled back: {exp.name}")

            except Exception as e:
                logger.exception(f"Chaos experiment failed: {e}")
                exp.rollback()
                self.alert_team(exp.name, error=str(e))

            time.sleep(30)  # Space between experiments

    def alert_team(self, experiment_name: str, error: str = None):
        msg = f"Chaos experiment detected issue: {experiment_name}"
        if error:
            msg += f" - {error}"
        requests.post("https://slack.com/api/chat.postMessage", json={
            "channel": "#chaos-alerts",
            "text": msg
        })

# Register experiments
engine = ChaosEngine("staging")

# Experiment: Kill database connection
engine.register_experiment(ChaosExperiment(
    name="database-connection-kill",
    description="Kill all database connections, verify reconnection",
    inject=lambda: os.system("pkill -f 'mysql|postgres'"),
    verify=lambda: requests.get("http://api:8080/health").status_code == 200,
    rollback=lambda: os.system("systemctl restart database"),
    duration_seconds=30,
    enabled_in_production=False
))

# Experiment: Inject latency
engine.register_experiment(ChaosExperiment(
    name="network-latency-injection",
    description="Add 1000ms latency to all outbound requests",
    inject=lambda: os.system(
      "tc qdisc add dev eth0 root netem delay 1000ms"
    ),
    verify=lambda: check_metrics_within_threshold("p99_latency", 5000),
    rollback=lambda: os.system("tc qdisc del dev eth0 root"),
    duration_seconds=60,
    enabled_in_production=False
))

# Run nightly
if __name__ == "__main__":
    engine.run_scheduled()

# Gameday Plan: Database Failover Simulation

**Date**: 2025-03-15 | **Time**: 14:00-16:00 UTC
**Environment**: Staging (exact production config)
**Participants**: 2 SREs, 2 Backend leads, 1 Database specialist

## Pre-Gameday (Day Before)
- [ ] Backup production data (for comparison)
- [ ] Set up monitoring dashboard with key metrics
- [ ] Notify customer success team (in case staging is visible)
- [ ] Ensure kill switches are staged and tested

## Scenario: Primary Database Becomes Unavailable

### Phase 1: Inject Failure (14:00-14:05)
**Injector**: Database specialist
- Command: `aws rds failover-db-instance --db-instance-identifier prod-primary`
- Expected: Primary transitions to "failing-over" state (5-10 seconds)
- Signal to team: Slack message "#gameday: Failure injected"

### Phase 2: Observe and Respond (14:05-14:30)
**Observers**: SREs monitor dashboards
**Responders**: Backend leads follow runbooks

**What we're testing:**
- Do alerts fire within 30 seconds?
- Does runbook guide team to diagnosis?
- How long until traffic reroutes to replica?
- Do read queries succeed? Write queries?
- What applications are impacted?

**Expected timeline:**
- 0s: Failure injected
- 5-10s: Connection timeout begins
- 15-30s: Alert fires
- 30-45s: Team diagnosed (runbook executed)
- 45-120s: Replica promoted (automatic or manual?)
- 120s: Full recovery

### Phase 3: Recovery (14:30-14:45)
- [ ] Verify all reads succeed
- [ ] Verify all writes succeed
- [ ] Check replication lag
- [ ] Confirm no data loss
- [ ] Log all events

### Phase 4: Debrief (14:45-16:00)
**Questions**:
- How long did detection take?
- What alerts worked? What didn't?
- Did runbooks have accurate steps?
- What surprised you?
- What do we need to fix?

**Output**: Tickets with priorities
- P0: Things that broke or detected too slowly
- P1: Automation opportunities
- P2: Documentation improvements

## Success Criteria
- [ ] Database failover completed <2 min
- [ ] All queries succeeded (no client errors)
- [ ] Alert fired and routed correctly
- [ ] Team could execute runbook without outside help
- [ ] Recovery time matches production expectations

Gamedays and Chaos Engineering

TL;DR

Learning Objectives

Motivating Scenario

Core Concepts

Gameday Framework

Gameday Scenarios by Severity

Chaos Engineering Dimensions

Practical Examples

When to Use Gamedays vs Continuous Chaos

Patterns and Pitfalls

Design Review Checklist

Self-Check

Next Steps

References

Gamedays and Chaos Engineering

TL;DR​

Learning Objectives​

Motivating Scenario​

Core Concepts​

Gameday Framework​

Gameday Scenarios by Severity​

Chaos Engineering Dimensions​

Practical Examples​

When to Use Gamedays vs Continuous Chaos​

Patterns and Pitfalls​

Design Review Checklist​

Self-Check​

Next Steps​

References​

TL;DR

Learning Objectives

Motivating Scenario

Core Concepts

Gameday Framework

Gameday Scenarios by Severity

Chaos Engineering Dimensions

Practical Examples

When to Use Gamedays vs Continuous Chaos

Patterns and Pitfalls

Design Review Checklist

Self-Check

Next Steps

References