Skip to main content

Runbooks and On-Call

Guide incident response with runbooks; structure on-call rotations for coverage and sustainability.

TL;DR

A runbook is a decision tree and action guide: when alert X fires, follow these steps. Check these metrics, try these remediation steps, escalate if unresolved. Without runbooks, incident response is chaotic—people guess, responses are inconsistent, MTTR increases. Runbooks are living documents updated after each incident. On-call rotations balance 24/7 coverage with human sustainability. Target: 1 person per rotation (not 3), shift length of 1 week (not 1 month), and compensation time off (time in lieu, not just "thanks for covering"). Clear escalation paths: if person A can't resolve in 15 minutes, escalate to person B automatically. Automate all feasible remediation (restart service, scale up, clear cache) so on-call execution is pushing buttons, not debugging.

Learning Objectives

  • Write effective, tested runbooks for common incidents
  • Structure runbooks for rapid decision-making
  • Design on-call rotations for coverage and sustainability
  • Define clear escalation criteria and paths
  • Automate routine incident remediation steps
  • Measure on-call load and burnout
  • Ensure junior engineers can follow runbooks and resolve incidents

Motivating Scenario

Your database alert fires at 3 AM. The on-call engineer is a junior developer unfamiliar with the database. They waste 30 minutes checking random metrics and guessing at fixes. When the senior on-call finally gets paged 40 minutes later, they identify the issue in 5 minutes (replication lag) and fix it in 2 minutes (resync).

A runbook would have guided the junior: "If replication lag alert fires, check these specific metrics. If lag > 1000ms, run this resync command. If resync fails, escalate to senior DBA." The issue would be resolved in 10 minutes, not 75.

Runbooks transfer expertise into procedures. Escalation paths ensure appropriate expertise is applied quickly.

Core Concepts

Runbook Decision Tree: From Alert to Resolution

Runbook Elements

Alert context: Which alert fired? What does it mean? Is it an immediate emergency (customer impact now) or a leading indicator (problem in progress)?

Initial diagnosis steps: What metrics/logs should the on-call check first? Should be rapid checks (30 seconds max per step).

Common causes & fixes: Most incidents of this type fall into a few categories. List them with diagnostic queries/commands and fixes.

Remediation procedures: Step-by-step commands to execute. Copy-paste ready, not "figure it out yourself."

Escalation criteria: If after 15 minutes the issue isn't fixed, escalate to whom? Provide escalation decision tree.

External communication: Which slack channels to notify? Who are impacted stakeholders?

Prevention: After incident is resolved, what prevents recurrence? (Feeds postmortem process)

On-Call Rotation Design

Primary rotation: 1 person carries the pager for the defined shift (1 week typical). They respond to all alerts.

Secondary rotation: 2nd person available to escalate to or support. Typically the person who was on-call the previous week or a senior person.

Shift length: 1-week shifts are sustainable (5 business days, 2 nights, 2 weekends). Anything longer causes burnout and degraded sleep.

Time off: After on-call week, engineer gets 1-2 days off to recover. Not "you can take time off next month," but immediate recovery.

Fair distribution: Each engineer takes on-call ~4-6 weeks per year (depending on team size). Prevents burnout concentrating on a few people.

Practical Example

# Runbook: API Latency Alert (P99 > 1000ms)

## Alert Triggers
- Alert name: `api-latency-p99-high`
- Condition: P99 latency > 1000ms for > 5 minutes
- Severity: SEV-2

## Initial Diagnosis (Do This First, < 5 minutes)

### Step 1: Confirm the alert
- Check Prometheus/Datadog dashboard for `api_request_latency_p99_ms`
- Is it sustained above 1000ms? (not a spike)
- Check dashboard "API Health" for overview

### Step 2: Error rate check
- Check dashboard: API error rate (`api_errors_total`)
- Is error rate elevated? (> 0.5%)
- If yes: there's a service outage, see runbook "API Errors"

### Step 3: Identify affected endpoint
- Check dashboard: latency by endpoint
- Which endpoint has highest latency?
- Is it widespread (all endpoints slow) or specific endpoint?

## Common Causes & Fixes

### Cause 1: Database Slow/Overloaded (50% of latency incidents)

**Diagnosis**:
```bash
# Check database latency
curl http://metrics.internal/api/db_latency_ms
# Should be < 50ms. If > 200ms, database is slow.

# Check database connections
curl http://metrics.internal/api/db_connections_used
# Should be < 80. If > 95, connection pool exhausted.

# Check slow queries
ssh dbmaster
mysql> SELECT * FROM information_schema.PROCESSLIST WHERE TIME > 10;

Fix options (try in order):

# Option 1: Restart slow query (if query is obvious)
mysql> KILL QUERY <query_id>;

# Option 2: Increase query timeout to shed load
# (Edit config, restart service if needed)

# Option 3: Escalate (see escalation section below)

Verification:

# After fix, check latency returns to normal
curl http://metrics.internal/api/api_latency_p99_ms
# Should drop below 500ms within 2 minutes

Cause 2: Service Overloaded/Memory Leak (30% of latency incidents)

Diagnosis:

# Check service resource utilization
curl http://metrics.internal/api/service_cpu_percent
curl http://metrics.internal/api/service_memory_mb

# Check request queue depth
curl http://metrics.internal/api/service_queue_depth
# If queue depth is high, service can't keep up

Fix:

# Restart the service
kubectl rollout restart deployment/api-service -n prod

# Monitor recovery
kubectl logs -n prod -l app=api-service -f --tail=50
# Wait 2 minutes, then check latency

Cause 3: Upstream Service Slow (15% of latency incidents)

Diagnosis:

# Check service dependencies latency
# API calls (Auth Service)
curl http://metrics.internal/api/upstream_auth_latency_ms
# API calls (Search Service)
curl http://metrics.internal/api/upstream_search_latency_ms

# If > 500ms, upstream is the bottleneck

Fix:

# You can't fix another team's service
# Escalate to their on-call (see escalation below)

# Meanwhile, temporary mitigation: cache or fallback
# Contact on-call team and escalate

Cause 4: Traffic Spike / DDoS (5% of latency incidents)

Diagnosis:

curl http://metrics.internal/api/requests_per_second
# Normal: 500 req/s. If > 2000 req/s, you have a spike.

# Check if legitimate traffic or DDoS
curl http://metrics.internal/api/top_clients
# Are requests from many unique IPs (DDoS) or known clients?

Fix:

# For legitimate spike: auto-scaling should handle this
# Check if auto-scaling kicked in
kubectl get hpa api-service -n prod

# For DDoS: block at CDN/WAF level
# Contact security on-call: #security-oncall

Escalation

If the issue isn't resolved after 15 minutes:

Step 1: Gather evidence

  • Collect logs and metrics (30 second snapshot)
  • Save dashboard link with affected time range
  • Note what you've already tried

Step 2: Escalate based on diagnosis

DiagnosisEscalate ToContact Method
Database issueDBA on-call (Bob)Slack: @bob-dba or 555-0001
Upstream serviceThat team's on-callCheck runbook for their escalation
Infrastructure issueInfrastructure on-call (Alice)Slack: @alice-infra or 555-0002
UnknownSenior engineer (whoever's available)Slack: #eng-oncall

Step 3: During escalation

  • Stay available to answer questions
  • Don't disappear once escalated
  • Help the escalated person understand what you've done

External Communication

Notify these channels

  • #customer-support: "Investigating API latency. ETA for fix: 20 mins"
  • #incidents: "SEV-2 API latency. Root cause: [diagnosis]. Status: investigating"

Customer impact message

Use template from /runbooks/comms_templates.md:

"We're experiencing elevated API latency affecting searches and product details. Our team is investigating. We expect resolution within 30 minutes."

Prevention (After Resolution)

  • Add dashboard for [root cause metric] to alert earlier
  • Add test coverage for [code path that broke]
  • Implement [caching/optimization] to reduce database load
  • Add capacity planning for [resource]

Runbook Improvement

  • When was this runbook last tested? 2025-01-20
  • Next full test: 2025-02-20
  • If you had to deviate from this runbook, update it with what you learned

</TabItem>
<TabItem value="oncall-schedule" label="On-Call Rotation Schedule">

```yaml
# On-call rotation configuration
# Team: API Backend (4 engineers)

rotation:
name: "API Backend On-Call"
team: "api-backend"
primary_shift_length_days: 7
secondary_shift_length_days: 7
timezone: "America/New_York"
handoff_time: "Monday 10:00 AM"

# Current quarter schedule
engineers:
- name: "Alice"
capacity_per_quarter: "2 weeks" # Alice does 2 weeks on-call per quarter

- name: "Bob"
capacity_per_quarter: "2 weeks"

- name: "Carol"
capacity_per_quarter: "1 week" # Carol less available

- name: "Dave"
capacity_per_quarter: "2 weeks"

schedule:
- week: "2025-02-03"
primary: "Alice"
secondary: "Bob"
notes: "Standard week"

- week: "2025-02-10"
primary: "Bob"
secondary: "Carol"
notes: "Carol on secondary (lighter load)"

- week: "2025-02-17"
primary: "Carol"
secondary: "Alice"
notes: "Carol primary (short week only)"

- week: "2025-02-24"
primary: "Dave"
secondary: "Bob"
notes: "Standard week"

# Time-off policy
compensation:
after_oncall_week: "2 days off in lieu" # Not optional
before_holidays: "No on-call week scheduled"
coverage_gaps: "Senior engineer covers, computes overtime"

# Escalation paths
escalation:
first_level: "Secondary on-call"
second_level: "Engineering manager"
emergency: "VP Engineering (24/7 available for critical)"

# Performance metrics
metrics:
avg_pages_per_week: 3 # Target: 2-4 pages
avg_mttr: 20 # minutes
page_to_acknowledgement: 3 # minutes
escalations_per_quarter: 2 # Should be rare

# Burnout detection
burnout_indicators:
- avg_pages_per_week > 7
- engineer_reports_sleep_disruption
- mttr degradation month-over-month
- high turnover in on-call engineers

When to Use / When Not to Use

Runbooks Essential
  1. Production systems with on-call coverage
  2. Complex systems with known failure modes
  3. Teams with varied experience levels
  4. Services handling customer-facing traffic
  5. Systems with historical incident patterns
Less Critical For
  1. Development/staging environments
  2. Single-engineer teams (everyone knows everything)
  3. Simple services with few failure modes
  4. Non-critical batch workloads
  5. Systems in active development phase

Patterns and Pitfalls

A runbook that's never been executed will fail under pressure. Schedule monthly runbook drills: fire a fake alert, have on-call follow the runbook. Catch missing steps and outdated commands before a real incident.
A 5-page runbook in a 3 AM incident is useless. On-call can't remember it all. Keep runbooks to 1-2 pages, with rapid decision trees and copy-paste commands. Complex procedures belong in detailed documentation, not the runbook.
When a runbook is updated (usually after an incident), version it. Old versions might still be running in people's heads. If 'curl endpoint X' changed to 'run script Y,' make that crystal clear.
One engineer on-call for 1 month, or 3 engineers all paged together, leads to burnout. Burned-out engineers make mistakes. Stick to: 1 primary per week, clear secondary escalation, compensation time off, fair distribution.
If you have a 'restart the service' step in the runbook, it should be a button-push, not 10 kubectl commands. Better yet: auto-remediate common issues before paging. 70% of incidents resolve without human intervention if you automate the fixes.
A runbook that ends with 'escalate' but doesn't say to whom is useless. Every 'escalate' step must have a name, phone number, Slack handle. Make escalation as easy as the primary response.

Design Review Checklist

  • Does every monitored alert have a runbook?
  • Are runbooks kept up-to-date after each incident?
  • Can a junior engineer follow a runbook and resolve common issues?
  • Are runbooks tested regularly (monthly drills)?
  • Is on-call load sustainable (1-2 weeks per quarter per engineer)?
  • Are escalation paths clear with names and contact info?
  • Is there compensation time off after on-call duty?
  • Are common remediation steps automated (not manual)?
  • Is escalation automatic (not 'oh, I forgot to page them')?
  • Are on-call burnout metrics tracked and acted upon?

Self-Check

  • How many runbooks do you have? Should equal your alert count.
  • When was the last runbook drill? (Should be monthly)
  • How long was your last on-call week?
  • Did you take time off after? (Yes/No/Maybe)
  • What's your average page response time? (Should be <5 minutes)

Next Steps

  1. Audit alerts vs runbooks: List all alerts; create runbooks for any without them
  2. Test runbooks: Monthly drill for top 3 most-fired alerts
  3. Design rotation: Define shift length, secondary role, escalation
  4. Automate remediation: Pick top 3 incident types; automate common fixes
  5. Monitor on-call health: Track pages per week, response time, burnout signals

References

  1. Beyer, B., et al. (2016). Site Reliability Engineering. O'Reilly Media ↗️
  2. Jones, C., et al. (2021). Effective DevOps. O'Reilly Media ↗️
  3. O'Neill, C. (2020). Incident Management for Operations. O'Reilly Media ↗️