Skip to main content

Production Readiness Checklist

Ensure systems are operationally sound with health checks, SLOs, alerting, and comprehensive runbooks

TL;DR

Production readiness ensures systems can be reliably operated. This checklist covers health checks (readiness and liveness probes), SLO/SLI definition, alerting configuration, capacity planning, and runbook documentation. Systems should not move to production without completing this checklist.

Learning Objectives

After using this checklist, you will be able to:

  • Configure health checks for application and infrastructure
  • Define measurable SLOs and SLIs
  • Establish alert thresholds based on SLOs
  • Conduct capacity planning for predicted load
  • Document operational runbooks for common scenarios
  • Enable operators to reliably run systems in production

Health Checks & Probes

Readiness Probes

  • Readiness Probe Definition
  • Probe checks critical dependencies (database, cache, message queue)
  • Probe returns 200 only when ready to accept traffic
  • Probe timeout configured appropriately (3-5 seconds typical)
  • Initial delay allows startup time before first check
  • Readiness Criteria
  • Cache is accessible and populated if required
  • External service dependencies are reachable
  • Configuration has been loaded
  • Warm-up tasks (caching, initialization) completed
  • Orchestrator Integration
  • Pod not included in service load balancer until ready
  • Failed probes prevent traffic routing
  • Probe behavior tested during deployment

Liveness Probes

  • Liveness Probe Definition
  • Probe detects process-level failures (deadlock, panic)
  • Probe returns 200 if process is healthy and responsive
  • Probe is lightweight (minimal CPU/memory impact)
  • Probe timeout and frequency configured appropriately
  • Liveness Criteria
  • No goroutine/thread deadlocks or hangs
  • Memory usage is within expected bounds
  • Event loop is not blocked
  • Restart Behavior
  • Restart policy configured appropriately
  • Crash loops are detected and avoided
  • Startup probe prevents premature liveness failures

Startup Probes

  • Startup Probe Configuration
  • Probe checks if initialization is complete
  • Generous timeout for application startup (up to 5 minutes)
  • Liveness/readiness probes not evaluated until startup succeeds

SLO & SLI Definition

SLO (Service Level Objective) Definition

  • SLO Documentation
  • SLOs tied to business requirements and user expectations
  • Multiple SLOs per service (availability, latency, error rate)
  • SLOs are achievable and realistic (not 100%)
  • Review and update SLOs annually or when business changes
  • SLO Examples
  • "95th percentile latency under 200ms" for user requests
  • "Error rate under 0.1%" (1 error per 1000 requests)
  • "Data consistency within 5 seconds" for distributed systems

SLI (Service Level Indicator) Implementation

  • SLI Metrics
  • Metrics for each SLO collected and tracked
  • Good events and bad events clearly defined
  • SLI calculation formula documented
  • Measurement methodology excludes external factors when possible
  • SLI Examples
  • Latency: p95(request_duration) measured from user perspective
  • Error rate: (error_responses / total_responses) * 100
  • Throughput: requests_per_second measured and tracked
  • SLI Tracking
  • Metrics stored in time-series database (Prometheus, etc.)
  • SLI windows calculated (daily, weekly, monthly, quarterly)
  • SLI dashboards show current performance vs. targets
  • Alerts trigger when SLIs approach SLO threshold

Error Budgets

  • Error Budget Calculation
  • Example: 99.9% SLO = 43.2 minutes downtime/month
  • Error budget tracked and consumed by incidents
  • When budget exhausted, focus on stability over features
  • Error Budget Management
  • Release process slowed if error budget depleted
  • Risk reviews conducted for major changes near budget limits
  • Post-incident reviews assess impact on error budget

Alerting & Monitoring

Alert Thresholds

  • Alert Strategy
  • Alerts fire when SLI approaching SLO threshold
  • Alert levels: warning (SLI degradation) and critical (SLO breach)
  • High-confidence alerts only (minimize false positives)
  • Missing alerts identified and addressed
  • Alert Configuration
  • Error rate alerts: trigger at 2× error budget burn rate
  • Availability alerts: trigger when SLI drops below SLO
  • Alert evaluation period (e.g., 5 minutes) reasonable
  • Alert grouping prevents alert storms
  • Alerting Platforms
  • Escalation rules defined (timing, who to notify)
  • Alert context includes runbook links and relevant metrics
  • Alert fatigue management (tuning false positives)

Observability

  • Metrics Collection
  • Business metrics tracked alongside technical metrics
  • Metrics retention policy defined
  • Metrics cardinality explosion prevented
  • Logging & Tracing
  • Log levels appropriate (ERROR, WARN, INFO, DEBUG)
  • Distributed tracing enabled for request flows
  • Trace sampling configured for performance
  • Sensitive data not logged
  • Dashboards
  • Dependency health visible (databases, external services)
  • Resource utilization (CPU, memory, disk, network)
  • Recent errors and incidents visible
  • Quick access to relevant runbooks

Capacity Planning

Load Testing & Sizing

  • Load Testing
  • Load testing performed at expected peak load
  • Latency, error rates, and resource usage measured
  • Bottlenecks identified and addressed
  • Load testing repeated before major releases
  • Resource Sizing
  • Memory sizing includes buffer for caching and overhead
  • Disk capacity includes growth projection
  • Network bandwidth sized for peak throughput
  • Database connections pooled appropriately

Auto-Scaling

  • Scaling Policies
  • Scale-up triggers at 70-80% utilization
  • Scale-down triggers at 30-40% utilization
  • Cooldown periods prevent rapid scaling cycles
  • Minimum and maximum replica counts configured
  • Scaling Testing
  • New instances integrate correctly into load balancing
  • Draining connections gracefully during scale-down
  • Database connection pool scales with instance count

Runbooks & Operational Procedures

Runbook Documentation

  • Runbook Content
  • Each runbook has clear title, purpose, and severity
  • Prerequisites and safety checks listed
  • Step-by-step procedures with expected outputs
  • Rollback procedures documented if applicable
  • Runbook Examples
  • "High latency incident: diagnosis and remediation"
  • "Service not responding: troubleshooting guide"
  • "Out of memory: debugging and restart procedures"
  • Runbook Maintenance
  • Runbooks updated after incidents
  • Runbooks accessible from dashboards and alerts
  • Team training on runbook procedures conducted

On-Call Procedures

  • On-Call Setup
  • Escalation path clear (who calls whom, how)
  • On-call contact information maintained
  • On-call tools configured (PagerDuty, OpsGenie, etc.)
  • On-Call Support
  • Handoff procedures (shift change) documented
  • Post-incident review process in place
  • On-call load is reasonable (prevents burnout)

Deployment & Change Management

Deployment Readiness

  • Deployment Strategy
  • Canary or rolling deployment strategy configured
  • Blue-green deployment capability for rapid rollback
  • Deployment automation tested regularly
  • Pre-Deployment Checklist
  • Database migrations tested on staging
  • Secrets rotated and verified
  • Backup completed before deployment
  • Rollback plan is clear

Common Production Readiness Mistakes

Mistakes to Avoid

No health checks: Deployments can't reliably detect when services are ready or failing. Always implement readiness and liveness probes.

SLOs set too aggressively: 99.99% availability requires expensive infrastructure and careful operations. Set realistic targets.

Alerting on causes, not symptoms: CPU alerts don't tell you if users are affected. Alert on SLIs instead.

No runbooks: When incidents happen, operators waste time figuring out what to do. Document procedures in advance.

Skipping load testing: Discovering scaling problems after deployment to production is expensive. Test before production.

Ignoring error budgets: Teams burn through error budgets with risky deployments, then can't iterate. Use error budgets to balance velocity and stability.

One-person knowledge silos: When the person who knows everything goes on vacation, operations suffers. Document and train the team.

Self-Check

Before deploying to production, verify:

  • Readiness and liveness probes are implemented and tested
  • SLOs and SLIs are defined and documented
  • Alert thresholds are configured based on SLI targets
  • Capacity planning is complete for expected load
  • Load testing has been performed and bottlenecks addressed
  • Runbooks exist for all critical failure scenarios
  • On-call procedures and escalation paths are clear
  • Deployment and rollback procedures are documented and tested
  • Observability (metrics, logs, traces) is operational
  • Team is trained on procedures and tools

One Takeaway

One Takeaway

Production readiness is not about perfection—it's about predictability. When systems are well-instrumented with health checks, clear SLOs, and comprehensive runbooks, operators can confidently detect and resolve issues without escalating to engineers.

Next Steps

  1. Define SLOs for your service based on business requirements
  2. Implement health checks (readiness and liveness probes)
  3. Configure alerting based on SLI targets
  4. Conduct load testing to validate sizing
  5. Document runbooks for critical scenarios
  6. Set up on-call procedures and tools

References