Production Readiness Checklist
Ensure systems are operationally sound with health checks, SLOs, alerting, and comprehensive runbooks
TL;DR
Production readiness ensures systems can be reliably operated. This checklist covers health checks (readiness and liveness probes), SLO/SLI definition, alerting configuration, capacity planning, and runbook documentation. Systems should not move to production without completing this checklist.
Learning Objectives
After using this checklist, you will be able to:
- Configure health checks for application and infrastructure
- Define measurable SLOs and SLIs
- Establish alert thresholds based on SLOs
- Conduct capacity planning for predicted load
- Document operational runbooks for common scenarios
- Enable operators to reliably run systems in production
Health Checks & Probes
Readiness Probes
- Readiness Probe Definition
- Probe checks critical dependencies (database, cache, message queue)
- Probe returns 200 only when ready to accept traffic
- Probe timeout configured appropriately (3-5 seconds typical)
- Initial delay allows startup time before first check
- Readiness Criteria
- Cache is accessible and populated if required
- External service dependencies are reachable
- Configuration has been loaded
- Warm-up tasks (caching, initialization) completed
- Orchestrator Integration
- Pod not included in service load balancer until ready
- Failed probes prevent traffic routing
- Probe behavior tested during deployment
Liveness Probes
- Liveness Probe Definition
- Probe detects process-level failures (deadlock, panic)
- Probe returns 200 if process is healthy and responsive
- Probe is lightweight (minimal CPU/memory impact)
- Probe timeout and frequency configured appropriately
- Liveness Criteria
- No goroutine/thread deadlocks or hangs
- Memory usage is within expected bounds
- Event loop is not blocked
- Restart Behavior
- Restart policy configured appropriately
- Crash loops are detected and avoided
- Startup probe prevents premature liveness failures
Startup Probes
- Startup Probe Configuration
- Probe checks if initialization is complete
- Generous timeout for application startup (up to 5 minutes)
- Liveness/readiness probes not evaluated until startup succeeds
SLO & SLI Definition
SLO (Service Level Objective) Definition
- SLO Documentation
- SLOs tied to business requirements and user expectations
- Multiple SLOs per service (availability, latency, error rate)
- SLOs are achievable and realistic (not 100%)
- Review and update SLOs annually or when business changes
- SLO Examples
- "95th percentile latency under 200ms" for user requests
- "Error rate under 0.1%" (1 error per 1000 requests)
- "Data consistency within 5 seconds" for distributed systems
SLI (Service Level Indicator) Implementation
- SLI Metrics
- Metrics for each SLO collected and tracked
- Good events and bad events clearly defined
- SLI calculation formula documented
- Measurement methodology excludes external factors when possible
- SLI Examples
- Latency: p95(request_duration) measured from user perspective
- Error rate: (error_responses / total_responses) * 100
- Throughput: requests_per_second measured and tracked
- SLI Tracking
- Metrics stored in time-series database (Prometheus, etc.)
- SLI windows calculated (daily, weekly, monthly, quarterly)
- SLI dashboards show current performance vs. targets
- Alerts trigger when SLIs approach SLO threshold
Error Budgets
- Error Budget Calculation
- Example: 99.9% SLO = 43.2 minutes downtime/month
- Error budget tracked and consumed by incidents
- When budget exhausted, focus on stability over features
- Error Budget Management
- Release process slowed if error budget depleted
- Risk reviews conducted for major changes near budget limits
- Post-incident reviews assess impact on error budget
Alerting & Monitoring
Alert Thresholds
- Alert Strategy
- Alerts fire when SLI approaching SLO threshold
- Alert levels: warning (SLI degradation) and critical (SLO breach)
- High-confidence alerts only (minimize false positives)
- Missing alerts identified and addressed
- Alert Configuration
- Error rate alerts: trigger at 2× error budget burn rate
- Availability alerts: trigger when SLI drops below SLO
- Alert evaluation period (e.g., 5 minutes) reasonable
- Alert grouping prevents alert storms
- Alerting Platforms
- Escalation rules defined (timing, who to notify)
- Alert context includes runbook links and relevant metrics
- Alert fatigue management (tuning false positives)
Observability
- Metrics Collection
- Business metrics tracked alongside technical metrics
- Metrics retention policy defined
- Metrics cardinality explosion prevented
- Logging & Tracing
- Log levels appropriate (ERROR, WARN, INFO, DEBUG)
- Distributed tracing enabled for request flows
- Trace sampling configured for performance
- Sensitive data not logged
- Dashboards
- Dependency health visible (databases, external services)
- Resource utilization (CPU, memory, disk, network)
- Recent errors and incidents visible
- Quick access to relevant runbooks
Capacity Planning
Load Testing & Sizing
- Load Testing
- Load testing performed at expected peak load
- Latency, error rates, and resource usage measured
- Bottlenecks identified and addressed
- Load testing repeated before major releases
- Resource Sizing
- Memory sizing includes buffer for caching and overhead
- Disk capacity includes growth projection
- Network bandwidth sized for peak throughput
- Database connections pooled appropriately
Auto-Scaling
- Scaling Policies
- Scale-up triggers at 70-80% utilization
- Scale-down triggers at 30-40% utilization
- Cooldown periods prevent rapid scaling cycles
- Minimum and maximum replica counts configured
- Scaling Testing
- New instances integrate correctly into load balancing
- Draining connections gracefully during scale-down
- Database connection pool scales with instance count
Runbooks & Operational Procedures
Runbook Documentation
- Runbook Content
- Each runbook has clear title, purpose, and severity
- Prerequisites and safety checks listed
- Step-by-step procedures with expected outputs
- Rollback procedures documented if applicable
- Runbook Examples
- "High latency incident: diagnosis and remediation"
- "Service not responding: troubleshooting guide"
- "Out of memory: debugging and restart procedures"
- Runbook Maintenance
- Runbooks updated after incidents
- Runbooks accessible from dashboards and alerts
- Team training on runbook procedures conducted
On-Call Procedures
- On-Call Setup
- Escalation path clear (who calls whom, how)
- On-call contact information maintained
- On-call tools configured (PagerDuty, OpsGenie, etc.)
- On-Call Support
- Handoff procedures (shift change) documented
- Post-incident review process in place
- On-call load is reasonable (prevents burnout)
Deployment & Change Management
Deployment Readiness
- Deployment Strategy
- Canary or rolling deployment strategy configured
- Blue-green deployment capability for rapid rollback
- Deployment automation tested regularly
- Pre-Deployment Checklist
- Database migrations tested on staging
- Secrets rotated and verified
- Backup completed before deployment
- Rollback plan is clear
Common Production Readiness Mistakes
Mistakes to Avoid
No health checks: Deployments can't reliably detect when services are ready or failing. Always implement readiness and liveness probes.
SLOs set too aggressively: 99.99% availability requires expensive infrastructure and careful operations. Set realistic targets.
Alerting on causes, not symptoms: CPU alerts don't tell you if users are affected. Alert on SLIs instead.
No runbooks: When incidents happen, operators waste time figuring out what to do. Document procedures in advance.
Skipping load testing: Discovering scaling problems after deployment to production is expensive. Test before production.
Ignoring error budgets: Teams burn through error budgets with risky deployments, then can't iterate. Use error budgets to balance velocity and stability.
One-person knowledge silos: When the person who knows everything goes on vacation, operations suffers. Document and train the team.
Self-Check
Before deploying to production, verify:
- Readiness and liveness probes are implemented and tested
- SLOs and SLIs are defined and documented
- Alert thresholds are configured based on SLI targets
- Capacity planning is complete for expected load
- Load testing has been performed and bottlenecks addressed
- Runbooks exist for all critical failure scenarios
- On-call procedures and escalation paths are clear
- Deployment and rollback procedures are documented and tested
- Observability (metrics, logs, traces) is operational
- Team is trained on procedures and tools
One Takeaway
Production readiness is not about perfection—it's about predictability. When systems are well-instrumented with health checks, clear SLOs, and comprehensive runbooks, operators can confidently detect and resolve issues without escalating to engineers.
Next Steps
- Define SLOs for your service based on business requirements
- Implement health checks (readiness and liveness probes)
- Configure alerting based on SLI targets
- Conduct load testing to validate sizing
- Document runbooks for critical scenarios
- Set up on-call procedures and tools
References
- Site Reliability Engineering (SRE) ↗️ - Google's guide to SLOs, error budgets, and operations
- Kubernetes Health Checks ↗️
- Four Golden Signals ↗️ - What to monitor (latency, traffic, errors, saturation)
- SLO Specification ↗️ - Tools for SLO definition and tracking