Production Readiness Checklist

Ensure systems are operationally sound with health checks, SLOs, alerting, and comprehensive runbooks

TL;DR

Production readiness ensures systems can be reliably operated. This checklist covers health checks (readiness and liveness probes), SLO/SLI definition, alerting configuration, capacity planning, and runbook documentation. Systems should not move to production without completing this checklist.

Learning Objectives

After using this checklist, you will be able to:

Configure health checks for application and infrastructure
Define measurable SLOs and SLIs
Establish alert thresholds based on SLOs
Conduct capacity planning for predicted load
Document operational runbooks for common scenarios
Enable operators to reliably run systems in production

Health Checks & Probes

Readiness Probes

Liveness Probes

Startup Probes

Startup Probe Configuration
Probe checks if initialization is complete
Generous timeout for application startup (up to 5 minutes)
Liveness/readiness probes not evaluated until startup succeeds

SLO & SLI Definition

SLO (Service Level Objective) Definition

SLO Documentation
SLOs tied to business requirements and user expectations
Multiple SLOs per service (availability, latency, error rate)
SLOs are achievable and realistic (not 100%)
Review and update SLOs annually or when business changes
SLO Examples
"95th percentile latency under 200ms" for user requests
"Error rate under 0.1%" (1 error per 1000 requests)
"Data consistency within 5 seconds" for distributed systems

SLI (Service Level Indicator) Implementation

Error Budgets

Error Budget Calculation
Example: 99.9% SLO = 43.2 minutes downtime/month
Error budget tracked and consumed by incidents
When budget exhausted, focus on stability over features
Error Budget Management
Release process slowed if error budget depleted
Risk reviews conducted for major changes near budget limits
Post-incident reviews assess impact on error budget

Alerting & Monitoring

Alert Thresholds

Observability

Capacity Planning

Load Testing & Sizing

Auto-Scaling

Scaling Policies
Scale-up triggers at 70-80% utilization
Scale-down triggers at 30-40% utilization
Cooldown periods prevent rapid scaling cycles
Minimum and maximum replica counts configured
Scaling Testing
New instances integrate correctly into load balancing
Draining connections gracefully during scale-down
Database connection pool scales with instance count

Runbooks & Operational Procedures

Runbook Documentation

On-Call Procedures

On-Call Setup
Escalation path clear (who calls whom, how)
On-call contact information maintained
On-call tools configured (PagerDuty, OpsGenie, etc.)
On-Call Support
Handoff procedures (shift change) documented
Post-incident review process in place
On-call load is reasonable (prevents burnout)

Deployment & Change Management

Deployment Readiness

Deployment Strategy
Canary or rolling deployment strategy configured
Blue-green deployment capability for rapid rollback
Deployment automation tested regularly
Pre-Deployment Checklist
Database migrations tested on staging
Secrets rotated and verified
Backup completed before deployment
Rollback plan is clear

Common Production Readiness Mistakes

Mistakes to Avoid

No health checks: Deployments can't reliably detect when services are ready or failing. Always implement readiness and liveness probes.

SLOs set too aggressively: 99.99% availability requires expensive infrastructure and careful operations. Set realistic targets.

Alerting on causes, not symptoms: CPU alerts don't tell you if users are affected. Alert on SLIs instead.

No runbooks: When incidents happen, operators waste time figuring out what to do. Document procedures in advance.

Skipping load testing: Discovering scaling problems after deployment to production is expensive. Test before production.

Ignoring error budgets: Teams burn through error budgets with risky deployments, then can't iterate. Use error budgets to balance velocity and stability.

One-person knowledge silos: When the person who knows everything goes on vacation, operations suffers. Document and train the team.

Self-Check

Before deploying to production, verify:

Readiness and liveness probes are implemented and tested
SLOs and SLIs are defined and documented
Alert thresholds are configured based on SLI targets
Capacity planning is complete for expected load
Load testing has been performed and bottlenecks addressed
Runbooks exist for all critical failure scenarios
On-call procedures and escalation paths are clear
Deployment and rollback procedures are documented and tested
Observability (metrics, logs, traces) is operational
Team is trained on procedures and tools

One Takeaway

Production readiness is not about perfection—it's about predictability. When systems are well-instrumented with health checks, clear SLOs, and comprehensive runbooks, operators can confidently detect and resolve issues without escalating to engineers.

Next Steps

Define SLOs for your service based on business requirements
Implement health checks (readiness and liveness probes)
Configure alerting based on SLI targets
Conduct load testing to validate sizing
Document runbooks for critical scenarios
Set up on-call procedures and tools

References

Site Reliability Engineering (SRE) ↗️ - Google's guide to SLOs, error budgets, and operations
Kubernetes Health Checks ↗️
Four Golden Signals ↗️ - What to monitor (latency, traffic, errors, saturation)
SLO Specification ↗️ - Tools for SLO definition and tracking

Production Readiness Checklist

TL;DR​

Learning Objectives​

Health Checks & Probes​

Readiness Probes​

Liveness Probes​

Startup Probes​

SLO & SLI Definition​

SLO (Service Level Objective) Definition​

SLI (Service Level Indicator) Implementation​

Error Budgets​

Alerting & Monitoring​

Alert Thresholds​

Observability​

Capacity Planning​

Load Testing & Sizing​

Auto-Scaling​

Runbooks & Operational Procedures​

Runbook Documentation​

On-Call Procedures​

Deployment & Change Management​

Deployment Readiness​

Common Production Readiness Mistakes​