Error Budgets and Toil
Quantify acceptable unreliability; measure and eliminate manual toil.
TL;DR
Error budget: SLO 99.9% = allowed to fail 0.1% = 43 minutes/month. Once consumed, freeze changes until next month. Use budget intentionally: risky deployments, chaos tests, experiments.
Toil: manual operational work (manual deployments, running scripts, repetitive monitoring). Measure it ruthlessly. Target: 20-30% of engineer time on toil; 70-80% on features. Automate > Hire. If toil >50%, you're understaffed in the wrong direction.
Learning Objectives
- Calculate SLO and error budgets accurately
- Use error budgets to make intentional risk decisions
- Measure toil systematically across your team
- Identify and prioritize toil-elimination projects
- Build automation roadmaps for 50% toil reduction
- Balance innovation velocity with reliability
Motivating Scenario
Your team shipped 8 features last quarter but spent 40% of time on manual deployments, monitoring, and incident response (toil). Your deployment pipeline is manual: run scripts, wait for tests, toggle feature flags, monitor logs manually.
Competitor's team shipped 20 features. Their deployment is fully automated: push to main, CI/CD runs everything, deploys automatically to production. They spend 10% on toil, 90% on features.
You have a 99.9% SLO but you've consumed your error budget by month 2 due to incidents caused by manual deployment mistakes.
Competitor has 99.95% SLO and budget remaining at month's end, so they can do a risky experiment in week 4.
Eliminating toil unlocks velocity. Error budgets give permission to deploy faster.
Core Concepts
Error Budget Math and Usage
SLO Breakdown:
- 99.9% = "three nines" = 99.9% uptime
- Failures allowed per month: (1 - 0.999) × 43,200 sec = 43.2 minutes
- Failures allowed per day: 43.2 / 30 = 1.44 minutes
- Per million requests: 1,000 errors out of 1,000,000 requests
Budget Consumption:
- Incident down 5 min: -5 min budget
- Deployment at 99% success: -0.1% × deploy volume
- Cascading failure for 2 hours: consumes budget for entire month
Toil Definition and Measurement
Toil: Manual, repetitive, operational work that doesn't solve problems.
Examples:
- Manually deploying to each environment
- Running pre-deployment checklists
- Manually scaling services during traffic spikes
- Manually restarting failed services
- Manually running batch jobs daily
- Manually updating configuration files
- Oncall escalation and alerts management
Not Toil: Feature development, design discussions, code review (these are valuable).
Measurement: Tracking toil requires honest logging.
Weekly Toil Audit:
- 2 hrs: Deploy process (multiple manual steps)
- 1.5 hrs: Manual scaling (peak traffic management)
- 1 hr: Runbook updates (post-incident)
- 2 hrs: Incident response (debugging + fixing)
- 0.5 hrs: Monitoring + alerting (manual checks)
--------
7 hours toil out of 40 hour week = 17.5% (GOOD)
Team average: 25%, Target: <30%
Toil-to-Automation Roadmap
Transform toil into automation:
- Manual deployment → CI/CD pipeline
- Manual scaling → Auto-scaling groups + metrics
- Manual monitoring → Alert rules + dashboards
- Manual runbook steps → Auto-remediation
- Manual config changes → Infrastructure-as-code
Practical Examples
- Error Budget Calculator
- Toil Measurement System
- Error Budget Governance
# error_budget_calculator.py
import json
from datetime import datetime, timedelta
class ErrorBudget:
SLO_TO_MINUTES = {
0.999: 43.2, # 99.9%
0.9999: 4.32, # 99.99%
0.99999: 0.432, # 99.999%
0.95: 216, # 95%
}
def __init__(self, slo: float, service_name: str):
self.slo = slo
self.service_name = service_name
self.budget_remaining = self.SLO_TO_MINUTES.get(slo)
self.incidents = []
def record_incident(self, name: str, downtime_minutes: float,
severity: str, cause: str):
"""Record incident and deduct from budget"""
self.incidents.append({
"name": name,
"downtime_minutes": downtime_minutes,
"severity": severity,
"cause": cause,
"timestamp": datetime.now().isoformat()
})
self.budget_remaining -= downtime_minutes
if self.budget_remaining < 0:
print(f"⚠️ ERROR BUDGET EXCEEDED for {self.service_name}")
self.alert_team()
def record_deployment_error_rate(self, error_rate: float,
requests_deployed: int):
"""Account for failed deploys"""
if error_rate > self.slo:
# Over budget on this deploy
excess_errors = requests_deployed * (error_rate - self.slo)
minutes_lost = excess_errors / (1000 * 60) # Rough conversion
self.budget_remaining -= minutes_lost
def get_status(self) -> dict:
"""Current budget status"""
percentage = (self.budget_remaining / self.SLO_TO_MINUTES[self.slo]) * 100
status = "🟢 OK" if percentage > 20 else "🟡 CAUTION" if percentage > 5 else "🔴 CRITICAL"
return {
"service": self.service_name,
"slo": f"{self.slo * 100}%",
"budget_remaining_min": round(self.budget_remaining, 2),
"percentage_remaining": round(percentage, 1),
"status": status,
"incidents_this_month": len(self.incidents),
"recommendation": self._recommendation(percentage)
}
def _recommendation(self, percentage: float) -> str:
if percentage > 50:
return "Safe to proceed with risky deployments"
elif percentage > 20:
return "Proceed with caution; avoid major changes"
elif percentage > 5:
return "Freeze non-critical changes until next period"
else:
return "CRITICAL: Freeze all changes; focus on stability"
def alert_team(self):
msg = f"ERROR BUDGET EXCEEDED: {self.service_name}"
print(f"\n🚨 {msg}\n")
# Send to Slack, PagerDuty, etc.
# Example usage
if __name__ == "__main__":
payment_budget = ErrorBudget(0.999, "payment-service")
# Record incidents
payment_budget.record_incident(
"Database connection pool exhaustion",
downtime_minutes=15,
severity="HIGH",
cause="Undetected leak"
)
payment_budget.record_incident(
"Bad deployment (config typo)",
downtime_minutes=5,
severity="CRITICAL",
cause="Manual deployment error"
)
payment_budget.record_deployment_error_rate(
error_rate=0.002, # 0.2% error rate
requests_deployed=50000
)
# Check status
print(json.dumps(payment_budget.get_status(), indent=2))
# toil-tracking.yaml - Weekly toil audit
team: Backend
week: "2025-02-10"
target_toil_percent: 30
toil_categories:
manual_deployments:
description: "Running manual deploy scripts, waiting for tests"
hours: 8.5
incidents:
- "Deploy to staging: 1hr"
- "Deploy to prod: 1.5hrs"
- "Rollback deployment: 0.5hrs"
- "Manual smoke tests: 1.5hrs"
- "Waiting for CI/CD: 4hrs (inefficiency)"
automation_goal: "Fully automated pipeline with auto-approval"
estimated_savings: "7 hours/week"
incident_response:
description: "Debugging, investigating, fixing incidents"
hours: 6
incidents:
- "Database slowdown diagnosis: 2hrs"
- "Memory leak investigation: 2hrs"
- "Post-incident cleanup: 1hr"
- "Runbook updates: 1hr"
automation_goal: "Auto-remediation for common issues"
estimated_savings: "3 hours/week (via auto-remediation)"
manual_monitoring:
description: "Checking dashboards, running diagnostic queries"
hours: 4
incidents:
- "Daily health checks: 1hr"
- "Load monitoring: 1.5hrs"
- "Manual log analysis: 1.5hrs"
automation_goal: "Automated alerts, pre-built queries"
estimated_savings: "2 hours/week"
configuration_updates:
description: "Manual config changes, syncing across systems"
hours: 3
incidents:
- "Database tuning: 1.5hrs"
- "Env variable updates: 1hr"
- "Firewall rule changes: 0.5hrs"
automation_goal: "Infrastructure-as-code, GitOps"
estimated_savings: "2.5 hours/week"
scaling_management:
description: "Manual scaling during traffic spikes"
hours: 2
incidents:
- "Black Friday manual scaling: 2hrs"
automation_goal: "Predictive auto-scaling"
estimated_savings: "2 hours/week"
summary:
total_hours: 23.5
total_engineer_hours: 160 # 4 engineers × 40 hrs
toil_percentage: 14.7
status: "EXCELLENT (below 30% target)"
recommendations:
- "Continue automating deployment pipeline"
- "Implement auto-remediation for top 5 incidents"
- "Build automated monitoring queries"
# Error Budget Governance Policy
## Service: Payment Processing
**SLO**: 99.95% | **Budget**: 21.6 minutes/month
### Budget Spending Rules
| Scenario | Budget Cost | Approval | Notes |
|----------|------------|----------|-------|
| Risky deployment (may fail) | 15 min | PM + Tech Lead | Worth the risk |
| Chaos testing | 5 min | Team consensus | Prevents future incidents |
| Experimental feature | 10 min | Architect | Controlled rollout |
| Major refactoring deploy | 20 min | CTO | Critical path |
| Emergency fix | Actual impact | VP Eng | After-the-fact review |
### Monthly Schedule
**Week 1**: Measure incidents, publish budget remaining
**Week 2-3**: Normal operations, no risky deploys if budget < 10 min
**Week 4**: Remaining budget planning (freeze or spend)
### Decision Matrix
**If Budget > 50%** (>10 min remaining):
- ✅ Ship risky features
- ✅ Run chaos experiments
- ✅ Try experimental deployments
- ✅ Aggressive rollout schedules
**If Budget 20-50%** (4-10 min remaining):
- ⚠️ Only ship critical features
- ✅ Deploy to canary/region first
- ❌ No chaos testing
- ❌ No experimental features
**If Budget < 20%** (< 4 min remaining):
- ❌ Feature freeze until next month
- ✅ Only critical hotfixes
- ✅ Increase monitoring
- ✅ Focus on reliability projects
When to Prioritize Error Budget vs Toil
- SLO is at risk (trending down)
- Recent incidents consumed budget
- Team is reactive to failures
- Need to slow deployment velocity
- Building fundamental reliability
- Toil is >40% of engineer time
- Team is burned out from manual work
- Deployment takes >1 hour
- Need to increase feature velocity
- Team has spare capacity
Patterns and Pitfalls
Design Review Checklist
- SLO is defined for each critical service (99.9% minimum)
- Error budget is calculated and tracked monthly
- Budget status is visible to entire team (dashboard/meeting)
- Team has decision rules for spending budget (governance policy)
- Toil is measured weekly or bi-weekly across team
- Toil measurement includes categories (deploy, incident, monitor, config, scaling)
- Top 5 toil sources are identified and ranked by impact
- Automation roadmap exists for toil reduction
- Post-incident reviews include 'what toil could prevent this?'
- Toil elimination projects are prioritized equally with feature work
Self-Check
- Can you state your SLO and current error budget without checking a document?
- What percentage of your time goes to toil? Is it trending down?
- Have you spent your error budget intentionally this month, or did incidents consume it?
- What single toil item would save you the most time if automated?
- Does your team discuss error budget in planning meetings?
Next Steps
- Day 1: Calculate SLO and error budget for 3 critical services
- Week 1: Measure toil across team for one week
- Week 2: Publish budget status and toil metrics to team
- Week 3: Create automation roadmap for top 3 toil sources
- Month 1: Implement first toil automation (likely CI/CD improvement)
- Ongoing: Weekly budget/toil tracking; monthly planning
References
- Beyer, B., et al. (2016). Site Reliability Engineering. O'Reilly Media.
- Forsgren, N., et al. (2018). Accelerate. IT Revolution Press.
- Perez-Rios, J. (2021). SRE Fundamentals. Wiley. (SLO/Error Budget chapters)