Skip to main content

SLO-Based Alerts vs Static Thresholds

Alert on service-level objectives, not arbitrary thresholds. Align alerts with actual user impact.

TL;DR

Static threshold alerting: "Alert if latency > 500ms." Problem: if your SLO permits 95% of requests under 1000ms, a few requests over 500ms don't violate the SLO. You alert for nothing; on-call gets paged for a non-issue. SLO-based alerting: "Alert if error budget is depleting faster than expected." Your SLO of 99.9% availability grants 43 minutes of downtime per month. If you consume 40 minutes in the first week, you're on pace to breach the SLO—alert immediately. Static thresholds measure system behavior; SLO-based alerts measure user impact. SLO-based alerts eliminate false positives and focus on what business cares about: meeting commitments to customers.

Learning Objectives

  • Define SLOs (Service-Level Objectives) and error budgets
  • Calculate error budget and remaining budget
  • Design alerts based on burn rate and budget consumption
  • Distinguish between SLO violations and degradation
  • Use burn rate to automate escalation decisions
  • Reduce false positives through SLO-aligned alerting
  • Use error budget to guide feature velocity vs. reliability trade-offs

Motivating Scenario

Your search service has two different SLOs: search availability (99.9%) and search latency (p95 < 500ms). A bug causes search latency to degrade: 20% of requests now exceed 500ms. You have two options for alerting:

Option 1 (Static Threshold): Alert when p95 latency > 500ms. Alert fires immediately. On-call paged.

Option 2 (SLO-Based): Alert when burn rate on latency SLO > 5 for 15 minutes. Calculation: your monthly budget is 5% of requests allowed to exceed 500ms. Current burn: 20%. You're consuming budget 4x faster than monthly pace. Alert fires, on-call paged.

Both pages the on-call, but the reasoning differs. Static threshold is context-blind: maybe 500ms isn't critical for your business. SLO-based alert is business-aligned: you know the user impact (5% of searches will fail SLO), and you know the trend (consuming monthly budget in one week).

Core Concepts

Error Budget Depletion: Burn Rate Determines Urgency

Error Budget and Burn Rate

SLO: Service-Level Objective. Example: "99.9% of requests succeed within 1000ms."

Error budget: The allowed failures. If SLO is 99.9% availability, error budget is 0.1% failures.

For a monthly window: 0.1% of all requests allowed to fail. If you handle 1B requests/month, error budget is 1M failures.

Burn rate: How fast you're consuming error budget relative to the allowed rate.

  • Burn rate = 1.0: Consuming budget at rate that lets you meet SLO by month-end
  • Burn rate = 5.0: Consuming 5x as fast—will breach SLO unless trend changes
  • Burn rate = 10.0+: Critical—will violate SLO within hours

Alerting by Burn Rate Windows

Multiple burn rate windows catch incidents at different stages:

Burn RateDurationActionPurpose
> 1001 minPage on-call immediately (total outage)Catch catastrophic failures instantly
> 105 minPage on-call immediately (will breach SLO soon)Catch severe incidents early
> 515 minPage on-call (will breach SLO in ~2 hours)Catch major issues before they worsen
> 1Multiple hoursWarn/create ticket (SLO at risk)Long-term trend monitoring

Practical Example

# Prometheus alert rules for SLO-based alerting

groups:
- name: slo_alerts
interval: 30s

rules:
# SLO: 99.9% availability (0.1% error budget)
# Monthly window: ~2.5M requests, 2500 allowed failures

- alert: APIAvailabilitySLOBreachImmediate
expr: |
(
rate(http_requests_total{status=~"5.."}[5m]) /
rate(http_requests_total[5m])
) > 0.001 # > 0.1% error rate
for: 5m
labels:
severity: critical
slo: api_availability
annotations:
summary: "API availability SLO being violated"
description: "Error rate {{ $value | humanizePercentage }} exceeds SLO threshold"

- alert: APIAvailabilitySLOBurnRateWarning
expr: |
(
rate(http_requests_total{status=~"5.."}[15m]) /
rate(http_requests_total[15m])
) > 0.0005 # > 5x burn rate (5 * 0.1%)
for: 15m
labels:
severity: warning
slo: api_availability
annotations:
summary: "API availability error budget burning quickly"
description: "Burn rate {{ $value | humanizePercentage }} is {{ $value | divideFloat 0.001 | humanize }}x expected"

- alert: APIAvailabilitySLOTrendingBreach
expr: |
(
rate(http_requests_total{status=~"5.."}[1h]) /
rate(http_requests_total[1h])
) > 0.0002 # > 2x monthly burn rate
for: 1h
labels:
severity: info
slo: api_availability
annotations:
summary: "API availability will breach SLO if trend continues"
description: "At current burn rate, SLO breach in {{ $value | humanize }} days"

# SLO: P95 latency < 500ms (5% of requests allowed above threshold)
- alert: APILatencySLOBreach
expr: |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: critical
slo: api_latency
annotations:
summary: "API P95 latency SLO breached"
description: "P95 latency {{ $value | humanizeDuration }} exceeds 500ms SLO"

# Error budget remaining (informational)
- alert: APIAvailabilityErrorBudgetLow
expr: |
(
1 - (
rate(http_requests_total{status!~"5.."}[30d]) /
rate(http_requests_total[30d])
)
) > 0.0009 # Consumed > 90% of monthly budget
for: 1h
labels:
severity: info
slo: api_availability
annotations:
summary: "API availability error budget 90% consumed"
description: "Only {{ $value | humanizePercentage }} of monthly budget remaining"

When to Use / When Not to Use

Use SLO-Based Alerting
  1. Services with defined SLOs
  2. Need to reduce alert fatigue
  3. Multiple services with different criticality
  4. Want to align operations with business goals
  5. Need to make trade-off decisions (speed vs reliability)
Static Thresholds Still Useful
  1. Resource utilization (CPU, memory, disk)
  2. Infrastructure health (node down, replica lag)
  3. External system health (payment processor down)
  4. Early warning signals (before SLO breach)
  5. Development/test environments

Patterns and Pitfalls

SLO-based alerts detect user impact. Infrastructure alerts detect problems before they impact users. A high database CPU alert (infrastructure) fires before latency SLO breaches (user impact). Use both: infrastructure alerts for early detection, SLO alerts as the business-facing measure.
If you set 99.9% availability but your infrastructure can reliably achieve 99.99%, you're leaving customers disappointed. If you set 99.9% but then create tight alerting to never breach it, you're wasting on-call time. SLO should match business needs, not operational convenience.
Halfway through the month, you have 50% error budget remaining. New feature request arrives but adds risk (5% error rate increase). You can evaluate: will we breach SLO? If no, deploy. If yes, fix or defer. Error budget becomes a resource for velocity trade-offs.
Not all SLOs are about availability. Latency SLOs (p95 < 500ms) are equally important. Batch processing SLOs (complete within 1 hour) matter for different services. Don't just track availability; track all SLO dimensions.
Early in the month, tolerate higher burn rates (still time to recover). Late in the month, lower tolerance. Alert thresholds should tighten as you approach the window end. Day 1: alert if burn rate > 20. Day 25: alert if burn rate > 2.
Incident: hard drive fails, database replicas go down for 2 hours. This doesn't breach 99.9% monthly SLO (only consumes 2% of budget). But it was painful. Maybe you need a stronger SLO? Review SLOs after significant incidents.

Design Review Checklist

  • Are SLOs defined for all critical services?
  • Can you explain the business reason for each SLO?
  • Are alerts based on burn rate, not arbitrary thresholds?
  • Do you have multiple burn rate windows (immediate, short-term, long-term)?
  • Is error budget visibility shared with the team?
  • Are SLO dashboards easily accessible to on-call?
  • Do you distinguish between SLO violations and degradation alerts?
  • Are infrastructure alerts layered with SLO alerts?
  • Is alert fatigue minimal (few false positives)?
  • Are SLOs reviewed and adjusted after major incidents?

Self-Check

  • What's your API service SLO and error budget?
  • How much error budget is left this month?
  • What's your current burn rate?
  • Which SLO is closest to being breached?
  • Can you explain why you set that SLO?

Next Steps

  1. Define SLOs: For each critical service, define availability and latency SLOs
  2. Calculate error budgets: For each SLO, determine allowed failure rate
  3. Set burn rate alerts: Create alerts at 10x, 5x, and 1x burn rates
  4. Dashboard SLO metrics: Visualize error budget, burn rate, and compliance
  5. Review monthly: Track SLO breach incidents; adjust SLOs if needed

References

  1. Beyer, B., et al. (2016). Site Reliability Engineering. O'Reilly Media ↗️
  2. Humble, J., & Farley, D. (2010). Continuous Delivery. Addison-Wesley ↗️
  3. Rong, K. (2019). Error Budget Alerts. Google SRE Book ↗️