Skip to main content

Alerting

An alert that fires at 3 AM and turns out to be a false positive wastes an engineer's sleep—and erodes trust in alerting. Alert fatigue is real: if you cry wolf too often, people stop responding. Effective alerting is hard: you need to know what to alert on (not everything), when to alert (not on temporary blips), and how to alert (actionable information, not noise).

This section covers:

  • SLO-Based Alerting: Alert when you breach service-level objectives, not arbitrary thresholds
  • Runbooks and On-Call: Structure incident response with runbooks and clear ownership
  • Incident Response: Manage the lifecycle from detection to postmortem