Metrics
Metrics are numbers that describe system behavior: requests processed, errors encountered, memory used, latency experienced. Unlike logs (which describe individual events), metrics aggregate and sample. A system processing 1 million requests generates millions of log entries but thousands of metric points.
Good metrics enable:
- Capacity Planning: How much load can your system handle? When will you run out of resources?
- Performance Analysis: Is latency degrading? Which endpoints are slow?
- Cost Visibility: How does resource consumption map to business value?
- Alerting: When metrics breach thresholds, wake up the on-call engineer
- Dashboards: Visual overview of system health
This section covers metrics as a discipline: the golden signals that matter, methodologies for choosing what to measure, designing dashboards, and using metrics to drive decisions.
📄️ Golden Signals: Latency, Traffic, Errors, Saturation
Google's four key metrics for understanding service health: measure these well, and you'll know your system.
📄️ RED and USE Methodologies
Measure system performance using RED (for services) and USE (for resources).
📄️ Dashboards and KPIs
Design dashboards and key performance indicators for operational visibility.