Metrics and Monitoring

Measure system behavior with metrics using RED and USE methods to identify performance issues.

TL;DR

Metrics measure system behavior: requests per second, latency, error rates, CPU usage, memory. The RED method focuses on services: Rate (requests/sec), Errors (error rate), Duration (latency). The USE method focuses on resources: Utilization (% busy), Saturation (queue depth), Errors. Combine both for complete visibility. Export metrics to time-series databases (Prometheus, InfluxDB) and visualize with Grafana. Set up alerts on key metrics: alert when error rate spikes, when latency exceeds SLO, when resource utilization nears limits. Metrics enable proactive monitoring: catch problems before users notice.

Learning Objectives

Instrument services to emit metrics
Understand RED method (Rate, Errors, Duration) for services
Understand USE method (Utilization, Saturation, Errors) for resources
Set up metric collection and storage
Create effective dashboards and alerts
Use metrics to identify performance issues

Motivating Scenario

A service has a performance problem. Users report slowness. You have no metrics. Is it a code issue? Database issue? Network? Deployed without instrumentation, you're blind. With proper metrics, you see: request rate constant, error rate normal, but latency increasing and database connection pool saturation increasing. You've identified the problem: database contention.

Core Concepts

RED Method

Rate: How many requests per second? Errors: What percentage fail? Duration: What's the response time? Track these for each service/endpoint. Alert when error rate spikes or latency exceeds targets.

USE Method

Utilization: What percentage of capacity is in use? Saturation: How many requests are queued? Errors: How many operations fail? Apply to resources: CPU, memory, disk, network, database connections.

Golden Signals

Google's four golden signals: latency (response time), traffic (request volume), errors (failure rate), saturation (capacity used). Monitor all four for each service.

Time-Series Metrics

Store metrics over time in systems designed for time-series data. Enable trending: is latency getting worse? Is traffic increasing? Historical data enables capacity planning.

Practical Example

Python
Go
Node.js

from prometheus_client import Counter, Histogram, Gauge
import time

# RED metrics for service
request_count = Counter(
    'service_requests_total',
    'Total requests',
    ['service', 'endpoint', 'status']
)

request_duration = Histogram(
    'service_request_duration_seconds',
    'Request duration',
    ['service', 'endpoint']
)

request_errors = Counter(
    'service_errors_total',
    'Total errors',
    ['service', 'endpoint', 'error_type']
)

# USE metrics for resources
db_pool_utilization = Gauge(
    'db_pool_utilization',
    'Database connection pool utilization',
    ['service']
)

db_pool_saturation = Gauge(
    'db_pool_saturation',
    'Database connection pool queue depth',
    ['service']
)

def handle_request(service, endpoint, request_func):
    """Instrument request with RED metrics"""
    start = time.time()
    status = 'success'
    error_type = None

    try:
        result = request_func()
        return result
    except Exception as e:
        status = 'error'
        error_type = type(e).__name__
        request_errors.labels(
            service=service,
            endpoint=endpoint,
            error_type=error_type
        ).inc()
        raise
    finally:
        duration = time.time() - start
        request_count.labels(
            service=service,
            endpoint=endpoint,
            status=status
        ).inc()
        request_duration.labels(
            service=service,
            endpoint=endpoint
        ).observe(duration)

def monitor_database_pool(pool):
    """Report database pool health"""
    # Utilization
    connections = pool.size()
    capacity = pool.max_size()
    utilization = (connections / capacity) * 100
    db_pool_utilization.labels(service='order-service').set(utilization)

    # Saturation
    queue_depth = pool.waiting_count()
    db_pool_saturation.labels(service='order-service').set(queue_depth)

import "github.com/prometheus/client_golang/prometheus"

// RED metrics
var (
    requestCount = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "service_requests_total",
            Help: "Total requests",
        },
        []string{"service", "endpoint", "status"},
    )

    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "service_request_duration_seconds",
            Help: "Request duration",
        },
        []string{"service", "endpoint"},
    )

    requestErrors = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "service_errors_total",
            Help: "Total errors",
        },
        []string{"service", "endpoint", "error_type"},
    )
)

// USE metrics
var (
    dbPoolUtilization = prometheus.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "db_pool_utilization",
            Help: "Database connection pool utilization",
        },
        []string{"service"},
    )

    dbPoolSaturation = prometheus.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "db_pool_saturation",
            Help: "Database connection queue depth",
        },
        []string{"service"},
    )
)

func HandleRequest(service, endpoint string, handler func() error) error {
    start := time.Now()
    status := "success"
    errorType := ""

    err := handler()
    if err != nil {
        status = "error"
        errorType = reflect.TypeOf(err).String()
        requestErrors.WithLabelValues(service, endpoint, errorType).Inc()
    }

    duration := time.Since(start).Seconds()
    requestCount.WithLabelValues(service, endpoint, status).Inc()
    requestDuration.WithLabelValues(service, endpoint).Observe(duration)

    return err
}

func MonitorDatabasePool(pool *sql.DB) {
    stats := pool.Stats()
    utilization := float64(stats.OpenConnections) / float64(10) * 100
    dbPoolUtilization.WithLabelValues("order-service").Set(utilization)
}

const prometheus = require('prom-client');

// RED metrics
const requestCounter = new prometheus.Counter({
    name: 'service_requests_total',
    help: 'Total requests',
    labelNames: ['service', 'endpoint', 'status']
});

const requestDurationHistogram = new prometheus.Histogram({
    name: 'service_request_duration_seconds',
    help: 'Request duration',
    labelNames: ['service', 'endpoint'],
    buckets: [0.1, 0.5, 1, 2, 5]
});

const errorCounter = new prometheus.Counter({
    name: 'service_errors_total',
    help: 'Total errors',
    labelNames: ['service', 'endpoint', 'error_type']
});

// USE metrics
const dbPoolUtilization = new prometheus.Gauge({
    name: 'db_pool_utilization',
    help: 'Database connection pool utilization',
    labelNames: ['service']
});

async function handleRequest(service, endpoint, handler) {
    const start = Date.now();
    let status = 'success';
    let errorType = null;

    try {
        return await handler();
    } catch (error) {
        status = 'error';
        errorType = error.constructor.name;
        errorCounter.labels(service, endpoint, errorType).inc();
        throw error;
    } finally {
        const duration = (Date.now() - start) / 1000;
        requestCounter.labels(service, endpoint, status).inc();
        requestDurationHistogram.labels(service, endpoint).observe(duration);
    }
}

When to Use / When Not to Use

When Detailed Metrics Are Essential

Production systems with high traffic
Systems where performance is critical
Distributed services with many endpoints
Systems requiring SLO/SLA compliance

When Basic Monitoring Suffices

Development and testing environments
Simple applications with few endpoints
Systems with lenient SLOs

Patterns and Pitfalls

Design Review Checklist

All services emit RED metrics (Rate, Errors, Duration)
Resource metrics (USE) are collected for critical resources
Metrics have reasonable label cardinality (not millions of combinations)
Dashboards display key metrics (latency, error rate, utilization)
Alerts are configured for SLO violations
Metrics retention policies align with requirements

Self-Check

What are the RED metrics and why are they important?
How do you detect resource saturation using metrics?
What metrics would you use to assess database health?

One Takeaway

Metrics make systems observable. RED method for services, USE method for resources. Combine both and you understand your system's health.

Advanced Metrics Topics

Cardinality and Label Design

High cardinality metrics (many unique label combinations) can overwhelm time-series databases. Design labels carefully:

# Bad: High cardinality label (unbounded user IDs)
user_requests = Counter(
    'requests_by_user',
    'Requests per user',
    ['user_id']  # Can have millions of unique values
)

# Good: Bounded label (user tier or segment)
user_requests = Counter(
    'requests_by_tier',
    'Requests per tier',
    ['user_tier']  # Only: free, premium, enterprise
)

# Bad: Request path (100s of unique paths)
path_latency = Histogram(
    'request_latency_seconds',
    'Latency by path',
    ['path']  # /users/1, /users/2, /users/3... millions
)

# Good: Endpoint pattern (grouped)
path_latency = Histogram(
    'request_latency_seconds',
    'Latency by endpoint',
    ['endpoint']  # /users/:id, /posts/:id (bounded)
)

Alerting Rules

Define alerting rules based on metrics to catch problems early:

# Prometheus alerting rules (alerting.yaml)
groups:
  - name: service_health
    rules:
      # Alert if error rate exceeds 1%
      - alert: HighErrorRate
        expr: |
          (
            rate(service_errors_total[5m]) /
            rate(service_requests_total[5m])
          ) > 0.01
        for: 5m
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }}"

      # Alert if latency p95 exceeds 500ms
      - alert: HighLatency
        expr: |
          histogram_quantile(
            0.95,
            rate(service_request_duration_seconds_bucket[5m])
          ) > 0.5
        for: 5m

      # Alert if service is down (no metrics for 2 minutes)
      - alert: ServiceDown
        expr: up{job="api-service"} == 0
        for: 2m

      # Alert on saturation (queue depth > 1000)
      - alert: QueueSaturation
        expr: job_queue_depth > 1000
        for: 1m

Observability Anti-Patterns

Common mistakes to avoid:

# Anti-pattern: Logging instead of metrics
# BAD
log.info(f"Request processed in {latency}ms")

# GOOD
request_latency.labels(endpoint="/users").observe(latency)

# Anti-pattern: Aggregating in application instead of database
# BAD
def average_latency():
    return sum(latencies) / len(latencies)  # Only current running instance

# GOOD: Let time-series DB aggregate
# Query: avg(rate(request_duration_seconds_sum[5m])) / avg(rate(request_duration_seconds_count[5m]))

# Anti-pattern: Metrics without context (no service/endpoint labels)
# BAD
response_time = Gauge('response_time_ms')

# GOOD: Always add identifying labels
response_time = Gauge(
    'response_time_ms',
    'Response time in milliseconds',
    ['service', 'endpoint', 'method']
)

Metrics in Practice: Case Studies

E-commerce Platform Monitoring

Metrics strategy for an online store:

# Service-level (RED)
- api_requests_total[service, endpoint, status]
- api_request_duration_seconds[service, endpoint]
- api_errors_total[service, endpoint, error_type]

# Resource-level (USE)
- database_connection_pool_active[database]
- database_query_duration_seconds_bucket[database, query_type]
- cache_hit_ratio[cache_name]
- kafka_consumer_lag[topic, partition]

# Business metrics
- orders_created_total[region, product_category]
- checkout_abandonment_rate[region]
- product_search_latency_p95[region]
- inventory_allocation_failures_total[warehouse]

Microservices Monitoring

Metrics for a distributed system:

# Every service should emit these baseline metrics
baseline_metrics = {
    # Traffic
    'requests_total': 'Counter[service, endpoint, method, status]',
    'request_duration_seconds': 'Histogram[service, endpoint]',

    # Errors
    'errors_total': 'Counter[service, error_type]',
    'exception_rate': 'Rate[service, exception_class]',

    # Saturation
    'queue_depth': 'Gauge[queue_name]',
    'threadpool_active_threads': 'Gauge[pool_name]',
    'connection_pool_active': 'Gauge[pool_name]',

    # Dependencies
    'downstream_call_duration_seconds': 'Histogram[service, downstream, endpoint]',
    'downstream_call_errors_total': 'Counter[service, downstream]',
}

Metrics vs Logs vs Traces

Understanding when to use each:

Metrics: "How much?" "How fast?" "How many errors?"
- Structured, aggregated, time-series
- Good for: trends, alerting, dashboards
- Example: 50 requests/sec, p99 latency 200ms

Logs: "What happened?"
- Detailed, unstructured, event-based
- Good for: debugging, investigation, audit
- Example: "User 123 logged in from 192.168.1.1"

Traces: "Why is it slow?"
- Distributed call path, timing breakdown
- Good for: performance debugging, understanding flow
- Example: Request took 200ms: 100ms in service A, 80ms in DB, 20ms in service B

Best practice: Use all three together
- Metrics trigger alerts
- Alerts lead to logs for investigation
- Traces show the exact flow for debugging

Next Steps

Instrument all services with RED metrics
Set up metric collection (Prometheus)
Create dashboards (Grafana)
Configure alerts on SLO violations
Implement tracing (Jaeger, Zipkin) for distributed debugging
Design runbooks for common alerts

References

Tom Wilkie, RED Method for Distributed Systems
Brendan Gregg, USE Method for Performance Analysis
Google Cloud: The Four Golden Signals
"Observability Engineering" by Charity Majors, Liz Fong-Jones, George Miranda
Prometheus Best Practices: https://prometheus.io/docs/practices/

Metrics and Monitoring

TL;DR​

Learning Objectives​

Motivating Scenario​

Core Concepts​

RED Method​

USE Method​

Golden Signals​

Time-Series Metrics​

Practical Example​

When to Use / When Not to Use​

Patterns and Pitfalls​

Design Review Checklist​

Self-Check​

Advanced Metrics Topics​

Cardinality and Label Design​

Alerting Rules​

Observability Anti-Patterns​

Metrics in Practice: Case Studies​

E-commerce Platform Monitoring​

Microservices Monitoring​

Metrics vs Logs vs Traces​

Next Steps​

References​