Observability (Cross-Reference)
Deep visibility into system behavior through metrics, logs, and traces. Full coverage in Section 12: Observability and Operations.
TL;DR
Observability is the ability to understand system behavior from external signals without internal code knowledge. It answers: "What's happening in production right now?" Built from three pillars: Metrics (quantitative, time-series data), Logs (textual event records), and Traces (request paths across services). Together, these provide visibility for incident diagnosis, performance optimization, and capacity planning. Strong observability is non-negotiable for all quality attributes: reliability, performance, security, and maintainability.
Why Observability Matters
Traditional monitoring asks: "Is the system up?" Modern observability asks: "Why did this happen? Where exactly is the problem? How do we fix it?"
- Without observability: System is slow. You have no idea why. Blind debugging.
- With observability: System is slow. Dashboard shows: 95% of latency is in database queries. One specific query is running slowly. You can fix it in minutes.
The Three Pillars
- Metrics: Quantitative Time-Series
- Logs: Textual Event Records
- Traces: Request Paths Across Services
- Profiles: CPU, Memory, I/O Analysis
Metrics are numerical measurements over time.
Examples:
- CPU usage: 45% (now), 42% (1 min ago), 38% (2 min ago)
- Request latency: p50=50ms, p95=200ms, p99=1000ms
- Error rate: 2 errors per minute
- Disk free: 150GB
Tools:
- Prometheus (open-source, pull-based)
- Datadog (commercial, push-based)
- CloudWatch (AWS native)
- Grafana (visualization)
Key queries:
- "What's the error rate over the last hour?"
- "Is CPU usage trending up or down?"
- "Which service has the highest latency?"
Logs are detailed text records of what happened.
Examples:
2025-02-14T10:23:45Z INFO [checkout-service] User 12345 initiated checkout
2025-02-14T10:23:46Z DEBUG [payment-api] Processing payment for $99.99
2025-02-14T10:23:47Z ERROR [payment-api] Stripe API timeout (5s). Retrying...
2025-02-14T10:23:48Z INFO [payment-api] Payment succeeded (stripe_id=ch_123)
Tools:
- Elasticsearch + Kibana (open-source)
- Splunk (commercial)
- Datadog (commercial)
- CloudWatch Logs (AWS native)
Key queries:
- "Show all ERROR logs from the last 30 minutes"
- "What requests are timing out?"
- "Debug: what happened to this user's session?"
Traces follow a single request/transaction as it flows through the system.
Example trace (checkout flow):
Trace ID: abc123
├─ API Gateway (0-100ms)
│ ├─ Checkout Service (0-95ms)
│ │ ├─ Inventory Service (0-20ms) ← slow check
│ │ ├─ Payment Service (20-85ms)
│ │ │ ├─ Stripe API (20-80ms) ← bottleneck!
│ │ │ └─ Logging (80-85ms)
│ │ └─ Notification Service (85-95ms)
│ └─ Load Balancer egress (95-100ms)
Tools:
- Jaeger (open-source)
- Zipkin (open-source)
- Datadog APM (commercial)
- AWS X-Ray (AWS native)
Key insights:
- Stripe API is the bottleneck (80ms)
- Inventory check is fast (20ms)
- Total latency: 100ms, most in payment processing
Profiles show where CPU/memory is actually spent.
Example CPU profile:
30% in payment_processing()
25% in database_query()
20% in encryption()
15% in logging()
10% other
Tools:
- pprof (Go)
- jprofile (Java)
- py-spy (Python)
- Datadog Profiler (commercial, language-agnostic)
Use when:
- "Why is CPU 80%?"
- "Memory leak: which function allocates most?"
- "Which part of the code is slowest?"
Observability in Practice
Motivating Scenario
Without observability:
- Alert: "API error rate is 5%"
- Engineer: "What's causing it? I don't know. Let's restart the service?"
- Service restarts. Error rate drops. Lucky...
With observability:
- Alert: "API error rate is 5%"
- Engineer: "Let me check the dashboard."
- Logs show: "Database connection pool exhausted"
- Metrics confirm: DB CPU is 99%, connections: 150/100 (over limit)
- Traces show: Checkout service has 50+ slow requests, each holding 1 connection for 30+ seconds
- Root cause: Stripe API is slow, checkout service waits, connections starve
- Fix: Increase connection pool to 200, add timeout to Stripe calls
Patterns & Pitfalls
Design Review Checklist
- Are key business metrics monitored (requests/sec, conversion rate, customer count)?
- Are key technical metrics monitored (latency p50/p95/p99, error rate, CPU, memory, disk)?
- Are logs centralized and searchable (not scattered across servers)?
- Do application logs include request IDs for tracing?
- Are distributed traces instrumented (all major services)?
- Do dashboards exist for on-call engineers (quick triage)?
- Are alerts configured for critical thresholds?
- Is alert fatigue addressed (low false positive rate)?
- Can you correlate metrics, logs, and traces (e.g., error in logs → affected requests in traces)?
- Is observability cost understood and budgeted?
- Are retention policies set (metrics: 15 days, logs: 30 days, traces: 7 days)?
- Do new features include observability instrumentation?
- Is observability testing part of CI/CD (verify telemetry)?
- Can you answer 'What went wrong?' within 5 minutes of alert?
- Is observability integrated into on-call runbook?
Self-Check
- Right now, if a customer reports an issue, can you diagnose it within 5 minutes using your observability tools?
- Do you know your P95 latency? Your error rate? Your top 5 slowest endpoints?
- Can you trace a single user's request through all services?
- Do you have a single pane of glass dashboard for on-call engineers?
- If a deployment goes wrong, can you see it immediately (canary metrics, golden signals)?
Key Differences: Observability vs. Monitoring
| Aspect | Monitoring | Observability |
|---|---|---|
| Scope | Predefined metrics | Unknown unknowns |
| Data | Metrics only | Metrics + logs + traces |
| Question | "Is it up?" | "Why did it fail?" |
| Dashboards | Static, pre-built | Dynamic, interactive drill-down |
| Alerting | Threshold-based | Anomaly-based, intelligent |
| Debugging | Slow (hunt for info) | Fast (all signals in one place) |
Next Steps
- Choose a platform — Prometheus/ELK stack (open-source) or Datadog/New Relic (commercial)
- Instrument key services — Add metrics, logs, traces
- Build dashboards — For on-call engineers, for product team
- Configure alerts — Based on SLOs (latency, error rate, availability)
- Create runbooks — Alert → diagnosis steps → fix
- Train team — How to use observability tools effectively
- Iterate — Add more signals, improve dashboards, reduce alert fatigue
References
- Observability Engineering (O'Reilly) ↗️
- Prometheus: Metrics ↗️
- Elasticsearch: Logs ↗️
- Jaeger: Distributed Tracing ↗️
- Datadog: Unified Observability ↗️
- Google Cloud: Observability Best Practices ↗️
Implementing Observability in Code
Adding Metrics
from prometheus_client import Counter, Histogram
request_count = Counter(
'api_requests_total',
'Total API requests',
['endpoint', 'method', 'status']
)
request_duration = Histogram(
'api_request_duration_seconds',
'API request duration',
['endpoint']
)
@app.route('/orders/<order_id>')
@request_duration.labels(endpoint='/orders/{order_id}').time()
def get_order(order_id):
try:
order = db.get_order(order_id)
request_count.labels(
endpoint='/orders/{order_id}',
method='GET',
status=200
).inc()
return order
except Exception as e:
request_count.labels(
endpoint='/orders/{order_id}',
method='GET',
status=500
).inc()
raise
Adding Logs with Structure
import logging
import json
from uuid import uuid4
logger = logging.getLogger(__name__)
def create_order(order_data):
request_id = str(uuid4())
logger.info("Order creation started", extra={
'request_id': request_id,
'customer_id': order_data['customer_id'],
'total_cents': order_data['total_cents']
})
try:
order = order_service.create(order_data)
logger.info("Order created successfully", extra={
'request_id': request_id,
'order_id': order['id'],
'total_cents': order['total_cents']
})
return order
except PaymentFailedError as e:
logger.error("Order creation failed: payment declined", extra={
'request_id': request_id,
'error': str(e),
'error_code': e.code
})
raise
Adding Traces
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
tracer = trace.get_tracer(__name__)
async def process_order(order_id):
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
with tracer.start_as_current_span("fetch_order"):
order = await get_order(order_id)
with tracer.start_as_current_span("validate_inventory"):
await check_inventory(order.items)
with tracer.start_as_current_span("process_payment"):
await payment_service.charge(order.total)
with tracer.start_as_current_span("send_confirmation"):
await email_service.send(order.customer_email)
return order
# Trace shows: fetch (10ms) + validate (50ms) + payment (200ms) + email (40ms) = 300ms
# Can see which spans are slow, what operations are nested
Anti-Patterns to Avoid
Anti-Pattern: No Context in Logs
# BAD: Can't correlate logs across services
logger.error("Database connection failed")
# GOOD: Include context
logger.error("Database connection failed", extra={
'user_id': user_id,
'operation': 'create_order',
'request_id': request_id
})
Anti-Pattern: Metrics Without Context
# BAD: Just a number, no insight
counter.inc()
# GOOD: Labeled metrics enable drill-down
counter.labels(
service='order-service',
operation='create_order',
status='success'
).inc()
Anti-Pattern: No Alerting
# BAD: Metrics are collected but no one cares
# System is slow; no one notices for hours
# GOOD: Alert-driven observability
if metrics.error_rate > 0.05: # 5% error rate
send_alert("Order service error rate high")
if metrics.latency_p95 > 500: # 500ms latency
send_alert("Order service slow (p95 > 500ms)")
Building Observability-Driven Culture
- Expose metrics in dashboards: Team sees P95 latency, error rate, SLOs
- Alert on anomalies: Alert when behavior changes, not just thresholds
- Incident retrospectives: Every incident leads to better observability
- Observability as requirement: New features require metrics, logs, traces
- On-call playbooks: Runbooks for common alerts, linking to observability tools
Self-Check
- Right now, if customer reports issue, can you diagnose within 5 minutes?
- Do you know your P95 latency? Error rate? Top 5 slowest endpoints?
- Can you trace single user request through all services?
- Do you have single pane of glass dashboard for on-call?
- If deployment goes wrong, can you see it immediately?
If you answer "no" to any, observability gaps exist.
One Takeaway: Observability is the foundation of all quality attributes. You can't maintain reliability, performance, or security without seeing what your system is doing. Invest early: metrics, logs, traces from day one. The cost of observability infrastructure is trivial compared to the cost of debugging production issues blind.
Key Differences: Observability vs. Monitoring
| Aspect | Monitoring | Observability |
|---|---|---|
| Scope | Predefined metrics | Unknown unknowns |
| Data | Metrics only | Metrics + logs + traces |
| Question | "Is it up?" | "Why did it fail?" |
| Dashboards | Static, pre-built | Dynamic, interactive drill-down |
| Alerting | Threshold-based | Anomaly-based, intelligent |
| Debugging | Slow (hunt for info) | Fast (all signals in one place) |
| Cost | Lower | Higher (more data collected) |
| Time to diagnose | Hours | Minutes |
Real-World Example: E-commerce Checkout Observability
Metrics:
checkout_requests_total (counter) - total requests
checkout_latency_seconds (histogram) - latency distribution
checkout_errors_total (counter) - errors by type
cart_size (gauge) - items in cart
payment_processing_time (histogram) - payment duration
Logs (structured):
User initiates checkout (user_id, cart_size, total)
→Inventory check starts (request_id, item_count)
→Inventory check complete (duration, items_available)
→Payment processing starts (gateway, amount)
→Payment processing complete (status, auth_code, duration)
→Checkout complete (order_id, total_time)
Traces (distributed):
checkout (total: 300ms)
├─ inventory_check (50ms)
├─ payment_processing (200ms)
└─ order_persistence (40ms)
└─ email_notification (async)
Alerts:
- Checkout latency > 1000ms
- Checkout error rate > 2%
- Payment processing latency > 5s
- Inventory service unavailable
Next Steps
- Choose a platform — Prometheus/ELK (open-source) or Datadog/New Relic (commercial)
- Instrument key services — Add metrics, logs, traces
- Build dashboards — For on-call engineers, product team
- Configure alerts — Based on SLOs (latency, error rate, availability)
- Create runbooks — Alert → diagnosis steps → fix
- Train team — How to use observability tools
- Iterate — Add more signals, improve dashboards, reduce alert fatigue
References
- Observability Engineering (O'Reilly) ↗️
- Prometheus: Metrics ↗️
- Elasticsearch: Logs ↗️
- Jaeger: Distributed Tracing ↗️
- Datadog: Unified Observability ↗️
- Google Cloud: Observability Best Practices ↗️
- Honeycomb: Observability for Engineers ↗️
Full Coverage: See Observability and Operations in Section 12 for comprehensive details on building and maintaining observable systems.