Skip to main content

Trace Context Propagation

Propagate trace context across service calls to build complete request flow visibility.

TL;DR

Trace context (trace ID, span ID, sampling decision) must propagate through all services handling a request. Use W3C Trace Context standard headers (traceparent, tracestate) for HTTP. Include trace context in logs (as correlation IDs), message queues, and async operations. OpenTelemetry handles propagation automatically when configured correctly. Without propagation, you see isolated requests in logs—each service logs independently, and you can't connect the dots. With propagation, a single trace ID reveals the complete request path, all services touched, timing at each hop, and errors. Propagate trace context consistently, automatically, across all communication boundaries.

Learning Objectives

  • Understand W3C Trace Context standard and its structure
  • Implement automatic trace context propagation in HTTP
  • Propagate trace context across message queues and async operations
  • Use OpenTelemetry to handle propagation without code changes
  • Integrate trace IDs into logging systems
  • Balance sampling decisions with overhead
  • Debug distributed requests end-to-end

Motivating Scenario

A user reports that searching for products takes 10 seconds. You check the search service logs: search returned in 500ms. You check the API gateway: request took 10 seconds. You check the frontend: request completed in 10 seconds.

The mystery: where did the 9.5 seconds go? Without trace context propagation, each service logs independently. You can't connect logs from search service to logs from API gateway to understand the full flow. With trace context propagation, a single trace ID connects all logs:

  • API gateway: received request, trace_id=abc123
  • API gateway calls search service, passes trace_id=abc123
  • Search service logs with trace_id=abc123
  • You see: API gateway → search (0.5s) → auth check (3s) → cache lookup (2s) → database (3s) → returned

The trace reveals the bottleneck (auth check + database) that you would have missed.

Core Concepts

Trace Context Propagation Across Services

W3C Trace Context Standard

Standard HTTP headers for propagating trace context:

traceparent header format:

traceparent: version-trace_id-parent_span_id-trace_flags

Example: traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01

  • version (00): Protocol version
  • trace_id (32 hex chars): Identifies the entire request across all services
  • parent_span_id (16 hex chars): ID of the span that initiated this call
  • trace_flags (2 hex chars): Sampling flag (01 = sampled/trace, 00 = not sampled)

tracestate header: Vendor-specific state. Allows tracing vendors to extend with proprietary information while maintaining standard compatibility.

Context Propagation Methods

HTTP headers: For synchronous service-to-service calls. Headers automatically propagate. Use traceparent as standard.

Message queues: For asynchronous operations. Include trace context in message headers or body. When consumer processes the message, resume the trace context.

Logging: Include trace ID in every log line. Use structured logging: {"timestamp": "...", "trace_id": "...", "message": "..."}. Allows correlating logs across services.

Database operations: Some tracing frameworks allow attaching trace context to database queries (via comments or tags). Helps when analyzing slow queries.

Practical Example

// Node.js: Automatic trace context propagation with OpenTelemetry

const { NodeTracerProvider } = require('@opentelemetry/node');
const { registerInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { SimpleSpanProcessor } = require('@opentelemetry/tracing');
const { W3CTraceContextPropagator } = require('@opentelemetry/core');
const { CompositePropagator } = require('@opentelemetry/core');
const express = require('express');

// Setup tracing
const tracerProvider = new NodeTracerProvider();

const jaegerExporter = new JaegerExporter({
serviceName: 'search-service',
host: 'jaeger.example.com',
port: 6832,
});

tracerProvider.addSpanProcessor(new SimpleSpanProcessor(jaegerExporter));

// Setup propagators for HTTP headers
tracerProvider.register();

const propagator = new CompositePropagator({
propagators: [new W3CTraceContextPropagator()],
});

// Auto-instrument common libraries (express, http, database drivers)
registerInstrumentations({
tracerProvider,
});

const app = express();

// Middleware to extract trace context from incoming requests
app.use((req, res, next) => {
// W3CTraceContextPropagator automatically extracts traceparent header
// and sets context for downstream calls
next();
});

// Example endpoint
app.get('/search', async (req, res) => {
// Tracer automatically captured the request span
// Any downstream calls (HTTP, database) automatically include trace context

const tracer = require('@opentelemetry/api').trace.getTracer('search-service');

// Create a custom span for business logic
const span = tracer.startSpan('process_search_query');

try {
const query = req.query.q;
span.addEvent('search_started', { query });

// Call auth service - trace context automatically propagated
const authResult = await fetch('http://auth-service/verify', {
headers: {
'Authorization': `Bearer ${req.headers.authorization}`,
// W3CTraceContextPropagator automatically adds traceparent header
},
});

span.addEvent('auth_completed');

// Query database - trace context automatically propagated
const results = await db.query('SELECT * FROM products WHERE name LIKE ?', [query]);

span.addEvent('search_completed', { result_count: results.length });
span.setStatus({ code: 0 });

res.json(results);
} catch (error) {
span.recordException(error);
span.setStatus({ code: 2, message: error.message });
res.status(500).json({ error: error.message });
} finally {
span.end();
}
});

app.listen(3000);

When to Use / When Not to Use

Propagate Trace Context
  1. Distributed systems (multiple services)
  2. Asynchronous operations and message queues
  3. Multi-language/multi-framework systems
  4. Need end-to-end request visibility
  5. Performance debugging across services
Less Critical For
  1. Monolithic applications
  2. Single-threaded, synchronous code
  3. Non-distributed systems
  4. Development/testing environments
  5. Systems with limited service count

Patterns and Pitfalls

Don't manually extract/inject trace context in every function. Use instrumentation libraries (OpenTelemetry, DataDog) that automatically handle propagation. Manual propagation is error-prone and duplicates work. Let frameworks handle it.
Some services use W3C Trace Context, others use Jaeger format, others use Zipkin format. Incompatible formats break trace correlation. Standardize on W3C Trace Context (HTTP headers) + your language's OpenTelemetry library for consistency.
Add traceid and spanid to every log entry via structured logging. Allows grepping logs: grep trace_id=abc123 all-service-logs.log returns all logs for a single request across all services. This is invaluable for debugging.
Tracing every request at 100% sampling can add 5-10% overhead. Use adaptive sampling: sample all error responses, sample 1% of successful responses. Adjust sampling rate based on traffic volume and overhead. Sampling decisions should be consistent (once decided for a trace, all services use same decision).
When services communicate via queues, include trace context in message headers. When consumer picks up message, extract and resume context. This connects async operations into a single trace, showing where time is spent (producer → queue → consumer).
Async frameworks (async/await, threadpools) can lose context if not configured properly. OpenTelemetry handles this automatically, but custom async code might lose context. Test: verify trace_id shows in logs from thread pool tasks.

Design Review Checklist

  • Is W3C Trace Context (traceparent header) used for HTTP propagation?
  • Is trace context propagated to all service-to-service calls?
  • Is trace context included in message queue headers for async operations?
  • Are trace IDs logged with every log entry (structured logging)?
  • Can you query logs by trace_id to get complete request flow?
  • Is sampling decision consistent across all services in a trace?
  • Is trace overhead monitored (<1% performance impact)?
  • Are all language runtimes using OpenTelemetry or equivalent?
  • Is trace context tested (unit tests verify headers propagated)?
  • Can you follow a single request through all services in your traces?

Self-Check

  • Can you query a single trace and see all services touched?
  • Do all your logs include trace_id?
  • Is latency broken down by service?
  • Can you identify which service caused slowness?
  • Is trace overhead acceptable for your traffic volume?

Next Steps

  1. Standardize on W3C Trace Context: Update all services to use traceparent header
  2. Install OpenTelemetry SDK: Add to all language runtimes in your stack
  3. Auto-instrument libraries: Enable auto-instrumentation for HTTP, database, message queues
  4. Add structured logging: Include trace_id in all log entries
  5. Test propagation: Verify trace context flows through all service boundaries

References

  1. W3C Trace Context. W3C Trace Context Standard ↗️
  2. OpenTelemetry Documentation. OpenTelemetry ↗️
  3. Pignata, L., & Stafford, U. (2022). Observability Engineering. O'Reilly Media ↗️