Skip to main content

Service Maps and Dependency Visualization

Visualize service dependencies and communication patterns to understand system topology.

TL;DR

Service maps are generated automatically from distributed traces: when service A calls service B, both appear as nodes in a directed graph. An edge connects them with metrics (throughput, latency, error rate). Service maps answer: what services does a user request touch? Which services are in the critical path? Are there unexpected cross-team dependencies? Which services are bottlenecks? Don't maintain service maps manually—derive them from traces, and they stay accurate automatically. Service maps catch architectural problems: deep call chains (excessive latency), circular dependencies (deadlock risks), chatty services (too many calls), and single points of failure. Use service maps for incident response, architecture reviews, and capacity planning.

Learning Objectives

  • Understand how service maps are derived from traces
  • Interpret service map topology and identify critical paths
  • Detect architectural anti-patterns (deep chains, circular deps)
  • Use service maps for incident investigation
  • Leverage service maps for architecture reviews
  • Monitor service dependencies for unexpected changes

Motivating Scenario

Your product has grown to 25+ microservices. When an incident occurs, you don't know which services are affected. You spend 30 minutes chasing through logs trying to understand the request flow. Meanwhile, customers are impacted.

A service map shows the complete dependency graph. When an incident alert fires for "user-api," you immediately see: user-api → auth-service → identity-db (on critical path), and user-api → recommendation-service → cache (off critical path). If the incident is latency, you know to focus on identity-db. If the incident is errors, you check auth-service.

Without a service map, you're flying blind. With one, you navigate confidently.

Core Concepts

Service Map: Dependencies, Metrics, and Critical Paths

Service Map Elements

Nodes: Individual services, external dependencies, data stores

  • Service-to-service: APIs, gRPC endpoints
  • Service-to-data: Databases, caches, data warehouses
  • Service-to-queue: Message brokers, event streams

Edges: Represent communication, labeled with:

  • Throughput: Requests per second on this path
  • Latency: p50, p95, p99 latency for calls on this edge
  • Error rate: Percentage of failed requests on this edge

Edge directions: Directed edges show which service initiates. A → B means A calls B. Bidirectional edges indicate mutual calls (problematic pattern).

Architectural Patterns

Deep call chains: User → API → Service1 → Service2 → Service3 → Database. Each hop adds latency. If each hop is 50ms, total is 250ms. Tail latencies multiply: p99 becomes even worse.

Circular dependencies: A calls B, B calls A (or longer cycles: A→B→C→A). Increases failure propagation. If A is slow, B queues requests, requests to B pile up, A gets backed up further—cascading.

Chatty patterns: Service A calls Service B 100 times per request. Better to batch: one call with 100 items. Reduces network overhead, improves throughput.

Single points of failure: Every request goes through Auth Service. If Auth Service is down or slow, everything degrades. Requires extra redundancy and monitoring.

Practical Example

#!/usr/bin/env python3
import requests
import json
from collections import defaultdict

class ServiceMapGenerator:
"""Generate service map from Jaeger traces"""

def __init__(self, jaeger_url="http://localhost:16686"):
self.jaeger_url = jaeger_url
self.services = set()
self.edges = defaultdict(lambda: {
'calls': 0,
'errors': 0,
'latency_samples': []
})

def fetch_traces(self, service_name, limit=1000):
"""Fetch traces from a service"""
response = requests.get(
f"{self.jaeger_url}/api/traces",
params={'service': service_name, 'limit': limit}
)
return response.json().get('data', [])

def build_map(self, service_names):
"""Build service map from traces"""
for service in service_names:
traces = self.fetch_traces(service)

for trace in traces:
spans = trace['spans']
self._process_trace(spans)

def _process_trace(self, spans):
"""Extract dependencies from spans"""
span_map = {s['spanID']: s for s in spans}

for span in spans:
self.services.add(span['process']['serviceName'])

# Find child spans to detect calls
for s in spans:
if s['parentSpanID'] == span['spanID']:
child_service = s['process']['serviceName']
parent_service = span['process']['serviceName']

edge_key = (parent_service, child_service)
self.edges[edge_key]['calls'] += 1

# Track latency
latency = s['duration']
self.edges[edge_key]['latency_samples'].append(latency)

# Track errors
if any(t['key'] == 'error' and t['value']
for t in s.get('tags', [])):
self.edges[edge_key]['errors'] += 1

def analyze_map(self):
"""Analyze service map for anti-patterns"""
issues = []

# Check for deep call chains
call_depth = self._find_max_chain_depth()
if call_depth > 4:
issues.append({
'type': 'DEEP_CALL_CHAIN',
'severity': 'WARNING',
'depth': call_depth,
'message': f"Deep call chain detected ({call_depth} hops). Consider caching or batching."
})

# Check for circular dependencies
cycles = self._find_cycles()
for cycle in cycles:
issues.append({
'type': 'CIRCULAR_DEPENDENCY',
'severity': 'ERROR',
'cycle': cycle,
'message': f"Circular dependency detected: {' -> '.join(cycle)}"
})

# Check for chatty services
for (src, dst), metrics in self.edges.items():
if metrics['calls'] > 100: # Threshold
issues.append({
'type': 'CHATTY_SERVICES',
'severity': 'WARNING',
'edge': (src, dst),
'calls': metrics['calls'],
'message': f"{src} calls {dst} {metrics['calls']} times. Consider batching."
})

return issues

def _find_max_chain_depth(self):
"""Find longest call chain"""
# Build adjacency list
graph = defaultdict(list)
for (src, dst) in self.edges:
graph[src].append(dst)

def dfs(node, visited):
if node in visited:
return 0
visited.add(node)
max_depth = 0
for neighbor in graph[node]:
max_depth = max(max_depth, 1 + dfs(neighbor, visited))
visited.remove(node)
return max_depth

max_chain = 0
for service in self.services:
max_chain = max(max_chain, dfs(service, set()))
return max_chain

def _find_cycles(self):
"""Find circular dependencies"""
cycles = []
visited = set()
rec_stack = set()

def dfs(node, path):
visited.add(node)
rec_stack.add(node)
path.append(node)

for _, neighbor in [e for e in self.edges if e[0] == node]:
if neighbor not in visited:
dfs(neighbor, path[:])
elif neighbor in rec_stack:
# Found cycle
cycle_start = path.index(neighbor)
cycles.append(path[cycle_start:] + [neighbor])

rec_stack.remove(node)

for service in self.services:
if service not in visited:
dfs(service, [])

return cycles

def export_json(self, filename):
"""Export service map as JSON"""
output = {
'services': list(self.services),
'edges': [
{
'source': src,
'target': dst,
'throughput_rps': metrics['calls'],
'error_rate': metrics['errors'] / metrics['calls'] if metrics['calls'] > 0 else 0,
'latency_p99_ms': sorted(metrics['latency_samples'])[int(len(metrics['latency_samples']) * 0.99)]
if metrics['latency_samples'] else 0
}
for (src, dst), metrics in self.edges.items()
]
}

with open(filename, 'w') as f:
json.dump(output, f, indent=2)

if __name__ == '__main__':
generator = ServiceMapGenerator()
services = ['user-api', 'auth-service', 'search-service', 'recommendation-service']

generator.build_map(services)
issues = generator.analyze_map()

print("Service Map Issues Found:")
for issue in issues:
print(f" [{issue['severity']}] {issue['type']}: {issue['message']}")

generator.export_json('service_map.json')

When to Use / When Not to Use

Service Maps Essential
  1. Microservices architecture (>5 services)
  2. Multiple teams owning different services
  3. Complex dependency chains
  4. Frequent incidents and debugging
  5. Architecture reviews and planning
Service Maps Less Critical
  1. Monolithic architecture
  2. Few services (<3) with simple dependencies
  3. Single team full-stack ownership
  4. Stable, mature systems with known topology
  5. Development and staging environments

Patterns and Pitfalls

Your documented architecture says User Service calls Auth Service. The actual service map (from traces) shows User Service calling Auth Service, then Auth Service calling User Service (circular). Service maps reveal what actually happens, not what architecture documents claim. This catches undocumented dependencies and drift.
A service map with 50+ services becomes unreadable. Visualizations become hairballs. Filter by traffic criticality: show only edges with >1% of total traffic. Or show views per team: platform team's internal services separate from product team's. Hierarchical visualization helps.
When an alert fires, pull up the service map. See the critical path immediately. If alert is for Search Service, you know it doesn't affect User Service (different branch). If it's for Auth Service, you know everything is affected. This speeds MTTR from 30 minutes to 5 minutes.
A service map that averages all requests loses important distinctions. Checkout flow might follow a different path than search flow. Show multiple service maps for different scenarios: typical request, checkout flow, admin operations. Scenario-specific maps are more useful.
Nodes receiving many incoming calls are critical. Color them red. Nodes with high error rates or latencies color yellow. At a glance, you see which services are bottlenecks or unhealthy. This focuses attention where it matters.
If service maps are generated weekly but services deploy daily, maps are outdated. Generate service maps continuously from live traces. Or update them real-time using an observability platform that ingests traces directly. Stale maps are worse than no maps.

Design Review Checklist

  • Do you generate service maps automatically from traces (not manually maintain)?
  • Are service maps available in your observability platform?
  • Can you identify the critical path for key user workflows?
  • Are deep call chains (>4 hops) visible and tracked?
  • Do you check for circular dependencies in service maps?
  • Are chatty services (high call count) identified and optimized?
  • Are single points of failure (highly depended-on services) flagged?
  • Is latency and error rate shown on edges?
  • Are service maps updated in real-time (not stale)?
  • Are service maps used during architecture reviews and incidents?

Self-Check

  • What's the longest call chain in your service map?
  • Which service has the most incoming dependencies?
  • Are there any circular dependencies?
  • What services have error rates >1%?
  • Which edges have latency >200ms?

Next Steps

  1. Enable distributed tracing: Ensure all services emit traces with context propagation
  2. Generate service map: Use your observability platform (Jaeger, Datadog, NewRelic) to render the map
  3. Analyze for issues: Check for deep chains, circular deps, chatty patterns, bottlenecks
  4. Document critical path: Identify the services that matter most for key workflows
  5. Monitor continuously: Set up alerts for topology changes (new dependencies, new services)

References

  1. Pignata, L., & Stafford, U. (2022). Observability Engineering. O'Reilly Media ↗️
  2. Jaeger Documentation. Jaeger ↗️
  3. OpenTelemetry. OpenTelemetry Project ↗️