Distributed Monolith Anti-Pattern
Services that call each other synchronously in sequence, creating tight coupling and monolith behavior.
TL;DR
A distributed monolith looks like microservices (separate services, separate deployments) but behaves like a monolith (tightly coupled, long synchronous call chains). Service A calls B calls C calls D; if any service is slow or down, the entire chain blocks. Result: All complexity of microservices with none of the benefits (independent scaling, loose coupling, resilience). Solution: Decouple with events. Services publish state changes asynchronously; others consume independently.
Learning Objectives
- Identify distributed monolith patterns in your architecture
- Understand the costs of tight coupling in distributed systems
- Implement event-driven decoupling
- Design workflows with eventual consistency
- Use caching and CQRS to reduce cross-service calls
- Monitor and measure call chain latency
Motivating Scenario
An order creation flow requires: User Auth Service validates user, Order Service creates order, Payment Service charges card, Invoice Service generates invoice, Shipping Service reserves inventory. All synchronous, in sequence. A customer clicks "buy now"; the system makes 5 service calls (5 × 100ms each = 500ms minimum). If the Invoice Service is temporarily slow, customers experience 500ms+ latency on every purchase. If Shipping Service goes down, all orders fail—not just shipping, all of it. The company thought it had microservices (independent teams, separate deployments), but operationally it's still a monolith.
Core Concepts
The Monolith Problem, Distributed
A traditional monolith forces tight coupling through shared process memory. A distributed monolith forces tight coupling through synchronous network calls. The coupling is still there; it's just slower and more fragile.
Why Call Chains Are Bad
- Latency Accumulation: 5 services × 100ms = 500ms (user perceives slowness)
- Cascading Failures: If Shipping Service is down, orders fail entirely (all-or-nothing)
- Resource Contention: A spike in orders cascades: Order Service overloaded → calls Payment Service → Payment overloaded → calls Invoice Service → entire system overloaded
- Debugging: Distributed tracing required to understand why a request took 500ms
- Scalability: Can't scale Payment Service independently; it's tied to Order Service's load
Practical Example
- Distributed Monolith (Synchronous)
- Event-Driven (Asynchronous)
# Order Service - synchronous chain
class OrderService:
def __init__(self, auth_service, payment_service, invoice_service, shipping_service):
self.auth = auth_service
self.payment = payment_service
self.invoice = invoice_service
self.shipping = shipping_service
def create_order(self, user_id, items):
# Call 1: Auth Service (100ms)
user = self.auth.validate_user(user_id)
if not user:
raise Exception("Invalid user")
# Call 2: Payment Service (100ms) - must wait for Auth
total = sum(item.price for item in items)
payment_result = self.payment.charge(user.id, total)
if payment_result.status != "success":
raise Exception("Payment failed")
# Call 3: Invoice Service (100ms) - must wait for Payment
invoice = self.invoice.create(user.id, items, total, payment_result.id)
# Call 4: Shipping Service (100ms) - must wait for Invoice
shipping = self.shipping.reserve(user.id, items)
return {
"order_id": self.save_order(...).id,
"total": total,
"shipping": shipping
}
# Total latency: 400ms minimum
# If ANY service is down, entire operation fails
# If Payment Service is slow, Order Service blocks
# Order Service - publishes event, others react
class OrderService:
def __init__(self, auth_service, event_bus):
self.auth = auth_service
self.event_bus = event_bus
def create_order(self, user_id, items):
# Only synchronous call: Auth (must know user exists)
user = self.auth.validate_user(user_id)
if not user:
raise Exception("Invalid user")
# Calculate locally (no service call)
total = sum(item.price for item in items)
# Create order locally
order = self.save_order(user_id, items, total, status="pending")
# Publish event - let others handle their parts asynchronously
self.event_bus.publish('order.created', {
'order_id': order.id,
'user_id': user_id,
'items': items,
'total': total
})
return {
"order_id": order.id,
"total": total,
"status": "pending"
}
# Total latency: 100ms (only Auth synchronous)
# Return immediately to user
# Payment Service - reacts to order.created event
class PaymentService:
def __init__(self, event_bus):
event_bus.subscribe('order.created', self.on_order_created)
async def on_order_created(self, event):
# Process asynchronously, no blocking
try:
result = self.charge(event['user_id'], event['total'])
self.event_bus.publish('payment.completed', {
'order_id': event['order_id'],
'status': 'success'
})
except Exception:
self.event_bus.publish('payment.failed', {
'order_id': event['order_id'],
'status': 'failed'
})
# Invoice Service - reacts to payment.completed event
class InvoiceService:
def __init__(self, event_bus):
event_bus.subscribe('payment.completed', self.on_payment_completed)
async def on_payment_completed(self, event):
invoice = self.create_invoice(event['order_id'])
self.event_bus.publish('invoice.created', event)
# Shipping Service - reacts to order.created event independently
class ShippingService:
def __init__(self, event_bus):
event_bus.subscribe('order.created', self.on_order_created)
async def on_order_created(self, event):
self.reserve_inventory(event['items'])
self.event_bus.publish('shipment.reserved', event)
# Benefits:
# - User sees response in 100ms (not 400ms)
# - Services process in parallel (not sequential)
# - If Invoice Service down, order still created and shipped
# - Payment Service down? Queue delivery to user; retry later
When to Use / When to Avoid
- Service A calls B calls C in synchronous sequence
- 400ms+ latency for simple operations
- If any service down, entire operation fails
- Services tightly coupled through call chains
- Can't scale independent services
- Debugging requires distributed tracing
- Service publishes event, others consume asynchronously
- User sees response immediately (100ms)
- Services process in parallel
- If service down, others continue (queue retries)
- Services loosely coupled through contracts
- Easy to scale and debug independently
Patterns & Pitfalls
Design Review Checklist
- Identified call chains (Service A → B → C)?
- Latency per chain measured and acceptable (<200ms)?
- Event-driven alternative designed for long chains?
- No synchronous calls for non-critical operations?
- Caching strategy in place for frequently accessed data?
- CQRS read models defined for high-demand queries?
- Saga/compensation logic for distributed workflows?
- Event contracts documented and versioned?
- Async handling implemented (no blocking on events)?
- Monitoring in place to detect slow/failing services?
Self-Check
- How many synchronous hops is too many? More than 3 is suspicious. Each hop adds 50-100ms.
- When is synchronous appropriate? When you absolutely need real-time consistency (auth validation, immediate payment response). Keep these minimal.
- How do I migrate from distributed monolith to event-driven? Gradually. Start with one non-critical chain; publish events instead of calling. Build async handlers. Test thoroughly.
- What if I need real-time consistency? Accept it's a constraint. Call synchronously only for that. Everything else, event-driven.
- How do I handle failures in event-driven systems? Idempotent handlers (safe to process same event twice) + dead letter queues for failures + monitoring alerts.
Next Steps
- Audit call chains — Map critical request flows; measure latency
- Identify candidates for decoupling — Which chains don't need real-time consistency?
- Design events — Define what events are published and by whom
- Build event infrastructure — Kafka, RabbitMQ, SNS+SQS, or custom
- Migrate one chain — Build async handler for one service
- Monitor and iterate — Measure latency improvement; expand pattern