Overuse of Synchronous Calls Anti-Pattern
Every inter-service communication is synchronous, creating blocking dependencies and reducing resilience.
TL;DR
Overusing synchronous calls means treating every inter-service communication as request/response. Service A calls B, blocks waiting for response. B calls C, blocks waiting. Latency accumulates (100ms × 5 services = 500ms). If any service is slow or down, the entire chain fails with no buffering or graceful degradation. Solution: Use asynchronous messaging for non-critical operations. Services publish events; others consume when ready. Reduces latency (immediate response), improves resilience (failures don't cascade), and enables independent scaling.
Learning Objectives
- Understand synchronous vs. asynchronous communication trade-offs
- Identify which operations should be synchronous vs. async
- Implement message queues for decoupling
- Design resilient systems with graceful degradation
- Use circuit breakers and timeouts for remaining sync calls
- Monitor and measure improvements in latency and reliability
Motivating Scenario
An e-commerce system synchronously sends email confirmation after every order. Order Service must wait for Email Service to respond. Email Service is slow (500ms to send), so every order creation waits 500ms minimum. A traffic spike causes Email Service to get backlogged. Orders start timing out. Customers abandon carts. Revenue drops. Meanwhile, the order was already successfully created; the email is just a nice-to-have. If Email Service was async (publish event, Order Service returns immediately), traffic spikes wouldn't matter—orders would succeed, and emails would be sent eventually with no customer impact.
Core Concepts
The Cost of Synchronous Communication
Every synchronous call is a potential failure point and a latency contributor. In distributed systems, this compounds quickly.
Synchronous vs. Asynchronous Characteristics
| Aspect | Synchronous | Asynchronous |
|---|---|---|
| Blocking | Yes (waits for response) | No (returns immediately) |
| Failure Propagation | Cascades (A fails if B fails) | Isolated (B's failure queued, retried) |
| Latency | Accumulates (100ms × 5 calls = 500ms) | Minimal to caller (immediate response) |
| Buffering | None (request lost if recipient down) | Queue buffers requests |
| Use Case | Critical path (auth, payment validation) | Background tasks (email, analytics) |
| Scaling | Harder (scale all together) | Easier (independent scaling) |
Practical Example
- All Synchronous (Anti-Pattern)
- Async for Non-Critical (Better)
- Smart Sync/Async Mix (Best)
# Order Service - everything synchronous
class OrderService:
def __init__(self, payment_service, email_service, analytics_service):
self.payment = payment_service
self.email = email_service
self.analytics = analytics_service
def create_order(self, user_id, items):
total = sum(item.price for item in items)
# Synchronous: must wait for payment to complete
payment = self.payment.charge(user_id, total) # 100ms
if not payment.success:
raise Exception("Payment failed")
order = self.save_order(user_id, items, total, payment.id)
# Synchronous: must wait for email to send
self.email.send_confirmation(user_id, order.id, items) # 500ms
# ^ Customer waits 500ms for email that is just a nice-to-have
# Synchronous: must wait for analytics to log
self.analytics.log_purchase(user_id, total, order.id) # 50ms
# ^ Customer waits 50ms for analytics that nobody cares about
return order # Total: 650ms for customer
# Problem: Email Service is slow on Black Friday
# Order creation blocks waiting for email
# Orders start timing out
# Customers see errors and leave
# Revenue lost, even though orders were already saved
# Order Service - sync for critical, async for non-critical
class OrderService:
def __init__(self, payment_service, event_bus):
self.payment = payment_service
self.event_bus = event_bus
def create_order(self, user_id, items):
total = sum(item.price for item in items)
# Synchronous: payment IS critical path (need real-time decision)
payment = self.payment.charge(user_id, total) # 100ms
if not payment.success:
raise Exception("Payment failed")
order = self.save_order(user_id, items, total, payment.id)
# Asynchronous: email is not critical
# Publish event, return to customer immediately
self.event_bus.publish('order.created', {
'order_id': order.id,
'user_id': user_id,
'items': items,
'total': total
})
# Same for analytics
self.event_bus.publish('purchase.tracked', {
'user_id': user_id,
'total': total
})
return order # Total: 100ms for customer
# Benefits:
# - Customer sees order confirmation in 100ms (not 650ms)
# - Email Service slow? Doesn't matter. Queue handles it.
# - Email Service down? Orders still created. Emails queued for retry.
# - Black Friday traffic spike? Payment scales; email scales independently.
# Email Service - subscribes to order.created events
class EmailService:
def __init__(self, event_bus):
event_bus.subscribe('order.created', self.send_confirmation)
def send_confirmation(self, event):
# Process asynchronously, no blocking
# If this takes 500ms, nobody cares—it's not on critical path
order = event
email = EmailTemplate.render('order_confirmation.html', order)
self.smtp.send(order.email, email)
# Analytics Service - subscribes to purchase.tracked events
class AnalyticsService:
def __init__(self, event_bus):
event_bus.subscribe('purchase.tracked', self.log)
def log(self, event):
# Process asynchronously
self.datawarehouse.insert(event)
# Order Service - depends on availability/criticality
class OrderService:
def __init__(self, payment_service, inventory_service, event_bus):
self.payment = payment_service
self.inventory = inventory_service
self.event_bus = event_bus
def create_order(self, user_id, items):
total = sum(item.price for item in items)
# Synchronous: Auth (critical, must validate immediately)
# Synchronous: Inventory check (critical, need real-time stock)
if not self.inventory.has_stock(items):
raise Exception("Out of stock")
# Synchronous: Payment (critical, need real-time response)
payment = self.payment.charge(user_id, total)
order = self.save_order(user_id, items, total, payment.id)
# Asynchronous: everything else
# Email, analytics, reporting, notifications
self.event_bus.publish('order.created', {...})
return order # 150ms: auth + inventory + payment
# Only sync when absolutely necessary
# Everything else: async with queues, retries, dead letters
When to Use / When to Avoid
- Critical path operations needing immediate decision
- Auth validation (user exists, token valid)
- Payment processing (need success/failure response)
- Inventory checks (stock availability)
- Keep to minimum; typically 1-2 calls per operation
- Email, SMS notifications (can be delayed)
- Analytics, logging, metrics (non-blocking)
- Reporting and data warehouse updates (eventual)
- Image processing, file conversions (background jobs)
- Publish events; handle responses asynchronously
Patterns & Pitfalls
Design Review Checklist
- Identified critical path operations (must be sync)?
- Non-critical operations converted to async events?
- Maximum 2-3 synchronous calls per operation?
- Message queue infrastructure in place (Kafka, RabbitMQ)?
- Event contracts documented and versioned?
- Circuit breakers protecting all sync calls?
- Timeouts configured for all sync calls (< 5s)?
- Dead letter queue monitoring for failed events?
- Idempotent event handlers (safe to retry)?
- Graceful degradation if async services down?
Self-Check
- Which operations should be synchronous? Auth, payment, inventory, validation. Anything where you need immediate success/failure decision.
- Which should be async? Email, analytics, logging, reporting, notifications. Anything that can happen later without impacting user experience.
- What if Email Service is slow? With async, doesn't matter. Queue handles it. Retry if needed. User never waits.
- How do I handle failures in async? Idempotent handlers (safe to retry) + dead letter queues (failed messages) + monitoring (alerts on failures).
- How do I know if I'm overusing sync? If any single operation makes > 3 synchronous calls, probably overusing. Measure latencies; anything > 200ms is suspicious.
Next Steps
- Audit all service calls — Map which are sync, which are async
- Identify non-critical sync calls — Candidates for async conversion
- Design events — Define what events to publish
- Implement message queue — Set up Kafka, RabbitMQ, or cloud equivalent
- Migrate one flow — Convert email/analytics from sync to async
- Add circuit breakers — Protect remaining sync calls
- Monitor improvements — Measure latency reduction and reliability gains