Skip to main content

Transaction Boundaries and Sagas

Coordinate distributed transactions across services using choreography and orchestration patterns.

TL;DR

In microservices, ACID transactions across services are impossible—you have separate databases and no global rollback. Sagas solve this: a sequence of local transactions coordinated to either all succeed or all fail. Two patterns exist: choreography (services communicate via events) and orchestration (a coordinator directs each step). Choreography is loosely coupled but hard to debug; orchestration is easier to understand but creates a central coordinator. Use compensating transactions to handle failures—if step 3 fails, explicitly undo steps 2 and 1. This isn't ACID, but it's sufficient for most business workflows.

Learning Objectives

  • Understand why traditional ACID transactions don't work across services
  • Design distributed transactions using sagas
  • Implement choreography-based sagas with events
  • Implement orchestration-based sagas with coordinators
  • Design compensating transactions for rollback
  • Handle failures and partial failures in distributed workflows

Motivating Scenario

A user places an order. The system must: reserve inventory, charge credit card, and reserve shipping. If any step fails, all previous steps must undo. In a monolith, this is one transaction with automatic rollback. In microservices, inventory service, payment service, and shipping service are separate. If step 2 (charge card) fails, you must explicitly unreserve inventory and cancel the shipping reservation. How do you coordinate this reliably?

Core Concepts

Why ACID Fails in Microservices

ACID transactions require a single database with global locks and rollback capability. Microservices have separate databases without shared locks. Coordination across networks is inherently unreliable—messages can be delayed, services can crash, and networks can partition. You can't achieve true ACID without a single database, so you settle for eventual consistency.

Sagas: Choreography vs. Orchestration

A saga is a sequence of local transactions, each updating one service's database. If all succeed, the saga completes. If one fails, compensating transactions undo previous changes. Two coordination styles exist: choreography (services react to events) and orchestration (a central coordinator sends commands).

Compensating Transactions

Rollback in distributed systems means explicitly undoing changes. If you reserved inventory, you must unreserve it. This isn't automatic—you must design undo operations for each step. Compensating transactions are the inverse of their original operation.

Idempotency and Retries

Networks fail, services crash, and messages duplicate. Make every operation idempotent: calling it multiple times with the same input produces the same result. This enables safe retries without double-processing.

Practical Example

# ❌ POOR - Trying to use distributed transactions naively
class OrderService:
def create_order(self, user_id, items):
# This will fail if payment service crashes after we reserved inventory
inventory = self.inventory_client.reserve_inventory(items)
payment = self.payment_client.charge_card(user_id, total_price)
shipping = self.shipping_client.reserve_shipping(user_id)
return Order(user_id=user_id, items=items)

# ✅ EXCELLENT - Saga using choreography (event-driven)
class OrderService:
def __init__(self, event_bus, db):
self.event_bus = event_bus
self.db = db

def create_order(self, user_id, items):
order = Order(user_id=user_id, items=items, status='pending')
self.db.insert(order)
# Emit event; other services react and emit their own events
self.event_bus.publish('OrderCreated', {
'order_id': order.id,
'user_id': user_id,
'items': items,
'total': calculate_total(items)
})
return order

def on_inventory_reserved(self, event):
order = self.db.get_order(event['order_id'])
order.inventory_reserved = True
self.db.update(order)

def on_payment_failed(self, event):
order = self.db.get_order(event['order_id'])
order.status = 'cancelled'
self.db.update(order)
# Emit compensating event to unreserve inventory
self.event_bus.publish('OrderCancelled', {'order_id': order.id})

# ✅ EXCELLENT - Saga using orchestration (coordinator-driven)
class OrderOrchestrator:
def __init__(self, db, services, event_bus):
self.db = db
self.services = services
self.event_bus = event_bus

def create_order(self, user_id, items):
order = Order(user_id=user_id, items=items, status='pending')
self.db.insert(order)

try:
# Step 1: Reserve inventory
self.services.inventory.reserve_inventory(items)
order.inventory_reserved = True
self.db.update(order)

# Step 2: Charge payment
self.services.payment.charge_card(user_id, calculate_total(items))
order.payment_charged = True
self.db.update(order)

# Step 3: Reserve shipping
self.services.shipping.reserve_shipping(user_id, items)
order.status = 'confirmed'
self.db.update(order)

except PaymentFailedError:
# Compensate: unreserve inventory
self.services.inventory.unreserve_inventory(order.id)
order.status = 'cancelled'
self.db.update(order)
raise

return order

When to Use / When Not to Use

When to Use Sagas
  1. Workflows spanning multiple services that must all succeed or all fail
  2. Business processes that can tolerate eventual consistency
  3. Systems where you need visibility into multi-step workflows
  4. Scenarios where services can handle compensating transactions
  5. High-scale systems requiring distributed coordination
When NOT to Use Sagas
  1. Single-service transactions (use ACID database transactions)
  2. Real-time financial transactions requiring strong ACID guarantees
  3. Workflows with complex rollback logic that
  4. ,
  5. t reliably undo previous changes
  6. Early-stage systems without mature event infrastructure

Patterns and Pitfalls

Design Review Checklist

  • All steps in the saga are idempotent (safe to retry)
  • Compensating transactions are designed for each step
  • Failures between steps are handled explicitly
  • Saga state is persisted to survive crashes
  • Monitoring and alerts are configured for saga failures
  • Long-running sagas have timeouts and cleanup policies
  • Compensating transactions are thoroughly tested

Self-Check

  • What's the difference between choreography and orchestration sagas?
  • How do you ensure idempotency in distributed transactions?
  • What happens if a compensating transaction fails?
One Takeaway

Sagas are not ACID, but they're sufficient for most distributed workflows. The key insight is: explicitly undo changes instead of relying on automatic rollback. This requires discipline but gives you control over distributed consistency.

Next Steps

  • Implement saga pattern with a framework like Temporal or Axon
  • Design compensating transactions for all critical workflows
  • Set up distributed tracing to monitor sagas across services
  • Explore dead-letter queues for handling failed compensations

References

  • Chris Richardson, Microservices Patterns: With examples in Java
  • Pat Helland, Life beyond Distributed Transactions
  • Sean Winn, Saga Pattern for Distributed Transactions