In a monolith, placing an order is a single database transaction — deduct payment, reserve inventory, create shipment, done. If anything fails, the database rolls everything back.
In microservices, each of those steps lives in a different service with its own database. There’s no global transaction coordinator. If the inventory service succeeds but the shipping service fails, you’re stuck with reserved stock and no shipment.
The SAGA pattern solves this. It replaces a single distributed transaction with a sequence of local transactions, each paired with a compensating action that undoes its effect on failure.
The Architecture at a Glance
Here’s how a saga coordinates work across services:
A central orchestrator tells each service what to do. If any step fails, it runs compensations in reverse order. This is the orchestration approach — and it’s not the only one.
Why Not Distributed Transactions?
Before diving into sagas, let’s understand why the traditional approach — two-phase commit (2PC) — falls apart in microservices:
| Problem | 2PC | SAGA |
|---|---|---|
| Availability | Blocks all participants if coordinator fails | Each step is independent |
| Latency | Holds locks across all services during prepare phase | No cross-service locks |
| Coupling | All services must support XA protocol | Services only need local transactions |
| Scalability | Lock contention kills throughput at scale | Each service scales independently |
| Partial failure | Timeout = uncertainty (did it commit?) | Explicit compensation = deterministic |
Two-phase commit is a coordination protocol, not a scaling strategy. It works for 2-3 tightly coupled databases. It doesn’t work when you have 10 services owned by different teams, running different tech stacks, deployed independently.
Two Flavors of SAGA
There are two fundamentally different ways to implement sagas:
1. Orchestration — Central Coordinator
A dedicated saga orchestrator service acts as the brain. It knows the full workflow, sends commands to each service in order, and handles failures by running compensations.
How it works:
- Client sends “Create Order” to the orchestrator
- Orchestrator sends “Reserve Payment” to Payment Service
- Payment Service reserves funds, replies with success
- Orchestrator sends “Reserve Inventory” to Inventory Service
- Inventory Service reserves stock, replies with success
- Orchestrator sends “Create Shipment” to Shipping Service
- If all succeed → mark order as COMPLETED
- If any step fails → run compensations in reverse
Here’s a production-style orchestrator in Node.js:
class OrderSagaOrchestrator {
constructor(paymentClient, inventoryClient, shippingClient, sagaStore) {
this.paymentClient = paymentClient;
this.inventoryClient = inventoryClient;
this.shippingClient = shippingClient;
this.sagaStore = sagaStore;
}
async execute(orderId, orderData) {
const saga = await this.sagaStore.create({
orderId,
state: 'STARTED',
steps: [],
});
try {
// Step 1: Reserve payment
await this.sagaStore.updateState(saga.id, 'PAYMENT_PENDING');
const paymentId = await this.paymentClient.reserve({
orderId,
amount: orderData.total,
currency: orderData.currency,
});
saga.steps.push({ name: 'payment', status: 'done', paymentId });
// Step 2: Reserve inventory
await this.sagaStore.updateState(saga.id, 'INVENTORY_PENDING');
const reservationId = await this.inventoryClient.reserve({
orderId,
items: orderData.items,
});
saga.steps.push({ name: 'inventory', status: 'done', reservationId });
// Step 3: Create shipment
await this.sagaStore.updateState(saga.id, 'SHIPPING_PENDING');
const shipmentId = await this.shippingClient.createShipment({
orderId,
address: orderData.shippingAddress,
items: orderData.items,
});
saga.steps.push({ name: 'shipping', status: 'done', shipmentId });
// All steps succeeded
await this.sagaStore.updateState(saga.id, 'COMPLETED');
return { success: true, orderId, shipmentId };
} catch (error) {
await this.compensate(saga, error);
throw new SagaFailedError(orderId, error.message);
}
}
async compensate(saga, originalError) {
await this.sagaStore.updateState(saga.id, 'COMPENSATING');
// Run compensations in reverse order
const completedSteps = [...saga.steps].reverse();
for (const step of completedSteps) {
try {
switch (step.name) {
case 'shipping':
await this.shippingClient.cancelShipment(step.shipmentId);
break;
case 'inventory':
await this.inventoryClient.release(step.reservationId);
break;
case 'payment':
await this.paymentClient.refund(step.paymentId);
break;
}
step.status = 'compensated';
} catch (compError) {
// Compensation failed — log and flag for manual intervention
step.status = 'compensation_failed';
step.error = compError.message;
await this.sagaStore.flagForManualReview(saga.id, step);
}
}
await this.sagaStore.updateState(saga.id, 'COMPENSATED');
}
}Key observations:
- Each step is recorded in the saga store before execution — this is your audit trail
- Compensations run in reverse order — shipping first, then inventory, then payment
- If a compensation itself fails, flag it for manual review rather than retrying forever
- The orchestrator is stateless — all state lives in
sagaStore
2. Choreography — Event-Driven
No central coordinator. Each service listens for events and decides what to do next. Services communicate through an event bus (Kafka, RabbitMQ, etc.).
How it works:
- Order Service creates order, publishes
OrderCreatedevent - Payment Service listens for
OrderCreated, reserves payment, publishesPaymentReserved - Inventory Service listens for
PaymentReserved, reserves stock, publishesInventoryReserved - Shipping Service listens for
InventoryReserved, creates shipment, publishesShipmentCreated - Order Service listens for
ShipmentCreated, marks order as COMPLETED
If Inventory Service fails, it publishes InventoryReservationFailed. Payment Service listens for that and refunds the payment.
Here’s how each service handles events:
# payment_service.py
class PaymentEventHandler:
def __init__(self, payment_repo, event_bus):
self.payment_repo = payment_repo
self.event_bus = event_bus
async def handle_order_created(self, event: OrderCreated):
"""React to a new order by reserving payment."""
try:
payment = await self.payment_repo.reserve(
order_id=event.order_id,
amount=event.total,
idempotency_key=event.event_id, # prevent double-processing
)
await self.event_bus.publish(PaymentReserved(
order_id=event.order_id,
payment_id=payment.id,
amount=event.total,
))
except InsufficientFundsError:
await self.event_bus.publish(PaymentFailed(
order_id=event.order_id,
reason="insufficient_funds",
))
async def handle_inventory_reservation_failed(self, event):
"""Compensate: refund payment when downstream step fails."""
payment = await self.payment_repo.find_by_order(event.order_id)
if payment and payment.status == "reserved":
await self.payment_repo.refund(payment.id)
await self.event_bus.publish(PaymentRefunded(
order_id=event.order_id,
payment_id=payment.id,
))# inventory_service.py
class InventoryEventHandler:
def __init__(self, inventory_repo, event_bus):
self.inventory_repo = inventory_repo
self.event_bus = event_bus
async def handle_payment_reserved(self, event: PaymentReserved):
"""React to successful payment by reserving inventory."""
try:
reservation = await self.inventory_repo.reserve(
order_id=event.order_id,
items=event.items,
idempotency_key=event.event_id,
)
await self.event_bus.publish(InventoryReserved(
order_id=event.order_id,
reservation_id=reservation.id,
))
except OutOfStockError as e:
await self.event_bus.publish(InventoryReservationFailed(
order_id=event.order_id,
reason=str(e),
))Orchestration vs Choreography — When to Use Which
| Aspect | Orchestration | Choreography |
|---|---|---|
| Complexity | Centralized, easier to understand | Distributed, harder to trace |
| Coupling | Orchestrator knows all services | Services only know events |
| Single point of failure | Orchestrator is critical | No single point |
| Debugging | Follow the orchestrator log | Correlate events across services |
| Best for | Complex workflows with branching | Simple linear flows (3-4 steps) |
| Adding steps | Modify orchestrator | Add new event listener |
| Circular dependencies | Not possible (orchestrator controls flow) | Possible if not careful |
My rule of thumb: Use orchestration when you have more than 4 steps, branching logic, or need clear visibility into the workflow. Use choreography for simple, linear flows where services are truly independent.
Compensation — The Hard Part
The forward path is easy. Compensation is where sagas get tricky.
A compensation is not a simple rollback. It’s a semantic undo — a new transaction that counteracts the effect of the original.
Designing Compensations
Every forward step needs a corresponding compensation:
| Forward Action | Compensation | Notes |
|---|---|---|
| Reserve payment ($100) | Refund payment ($100) | Must handle partial refunds |
| Reserve 5 units of inventory | Release 5 units of inventory | Check if already shipped |
| Create shipment | Cancel shipment | Only if not already dispatched |
| Send confirmation email | Send cancellation email | Can’t “unsend” — send correction |
| Charge credit card | Issue refund | May take days to process |
The Golden Rules of Compensation
1. Compensations must be idempotent
A compensation might be retried if the saga crashes mid-compensation. Running it twice must produce the same result as running it once.
async function refundPayment(paymentId, idempotencyKey) {
// Check if refund already processed
const existing = await db.refunds.findOne({
paymentId,
idempotencyKey,
});
if (existing) {
return existing; // Already refunded — return existing result
}
const refund = await paymentGateway.refund(paymentId);
await db.refunds.insert({
paymentId,
idempotencyKey,
refundId: refund.id,
amount: refund.amount,
createdAt: new Date(),
});
return refund;
}2. Compensations must be commutative with retries
If a forward action and its compensation are both retried due to network issues, the end state must still be correct. This is why reservations work better than direct mutations:
// BAD: Direct mutation — compensation can overshoot
async function addStock(itemId, quantity) {
await db.query('UPDATE items SET stock = stock + $1 WHERE id = $2', [quantity, itemId]);
}
// GOOD: Reservation-based — idempotent and safe
async function releaseReservation(reservationId) {
const result = await db.query(
'UPDATE reservations SET status = $1 WHERE id = $2 AND status = $3',
['released', reservationId, 'reserved']
);
// Only releases if still in 'reserved' state — safe to retry
return result.rowCount > 0;
}3. Some actions can’t be compensated
You can’t unsend an email. You can’t un-notify a user. For these cases:
- Reorder your saga steps: Put non-compensatable actions last. Send the email only after all other steps succeed.
- Use a buffer: Queue the email but don’t send it until the saga completes.
- Send a correction: If the email was sent, send a follow-up cancellation email.
Saga State Machine
Every saga instance goes through a well-defined set of states. Modeling this as a state machine makes the behavior explicit and debuggable:
Here’s a state machine implementation:
const SagaStateMachine = {
STARTED: {
transitions: {
BEGIN_PAYMENT: 'PAYMENT_PENDING',
ABORT: 'COMPENSATING',
},
},
PAYMENT_PENDING: {
transitions: {
PAYMENT_SUCCESS: 'INVENTORY_PENDING',
PAYMENT_FAILED: 'COMPENSATING',
},
},
INVENTORY_PENDING: {
transitions: {
INVENTORY_SUCCESS: 'SHIPPING_PENDING',
INVENTORY_FAILED: 'COMPENSATING',
},
},
SHIPPING_PENDING: {
transitions: {
SHIPPING_SUCCESS: 'COMPLETED',
SHIPPING_FAILED: 'COMPENSATING',
},
},
COMPENSATING: {
transitions: {
COMPENSATION_DONE: 'COMPENSATED',
COMPENSATION_FAILED: 'REQUIRES_MANUAL',
},
},
COMPLETED: { transitions: {} },
COMPENSATED: { transitions: {} },
REQUIRES_MANUAL: { transitions: {} },
};
class SagaInstance {
constructor(id, initialState = 'STARTED') {
this.id = id;
this.state = initialState;
this.history = [{ state: initialState, timestamp: Date.now() }];
}
transition(event) {
const currentConfig = SagaStateMachine[this.state];
const nextState = currentConfig?.transitions[event];
if (!nextState) {
throw new Error(
`Invalid transition: ${this.state} + ${event}. ` +
`Allowed: ${Object.keys(currentConfig?.transitions || {}).join(', ')}`
);
}
this.state = nextState;
this.history.push({ state: nextState, event, timestamp: Date.now() });
return this;
}
isTerminal() {
const config = SagaStateMachine[this.state];
return Object.keys(config.transitions).length === 0;
}
}Usage:
const saga = new SagaInstance('order-123');
saga.transition('BEGIN_PAYMENT'); // → PAYMENT_PENDING
saga.transition('PAYMENT_SUCCESS'); // → INVENTORY_PENDING
saga.transition('INVENTORY_FAILED'); // → COMPENSATING
saga.transition('COMPENSATION_DONE'); // → COMPENSATED
console.log(saga.history);
// Full audit trail of every state change with timestampsSaga Store — Persistence Layer
The saga store is the source of truth. If the orchestrator crashes mid-saga, it must recover by reading the saga state and resuming from where it left off.
CREATE TABLE sagas (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
type VARCHAR(100) NOT NULL, -- 'order_saga', 'refund_saga'
correlation_id VARCHAR(255) NOT NULL, -- order_id, user_id, etc.
state VARCHAR(50) NOT NULL,
payload JSONB NOT NULL,
steps JSONB NOT NULL DEFAULT '[]',
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
version INTEGER DEFAULT 1 -- optimistic concurrency
);
CREATE INDEX idx_sagas_correlation ON sagas(correlation_id);
CREATE INDEX idx_sagas_state ON sagas(state) WHERE state NOT IN ('COMPLETED', 'COMPENSATED');
CREATE TABLE saga_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
saga_id UUID REFERENCES sagas(id),
event_type VARCHAR(100) NOT NULL,
payload JSONB NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);Recovery on startup:
async function recoverPendingSagas() {
const pendingSagas = await db.query(`
SELECT * FROM sagas
WHERE state NOT IN ('COMPLETED', 'COMPENSATED', 'REQUIRES_MANUAL')
AND updated_at < NOW() - INTERVAL '5 minutes'
ORDER BY created_at ASC
`);
for (const saga of pendingSagas.rows) {
console.log(`Recovering saga ${saga.id} in state ${saga.state}`);
if (saga.state === 'COMPENSATING') {
await orchestrator.compensate(saga);
} else {
// Resume from last completed step
await orchestrator.resume(saga);
}
}
}Real-World Example: E-Commerce Order
Let’s put it all together with a complete e-commerce order saga:
// saga-definitions/order-saga.js
const OrderSagaDefinition = {
name: 'create_order',
steps: [
{
name: 'validate_order',
action: async (ctx) => {
const valid = await orderValidator.validate(ctx.payload);
if (!valid) throw new ValidationError('Invalid order data');
return { validated: true };
},
// No compensation needed — validation has no side effects
compensation: null,
},
{
name: 'reserve_payment',
action: async (ctx) => {
const payment = await paymentService.reserve({
orderId: ctx.sagaId,
amount: ctx.payload.total,
method: ctx.payload.paymentMethod,
idempotencyKey: `${ctx.sagaId}-payment`,
});
return { paymentId: payment.id };
},
compensation: async (ctx, stepResult) => {
await paymentService.refund(stepResult.paymentId, {
idempotencyKey: `${ctx.sagaId}-payment-refund`,
});
},
},
{
name: 'reserve_inventory',
action: async (ctx) => {
const reservation = await inventoryService.reserve({
orderId: ctx.sagaId,
items: ctx.payload.items,
idempotencyKey: `${ctx.sagaId}-inventory`,
});
return { reservationId: reservation.id };
},
compensation: async (ctx, stepResult) => {
await inventoryService.release(stepResult.reservationId, {
idempotencyKey: `${ctx.sagaId}-inventory-release`,
});
},
},
{
name: 'create_shipment',
action: async (ctx) => {
const shipment = await shippingService.create({
orderId: ctx.sagaId,
address: ctx.payload.shippingAddress,
items: ctx.payload.items,
idempotencyKey: `${ctx.sagaId}-shipment`,
});
return { shipmentId: shipment.id };
},
compensation: async (ctx, stepResult) => {
await shippingService.cancel(stepResult.shipmentId, {
idempotencyKey: `${ctx.sagaId}-shipment-cancel`,
});
},
},
{
name: 'send_confirmation',
action: async (ctx) => {
// Last step — no compensation needed since it's the final action
await notificationService.sendOrderConfirmation({
orderId: ctx.sagaId,
email: ctx.payload.customerEmail,
});
return { notified: true };
},
compensation: async (ctx) => {
// Send cancellation email as correction
await notificationService.sendOrderCancellation({
orderId: ctx.sagaId,
email: ctx.payload.customerEmail,
});
},
},
],
};A generic saga executor that runs any saga definition:
class SagaExecutor {
constructor(sagaStore) {
this.sagaStore = sagaStore;
}
async run(definition, payload) {
const saga = await this.sagaStore.create({
type: definition.name,
state: 'STARTED',
payload,
steps: [],
});
const ctx = { sagaId: saga.id, payload };
const completedSteps = [];
try {
for (const stepDef of definition.steps) {
await this.sagaStore.updateState(saga.id, `${stepDef.name}_PENDING`);
const result = await stepDef.action(ctx);
completedSteps.push({ definition: stepDef, result });
await this.sagaStore.recordStep(saga.id, {
name: stepDef.name,
status: 'completed',
result,
});
}
await this.sagaStore.updateState(saga.id, 'COMPLETED');
return { success: true, sagaId: saga.id };
} catch (error) {
console.error(`Saga ${saga.id} failed at step: ${error.message}`);
await this.sagaStore.updateState(saga.id, 'COMPENSATING');
// Compensate in reverse order
for (const step of completedSteps.reverse()) {
if (step.definition.compensation) {
try {
await step.definition.compensation(ctx, step.result);
await this.sagaStore.recordStep(saga.id, {
name: step.definition.name,
status: 'compensated',
});
} catch (compError) {
await this.sagaStore.recordStep(saga.id, {
name: step.definition.name,
status: 'compensation_failed',
error: compError.message,
});
await this.sagaStore.updateState(saga.id, 'REQUIRES_MANUAL');
throw new SagaCompensationError(saga.id, compError);
}
}
}
await this.sagaStore.updateState(saga.id, 'COMPENSATED');
throw new SagaFailedError(saga.id, error.message);
}
}
}Handling Edge Cases
1. Saga Timeout
What if a service never responds? Set a timeout on the entire saga:
async function executeWithTimeout(saga, timeoutMs = 30000) {
const timeoutPromise = new Promise((_, reject) =>
setTimeout(() => reject(new SagaTimeoutError(saga.id)), timeoutMs)
);
try {
return await Promise.race([
sagaExecutor.run(saga.definition, saga.payload),
timeoutPromise,
]);
} catch (error) {
if (error instanceof SagaTimeoutError) {
// Force compensation for any completed steps
await sagaExecutor.forceCompensate(saga.id);
}
throw error;
}
}2. Concurrent Sagas on the Same Resource
Two sagas reserving the same inventory item can conflict. Use optimistic locking:
UPDATE inventory
SET reserved_count = reserved_count + $1,
version = version + 1
WHERE item_id = $2
AND version = $3
AND (stock - reserved_count) >= $1;If rowCount === 0, either the version changed (concurrent modification) or insufficient stock. Retry or fail the saga step.
3. Observability
Sagas are hard to debug without proper observability. Add correlation IDs and structured logging:
const sagaLogger = {
stepStarted(sagaId, stepName) {
console.log(JSON.stringify({
event: 'saga_step_started',
sagaId,
step: stepName,
timestamp: new Date().toISOString(),
}));
},
stepCompleted(sagaId, stepName, durationMs) {
console.log(JSON.stringify({
event: 'saga_step_completed',
sagaId,
step: stepName,
durationMs,
timestamp: new Date().toISOString(),
}));
},
sagaFailed(sagaId, failedStep, error) {
console.log(JSON.stringify({
event: 'saga_failed',
sagaId,
failedStep,
error: error.message,
timestamp: new Date().toISOString(),
}));
},
};Track these metrics:
- Saga completion rate — percentage that reach COMPLETED state
- Saga duration — p50/p95/p99 end-to-end time
- Compensation rate — how often compensations run
- Manual intervention rate — how often REQUIRES_MANUAL is reached
- Step failure distribution — which step fails most often
Common Pitfalls
1. Not persisting saga state before executing steps. If the orchestrator crashes after sending a command but before recording the step, you don’t know whether to retry or compensate. Always persist first.
2. Non-idempotent compensations. If a refund is issued twice, the customer gets double their money back. Always use idempotency keys.
3. Putting non-compensatable steps early. If you send a confirmation email in step 2 and payment fails in step 3, you’ve notified the customer about an order that won’t be fulfilled. Put irreversible actions last.
4. Ignoring partial failures in compensation. If compensation for step 2 fails, you must still try to compensate step 1. Don’t short-circuit — attempt all compensations and collect errors.
5. No timeout on individual steps. A hung service can block the entire saga indefinitely. Set timeouts on each service call and treat timeouts as failures.
When to Use SAGA vs Alternatives
| Scenario | Pattern |
|---|---|
| All data in one database | Use a regular ACID transaction |
| 2-3 tightly coupled services | Consider 2PC if latency is acceptable |
| Multiple independent services, eventual consistency OK | SAGA pattern |
| Read-heavy with rare cross-service writes | SAGA for writes, CQRS for reads |
| Need strong consistency across services | Reconsider your service boundaries |
Summary
The SAGA pattern is the standard approach for distributed transactions in microservices:
- Orchestration gives you a clear, centralized workflow — use it for complex flows
- Choreography decouples services through events — use it for simple, linear flows
- Compensations are semantic undos, not rollbacks — they must be idempotent
- Persist saga state before each step — this is your crash recovery mechanism
- Put irreversible actions last — you can’t unsend an email
- Monitor everything — saga completion rate and compensation rate are your key metrics
The pattern adds complexity, but it’s the price of distributed systems. The alternative — hoping that all services succeed or manually fixing data inconsistencies — is far worse.










