Explaining SAGA Patterns with Examples

In a monolith, placing an order is a single database transaction — deduct payment, reserve inventory, create shipment, done. If anything fails, the database rolls everything back.

In microservices, each of those steps lives in a different service with its own database. There’s no global transaction coordinator. If the inventory service succeeds but the shipping service fails, you’re stuck with reserved stock and no shipment.

The SAGA pattern solves this. It replaces a single distributed transaction with a sequence of local transactions, each paired with a compensating action that undoes its effect on failure.

The Architecture at a Glance

Here’s how a saga coordinates work across services:

SAGA Orchestration Pattern

A central orchestrator tells each service what to do. If any step fails, it runs compensations in reverse order. This is the orchestration approach — and it’s not the only one.

Why Not Distributed Transactions?

Before diving into sagas, let’s understand why the traditional approach — two-phase commit (2PC) — falls apart in microservices:

Problem	2PC	SAGA
Availability	Blocks all participants if coordinator fails	Each step is independent
Latency	Holds locks across all services during prepare phase	No cross-service locks
Coupling	All services must support XA protocol	Services only need local transactions
Scalability	Lock contention kills throughput at scale	Each service scales independently
Partial failure	Timeout = uncertainty (did it commit?)	Explicit compensation = deterministic

Two-phase commit is a coordination protocol, not a scaling strategy. It works for 2-3 tightly coupled databases. It doesn’t work when you have 10 services owned by different teams, running different tech stacks, deployed independently.

Two Flavors of SAGA

There are two fundamentally different ways to implement sagas:

1. Orchestration — Central Coordinator

A dedicated saga orchestrator service acts as the brain. It knows the full workflow, sends commands to each service in order, and handles failures by running compensations.

SAGA Orchestration Pattern

How it works:

Client sends “Create Order” to the orchestrator
Orchestrator sends “Reserve Payment” to Payment Service
Payment Service reserves funds, replies with success
Orchestrator sends “Reserve Inventory” to Inventory Service
Inventory Service reserves stock, replies with success
Orchestrator sends “Create Shipment” to Shipping Service
If all succeed → mark order as COMPLETED
If any step fails → run compensations in reverse

Here’s a production-style orchestrator in Node.js:

class OrderSagaOrchestrator {
  constructor(paymentClient, inventoryClient, shippingClient, sagaStore) {
    this.paymentClient = paymentClient;
    this.inventoryClient = inventoryClient;
    this.shippingClient = shippingClient;
    this.sagaStore = sagaStore;
  }

  async execute(orderId, orderData) {
    const saga = await this.sagaStore.create({
      orderId,
      state: 'STARTED',
      steps: [],
    });

    try {
      // Step 1: Reserve payment
      await this.sagaStore.updateState(saga.id, 'PAYMENT_PENDING');
      const paymentId = await this.paymentClient.reserve({
        orderId,
        amount: orderData.total,
        currency: orderData.currency,
      });
      saga.steps.push({ name: 'payment', status: 'done', paymentId });

      // Step 2: Reserve inventory
      await this.sagaStore.updateState(saga.id, 'INVENTORY_PENDING');
      const reservationId = await this.inventoryClient.reserve({
        orderId,
        items: orderData.items,
      });
      saga.steps.push({ name: 'inventory', status: 'done', reservationId });

      // Step 3: Create shipment
      await this.sagaStore.updateState(saga.id, 'SHIPPING_PENDING');
      const shipmentId = await this.shippingClient.createShipment({
        orderId,
        address: orderData.shippingAddress,
        items: orderData.items,
      });
      saga.steps.push({ name: 'shipping', status: 'done', shipmentId });

      // All steps succeeded
      await this.sagaStore.updateState(saga.id, 'COMPLETED');
      return { success: true, orderId, shipmentId };

    } catch (error) {
      await this.compensate(saga, error);
      throw new SagaFailedError(orderId, error.message);
    }
  }

  async compensate(saga, originalError) {
    await this.sagaStore.updateState(saga.id, 'COMPENSATING');

    // Run compensations in reverse order
    const completedSteps = [...saga.steps].reverse();

    for (const step of completedSteps) {
      try {
        switch (step.name) {
          case 'shipping':
            await this.shippingClient.cancelShipment(step.shipmentId);
            break;
          case 'inventory':
            await this.inventoryClient.release(step.reservationId);
            break;
          case 'payment':
            await this.paymentClient.refund(step.paymentId);
            break;
        }
        step.status = 'compensated';
      } catch (compError) {
        // Compensation failed — log and flag for manual intervention
        step.status = 'compensation_failed';
        step.error = compError.message;
        await this.sagaStore.flagForManualReview(saga.id, step);
      }
    }

    await this.sagaStore.updateState(saga.id, 'COMPENSATED');
  }
}

Key observations:

Each step is recorded in the saga store before execution — this is your audit trail
Compensations run in reverse order — shipping first, then inventory, then payment
If a compensation itself fails, flag it for manual review rather than retrying forever
The orchestrator is stateless — all state lives in sagaStore

2. Choreography — Event-Driven

No central coordinator. Each service listens for events and decides what to do next. Services communicate through an event bus (Kafka, RabbitMQ, etc.).

SAGA Choreography Pattern

How it works:

Order Service creates order, publishes OrderCreated event
Payment Service listens for OrderCreated, reserves payment, publishes PaymentReserved
Inventory Service listens for PaymentReserved, reserves stock, publishes InventoryReserved
Shipping Service listens for InventoryReserved, creates shipment, publishes ShipmentCreated
Order Service listens for ShipmentCreated, marks order as COMPLETED

If Inventory Service fails, it publishes InventoryReservationFailed. Payment Service listens for that and refunds the payment.

Here’s how each service handles events:

# payment_service.py
class PaymentEventHandler:
    def __init__(self, payment_repo, event_bus):
        self.payment_repo = payment_repo
        self.event_bus = event_bus

    async def handle_order_created(self, event: OrderCreated):
        """React to a new order by reserving payment."""
        try:
            payment = await self.payment_repo.reserve(
                order_id=event.order_id,
                amount=event.total,
                idempotency_key=event.event_id,  # prevent double-processing
            )
            await self.event_bus.publish(PaymentReserved(
                order_id=event.order_id,
                payment_id=payment.id,
                amount=event.total,
            ))
        except InsufficientFundsError:
            await self.event_bus.publish(PaymentFailed(
                order_id=event.order_id,
                reason="insufficient_funds",
            ))

    async def handle_inventory_reservation_failed(self, event):
        """Compensate: refund payment when downstream step fails."""
        payment = await self.payment_repo.find_by_order(event.order_id)
        if payment and payment.status == "reserved":
            await self.payment_repo.refund(payment.id)
            await self.event_bus.publish(PaymentRefunded(
                order_id=event.order_id,
                payment_id=payment.id,
            ))

# inventory_service.py
class InventoryEventHandler:
    def __init__(self, inventory_repo, event_bus):
        self.inventory_repo = inventory_repo
        self.event_bus = event_bus

    async def handle_payment_reserved(self, event: PaymentReserved):
        """React to successful payment by reserving inventory."""
        try:
            reservation = await self.inventory_repo.reserve(
                order_id=event.order_id,
                items=event.items,
                idempotency_key=event.event_id,
            )
            await self.event_bus.publish(InventoryReserved(
                order_id=event.order_id,
                reservation_id=reservation.id,
            ))
        except OutOfStockError as e:
            await self.event_bus.publish(InventoryReservationFailed(
                order_id=event.order_id,
                reason=str(e),
            ))

Orchestration vs Choreography — When to Use Which

Aspect	Orchestration	Choreography
Complexity	Centralized, easier to understand	Distributed, harder to trace
Coupling	Orchestrator knows all services	Services only know events
Single point of failure	Orchestrator is critical	No single point
Debugging	Follow the orchestrator log	Correlate events across services
Best for	Complex workflows with branching	Simple linear flows (3-4 steps)
Adding steps	Modify orchestrator	Add new event listener
Circular dependencies	Not possible (orchestrator controls flow)	Possible if not careful

My rule of thumb: Use orchestration when you have more than 4 steps, branching logic, or need clear visibility into the workflow. Use choreography for simple, linear flows where services are truly independent.

Compensation — The Hard Part

The forward path is easy. Compensation is where sagas get tricky.

SAGA Compensation Flow

A compensation is not a simple rollback. It’s a semantic undo — a new transaction that counteracts the effect of the original.

Designing Compensations

Every forward step needs a corresponding compensation:

Forward Action	Compensation	Notes
Reserve payment ($100)	Refund payment ($100)	Must handle partial refunds
Reserve 5 units of inventory	Release 5 units of inventory	Check if already shipped
Create shipment	Cancel shipment	Only if not already dispatched
Send confirmation email	Send cancellation email	Can’t “unsend” — send correction
Charge credit card	Issue refund	May take days to process

The Golden Rules of Compensation

1. Compensations must be idempotent

A compensation might be retried if the saga crashes mid-compensation. Running it twice must produce the same result as running it once.

async function refundPayment(paymentId, idempotencyKey) {
  // Check if refund already processed
  const existing = await db.refunds.findOne({
    paymentId,
    idempotencyKey,
  });

  if (existing) {
    return existing; // Already refunded — return existing result
  }

  const refund = await paymentGateway.refund(paymentId);
  await db.refunds.insert({
    paymentId,
    idempotencyKey,
    refundId: refund.id,
    amount: refund.amount,
    createdAt: new Date(),
  });

  return refund;
}

2. Compensations must be commutative with retries

If a forward action and its compensation are both retried due to network issues, the end state must still be correct. This is why reservations work better than direct mutations:

// BAD: Direct mutation — compensation can overshoot
async function addStock(itemId, quantity) {
  await db.query('UPDATE items SET stock = stock + $1 WHERE id = $2', [quantity, itemId]);
}

// GOOD: Reservation-based — idempotent and safe
async function releaseReservation(reservationId) {
  const result = await db.query(
    'UPDATE reservations SET status = $1 WHERE id = $2 AND status = $3',
    ['released', reservationId, 'reserved']
  );
  // Only releases if still in 'reserved' state — safe to retry
  return result.rowCount > 0;
}

3. Some actions can’t be compensated

You can’t unsend an email. You can’t un-notify a user. For these cases:

Reorder your saga steps: Put non-compensatable actions last. Send the email only after all other steps succeed.
Use a buffer: Queue the email but don’t send it until the saga completes.
Send a correction: If the email was sent, send a follow-up cancellation email.

Saga State Machine

Every saga instance goes through a well-defined set of states. Modeling this as a state machine makes the behavior explicit and debuggable:

SAGA State Machine

Here’s a state machine implementation:

const SagaStateMachine = {
  STARTED: {
    transitions: {
      BEGIN_PAYMENT: 'PAYMENT_PENDING',
      ABORT: 'COMPENSATING',
    },
  },
  PAYMENT_PENDING: {
    transitions: {
      PAYMENT_SUCCESS: 'INVENTORY_PENDING',
      PAYMENT_FAILED: 'COMPENSATING',
    },
  },
  INVENTORY_PENDING: {
    transitions: {
      INVENTORY_SUCCESS: 'SHIPPING_PENDING',
      INVENTORY_FAILED: 'COMPENSATING',
    },
  },
  SHIPPING_PENDING: {
    transitions: {
      SHIPPING_SUCCESS: 'COMPLETED',
      SHIPPING_FAILED: 'COMPENSATING',
    },
  },
  COMPENSATING: {
    transitions: {
      COMPENSATION_DONE: 'COMPENSATED',
      COMPENSATION_FAILED: 'REQUIRES_MANUAL',
    },
  },
  COMPLETED: { transitions: {} },
  COMPENSATED: { transitions: {} },
  REQUIRES_MANUAL: { transitions: {} },
};

class SagaInstance {
  constructor(id, initialState = 'STARTED') {
    this.id = id;
    this.state = initialState;
    this.history = [{ state: initialState, timestamp: Date.now() }];
  }

  transition(event) {
    const currentConfig = SagaStateMachine[this.state];
    const nextState = currentConfig?.transitions[event];

    if (!nextState) {
      throw new Error(
        `Invalid transition: ${this.state} + ${event}. ` +
        `Allowed: ${Object.keys(currentConfig?.transitions || {}).join(', ')}`
      );
    }

    this.state = nextState;
    this.history.push({ state: nextState, event, timestamp: Date.now() });
    return this;
  }

  isTerminal() {
    const config = SagaStateMachine[this.state];
    return Object.keys(config.transitions).length === 0;
  }
}

Usage:

const saga = new SagaInstance('order-123');

saga.transition('BEGIN_PAYMENT');     // → PAYMENT_PENDING
saga.transition('PAYMENT_SUCCESS');   // → INVENTORY_PENDING
saga.transition('INVENTORY_FAILED');  // → COMPENSATING
saga.transition('COMPENSATION_DONE'); // → COMPENSATED

console.log(saga.history);
// Full audit trail of every state change with timestamps

Saga Store — Persistence Layer

The saga store is the source of truth. If the orchestrator crashes mid-saga, it must recover by reading the saga state and resuming from where it left off.

CREATE TABLE sagas (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    type            VARCHAR(100) NOT NULL,  -- 'order_saga', 'refund_saga'
    correlation_id  VARCHAR(255) NOT NULL,  -- order_id, user_id, etc.
    state           VARCHAR(50) NOT NULL,
    payload         JSONB NOT NULL,
    steps           JSONB NOT NULL DEFAULT '[]',
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    updated_at      TIMESTAMPTZ DEFAULT NOW(),
    version         INTEGER DEFAULT 1       -- optimistic concurrency
);

CREATE INDEX idx_sagas_correlation ON sagas(correlation_id);
CREATE INDEX idx_sagas_state ON sagas(state) WHERE state NOT IN ('COMPLETED', 'COMPENSATED');

CREATE TABLE saga_events (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    saga_id     UUID REFERENCES sagas(id),
    event_type  VARCHAR(100) NOT NULL,
    payload     JSONB NOT NULL,
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

Recovery on startup:

async function recoverPendingSagas() {
  const pendingSagas = await db.query(`
    SELECT * FROM sagas
    WHERE state NOT IN ('COMPLETED', 'COMPENSATED', 'REQUIRES_MANUAL')
    AND updated_at < NOW() - INTERVAL '5 minutes'
    ORDER BY created_at ASC
  `);

  for (const saga of pendingSagas.rows) {
    console.log(`Recovering saga ${saga.id} in state ${saga.state}`);

    if (saga.state === 'COMPENSATING') {
      await orchestrator.compensate(saga);
    } else {
      // Resume from last completed step
      await orchestrator.resume(saga);
    }
  }
}

Real-World Example: E-Commerce Order

Let’s put it all together with a complete e-commerce order saga:

// saga-definitions/order-saga.js
const OrderSagaDefinition = {
  name: 'create_order',

  steps: [
    {
      name: 'validate_order',
      action: async (ctx) => {
        const valid = await orderValidator.validate(ctx.payload);
        if (!valid) throw new ValidationError('Invalid order data');
        return { validated: true };
      },
      // No compensation needed — validation has no side effects
      compensation: null,
    },
    {
      name: 'reserve_payment',
      action: async (ctx) => {
        const payment = await paymentService.reserve({
          orderId: ctx.sagaId,
          amount: ctx.payload.total,
          method: ctx.payload.paymentMethod,
          idempotencyKey: `${ctx.sagaId}-payment`,
        });
        return { paymentId: payment.id };
      },
      compensation: async (ctx, stepResult) => {
        await paymentService.refund(stepResult.paymentId, {
          idempotencyKey: `${ctx.sagaId}-payment-refund`,
        });
      },
    },
    {
      name: 'reserve_inventory',
      action: async (ctx) => {
        const reservation = await inventoryService.reserve({
          orderId: ctx.sagaId,
          items: ctx.payload.items,
          idempotencyKey: `${ctx.sagaId}-inventory`,
        });
        return { reservationId: reservation.id };
      },
      compensation: async (ctx, stepResult) => {
        await inventoryService.release(stepResult.reservationId, {
          idempotencyKey: `${ctx.sagaId}-inventory-release`,
        });
      },
    },
    {
      name: 'create_shipment',
      action: async (ctx) => {
        const shipment = await shippingService.create({
          orderId: ctx.sagaId,
          address: ctx.payload.shippingAddress,
          items: ctx.payload.items,
          idempotencyKey: `${ctx.sagaId}-shipment`,
        });
        return { shipmentId: shipment.id };
      },
      compensation: async (ctx, stepResult) => {
        await shippingService.cancel(stepResult.shipmentId, {
          idempotencyKey: `${ctx.sagaId}-shipment-cancel`,
        });
      },
    },
    {
      name: 'send_confirmation',
      action: async (ctx) => {
        // Last step — no compensation needed since it's the final action
        await notificationService.sendOrderConfirmation({
          orderId: ctx.sagaId,
          email: ctx.payload.customerEmail,
        });
        return { notified: true };
      },
      compensation: async (ctx) => {
        // Send cancellation email as correction
        await notificationService.sendOrderCancellation({
          orderId: ctx.sagaId,
          email: ctx.payload.customerEmail,
        });
      },
    },
  ],
};

A generic saga executor that runs any saga definition:

class SagaExecutor {
  constructor(sagaStore) {
    this.sagaStore = sagaStore;
  }

  async run(definition, payload) {
    const saga = await this.sagaStore.create({
      type: definition.name,
      state: 'STARTED',
      payload,
      steps: [],
    });

    const ctx = { sagaId: saga.id, payload };
    const completedSteps = [];

    try {
      for (const stepDef of definition.steps) {
        await this.sagaStore.updateState(saga.id, `${stepDef.name}_PENDING`);

        const result = await stepDef.action(ctx);
        completedSteps.push({ definition: stepDef, result });

        await this.sagaStore.recordStep(saga.id, {
          name: stepDef.name,
          status: 'completed',
          result,
        });
      }

      await this.sagaStore.updateState(saga.id, 'COMPLETED');
      return { success: true, sagaId: saga.id };

    } catch (error) {
      console.error(`Saga ${saga.id} failed at step: ${error.message}`);
      await this.sagaStore.updateState(saga.id, 'COMPENSATING');

      // Compensate in reverse order
      for (const step of completedSteps.reverse()) {
        if (step.definition.compensation) {
          try {
            await step.definition.compensation(ctx, step.result);
            await this.sagaStore.recordStep(saga.id, {
              name: step.definition.name,
              status: 'compensated',
            });
          } catch (compError) {
            await this.sagaStore.recordStep(saga.id, {
              name: step.definition.name,
              status: 'compensation_failed',
              error: compError.message,
            });
            await this.sagaStore.updateState(saga.id, 'REQUIRES_MANUAL');
            throw new SagaCompensationError(saga.id, compError);
          }
        }
      }

      await this.sagaStore.updateState(saga.id, 'COMPENSATED');
      throw new SagaFailedError(saga.id, error.message);
    }
  }
}

Handling Edge Cases

1. Saga Timeout

What if a service never responds? Set a timeout on the entire saga:

async function executeWithTimeout(saga, timeoutMs = 30000) {
  const timeoutPromise = new Promise((_, reject) =>
    setTimeout(() => reject(new SagaTimeoutError(saga.id)), timeoutMs)
  );

  try {
    return await Promise.race([
      sagaExecutor.run(saga.definition, saga.payload),
      timeoutPromise,
    ]);
  } catch (error) {
    if (error instanceof SagaTimeoutError) {
      // Force compensation for any completed steps
      await sagaExecutor.forceCompensate(saga.id);
    }
    throw error;
  }
}

2. Concurrent Sagas on the Same Resource

Two sagas reserving the same inventory item can conflict. Use optimistic locking:

UPDATE inventory
SET reserved_count = reserved_count + $1,
    version = version + 1
WHERE item_id = $2
  AND version = $3
  AND (stock - reserved_count) >= $1;

If rowCount === 0, either the version changed (concurrent modification) or insufficient stock. Retry or fail the saga step.

3. Observability

Sagas are hard to debug without proper observability. Add correlation IDs and structured logging:

const sagaLogger = {
  stepStarted(sagaId, stepName) {
    console.log(JSON.stringify({
      event: 'saga_step_started',
      sagaId,
      step: stepName,
      timestamp: new Date().toISOString(),
    }));
  },
  stepCompleted(sagaId, stepName, durationMs) {
    console.log(JSON.stringify({
      event: 'saga_step_completed',
      sagaId,
      step: stepName,
      durationMs,
      timestamp: new Date().toISOString(),
    }));
  },
  sagaFailed(sagaId, failedStep, error) {
    console.log(JSON.stringify({
      event: 'saga_failed',
      sagaId,
      failedStep,
      error: error.message,
      timestamp: new Date().toISOString(),
    }));
  },
};

Track these metrics:

Saga completion rate — percentage that reach COMPLETED state
Saga duration — p50/p95/p99 end-to-end time
Compensation rate — how often compensations run
Manual intervention rate — how often REQUIRES_MANUAL is reached
Step failure distribution — which step fails most often

Common Pitfalls

1. Not persisting saga state before executing steps. If the orchestrator crashes after sending a command but before recording the step, you don’t know whether to retry or compensate. Always persist first.

2. Non-idempotent compensations. If a refund is issued twice, the customer gets double their money back. Always use idempotency keys.

3. Putting non-compensatable steps early. If you send a confirmation email in step 2 and payment fails in step 3, you’ve notified the customer about an order that won’t be fulfilled. Put irreversible actions last.

4. Ignoring partial failures in compensation. If compensation for step 2 fails, you must still try to compensate step 1. Don’t short-circuit — attempt all compensations and collect errors.

5. No timeout on individual steps. A hung service can block the entire saga indefinitely. Set timeouts on each service call and treat timeouts as failures.

When to Use SAGA vs Alternatives

Scenario	Pattern
All data in one database	Use a regular ACID transaction
2-3 tightly coupled services	Consider 2PC if latency is acceptable
Multiple independent services, eventual consistency OK	SAGA pattern
Read-heavy with rare cross-service writes	SAGA for writes, CQRS for reads
Need strong consistency across services	Reconsider your service boundaries

Summary

The SAGA pattern is the standard approach for distributed transactions in microservices:

Orchestration gives you a clear, centralized workflow — use it for complex flows
Choreography decouples services through events — use it for simple, linear flows
Compensations are semantic undos, not rollbacks — they must be idempotent
Persist saga state before each step — this is your crash recovery mechanism
Put irreversible actions last — you can’t unsend an email
Monitor everything — saga completion rate and compensation rate are your key metrics

The pattern adds complexity, but it’s the price of distributed systems. The alternative — hoping that all services succeed or manually fixing data inconsistencies — is far worse.