Dual Write Problem: The Consistency Trap in Microservices

You've probably shipped this code. It passed code review. It passed integration tests. It's running in production right now.

function placeOrder(order):
    db.save(order)
    broker.publish("OrderPlaced", order)
    return order.id

On the happy path, both operations succeed. The order lands in the database. The event lands in the broker. Downstream services receive it. Everything looks correct.

Now assume the process crashes between the two lines. Or the network to the broker times out. Or the database commits and the broker responds with a transient error. In each of these cases — which are not edge cases but normal failure modes in distributed systems — your database and your broker end up in different states. One believes the order exists. The other doesn't know about it.

This is the dual write problem, and it will eventually affect every system that writes to a database and a message broker in the same operation.

What Is the Dual Write Problem?

The dual write problem is the impossibility of atomically writing to two independent systems — typically a relational database and a message broker — without a distributed coordination mechanism that spans both.

In a traditional single-database application, a transaction gives you atomicity: either all changes within the transaction commit together, or none do. That guarantee is foundational to data consistency. It's what makes database systems trustworthy.

The moment you introduce a second independent system — a Kafka cluster, a RabbitMQ broker, an external HTTP API — that atomicity guarantee evaporates. The two systems operate independently. There is no shared transaction log. There is no coordinator that can roll both back simultaneously if something goes wrong.

The result is a consistency gap: a window of time (however small) during which one system has committed a change and the other hasn't yet received it. In normal operation, this window closes quickly and is invisible. Under failure conditions, the window becomes permanent, and the inconsistency becomes a bug. This is the defining challenge of microservices data consistency — it surfaces precisely when your system is under stress.

What makes the dual write problem insidious is that it doesn't surface on the happy path. Your integration tests pass. Your load tests pass. Everything looks correct — until a partial failure exposes the inconsistency, often in production, often after months of accumulated silent drift.

Why Transactions Don't Cross System Boundaries (Kafka, RabbitMQ, or Any Broker)

The natural instinct is to ask: "Can't I just wrap both operations in a transaction?" The answer is no — not in any practical sense.

Database transactions are scoped to a single database connection. When you call db.save(), the database engine tracks the change in its write-ahead log and holds it until you commit. When you call broker.publish(), you're making a network call to a completely separate process with its own internal state and its own durability model.

BEGIN / COMMIT in SQL, @Transactional in Java, session.flush() in an ORM — none of these reach across to the broker. They know nothing about it.

But what about distributed transactions? Two-phase commit (2PC) is the classical mechanism for coordinating transactions across multiple participants. In theory, you could enlist both the database and the broker in a 2PC protocol and get atomic commits across both.

In practice, this rarely works:

Most message brokers don't support XA transactions — the protocol 2PC relies on
2PC is fundamentally blocking: all participants hold locks while the coordinator decides, which destroys throughput under load
Distributed coordinators become single points of failure
The operational complexity of managing 2PC across polyglot infrastructure is substantial

The industry has largely abandoned 2PC for these reasons. Cloud-native architectures are built on the premise of eventual consistency. The transaction boundary stops at the database edge. The broker sits outside it. That boundary is exactly where the dual write problem lives.

The Three Failure Modes

The dual write problem isn't a single failure — it's a family of failures. Understanding each one precisely is essential before you can design against them.

Failure Mode 1: Lost Event

The database write succeeds, but the broker publish fails.

db.save(order)       // ✓ committed to database
broker.publish(...)  // ✗ network error — event never delivered

The order exists in the database. Downstream consumers never receive the OrderPlaced event. The inventory service never reserves stock. The notification service never sends a confirmation email.

The inconsistency is silent. No exception surfaces to the user — the order ID returns successfully. The divergence is invisible until a downstream team notices their records don't match the orders table.

Failure Mode 2: Phantom Event

The broker publish succeeds, but the database write fails or is rolled back.

broker.publish("OrderPlaced", order)  // ✓ event delivered to broker
db.save(order)                        // ✗ constraint violation — rolled back

Or more commonly, the publish happens inside a transaction that later rolls back:

begin transaction
    db.save(order)
    broker.publish(...)   // ⚠ DANGEROUS: event exits system before commit
    // something fails downstream — validation, HTTP call, DB constraint
rollback transaction
// order is gone, but the event is already in the broker

Downstream consumers are already acting on the event. Inventory has been reserved. An email has been sent. A charge is pending. But the order itself was never committed to the database.

This is the phantom event — reality visible to the outside world that doesn't exist in the source of truth. Undoing it requires compensating transactions across all downstream systems, which is a distributed systems problem at least as hard as the original one.

Failure Mode 3: Duplicate Event

The database write succeeds, but it's unknown whether the broker publish completed before the crash.

db.save(order)          // ✓ committed
broker.publish(...)     // call starts — process crashes mid-publish
// did the event reach the broker? unknown.

On recovery or retry, the application publishes again — because it cannot know whether the first attempt succeeded. The broker now has two copies of OrderPlaced. Downstream consumers process the event twice: inventory reserved twice, customer charged twice, two confirmation emails sent.

This failure mode is particularly dangerous because it doesn't prevent the system from appearing to work. Everything succeeds — it just succeeds too many times. The bugs are business-logic bugs: duplicate charges, over-reserved inventory, duplicate notifications. They appear in monitoring dashboards, not in error logs.

This failure also pushes complexity outward: it forces every downstream consumer to implement idempotency — the ability to detect and discard duplicate events. That's non-trivial logic distributed across every service that subscribes to your events.

Here is how the three modes map to execution timing:

sequenceDiagram
    participant S as Service
    participant DB as Database
    participant B as Broker

    rect rgb(0, 0, 0)
        Note over S,B: Happy Path
        S->>DB: save(order) ✓
        S->>B: publish(OrderPlaced) ✓
    end

    rect rgb(0, 0, 0)
        Note over S,B: Lost Event
        S->>DB: save(order) ✓
        S->>B: publish(OrderPlaced) ✗ timeout
        Note over B: event never arrives
    end

    rect rgb(0, 0, 0)
        Note over S,B: Phantom Event
        S->>B: publish(OrderPlaced) ✓
        S->>DB: save(order) ✗ rollback
        Note over DB: order never committed
    end

    rect rgb(0, 0, 0)
        Note over S,B: Duplicate Event
        S->>DB: save(order) ✓
        S->>B: publish(OrderPlaced) — crash
        Note over S: retry on restart
        S->>B: publish(OrderPlaced) ✓
        Note over B: two copies delivered
    end

The Event-Driven Architecture in Practice lab reproduces each of these three scenarios against real infrastructure — you see exactly what breaks and when.

Why Application-Level Patches Fall Short

Given these failure modes, teams typically reach for application-level fixes: retry logic, state checks before publishing, compensating transactions.

These reduce the problem — they don't eliminate it.

Retry logic helps with transient broker failures but creates duplicates when you can't distinguish "the broker never received the event" from "the broker received it but the ACK was lost in transit." Without that distinction, retries trade lost events for duplicate events.

Checking state before publishing — reading from the database to confirm the write succeeded before publishing — introduces read-after-write consistency challenges, adds latency to every write path, and still has a race condition between the check and the publish. It's also dead code on the happy path and difficult to test correctly.

Compensating transactions — undoing business operations after a partial failure — work in simple two-step operations and break in complex ones. They require careful design, add code complexity to every service involved, and still have a window of inconsistency between when the failure is detected and when the compensation runs.

The fundamental problem with all of these approaches: they attempt to paper over a structural issue with application logic. The write to the database and the write to the broker are separate, uncoordinated operations. No amount of retry or checking can make two independent system writes atomic.

The reliable solutions address this at the architecture level.

The Solution Space

Three patterns reliably solve the dual write problem. Each eliminates the consistency gap at its root, with different trade-offs.

Transactional Outbox Pattern

The most widely adopted solution. The insight is simple: if you can't atomically write to a database and a broker, write to the database twice — and let a separate process relay from the database to the broker.

begin transaction
    db.save(order)                                          // business write
    db.save(outbox, { type: "OrderPlaced", payload: order }) // outbox write
commit transaction

// relay process (runs independently):
for each pending_message in outbox:
    broker.publish(pending_message)
    db.mark_as_processed(pending_message)

The two database writes are atomic — they commit together or not at all. The relay process reads from the outbox table and publishes to the broker. If the relay crashes, it restarts and continues from the last unprocessed row.

The critical guarantee: an event only enters the outbox if the corresponding business write committed. Phantom events are structurally impossible. Lost events are eliminated. The relay may deliver the event more than once under crash-recovery scenarios, so downstream idempotency remains a good practice — but you own the retry logic in one place, not across every consumer.

The trade-off: delivery is eventually consistent. The event reaches the broker shortly after the database commit, not instantaneously. For most event-driven architectures, this is perfectly acceptable. You also need a relay process and a strategy for managing outbox table growth.

To experiment with the pattern locally before implementing it, mars-enterprise-kit-lite spins up PostgreSQL, Kafka, and a Spring Boot service with Docker Compose — real infrastructure, no manual configuration. Clone, start the stack, and you have a working environment to validate your outbox table design against the full stack.

Change Data Capture (CDC)

A more infrastructure-level approach. Instead of writing to an explicit outbox table, you tap into the database's replication log — PostgreSQL's Write-Ahead Log, MySQL's binlog — and stream committed row-level changes directly to the broker.

Tools like Debezium act as CDC connectors: they read the replication log and publish change events to Kafka topics without any changes to application code. Every row that commits to the database automatically becomes an event.

The trade-off: CDC emits database-level change events, not domain events. You receive "row with these column values was inserted in this table" rather than "an order was placed with this business context." Translating between the two requires additional event-processing logic. Operationally, CDC connectors (Debezium, Kafka Connect) add infrastructure complexity. For teams already running Kafka Connect, this can be a natural fit. For teams that aren't, it's a significant operational investment — treat CDC as a future optimization, not a starting point.

Event Sourcing

The most architectural shift. In event sourcing, the event log is the source of truth. Your service doesn't persist state — it persists events. The current state of an order is derived by replaying all OrderCreated, OrderUpdated, OrderShipped events from the beginning.

The dual write problem disappears because there is only one write target — the event store. Events are simultaneously the persistence mechanism and the communication mechanism. There's no separate database write to coordinate with.

The trade-off is substantial: event sourcing is a programming model, not just a pattern. It requires rethinking how you query state (event stores aren't relational), how you handle schema evolution (event formats must be versioned carefully), and how you reason about time and causality. It's the right choice for domains where auditability, temporal queries, and complete audit trails are first-class requirements. It's heavyweight for services that just need reliable event publishing. Most teams reaching for event sourcing to solve the dual write problem don't need it — they need the Outbox Pattern.

Which Pattern Should You Use?

For most teams building event-driven architecture today, the Transactional Outbox Pattern is the pragmatic choice.

It solves the core problem without requiring infrastructure changes (CDC) or a full programming model shift (Event Sourcing). It works with any relational database and any message broker, fits naturally alongside existing database-first development, and its failure modes are understandable and testable. For teams where data consistency in microservices is a priority but overhauling the stack isn't an option, the outbox pattern is the right starting point.

If you're splitting a monolith into microservices, the outbox pattern adopts incrementally — service by service, without restructuring your entire data layer.

CDC is worth evaluating if you already operate Kafka Connect infrastructure and want application-transparent event publishing. Event Sourcing is worth the investment if your domain has strong audit trail requirements and your team has the bandwidth to commit to the programming model.

The dual write problem doesn't announce itself. Systems run for months looking healthy while accumulating silent drift. Then a partial failure exposes years of inconsistency at once: a Lost Event that left downstream services out of sync, a Phantom Event that triggered business logic for an order that never existed, a Duplicate Event that resulted in a double charge traced back weeks later. The damage is business damage, not just technical. Designing against these failure modes deliberately — choosing the right pattern for your context — is what separates event-driven architectures that scale from ones that quietly fall out of sync.

What to Read Next

The Event-Driven Architecture in Practice lab reproduces all three failure modes against real infrastructure: live PostgreSQL and Kafka, with tests that force each failure condition and show exactly what breaks. It's the most direct path from theory to seeing the problem happen.

To have the local environment ready now, clone mars-enterprise-kit-lite and bring up the stack with Docker Compose — PostgreSQL, Kafka, and Spring Boot configured, no manual setup required.

Part 2 of this series covers the Transactional Outbox Pattern in depth: outbox table design, relay process with FOR UPDATE SKIP LOCKED, consumer idempotency, and Testcontainers integration tests. Coming soon.

Tags: