Two-Phase Commit
Prepare/commit + coordinator crash blocking. Ch 13.
This interactive explanation is built for system design interview prep: step through Two-Phase Commit, watch the internal state change, and connect the concept to real distributed-system trade-offs.
Overview
Two-phase commit (2PC) is the classical protocol for atomically committing a transaction across multiple independent participants: every participant either commits or every participant aborts, even though they sit behind their own locks and logs. A coordinator drives the transaction through two rounds. In Phase 1 it asks every participant "can you commit?" and each participant either votes YES (and durably promises it will commit if asked) or votes NO. In Phase 2 the coordinator tallies the votes and broadcasts the decision: COMMIT if every vote was YES, ABORT otherwise. The protocol is the default in XA, classical distributed databases, and many microservice saga implementations, and it is the honest answer whenever you need cross-service atomicity. It is also famously the protocol that blocks forever if the coordinator dies at the worst possible moment, which is why it is the gateway drug to three-phase commit, Paxos Commit, and Raft-backed transaction managers.
How it works
A transaction arrives at the coordinator with a set of participants already enlisted (databases, queues, remote services). The coordinator assigns a transaction id, durably logs BEGIN, and enters Phase 1 by sending PREPARE to every participant. A participant that receives PREPARE executes the transaction locally up to the point of commit, acquires all necessary locks, writes an undo/redo record to its WAL, forces it to disk, and only then replies YES; this is the contract that makes its vote binding. A participant that cannot commit (constraint violation, resource exhaustion) writes ABORT locally and replies NO. The coordinator waits for all votes. If any vote is NO (or any participant times out), it logs ABORT and sends ABORT to everyone. If every vote is YES, it logs COMMIT - this log record is the point of no return - and sends COMMIT to everyone. Participants apply the durable outcome, release their locks, and ack. The coordinator clears the transaction log when all acks return. The protocol is blocking because a participant that has voted YES and then lost contact with the coordinator cannot unilaterally decide: it must keep locks held and wait for the coordinator to recover and tell it the fate. In practice, production 2PC pairs the coordinator with heartbeats, a replicated coordinator log, and a participant-to-participant "what did you hear?" recovery path to limit how long a crashed coordinator can freeze the cluster.
Implementation
public class Coordinator {
private final List<ParticipantClient> participants;
private final TransactionLog log;
private static final long VOTE_TIMEOUT_MS = 2_000;
public Coordinator(List<ParticipantClient> participants, TransactionLog log) {
this.participants = participants; this.log = log;
}
public boolean commit(String txId) {
log.append(txId, "BEGIN");
// Phase 1: PREPARE
List<CompletableFuture<Boolean>> votes = new ArrayList<>();
for (ParticipantClient p : participants) {
votes.add(CompletableFuture.supplyAsync(() -> p.prepare(txId))
.orTimeout(VOTE_TIMEOUT_MS, TimeUnit.MILLISECONDS)
.exceptionally(e -> false));
}
boolean allYes = votes.stream().map(CompletableFuture::join)
.reduce(true, Boolean::logicalAnd);
// Phase 2: decide and broadcast
if (allYes) {
log.append(txId, "COMMIT"); // point of no return
for (ParticipantClient p : participants) p.commit(txId);
log.append(txId, "DONE");
return true;
} else {
log.append(txId, "ABORT");
for (ParticipantClient p : participants) p.abort(txId);
log.append(txId, "DONE");
return false;
}
}
/** Called on coordinator restart to finish any in-flight transactions. */
public void recover() {
for (TransactionLog.Record r : log.unfinished()) {
String decision = r.lastEntry();
for (ParticipantClient p : participants) {
if ("COMMIT".equals(decision)) p.commit(r.txId());
else p.abort(r.txId());
}
log.append(r.txId(), "DONE");
}
}
}
public class Participant {
public enum State { IDLE, PREPARED, COMMITTED, ABORTED }
private final ParticipantLog log;
private final LockManager locks;
private final TxExecutor executor;
private final Map<String, State> txStates = new ConcurrentHashMap<>();
public Participant(ParticipantLog log, LockManager locks, TxExecutor executor) {
this.log = log; this.locks = locks; this.executor = executor;
}
public synchronized boolean onPrepare(String txId, List<Op> ops) {
if (txStates.containsKey(txId)) return txStates.get(txId) == State.PREPARED;
try {
locks.acquireAll(txId, ops);
executor.stage(txId, ops); // materialise undo/redo records
log.forceSync(txId, "PREPARED"); // durable before we vote YES
txStates.put(txId, State.PREPARED);
return true;
} catch (Exception e) {
locks.releaseAll(txId);
log.append(txId, "ABORTED");
txStates.put(txId, State.ABORTED);
return false;
}
}
public synchronized void onCommit(String txId) {
if (txStates.get(txId) != State.PREPARED) return; // idempotent reapply
log.forceSync(txId, "COMMIT");
executor.applyStaged(txId);
locks.releaseAll(txId);
txStates.put(txId, State.COMMITTED);
}
public synchronized void onAbort(String txId) {
log.append(txId, "ABORT");
executor.discardStaged(txId);
locks.releaseAll(txId);
txStates.put(txId, State.ABORTED);
}
public State status(String txId) {
return txStates.getOrDefault(txId, State.IDLE);
}
}
Complexity
- Messages per transaction:
O(N) PREPARE + O(N) vote + O(N) decision + O(N) ack = 4N - Log forces per participant:
2 (PREPARED, COMMIT/ABORT) - Log forces on coordinator:
2 (BEGIN, decision) - Lock hold time:
O(max participant Phase-1 latency) - Blocking window if coordinator fails:
unbounded until coordinator recovery
Key design decisions & trade-offs
- Atomicity model — Chosen: Durable vote before YES. A participant voting YES without forcing its log could crash and forget its promise, breaking atomicity. The forced write is the price of correctness.
- Blocking behaviour — Chosen: Accept coordinator single point of failure. Pure 2PC blocks participants while the coordinator is down. Variants (3PC, Paxos Commit) solve this at the cost of more rounds and more complexity.
- Cross-service scope — Chosen: Use 2PC only within a trust boundary. Running 2PC across services you do not operate couples failure domains. Sagas with compensating actions are preferred for long-running or cross-org workflows.
Common pitfalls
- Voting YES without forcing the WAL, so a crashed participant forgets its promise and a later COMMIT finds no transaction
- Holding locks across PREPARE for human-scale time, deadlocking throughput on the hottest rows
- Running a single coordinator without replication, making the whole transaction system as available as one node
- Timing out a prepared participant and aborting on the coordinator while the participant keeps locks waiting
Interview follow-ups
- Replicate the coordinator log with Raft so a crashed coordinator can be failed over without blocking
- Add presumed-abort optimisation to skip the ABORT log record on read-only or single-vote-NO paths
- Move to Paxos Commit when you need non-blocking semantics under coordinator failure
- Swap 2PC for sagas and idempotent compensations when participants span independent services or organisations
Recommended reading
- Alex Petrov, Database Internals — storage engines and distributed systems internals.
- Martin Kleppmann, Designing Data-Intensive Applications (DDIA) — data models, replication, partitioning, consistency.
- The System Design Primer — high-level design building blocks.
- Foundational networking + web-security references (TCP/IP, TLS 1.3, OWASP Top 10).