Quorum Replication
Tunable W/R; read-after-write; sloppy quorum. Ch 11.
This interactive explanation is built for system design interview prep: step through Quorum Replication, watch the internal state change, and connect the concept to real distributed-system trade-offs.
Overview
Quorum replication is the compromise between blind mirroring ("write to every replica, block until all ack") and fire-and-forget ("write to one, hope it replicates"). Given N replicas, you pick a write quorum W and a read quorum R such that W + R > N. Any write that reached W replicas is guaranteed to be visible to any read that visits R replicas, because the two sets must overlap by at least one node. This single inequality is the mathematical backbone of Dynamo, Cassandra, Riak, and every other masterless store. You get to tune where that intersection sits: W=N, R=1 optimises read latency at the cost of write availability; W=1, R=N does the opposite; W=R=((N+1)/2) is majority quorum and is what most strongly-consistent deployments pick. Because writes can race, replicas need a conflict resolution rule, typically last-write-wins with wall-clock or Lamport timestamps, or vector clocks for safe multi-writer merges.
How it works
On put(k, v), the client computes a preference list of N replicas from the consistent-hash ring and fires the write in parallel to all of them. Every replica stamps the value with a version (timestamp or vector clock), persists it, and returns an ack. The client blocks until at least W acks arrive, then returns success to the caller. The remaining N-W replicas may ack later, fail silently, or fall behind; anti-entropy (hinted handoff, read-repair, Merkle sync) catches them up asynchronously. On get(k), the client sends read requests to the same preference list and collects responses until R replicas have answered. Because replicas can disagree, the client merges the R responses by picking the highest version: for LWW, the newest timestamp wins; for vector clocks, concurrent siblings are returned to the caller to resolve. If a read discovers a stale replica, it triggers a read-repair by writing the winning version back. Tuning W and R is the whole game. W=R=majority gives linearisable behaviour if the timestamps are monotonic and the writes go through a coordinator, but in masterless setups it only gives "eventually consistent with quorum visibility." Latency is dominated by the slowest of the W (or R) responses, so sloppy quorum - accepting acks from any live node, not just the preference list - trades strict overlap for availability during partitions.
Implementation
public class QuorumClient<V> {
private final List<Replica<V>> replicas; // size N, chosen by consistent hashing
private final int W;
private final int R;
private final ExecutorService pool;
private final Clock clock;
public QuorumClient(List<Replica<V>> replicas, int W, int R,
ExecutorService pool, Clock clock) {
if (W + R <= replicas.size()) throw new IllegalArgumentException("W+R must exceed N");
this.replicas = replicas; this.W = W; this.R = R;
this.pool = pool; this.clock = clock;
}
public void put(String key, V value) throws Exception {
long ts = clock.millis();
Versioned<V> versioned = new Versioned<>(value, ts);
CountDownLatch acked = new CountDownLatch(W);
AtomicInteger failures = new AtomicInteger();
for (Replica<V> r : replicas) {
pool.submit(() -> {
try { r.write(key, versioned); acked.countDown(); }
catch (Exception e) { failures.incrementAndGet(); }
});
}
if (!acked.await(200, TimeUnit.MILLISECONDS))
throw new TimeoutException("quorum write failed: " + failures.get() + " errors");
}
public Versioned<V> get(String key) throws Exception {
BlockingQueue<Versioned<V>> responses = new LinkedBlockingQueue<>();
for (Replica<V> r : replicas) {
pool.submit(() -> {
try { Versioned<V> v = r.read(key); if (v != null) responses.offer(v); }
catch (Exception ignored) {}
});
}
List<Versioned<V>> gathered = new ArrayList<>(R);
for (int i = 0; i < R; i++) {
Versioned<V> v = responses.poll(200, TimeUnit.MILLISECONDS);
if (v == null) throw new TimeoutException("quorum read incomplete");
gathered.add(v);
}
// Last-write-wins reconciliation; swap for vector-clock merge when needed.
Versioned<V> winner = gathered.stream()
.max(Comparator.comparingLong(Versioned::timestamp))
.orElseThrow();
triggerReadRepair(key, gathered, winner);
return winner;
}
private void triggerReadRepair(String key, List<Versioned<V>> seen, Versioned<V> winner) {
for (int i = 0; i < seen.size(); i++) {
if (seen.get(i).timestamp() < winner.timestamp()) {
final int idx = i;
pool.submit(() -> replicas.get(idx).write(key, winner));
}
}
}
public record Versioned<V>(V value, long timestamp) {}
}
Complexity
- Write latency:
slowest of the first W replica acks - Read latency:
slowest of the first R replica responses - Messages per put:
O(N) sent, O(W) on critical path - Messages per get:
O(N) sent, O(R) on critical path - Storage overhead:
O(N) copies of every key
Key design decisions & trade-offs
- Quorum sizing — Chosen: W=R=majority as default. Gives read-your-write visibility with minimal write latency. Tuning W up favours durability; tuning R up favours staleness protection at the cost of read latency.
- Conflict resolution — Chosen: Last-write-wins with physical timestamps. Simple and O(1), but silently loses concurrent updates when clocks drift. Vector clocks preserve causality but push merge logic to the application.
- Sloppy quorum under partition — Chosen: Accept any W live nodes, hint replay to preference list later. Keeps the cluster writable during partial partitions at the cost of temporary W+R <= N overlap failure, which can return stale reads until hinted handoff replays.
Common pitfalls
- Using wall-clock timestamps without NTP monitoring, so clock skew silently picks older values as winners
- Running W+R = N (equal) instead of strictly greater than N, which permits reads that miss writes
- Assuming quorum reads are linearisable in a masterless store; without a coordinator or ABD-style read-back they are merely regular
- Counting acks from failed writes that later rolled back during a crash-recovery window
Interview follow-ups
- Layer Merkle anti-entropy so divergent replicas reconcile without full-table scans
- Replace physical timestamps with hybrid logical clocks (HLC) to bound skew while keeping LWW semantics
- Add read-your-writes session tracking via sticky coordinators so clients never see their own older version
- Consider CRDTs for counters and sets so conflicting concurrent writes merge without application-level resolution
Recommended reading
- Alex Petrov, Database Internals — storage engines and distributed systems internals.
- Martin Kleppmann, Designing Data-Intensive Applications (DDIA) — data models, replication, partitioning, consistency.
- The System Design Primer — high-level design building blocks.
- Foundational networking + web-security references (TCP/IP, TLS 1.3, OWASP Top 10).