Quorum Replication

By Rahul Kumar · Senior Software Engineer · Updated May 2026 · Category: Petrov · Distributed Systems (Part II)

Tunable W/R; read-after-write; sloppy quorum. Ch 11.

This interactive explanation is built for system design interview prep: step through Quorum Replication, watch the internal state change, and connect the concept to real distributed-system trade-offs.

Overview

Quorum replication is the compromise between blind mirroring ("write to every replica, block until all ack") and fire-and-forget ("write to one, hope it replicates"). Given N replicas, you pick a write quorum W and a read quorum R such that W + R > N. Any write that reached W replicas is guaranteed to be visible to any read that visits R replicas, because the two sets must overlap by at least one node. This single inequality is the mathematical backbone of Dynamo, Cassandra, Riak, and every other masterless store. You get to tune where that intersection sits: W=N, R=1 optimises read latency at the cost of write availability; W=1, R=N does the opposite; W=R=((N+1)/2) is majority quorum and is what most strongly-consistent deployments pick. Because writes can race, replicas need a conflict resolution rule, typically last-write-wins with wall-clock or Lamport timestamps, or vector clocks for safe multi-writer merges.

Quorum Replication — Interactive Simulator

Runs fully client-side in your browser; no sign-up. Or open full screen →

Launch the interactive Quorum Replication widget — step through the algorithm or protocol and observe the internal state updating in real time.

How it works

On put(k, v), the client computes a preference list of N replicas from the consistent-hash ring and fires the write in parallel to all of them. Every replica stamps the value with a version (timestamp or vector clock), persists it, and returns an ack. The client blocks until at least W acks arrive, then returns success to the caller. The remaining N-W replicas may ack later, fail silently, or fall behind; anti-entropy (hinted handoff, read-repair, Merkle sync) catches them up asynchronously. On get(k), the client sends read requests to the same preference list and collects responses until R replicas have answered. Because replicas can disagree, the client merges the R responses by picking the highest version: for LWW, the newest timestamp wins; for vector clocks, concurrent siblings are returned to the caller to resolve. If a read discovers a stale replica, it triggers a read-repair by writing the winning version back. Tuning W and R is the whole game. W=R=majority gives linearisable behaviour if the timestamps are monotonic and the writes go through a coordinator, but in masterless setups it only gives "eventually consistent with quorum visibility." Latency is dominated by the slowest of the W (or R) responses, so sloppy quorum - accepting acks from any live node, not just the preference list - trades strict overlap for availability during partitions.

Implementation

QuorumClient with parallel put/get

public class QuorumClient<V> {
    private final List<Replica<V>> replicas; // size N, chosen by consistent hashing
    private final int W;
    private final int R;
    private final ExecutorService pool;
    private final Clock clock;

    public QuorumClient(List<Replica<V>> replicas, int W, int R,
                        ExecutorService pool, Clock clock) {
        if (W + R <= replicas.size()) throw new IllegalArgumentException("W+R must exceed N");
        this.replicas = replicas; this.W = W; this.R = R;
        this.pool = pool; this.clock = clock;
    }

    public void put(String key, V value) throws Exception {
        long ts = clock.millis();
        Versioned<V> versioned = new Versioned<>(value, ts);
        CountDownLatch acked = new CountDownLatch(W);
        AtomicInteger failures = new AtomicInteger();
        for (Replica<V> r : replicas) {
            pool.submit(() -> {
                try { r.write(key, versioned); acked.countDown(); }
                catch (Exception e) { failures.incrementAndGet(); }
            });
        }
        if (!acked.await(200, TimeUnit.MILLISECONDS))
            throw new TimeoutException("quorum write failed: " + failures.get() + " errors");
    }

    public Versioned<V> get(String key) throws Exception {
        BlockingQueue<Versioned<V>> responses = new LinkedBlockingQueue<>();
        for (Replica<V> r : replicas) {
            pool.submit(() -> {
                try { Versioned<V> v = r.read(key); if (v != null) responses.offer(v); }
                catch (Exception ignored) {}
            });
        }
        List<Versioned<V>> gathered = new ArrayList<>(R);
        for (int i = 0; i < R; i++) {
            Versioned<V> v = responses.poll(200, TimeUnit.MILLISECONDS);
            if (v == null) throw new TimeoutException("quorum read incomplete");
            gathered.add(v);
        }
        // Last-write-wins reconciliation; swap for vector-clock merge when needed.
        Versioned<V> winner = gathered.stream()
                .max(Comparator.comparingLong(Versioned::timestamp))
                .orElseThrow();
        triggerReadRepair(key, gathered, winner);
        return winner;
    }

    private void triggerReadRepair(String key, List<Versioned<V>> seen, Versioned<V> winner) {
        for (int i = 0; i < seen.size(); i++) {
            if (seen.get(i).timestamp() < winner.timestamp()) {
                final int idx = i;
                pool.submit(() -> replicas.get(idx).write(key, winner));
            }
        }
    }

    public record Versioned<V>(V value, long timestamp) {}
}

Complexity

Write latency: slowest of the first W replica acks
Read latency: slowest of the first R replica responses
Messages per put: O(N) sent, O(W) on critical path
Messages per get: O(N) sent, O(R) on critical path
Storage overhead: O(N) copies of every key

Key design decisions & trade-offs

Quorum sizing — Chosen: W=R=majority as default. Gives read-your-write visibility with minimal write latency. Tuning W up favours durability; tuning R up favours staleness protection at the cost of read latency.
Conflict resolution — Chosen: Last-write-wins with physical timestamps. Simple and O(1), but silently loses concurrent updates when clocks drift. Vector clocks preserve causality but push merge logic to the application.
Sloppy quorum under partition — Chosen: Accept any W live nodes, hint replay to preference list later. Keeps the cluster writable during partial partitions at the cost of temporary W+R <= N overlap failure, which can return stale reads until hinted handoff replays.

Common pitfalls

Using wall-clock timestamps without NTP monitoring, so clock skew silently picks older values as winners
Running W+R = N (equal) instead of strictly greater than N, which permits reads that miss writes
Assuming quorum reads are linearisable in a masterless store; without a coordinator or ABD-style read-back they are merely regular
Counting acks from failed writes that later rolled back during a crash-recovery window

Interview follow-ups

Layer Merkle anti-entropy so divergent replicas reconcile without full-table scans
Replace physical timestamps with hybrid logical clocks (HLC) to bound skew while keeping LWW semantics
Add read-your-writes session tracking via sticky coordinators so clients never see their own older version
Consider CRDTs for counters and sets so conflicting concurrent writes merge without application-level resolution