← System Design Simulator

Quorum Replication

By Rahul Kumar · Senior Software Engineer · Updated · Category: Petrov · Distributed Systems (Part II)

Tunable W/R; read-after-write; sloppy quorum. Ch 11.

This interactive explanation is built for system design interview prep: step through Quorum Replication, watch the internal state change, and connect the concept to real distributed-system trade-offs.

Overview

Quorum replication is the compromise between blind mirroring ("write to every replica, block until all ack") and fire-and-forget ("write to one, hope it replicates"). Given N replicas, you pick a write quorum W and a read quorum R such that W + R > N. Any write that reached W replicas is guaranteed to be visible to any read that visits R replicas, because the two sets must overlap by at least one node. This single inequality is the mathematical backbone of Dynamo, Cassandra, Riak, and every other masterless store. You get to tune where that intersection sits: W=N, R=1 optimises read latency at the cost of write availability; W=1, R=N does the opposite; W=R=((N+1)/2) is majority quorum and is what most strongly-consistent deployments pick. Because writes can race, replicas need a conflict resolution rule, typically last-write-wins with wall-clock or Lamport timestamps, or vector clocks for safe multi-writer merges.

Quorum Replication — Interactive Simulator

Runs fully client-side in your browser; no sign-up. Or open full screen →

Launch the interactive Quorum Replication widget — step through the algorithm or protocol and observe the internal state updating in real time.

How it works

On put(k, v), the client computes a preference list of N replicas from the consistent-hash ring and fires the write in parallel to all of them. Every replica stamps the value with a version (timestamp or vector clock), persists it, and returns an ack. The client blocks until at least W acks arrive, then returns success to the caller. The remaining N-W replicas may ack later, fail silently, or fall behind; anti-entropy (hinted handoff, read-repair, Merkle sync) catches them up asynchronously. On get(k), the client sends read requests to the same preference list and collects responses until R replicas have answered. Because replicas can disagree, the client merges the R responses by picking the highest version: for LWW, the newest timestamp wins; for vector clocks, concurrent siblings are returned to the caller to resolve. If a read discovers a stale replica, it triggers a read-repair by writing the winning version back. Tuning W and R is the whole game. W=R=majority gives linearisable behaviour if the timestamps are monotonic and the writes go through a coordinator, but in masterless setups it only gives "eventually consistent with quorum visibility." Latency is dominated by the slowest of the W (or R) responses, so sloppy quorum - accepting acks from any live node, not just the preference list - trades strict overlap for availability during partitions.

Implementation

QuorumClient with parallel put/get
public class QuorumClient<V> {
    private final List<Replica<V>> replicas; // size N, chosen by consistent hashing
    private final int W;
    private final int R;
    private final ExecutorService pool;
    private final Clock clock;

    public QuorumClient(List<Replica<V>> replicas, int W, int R,
                        ExecutorService pool, Clock clock) {
        if (W + R <= replicas.size()) throw new IllegalArgumentException("W+R must exceed N");
        this.replicas = replicas; this.W = W; this.R = R;
        this.pool = pool; this.clock = clock;
    }

    public void put(String key, V value) throws Exception {
        long ts = clock.millis();
        Versioned<V> versioned = new Versioned<>(value, ts);
        CountDownLatch acked = new CountDownLatch(W);
        AtomicInteger failures = new AtomicInteger();
        for (Replica<V> r : replicas) {
            pool.submit(() -> {
                try { r.write(key, versioned); acked.countDown(); }
                catch (Exception e) { failures.incrementAndGet(); }
            });
        }
        if (!acked.await(200, TimeUnit.MILLISECONDS))
            throw new TimeoutException("quorum write failed: " + failures.get() + " errors");
    }

    public Versioned<V> get(String key) throws Exception {
        BlockingQueue<Versioned<V>> responses = new LinkedBlockingQueue<>();
        for (Replica<V> r : replicas) {
            pool.submit(() -> {
                try { Versioned<V> v = r.read(key); if (v != null) responses.offer(v); }
                catch (Exception ignored) {}
            });
        }
        List<Versioned<V>> gathered = new ArrayList<>(R);
        for (int i = 0; i < R; i++) {
            Versioned<V> v = responses.poll(200, TimeUnit.MILLISECONDS);
            if (v == null) throw new TimeoutException("quorum read incomplete");
            gathered.add(v);
        }
        // Last-write-wins reconciliation; swap for vector-clock merge when needed.
        Versioned<V> winner = gathered.stream()
                .max(Comparator.comparingLong(Versioned::timestamp))
                .orElseThrow();
        triggerReadRepair(key, gathered, winner);
        return winner;
    }

    private void triggerReadRepair(String key, List<Versioned<V>> seen, Versioned<V> winner) {
        for (int i = 0; i < seen.size(); i++) {
            if (seen.get(i).timestamp() < winner.timestamp()) {
                final int idx = i;
                pool.submit(() -> replicas.get(idx).write(key, winner));
            }
        }
    }

    public record Versioned<V>(V value, long timestamp) {}
}

Complexity

Key design decisions & trade-offs

Common pitfalls

Interview follow-ups

Recommended reading

Related