Consistent Hashing for a Distributed KV Cluster System Design Interview Question
Problem: Design a distributed KV cluster that spreads keys across N shards, routes requests to the owning shard in O(1), replicates for durability, and rebalances with minimal key movement when nodes are added or removed.
Overview
Consistent hashing is the algorithm that lets a distributed cache, KV store, or load balancer add and remove nodes without reshuffling nearly every key in the cluster. Introduced by Karger et al. in 1997 for the Akamai CDN, it powers Amazon Dynamo and DynamoDB, Apache Cassandra, Riak, Couchbase, Discord's message fan-out, Cloudflare's edge routing, Google's Maglev L4 load balancer, and Memcached client libraries like libketama. The core insight: map both nodes and keys onto the same circular hash space, then assign each key to the first node encountered walking clockwise from the key's position. When a node joins or leaves, only the arc between it and its predecessor migrates — O(K/N) keys instead of O(K) — so a cache that held 1 TB across four nodes loses only ~200 GB of freshness when a fifth joins. The algorithm shows up in interviews framed either directly ("design consistent hashing") or indirectly as a building block inside cache clusters, KV stores, and sharded databases. Understanding virtual nodes — the fix for uneven load distribution when N is small — is the detail that separates candidates who memorized the picture from candidates who actually know why it works.
Summary
A coordinator-fronted KV cluster built on a consistent-hash ring with virtual nodes (vnodes) — the technique Alex Xu's chapter converges on after surveying the rehashing problem and naive modulo-N hashing. The chapter's canonical hash function is SHA-1 over a 2^160 ring (real systems like Dynamo/Cassandra use MD5, MurmurHash, or similar truncated to 128 or 32 bits — the ring shape is identical). Each physical shard is mapped onto the ring at multiple vnode positions so ownership is uniformly distributed even with small cluster sizes; the book cites an experiment showing std-dev drops to ~10% at 100 vnodes and ~5% at 200. Keys are hashed into the ring, and each key is owned by the first shard clockwise (plus N-1 more for replication). When a shard joins or leaves, only keys on the arc between it and its predecessor move — k/n keys on average per Karger et al., not ~k as with modulo hashing. The chapter's wrap-up calls out consistent hashing as the technique used in Dynamo, Cassandra, Discord, Akamai CDN, and Google's Maglev load balancer. Dominant tradeoff: vnode count — too few and load skews; too many and ring maintenance + gossip state grows linearly.
Requirements
Functional
- addNode(nodeId) places a physical node on the ring using V virtual node positions
- removeNode(nodeId) atomically removes all V vnodes belonging to that node
- getNode(key) returns the owning physical node in O(log V*N) via binary search on a sorted ring
- getReplicas(key, n) returns the next N distinct physical nodes clockwise for replication
- Ring membership is observable (size, node list, vnode positions) for rebalance tooling
- Node joins/leaves report the exact set of keys that need to migrate to/from each peer
Non-functional
- Lookup latency sub-microsecond for rings up to ~10,000 vnodes
- Standard deviation of per-node load under 10% with >=100 vnodes per physical node
- Membership changes move only K/N keys on average, never more than 2*K/N in the worst case
- Ring state is small (~50 KB for 600 vnodes) so it propagates via gossip in seconds
- Thread-safe reads concurrent with rare writes (node add/remove) using copy-on-write or read-write locks
- Deterministic placement: same (nodeId, vnodeIndex) always hashes to the same ring position across processes
Capacity Assumptions
- 4 physical shards, replication factor N=3, W=2 acks on write (Dynamo-style quorum extension; the book itself does not discuss N/W/R but the ring is the same)
- 150 virtual nodes per physical shard (600 vnodes on the ring) — in the band the book's referenced experiment identifies as giving ~5-10% std dev of load
- Hash space: chapter's canonical example is SHA-1 (2^160); this design uses MurmurHash3 truncated to 2^32 for cheaper lookup. Shape is identical either way.
- Expected dataset ~1B keys at ~1 KB each (~1 TB total, ~750 GB per shard with RF=3)
- Read:write ratio 5:1, peak 50K ops/sec
Back-of-Envelope Estimates
- Per-shard ownership with 150 vnodes: std-dev of load ~5-10% (chapter reference [2] experiment: 100 vnodes → 10%, 200 vnodes → 5%)
- Adding a 5th shard rebalances ~1/5 of the keyspace ≈ 200 GB of data movement (vs ~100% with naive modulo hashing — the chapter's 'cache miss storm' scenario)
- Rebalance throughput budget ≈ 100 MB/s per shard → ~35 min to drain a neighbour's share
- Coordinator routing cost: 1 hash + 1 binary search on sorted ring (~600 entries) ≈ sub-microsecond
- Gossip/membership state: 600 vnodes * ~80 B/entry ≈ 50 KB per node — trivial to replicate via gossip
High-level architecture
A consistent-hash ring is conceptually a sorted map from hash positions to physical nodes. The implementation lives inside a coordinator service or an embedded client library: when a physical node joins, the ring manager computes V virtual positions by hashing strings like nodeId + '#' + i for i in [0, V) through a fast non-cryptographic hash (MurmurHash3 or xxHash truncated to 32 bits) and inserts them into a java.util.TreeMap<Integer, String> keyed by position, with the physical nodeId as value. Lookup hashes the key, calls TreeMap.ceilingEntry(hash) to find the first vnode at or clockwise from the key, falling back to firstEntry() to handle wrap-around. Because a TreeMap uses a red-black tree, lookup is O(log M) where M = V*N — for N=4 shards and V=150 that's ~10 comparisons, under a microsecond on a modern CPU. Replication simply walks clockwise past already-chosen physical nodes until N distinct ones are collected. The ring sits behind a coordinator that fronts the KV cluster (or embedded in smart clients): clients send a PUT/GET to any coordinator, the coordinator consults the ring, forwards to the owner, and waits for a W-of-N quorum. When a new shard joins, the coordinator consults both the old and new ring to identify the key arc that moved, kicks off a streaming rebalance at ~100 MB/s per shard, and publishes the updated ring via gossip — O(log N) rounds to converge. V (virtual nodes per shard) is the critical tuning knob: V=1 gives wildly uneven load, V=150 gives ~5-10% std dev, V=1000 gives ~3% but inflates gossip state to ~80 KB per node. The book's referenced experiment shows the knee of the curve at V in [100, 200], which is what production Dynamo and Cassandra deployments use.
Architecture Components (8)
- Client SDK (client) — Thin client library that sends PUT/GET to a coordinator and retries on transient failures.
- Coordinator (coordinator) — Stateless request router: hashes the key, looks up the owning shards on the ring, fans out the request to replicas.
- Consistent-Hash Ring (coordinator) — The chapter's core data structure: a circular hash space (canonically SHA-1 over 2^160, here truncated to 2^32) populated with vnode positions for each physical shard. Owns the placement function used by every read and write.
- Shard A (Storage Node) (kv) — LSM-backed storage node owning ~150 vnodes' worth of the ring (RocksDB under the hood).
- Shard B (Storage Node) (kv) — Peer of Shard A; owns a different set of 150 vnodes on the ring.
- Shard C (Storage Node) (kv) — Peer of shards A and B; owns a different set of 150 vnodes.
- Shard D (Storage Node) (kv) — Fourth peer shard. Present so the cluster can lose one shard and still have N=3 replicas of most keys.
- Background Rebalancer (worker) — Long-running job that moves keys between shards when the ring topology changes (join, leave, weight change).
Operations Walked Through (3)
- put — Client writes `user:42 → {...}`. Coordinator hashes the key, looks up the N=3 preference list on the ring, writes to all 3 in parallel, and acks the client once W=2 replicas have committed.
- get — Client reads `user:42`. Coordinator hashes, picks N shards from the ring, and requests from the nearest (lowest-latency) replica first; falls back to others for R=2 quorum.
- rebalance — A fifth shard E joins the cluster. The ring adds 150 new vnodes for E. The rebalancer computes which token ranges now belong to E and streams only those keys — roughly 1/5 of the total keyspace, not 100%.
Implementation
package com.example.ring;
import java.nio.charset.StandardCharsets;
import java.util.*;
import java.util.concurrent.locks.ReentrantReadWriteLock;
/**
* Consistent hash ring with virtual nodes. Thread-safe: many concurrent reads,
* exclusive writes on addNode / removeNode.
*/
public class ConsistentHashRing {
private final int vnodesPerNode;
private final HashFunction hash;
private final TreeMap<Integer, String> ring = new TreeMap<>();
private final Map<String, List<Integer>> nodeToPositions = new HashMap<>();
private final ReentrantReadWriteLock lock = new ReentrantReadWriteLock();
public ConsistentHashRing(int vnodesPerNode, HashFunction hash) {
if (vnodesPerNode <= 0) throw new IllegalArgumentException("vnodesPerNode must be > 0");
this.vnodesPerNode = vnodesPerNode;
this.hash = Objects.requireNonNull(hash);
}
public void addNode(String nodeId) {
Objects.requireNonNull(nodeId);
lock.writeLock().lock();
try {
if (nodeToPositions.containsKey(nodeId)) return;
List<Integer> positions = new ArrayList<>(vnodesPerNode);
for (int i = 0; i < vnodesPerNode; i++) {
int pos = hash.hash32((nodeId + "#" + i).getBytes(StandardCharsets.UTF_8));
// Handle rare collisions by probing forward.
while (ring.containsKey(pos)) pos++;
ring.put(pos, nodeId);
positions.add(pos);
}
nodeToPositions.put(nodeId, positions);
} finally {
lock.writeLock().unlock();
}
}
public void removeNode(String nodeId) {
lock.writeLock().lock();
try {
List<Integer> positions = nodeToPositions.remove(nodeId);
if (positions == null) return;
for (int pos : positions) ring.remove(pos);
} finally {
lock.writeLock().unlock();
}
}
public String getNode(String key) {
lock.readLock().lock();
try {
if (ring.isEmpty()) throw new IllegalStateException("ring is empty");
int h = hash.hash32(key.getBytes(StandardCharsets.UTF_8));
Map.Entry<Integer, String> e = ring.ceilingEntry(h);
if (e == null) e = ring.firstEntry();
return e.getValue();
} finally {
lock.readLock().unlock();
}
}
public List<String> getReplicas(String key, int n) {
if (n <= 0) throw new IllegalArgumentException("n must be > 0");
lock.readLock().lock();
try {
if (ring.size() < vnodesPerNode) return Collections.emptyList();
int h = hash.hash32(key.getBytes(StandardCharsets.UTF_8));
List<String> out = new ArrayList<>(n);
Set<String> seen = new HashSet<>();
NavigableMap<Integer, String> tail = ring.tailMap(h, true);
Iterator<Map.Entry<Integer, String>> it = tail.entrySet().iterator();
while (out.size() < n) {
if (!it.hasNext()) it = ring.entrySet().iterator();
Map.Entry<Integer, String> e = it.next();
if (seen.add(e.getValue())) out.add(e.getValue());
if (seen.size() == nodeToPositions.size()) break;
}
return out;
} finally {
lock.readLock().unlock();
}
}
public int size() {
lock.readLock().lock();
try { return nodeToPositions.size(); } finally { lock.readLock().unlock(); }
}
}
package com.example.ring;
/**
* MurmurHash3 x86_32 — fast, non-cryptographic, well-distributed.
* Reference: https://github.com/aappleby/smhasher
*/
public final class MurmurHash3 implements HashFunction {
private static final int SEED = 0x9747b28c;
private static final int C1 = 0xcc9e2d51;
private static final int C2 = 0x1b873593;
@Override
public int hash32(byte[] data) {
int h1 = SEED;
int len = data.length;
int roundedEnd = len & 0xfffffffc;
for (int i = 0; i < roundedEnd; i += 4) {
int k1 = (data[i] & 0xff)
| ((data[i + 1] & 0xff) << 8)
| ((data[i + 2] & 0xff) << 16)
| (data[i + 3] << 24);
k1 *= C1;
k1 = Integer.rotateLeft(k1, 15);
k1 *= C2;
h1 ^= k1;
h1 = Integer.rotateLeft(h1, 13);
h1 = h1 * 5 + 0xe6546b64;
}
int k1 = 0;
switch (len & 0x03) {
case 3: k1 = (data[roundedEnd + 2] & 0xff) << 16;
case 2: k1 |= (data[roundedEnd + 1] & 0xff) << 8;
case 1: k1 |= (data[roundedEnd] & 0xff);
k1 *= C1;
k1 = Integer.rotateLeft(k1, 15);
k1 *= C2;
h1 ^= k1;
}
h1 ^= len;
h1 ^= h1 >>> 16;
h1 *= 0x85ebca6b;
h1 ^= h1 >>> 13;
h1 *= 0xc2b2ae35;
h1 ^= h1 >>> 16;
return h1;
}
}
interface HashFunction {
int hash32(byte[] data);
}
package com.example.ring;
import java.util.*;
/**
* Given an old ring and a new ring, compute which (sourceNode -> destNode) key ranges
* need to move. Used by the coordinator to stream data during a cluster expansion.
*/
public class RebalancePlanner {
public static final class Move {
public final String sourceNode;
public final String destNode;
public final int startHashInclusive;
public final int endHashInclusive;
public Move(String source, String dest, int start, int end) {
this.sourceNode = source;
this.destNode = dest;
this.startHashInclusive = start;
this.endHashInclusive = end;
}
@Override
public String toString() {
return String.format("Move[%s -> %s: [%d, %d]]", sourceNode, destNode, startHashInclusive, endHashInclusive);
}
}
public List<Move> plan(NavigableMap<Integer, String> oldRing, NavigableMap<Integer, String> newRing) {
if (oldRing.isEmpty()) return Collections.emptyList();
List<Move> moves = new ArrayList<>();
int prev = oldRing.lastKey();
for (Map.Entry<Integer, String> e : oldRing.entrySet()) {
int hash = e.getKey();
String oldOwner = e.getValue();
String newOwner = lookup(newRing, hash);
if (!oldOwner.equals(newOwner)) {
int rangeStart = (prev + 1);
moves.add(new Move(oldOwner, newOwner, rangeStart, hash));
}
prev = hash;
}
return moves;
}
private static String lookup(NavigableMap<Integer, String> ring, int hash) {
Map.Entry<Integer, String> e = ring.ceilingEntry(hash);
if (e == null) e = ring.firstEntry();
return e.getValue();
}
}
Key design decisions & trade-offs
- Number of virtual nodes per physical shard — Chosen: 150 vnodes. Below 100 the load std-dev exceeds 15%; above 500 the gossip/ring state grows without noticeably better balance.
- Hash function: SHA-1 vs MurmurHash3 vs xxHash — Chosen: MurmurHash3 32-bit. Cryptographic strength is irrelevant for routing, and a faster hash shrinks lookup latency to sub-microsecond.
- Modulo-N hashing vs consistent hashing — Chosen: Consistent hashing. Modulo-N moves ~(N-1)/N of all keys when N changes, causing a cache-miss storm; consistent hashing moves only 1/N.
- Ring stored centrally vs embedded in smart clients — Chosen: Embedded in clients (with coordinator fallback). Embedding skips a network hop per request at the cost of shipping ring updates via gossip — worthwhile above 10K QPS per client.
- Virtual nodes vs bounded-load consistent hashing — Chosen: Virtual nodes. Bounded-load (Google's 2017 paper) gives tighter guarantees but requires per-node counters and extra coordination — overkill for most use cases.
- Copy-on-write ring vs read-write lock on mutation — Chosen: ReadWriteLock. Node joins are rare (minutes apart); a ReadWriteLock gives concurrent lookups without paying the GC cost of copying the TreeMap on every change.
Interview follow-ups
- Walk me through what happens to in-flight reads and writes during the 35-minute window where a new shard is streaming its range.
- A node's disk fills up and it becomes unresponsive — how does the ring react, and how do you avoid a cascading hot-spot on its clockwise neighbour?
- How would you extend this ring to support weighted nodes where one shard is a beefier box with 2x the capacity of the others?
- Your ring has 4 shards with 150 vnodes each. Load is still 20% skewed in production — what would you measure first and what could cause this?
- Design an idempotent protocol for the key-range migration so that a coordinator crash mid-rebalance can resume without double-writing or losing data.
- Compare this design to rendezvous (HRW) hashing — where does HRW win, and what would make you pick it over a ring?