← System Design Simulator

Consistent Hashing for a Distributed KV Cluster System Design Interview Question

By Rahul Kumar · Senior Software Engineer · Updated · 8 components · 3 operations ·Source: Alex Xu, System Design Interview Vol 1, Chapter 5; Karger et al. 1997 'Consistent Hashing and Random Trees'; Amazon Dynamo paper (DeCandia et al. 2007)

Problem: Design a distributed KV cluster that spreads keys across N shards, routes requests to the owning shard in O(1), replicates for durability, and rebalances with minimal key movement when nodes are added or removed.

Overview

Consistent hashing is the algorithm that lets a distributed cache, KV store, or load balancer add and remove nodes without reshuffling nearly every key in the cluster. Introduced by Karger et al. in 1997 for the Akamai CDN, it powers Amazon Dynamo and DynamoDB, Apache Cassandra, Riak, Couchbase, Discord's message fan-out, Cloudflare's edge routing, Google's Maglev L4 load balancer, and Memcached client libraries like libketama. The core insight: map both nodes and keys onto the same circular hash space, then assign each key to the first node encountered walking clockwise from the key's position. When a node joins or leaves, only the arc between it and its predecessor migrates — O(K/N) keys instead of O(K) — so a cache that held 1 TB across four nodes loses only ~200 GB of freshness when a fifth joins. The algorithm shows up in interviews framed either directly ("design consistent hashing") or indirectly as a building block inside cache clusters, KV stores, and sharded databases. Understanding virtual nodes — the fix for uneven load distribution when N is small — is the detail that separates candidates who memorized the picture from candidates who actually know why it works.

Consistent Hashing for a Distributed KV Cluster — Interactive Simulator

Runs fully client-side in your browser; no sign-up. Or open full screen →

Launch the interactive walkthrough for Consistent Hashing for a Distributed KV Cluster — animated architecture diagram, step-by-step flow with real payloads, component swap, and a discrete-event stress simulator.

Summary

A coordinator-fronted KV cluster built on a consistent-hash ring with virtual nodes (vnodes) — the technique Alex Xu's chapter converges on after surveying the rehashing problem and naive modulo-N hashing. The chapter's canonical hash function is SHA-1 over a 2^160 ring (real systems like Dynamo/Cassandra use MD5, MurmurHash, or similar truncated to 128 or 32 bits — the ring shape is identical). Each physical shard is mapped onto the ring at multiple vnode positions so ownership is uniformly distributed even with small cluster sizes; the book cites an experiment showing std-dev drops to ~10% at 100 vnodes and ~5% at 200. Keys are hashed into the ring, and each key is owned by the first shard clockwise (plus N-1 more for replication). When a shard joins or leaves, only keys on the arc between it and its predecessor move — k/n keys on average per Karger et al., not ~k as with modulo hashing. The chapter's wrap-up calls out consistent hashing as the technique used in Dynamo, Cassandra, Discord, Akamai CDN, and Google's Maglev load balancer. Dominant tradeoff: vnode count — too few and load skews; too many and ring maintenance + gossip state grows linearly.

Requirements

Functional

Non-functional

Capacity Assumptions

Back-of-Envelope Estimates

High-level architecture

A consistent-hash ring is conceptually a sorted map from hash positions to physical nodes. The implementation lives inside a coordinator service or an embedded client library: when a physical node joins, the ring manager computes V virtual positions by hashing strings like nodeId + '#' + i for i in [0, V) through a fast non-cryptographic hash (MurmurHash3 or xxHash truncated to 32 bits) and inserts them into a java.util.TreeMap<Integer, String> keyed by position, with the physical nodeId as value. Lookup hashes the key, calls TreeMap.ceilingEntry(hash) to find the first vnode at or clockwise from the key, falling back to firstEntry() to handle wrap-around. Because a TreeMap uses a red-black tree, lookup is O(log M) where M = V*N — for N=4 shards and V=150 that's ~10 comparisons, under a microsecond on a modern CPU. Replication simply walks clockwise past already-chosen physical nodes until N distinct ones are collected. The ring sits behind a coordinator that fronts the KV cluster (or embedded in smart clients): clients send a PUT/GET to any coordinator, the coordinator consults the ring, forwards to the owner, and waits for a W-of-N quorum. When a new shard joins, the coordinator consults both the old and new ring to identify the key arc that moved, kicks off a streaming rebalance at ~100 MB/s per shard, and publishes the updated ring via gossip — O(log N) rounds to converge. V (virtual nodes per shard) is the critical tuning knob: V=1 gives wildly uneven load, V=150 gives ~5-10% std dev, V=1000 gives ~3% but inflates gossip state to ~80 KB per node. The book's referenced experiment shows the knee of the curve at V in [100, 200], which is what production Dynamo and Cassandra deployments use.

Architecture Components (8)

Operations Walked Through (3)

Implementation

ConsistentHashRing.java
package com.example.ring;

import java.nio.charset.StandardCharsets;
import java.util.*;
import java.util.concurrent.locks.ReentrantReadWriteLock;

/**
 * Consistent hash ring with virtual nodes. Thread-safe: many concurrent reads,
 * exclusive writes on addNode / removeNode.
 */
public class ConsistentHashRing {

    private final int vnodesPerNode;
    private final HashFunction hash;
    private final TreeMap<Integer, String> ring = new TreeMap<>();
    private final Map<String, List<Integer>> nodeToPositions = new HashMap<>();
    private final ReentrantReadWriteLock lock = new ReentrantReadWriteLock();

    public ConsistentHashRing(int vnodesPerNode, HashFunction hash) {
        if (vnodesPerNode <= 0) throw new IllegalArgumentException("vnodesPerNode must be > 0");
        this.vnodesPerNode = vnodesPerNode;
        this.hash = Objects.requireNonNull(hash);
    }

    public void addNode(String nodeId) {
        Objects.requireNonNull(nodeId);
        lock.writeLock().lock();
        try {
            if (nodeToPositions.containsKey(nodeId)) return;
            List<Integer> positions = new ArrayList<>(vnodesPerNode);
            for (int i = 0; i < vnodesPerNode; i++) {
                int pos = hash.hash32((nodeId + "#" + i).getBytes(StandardCharsets.UTF_8));
                // Handle rare collisions by probing forward.
                while (ring.containsKey(pos)) pos++;
                ring.put(pos, nodeId);
                positions.add(pos);
            }
            nodeToPositions.put(nodeId, positions);
        } finally {
            lock.writeLock().unlock();
        }
    }

    public void removeNode(String nodeId) {
        lock.writeLock().lock();
        try {
            List<Integer> positions = nodeToPositions.remove(nodeId);
            if (positions == null) return;
            for (int pos : positions) ring.remove(pos);
        } finally {
            lock.writeLock().unlock();
        }
    }

    public String getNode(String key) {
        lock.readLock().lock();
        try {
            if (ring.isEmpty()) throw new IllegalStateException("ring is empty");
            int h = hash.hash32(key.getBytes(StandardCharsets.UTF_8));
            Map.Entry<Integer, String> e = ring.ceilingEntry(h);
            if (e == null) e = ring.firstEntry();
            return e.getValue();
        } finally {
            lock.readLock().unlock();
        }
    }

    public List<String> getReplicas(String key, int n) {
        if (n <= 0) throw new IllegalArgumentException("n must be > 0");
        lock.readLock().lock();
        try {
            if (ring.size() < vnodesPerNode) return Collections.emptyList();
            int h = hash.hash32(key.getBytes(StandardCharsets.UTF_8));
            List<String> out = new ArrayList<>(n);
            Set<String> seen = new HashSet<>();
            NavigableMap<Integer, String> tail = ring.tailMap(h, true);
            Iterator<Map.Entry<Integer, String>> it = tail.entrySet().iterator();
            while (out.size() < n) {
                if (!it.hasNext()) it = ring.entrySet().iterator();
                Map.Entry<Integer, String> e = it.next();
                if (seen.add(e.getValue())) out.add(e.getValue());
                if (seen.size() == nodeToPositions.size()) break;
            }
            return out;
        } finally {
            lock.readLock().unlock();
        }
    }

    public int size() {
        lock.readLock().lock();
        try { return nodeToPositions.size(); } finally { lock.readLock().unlock(); }
    }
}
MurmurHash3.java
package com.example.ring;

/**
 * MurmurHash3 x86_32 — fast, non-cryptographic, well-distributed.
 * Reference: https://github.com/aappleby/smhasher
 */
public final class MurmurHash3 implements HashFunction {

    private static final int SEED = 0x9747b28c;
    private static final int C1 = 0xcc9e2d51;
    private static final int C2 = 0x1b873593;

    @Override
    public int hash32(byte[] data) {
        int h1 = SEED;
        int len = data.length;
        int roundedEnd = len & 0xfffffffc;

        for (int i = 0; i < roundedEnd; i += 4) {
            int k1 = (data[i] & 0xff)
                   | ((data[i + 1] & 0xff) << 8)
                   | ((data[i + 2] & 0xff) << 16)
                   | (data[i + 3] << 24);
            k1 *= C1;
            k1 = Integer.rotateLeft(k1, 15);
            k1 *= C2;
            h1 ^= k1;
            h1 = Integer.rotateLeft(h1, 13);
            h1 = h1 * 5 + 0xe6546b64;
        }

        int k1 = 0;
        switch (len & 0x03) {
            case 3: k1 = (data[roundedEnd + 2] & 0xff) << 16;
            case 2: k1 |= (data[roundedEnd + 1] & 0xff) << 8;
            case 1: k1 |= (data[roundedEnd] & 0xff);
                    k1 *= C1;
                    k1 = Integer.rotateLeft(k1, 15);
                    k1 *= C2;
                    h1 ^= k1;
        }

        h1 ^= len;
        h1 ^= h1 >>> 16;
        h1 *= 0x85ebca6b;
        h1 ^= h1 >>> 13;
        h1 *= 0xc2b2ae35;
        h1 ^= h1 >>> 16;
        return h1;
    }
}

interface HashFunction {
    int hash32(byte[] data);
}
RebalancePlanner.java
package com.example.ring;

import java.util.*;

/**
 * Given an old ring and a new ring, compute which (sourceNode -> destNode) key ranges
 * need to move. Used by the coordinator to stream data during a cluster expansion.
 */
public class RebalancePlanner {

    public static final class Move {
        public final String sourceNode;
        public final String destNode;
        public final int startHashInclusive;
        public final int endHashInclusive;

        public Move(String source, String dest, int start, int end) {
            this.sourceNode = source;
            this.destNode = dest;
            this.startHashInclusive = start;
            this.endHashInclusive = end;
        }

        @Override
        public String toString() {
            return String.format("Move[%s -> %s: [%d, %d]]", sourceNode, destNode, startHashInclusive, endHashInclusive);
        }
    }

    public List<Move> plan(NavigableMap<Integer, String> oldRing, NavigableMap<Integer, String> newRing) {
        if (oldRing.isEmpty()) return Collections.emptyList();
        List<Move> moves = new ArrayList<>();
        int prev = oldRing.lastKey();
        for (Map.Entry<Integer, String> e : oldRing.entrySet()) {
            int hash = e.getKey();
            String oldOwner = e.getValue();
            String newOwner = lookup(newRing, hash);
            if (!oldOwner.equals(newOwner)) {
                int rangeStart = (prev + 1);
                moves.add(new Move(oldOwner, newOwner, rangeStart, hash));
            }
            prev = hash;
        }
        return moves;
    }

    private static String lookup(NavigableMap<Integer, String> ring, int hash) {
        Map.Entry<Integer, String> e = ring.ceilingEntry(hash);
        if (e == null) e = ring.firstEntry();
        return e.getValue();
    }
}

Key design decisions & trade-offs

Interview follow-ups

Related