← System Design Simulator

Merkle Anti-Entropy

By Rahul Kumar · Senior Software Engineer · Updated · Category: Petrov · Distributed Systems (Part II)

Compare two replicas' trees, sync divergent keys. Ch 12.

This interactive explanation is built for system design interview prep: step through Merkle Anti-Entropy, watch the internal state change, and connect the concept to real distributed-system trade-offs.

Overview

A Merkle tree is the data structure that lets two replicas agree on which keys differ without shipping the whole dataset over the wire. Each leaf holds the hash of one key's value (or of a small range of keys); each internal node holds the hash of its children concatenated. Two replicas compare trees top-down: if the root hashes match, the datasets are identical and we can stop immediately; if they differ, the mismatch must be in the subtree whose hash disagrees, so we recurse only into that side. The cost of finding K differing keys in an N-key dataset drops from O(N) to roughly O(K log N), which is why every serious eventually-consistent store uses it: Dynamo, Cassandra, Riak, and Bitcoin's block verification all run this same pattern. The primitive is also what makes anti-entropy cheap enough to run continuously in the background instead of as a scheduled full repair.

Merkle Anti-Entropy — Interactive Simulator

Runs fully client-side in your browser; no sign-up. Or open full screen →

Launch the interactive Merkle Anti-Entropy widget — step through the algorithm or protocol and observe the internal state updating in real time.

How it works

Build starts with a sorted or hash-bucketed list of (key, value) pairs. Each leaf stores H(value) or H(key || value) to protect against deliberate collisions; each internal node stores H(left.hash || right.hash). The tree is binary, full, and padded so every level is a power of two, which keeps indexing arithmetic simple. For range-repair protocols, leaves usually cover a fixed key-range rather than a single key, so a tree of depth 20 can cover a million ranges instead of a million individual keys; this caps network traffic during sync. Diffing two trees is a parallel DFS. Start at the roots; if they agree, return the empty set. Otherwise recurse into the pair of children and union their diff sets. At the leaf level, a mismatch is a candidate differing range, and the callers exchange the actual key-value pairs in that range to reconcile. The key correctness invariant is that any change to a leaf propagates all the way up: a single byte flip in one value changes exactly log2(N) hashes on the path to the root, so divergence is always detectable at the root. Updates are incremental: modifying one leaf requires O(log N) rehashes, not a full rebuild. The primitive fails gracefully under adversarial input only if the hash function is collision-resistant, which is why production systems use SHA-256 or BLAKE3 rather than murmur.

Implementation

MerkleTree build and diff
public class MerkleTree {
    private final byte[][] nodes; // level-order flat array
    private final int leafCount;
    private static final MessageDigest DIGEST = newDigest();

    private MerkleTree(byte[][] nodes, int leafCount) {
        this.nodes = nodes; this.leafCount = leafCount;
    }

    public static MerkleTree build(byte[][] leaves) {
        int n = nextPowerOfTwo(Math.max(leaves.length, 1));
        byte[][] padded = new byte[n][];
        for (int i = 0; i < n; i++) {
            padded[i] = i < leaves.length ? leaves[i] : new byte[0];
        }
        // Total nodes in a full binary tree with n leaves = 2n - 1; use 1-indexed layout.
        byte[][] nodes = new byte[2 * n][];
        for (int i = 0; i < n; i++) nodes[n + i] = hash(padded[i]);
        for (int i = n - 1; i >= 1; i--) {
            nodes[i] = hashPair(nodes[2 * i], nodes[2 * i + 1]);
        }
        return new MerkleTree(nodes, n);
    }

    public byte[] rootHash() { return nodes[1]; }

    /** Returns the indices of leaves whose hashes differ from `other`. */
    public Set<Integer> diff(MerkleTree other) {
        if (this.leafCount != other.leafCount)
            throw new IllegalArgumentException("trees must have the same leaf count");
        Set<Integer> out = new HashSet<>();
        Deque<Integer> stack = new ArrayDeque<>();
        stack.push(1);
        while (!stack.isEmpty()) {
            int idx = stack.pop();
            if (Arrays.equals(nodes[idx], other.nodes[idx])) continue;
            if (idx >= leafCount) { // leaf level starts at leafCount
                out.add(idx - leafCount);
            } else {
                stack.push(2 * idx);
                stack.push(2 * idx + 1);
            }
        }
        return out;
    }

    private static byte[] hash(byte[] in) {
        synchronized (DIGEST) { DIGEST.reset(); return DIGEST.digest(in); }
    }

    private static byte[] hashPair(byte[] l, byte[] r) {
        synchronized (DIGEST) {
            DIGEST.reset(); DIGEST.update(l); return DIGEST.digest(r);
        }
    }

    private static int nextPowerOfTwo(int x) {
        int p = 1; while (p < x) p <<= 1; return p;
    }

    private static MessageDigest newDigest() {
        try { return MessageDigest.getInstance("SHA-256"); }
        catch (NoSuchAlgorithmException e) { throw new IllegalStateException(e); }
    }
}

Complexity

Key design decisions & trade-offs

Common pitfalls

Interview follow-ups

Recommended reading

Related