← System Design Simulator

Copy-on-Write B-Tree

By Rahul Kumar · Senior Software Engineer · Updated · Category: Petrov · Storage Engines (Part I)

LMDB-style path cloning, snapshots, GC. Ch 6.

This interactive explanation is built for system design interview prep: step through Copy-on-Write B-Tree, watch the internal state change, and connect the concept to real distributed-system trade-offs.

Overview

A Copy-on-Write B-tree (also called an append-only or shadowing B-tree) is a B-tree variant that never modifies an existing page. Instead, every write copies the path from root to leaf, mutates the copies, and atomically swings a single root pointer to publish the new version. LMDB, Btrfs, CouchDB, OpenZFS metadata, and Datomic all use this trick, and it is the backbone of every MVCC storage engine that wants snapshot isolation without a separate undo log. The price is write amplification — a single key update rewrites O(height) pages — but the reward is enormous: readers never block writers and never need locks because they read a consistent snapshot rooted at whatever tree-root pointer they captured when the transaction began. There is no write-ahead log in the classical sense either: the atomic root swap IS the commit, so crash recovery reduces to 'find the last valid root'.

Copy-on-Write B-Tree — Interactive Simulator

Runs fully client-side in your browser; no sign-up. Or open full screen →

Launch the interactive Copy-on-Write B-Tree widget — step through the algorithm or protocol and observe the internal state updating in real time.

How it works

The invariant is that tree nodes are immutable once written. A read transaction captures the current root pointer at its start; any subsequent write is invisible to it because it reads through that stable root. WRITE(key, val) walks from the current root to the target leaf and records the path. It then clones the leaf, applies the mutation (enters state LEAF_DIRTY), allocates a new page id for the clone, and walks back up the recorded path: at each level it clones the parent, rewrites the child pointer that used to reference the old child so it now references the clone, and repeats. This is the SHADOW_PATH phase — O(height) new pages, all of which are written to freshly allocated locations on disk. COMMIT atomically writes a new superblock whose root pointer points to the newly created root; until that write lands on disk, the tree effectively does not exist yet. After commit the old path's pages are unreachable from the new root but may still be reachable from older reader snapshots, so reclamation is driven by refcounts: each snapshot holds an implicit reference to every page reachable from its root, and the GC frees pages once no live snapshot references them. Crash recovery is trivial: on open, pick whichever superblock (there are usually two, written alternately) has the higher generation number and passes a checksum, discard any half-written pages (they are unreachable by construction), and resume.

Implementation

CowBTree.java
package com.example.cowbtree;

import java.util.concurrent.atomic.AtomicReference;
import java.util.*;

public class CowBTree<K extends Comparable<K>, V> {
    static final int M = 32;

    static final class Node<K, V> {
        final boolean leaf;
        final Object[] keys;              // K[] in practice
        final Object[] vals;              // V[] for leaf, Node<K,V>[] for internal
        final int n;
        Node(boolean leaf, Object[] keys, Object[] vals, int n) {
            this.leaf = leaf; this.keys = keys; this.vals = vals; this.n = n;
        }
    }

    // The atomic root swap IS the commit.
    private final AtomicReference<Node<K, V>> root =
        new AtomicReference<>(new Node<>(true, new Object[M], new Object[M], 0));

    /** A read-only view pinned to the current root; never observes later writes. */
    public Snapshot<K, V> snapshot() { return new Snapshot<>(root.get()); }

    /** Single-writer insert. Clones the path root-to-leaf, then CAS-swings root. */
    public void put(K key, V val) {
        Node<K, V> old = root.get();
        Node<K, V> fresh = copyOnWritePut(old, key, val);
        root.set(fresh);                   // in a real impl: write superblock + fsync
    }

    @SuppressWarnings("unchecked")
    private Node<K, V> copyOnWritePut(Node<K, V> n, K key, V val) {
        if (n.leaf) {
            Object[] k = Arrays.copyOf(n.keys, M);
            Object[] v = Arrays.copyOf(n.vals, M);
            int i = 0;
            while (i < n.n && ((K) k[i]).compareTo(key) < 0) i++;
            if (i < n.n && key.equals(k[i])) {
                v[i] = val;
                return new Node<>(true, k, v, n.n);
            }
            System.arraycopy(k, i, k, i + 1, n.n - i);
            System.arraycopy(v, i, v, i + 1, n.n - i);
            k[i] = key; v[i] = val;
            return new Node<>(true, k, v, n.n + 1);
        }
        int i = 0;
        while (i < n.n && ((K) n.keys[i]).compareTo(key) <= 0) i++;
        Node<K, V> childOld = (Node<K, V>) n.vals[i];
        Node<K, V> childNew = copyOnWritePut(childOld, key, val);
        Object[] vals = Arrays.copyOf(n.vals, M);
        vals[i] = childNew;                 // only the pointer changes
        return new Node<>(false, Arrays.copyOf(n.keys, M), vals, n.n);
    }
}
Snapshot.java
package com.example.cowbtree;

import java.util.*;

/**
 * An immutable view rooted at the tree root captured at snapshot() time.
 * Because no page reachable from this root is ever mutated, readers need
 * no locks and cannot observe torn state.
 */
public final class Snapshot<K extends Comparable<K>, V> implements AutoCloseable {
    private final CowBTree.Node<K, V> root;

    Snapshot(CowBTree.Node<K, V> root) { this.root = root; }

    @SuppressWarnings("unchecked")
    public V get(K key) {
        CowBTree.Node<K, V> n = root;
        while (!n.leaf) {
            int i = 0;
            while (i < n.n && ((K) n.keys[i]).compareTo(key) <= 0) i++;
            n = (CowBTree.Node<K, V>) n.vals[i];
        }
        for (int i = 0; i < n.n; i++) {
            if (key.equals(n.keys[i])) return (V) n.vals[i];
        }
        return null;
    }

    /** In a real impl, close() decrements refcounts so stale pages can be freed. */
    @Override public void close() { /* release hold on root */ }
}
PageRefcountGC.java
package com.example.cowbtree;

import java.util.*;
import java.util.concurrent.ConcurrentHashMap;

/**
 * Sketch of a page-level refcount GC used by real CoW engines (LMDB style).
 * Each live snapshot implicitly pins every page reachable from its root;
 * a page is freed exactly when its refcount hits zero.
 */
public class PageRefcountGC {
    private final Map<Long, Integer> refs = new ConcurrentHashMap<>();
    private final Deque<Long> freeList = new ArrayDeque<>();

    public void retainReachable(long pageId) {
        refs.merge(pageId, 1, Integer::sum);
    }

    /** Called when a snapshot is closed; walks the root and decrements refs. */
    public synchronized void releaseReachable(long pageId) {
        Integer v = refs.get(pageId);
        if (v == null) return;
        if (v == 1) { refs.remove(pageId); freeList.push(pageId); }
        else refs.put(pageId, v - 1);
    }

    /** Allocate prefers freed pages so the file does not grow unboundedly. */
    public synchronized long allocate(long nextFreshId) {
        return freeList.isEmpty() ? nextFreshId : freeList.pop();
    }
}

Complexity

Key design decisions & trade-offs

Common pitfalls

Interview follow-ups

Recommended reading

Related