Copy-on-Write B-Tree
LMDB-style path cloning, snapshots, GC. Ch 6.
This interactive explanation is built for system design interview prep: step through Copy-on-Write B-Tree, watch the internal state change, and connect the concept to real distributed-system trade-offs.
Overview
A Copy-on-Write B-tree (also called an append-only or shadowing B-tree) is a B-tree variant that never modifies an existing page. Instead, every write copies the path from root to leaf, mutates the copies, and atomically swings a single root pointer to publish the new version. LMDB, Btrfs, CouchDB, OpenZFS metadata, and Datomic all use this trick, and it is the backbone of every MVCC storage engine that wants snapshot isolation without a separate undo log. The price is write amplification — a single key update rewrites O(height) pages — but the reward is enormous: readers never block writers and never need locks because they read a consistent snapshot rooted at whatever tree-root pointer they captured when the transaction began. There is no write-ahead log in the classical sense either: the atomic root swap IS the commit, so crash recovery reduces to 'find the last valid root'.
How it works
The invariant is that tree nodes are immutable once written. A read transaction captures the current root pointer at its start; any subsequent write is invisible to it because it reads through that stable root. WRITE(key, val) walks from the current root to the target leaf and records the path. It then clones the leaf, applies the mutation (enters state LEAF_DIRTY), allocates a new page id for the clone, and walks back up the recorded path: at each level it clones the parent, rewrites the child pointer that used to reference the old child so it now references the clone, and repeats. This is the SHADOW_PATH phase — O(height) new pages, all of which are written to freshly allocated locations on disk. COMMIT atomically writes a new superblock whose root pointer points to the newly created root; until that write lands on disk, the tree effectively does not exist yet. After commit the old path's pages are unreachable from the new root but may still be reachable from older reader snapshots, so reclamation is driven by refcounts: each snapshot holds an implicit reference to every page reachable from its root, and the GC frees pages once no live snapshot references them. Crash recovery is trivial: on open, pick whichever superblock (there are usually two, written alternately) has the higher generation number and passes a checksum, discard any half-written pages (they are unreachable by construction), and resume.
Implementation
package com.example.cowbtree;
import java.util.concurrent.atomic.AtomicReference;
import java.util.*;
public class CowBTree<K extends Comparable<K>, V> {
static final int M = 32;
static final class Node<K, V> {
final boolean leaf;
final Object[] keys; // K[] in practice
final Object[] vals; // V[] for leaf, Node<K,V>[] for internal
final int n;
Node(boolean leaf, Object[] keys, Object[] vals, int n) {
this.leaf = leaf; this.keys = keys; this.vals = vals; this.n = n;
}
}
// The atomic root swap IS the commit.
private final AtomicReference<Node<K, V>> root =
new AtomicReference<>(new Node<>(true, new Object[M], new Object[M], 0));
/** A read-only view pinned to the current root; never observes later writes. */
public Snapshot<K, V> snapshot() { return new Snapshot<>(root.get()); }
/** Single-writer insert. Clones the path root-to-leaf, then CAS-swings root. */
public void put(K key, V val) {
Node<K, V> old = root.get();
Node<K, V> fresh = copyOnWritePut(old, key, val);
root.set(fresh); // in a real impl: write superblock + fsync
}
@SuppressWarnings("unchecked")
private Node<K, V> copyOnWritePut(Node<K, V> n, K key, V val) {
if (n.leaf) {
Object[] k = Arrays.copyOf(n.keys, M);
Object[] v = Arrays.copyOf(n.vals, M);
int i = 0;
while (i < n.n && ((K) k[i]).compareTo(key) < 0) i++;
if (i < n.n && key.equals(k[i])) {
v[i] = val;
return new Node<>(true, k, v, n.n);
}
System.arraycopy(k, i, k, i + 1, n.n - i);
System.arraycopy(v, i, v, i + 1, n.n - i);
k[i] = key; v[i] = val;
return new Node<>(true, k, v, n.n + 1);
}
int i = 0;
while (i < n.n && ((K) n.keys[i]).compareTo(key) <= 0) i++;
Node<K, V> childOld = (Node<K, V>) n.vals[i];
Node<K, V> childNew = copyOnWritePut(childOld, key, val);
Object[] vals = Arrays.copyOf(n.vals, M);
vals[i] = childNew; // only the pointer changes
return new Node<>(false, Arrays.copyOf(n.keys, M), vals, n.n);
}
}
package com.example.cowbtree;
import java.util.*;
/**
* An immutable view rooted at the tree root captured at snapshot() time.
* Because no page reachable from this root is ever mutated, readers need
* no locks and cannot observe torn state.
*/
public final class Snapshot<K extends Comparable<K>, V> implements AutoCloseable {
private final CowBTree.Node<K, V> root;
Snapshot(CowBTree.Node<K, V> root) { this.root = root; }
@SuppressWarnings("unchecked")
public V get(K key) {
CowBTree.Node<K, V> n = root;
while (!n.leaf) {
int i = 0;
while (i < n.n && ((K) n.keys[i]).compareTo(key) <= 0) i++;
n = (CowBTree.Node<K, V>) n.vals[i];
}
for (int i = 0; i < n.n; i++) {
if (key.equals(n.keys[i])) return (V) n.vals[i];
}
return null;
}
/** In a real impl, close() decrements refcounts so stale pages can be freed. */
@Override public void close() { /* release hold on root */ }
}
package com.example.cowbtree;
import java.util.*;
import java.util.concurrent.ConcurrentHashMap;
/**
* Sketch of a page-level refcount GC used by real CoW engines (LMDB style).
* Each live snapshot implicitly pins every page reachable from its root;
* a page is freed exactly when its refcount hits zero.
*/
public class PageRefcountGC {
private final Map<Long, Integer> refs = new ConcurrentHashMap<>();
private final Deque<Long> freeList = new ArrayDeque<>();
public void retainReachable(long pageId) {
refs.merge(pageId, 1, Integer::sum);
}
/** Called when a snapshot is closed; walks the root and decrements refs. */
public synchronized void releaseReachable(long pageId) {
Integer v = refs.get(pageId);
if (v == null) return;
if (v == 1) { refs.remove(pageId); freeList.push(pageId); }
else refs.put(pageId, v - 1);
}
/** Allocate prefers freed pages so the file does not grow unboundedly. */
public synchronized long allocate(long nextFreshId) {
return freeList.isEmpty() ? nextFreshId : freeList.pop();
}
}
Complexity
- Read:
O(log n), lock-free, snapshot-consistent - Write:
O(log n) logical, but rewrites O(log n) pages (write amplification = height) - Commit:
O(1) — one atomic superblock write - Space:
O(n) steady state plus one extra path per live snapshot - Recovery:
O(1) — pick the newer valid superblock
Key design decisions & trade-offs
- Write in place vs copy on write — Chosen: CoW: never mutate a page once written. Trades write amplification (entire root-to-leaf path per write) for lock-free readers, free snapshots, and trivial crash recovery — usually a great deal for read-heavy workloads.
- Concurrency model — Chosen: Single writer, many readers. Coordinating multiple concurrent writers against a single evolving root either requires lock-free tree operations or optimistic retries. Most CoW engines (LMDB, Btrfs subvols) serialize writers and rely on the fact that writes are fast.
- Free page reclamation — Chosen: Refcount or epoch-based GC keyed on live snapshots. Naive free-on-commit would rip pages out from under live readers. Refcounts cost bookkeeping but guarantee a snapshot stays consistent for its entire lifetime.
- Commit mechanism — Chosen: Alternating double-buffered superblock with checksum + generation. One superblock can tear during a crash; having two written alternately means at least one is always a valid older version — pick the newer valid one on recovery.
Common pitfalls
- Forgetting to persist the new root atomically — if the superblock write tears, the tree boots into a hybrid that mixes old and new pointers.
- Holding a long-lived snapshot in a write-heavy workload; the free list cannot advance and the data file balloons.
- Treating CoW as free for writes — sustained small-key workloads see 5-10x write amplification and the SSD wears out accordingly.
- Allowing multiple concurrent writers without serialization; the CAS-the-root pattern looks simple but nested splits break without a careful lock-free design.
- Using fsync on the data file instead of a write-barrier ordering (log → data → superblock); the root can become durable before the pages it points to.
Interview follow-ups
- Why does a copy-on-write B-tree give snapshot isolation essentially for free?
- What is write amplification in this scheme and how does LMDB mitigate it for small updates?
- Walk through crash recovery when a superblock write tears mid-way.
- How would you support multiple concurrent writers — optimistic retry, fine-grained locks, or a lock-free tree?
- Compare CoW B-tree write amplification with LSM write amplification; which workload favors which?
Recommended reading
- Alex Petrov, Database Internals — storage engines and distributed systems internals.
- Martin Kleppmann, Designing Data-Intensive Applications (DDIA) — data models, replication, partitioning, consistency.
- The System Design Primer — high-level design building blocks.
- Foundational networking + web-security references (TCP/IP, TLS 1.3, OWASP Top 10).