B-Tree

By Rahul Kumar · Senior Software Engineer · Updated May 2026 · Category: Petrov · Storage Engines (Part I)

Insert, search, delete with animated splits. Ch 2 + 4.

This interactive explanation is built for system design interview prep: step through B-Tree, watch the internal state change, and connect the concept to real distributed-system trade-offs.

Overview

The B-tree is the workhorse index of every major relational database — Postgres, MySQL's InnoDB, Oracle, SQL Server, SQLite — and understanding it is non-negotiable if you want to reason about query plans, locking, or why a particular DDL change locks up production. Unlike the binary search trees taught in algorithms class, a B-tree is a wide, shallow, balanced tree where each node holds hundreds of keys so that the height stays under five even for billion-row tables. Every node corresponds to a fixed-size disk page (usually 8 KiB or 16 KiB), so a point lookup is four or five page I/Os worst-case, and a range scan is a linked-list walk across leaves. This page covers the classical B+ tree: insert with split-on-overflow cascading up the root, search via per-node binary search, and the structural invariants that guarantee logarithmic height no matter the insertion order.

B-Tree — Interactive Simulator

Runs fully client-side in your browser; no sign-up. Or open full screen →

Launch the interactive B-Tree widget — step through the algorithm or protocol and observe the internal state updating in real time.

How it works

A B+ tree of order m keeps between ceil(m/2) and m keys per node and routes based on separator keys. SEARCH(key) enters the root, binary-searches its keys to pick a child pointer, and repeats until it hits a leaf; there the same binary search yields either the tuple id or a not-found. INSERT(key, tid) first walks the same path to the correct leaf; if the leaf has room the key is simply inserted in sorted order (state LEAF_INSERT). If the leaf is full the tree enters SPLIT: the leaf's keys are partitioned into two new leaves, a separator (the smallest key of the right half) is promoted to the parent, and sibling leaf pointers are rewired so range scans still work. That promotion can itself overflow the parent, triggering the same split one level up — the CASCADE state — and in the worst case it propagates all the way to the root, at which point a new root is allocated and the tree grows one level taller (the only way a B-tree's height ever changes). Deletes mirror this with merge/redistribute, though production engines often skip merging and just tombstone + rebuild. Because every path from root to leaf has the same length, lookups and inserts are both O(log_m n), and with m around 200 even a trillion-row index is at most five levels deep.

Implementation

BTreeNode.java

package com.example.btree;

import java.util.Arrays;

public class BTreeNode {
    static final int ORDER = 4;              // max children; keys <= ORDER-1
    int numKeys;
    final int[] keys = new int[ORDER];       // one slot of slack for easier split math
    final BTreeNode[] children = new BTreeNode[ORDER + 1];
    boolean leaf;

    BTreeNode(boolean leaf) { this.leaf = leaf; }

    /** Binary search for key; returns the child index to descend into. */
    int childIndex(int key) {
        int lo = 0, hi = numKeys;
        while (lo < hi) {
            int mid = (lo + hi) >>> 1;
            if (keys[mid] <= key) lo = mid + 1; else hi = mid;
        }
        return lo;
    }

    /** True if key exists in the subtree rooted here. */
    boolean search(int key) {
        int i = 0;
        while (i < numKeys && keys[i] < key) i++;
        if (i < numKeys && keys[i] == key) return true;
        return !leaf && children[i].search(key);
    }

    boolean isFull() { return numKeys == ORDER - 1; }
}

BTree.java

package com.example.btree;

public class BTree {
    private BTreeNode root = new BTreeNode(true);

    public boolean contains(int key) { return root.search(key); }

    public void insert(int key) {
        if (root.isFull()) {
            // root split -> tree grows one level
            BTreeNode newRoot = new BTreeNode(false);
            newRoot.children[0] = root;
            splitChild(newRoot, 0);
            root = newRoot;
        }
        insertNonFull(root, key);
    }

    private void insertNonFull(BTreeNode n, int key) {
        int i = n.numKeys - 1;
        if (n.leaf) {
            while (i >= 0 && n.keys[i] > key) { n.keys[i + 1] = n.keys[i]; i--; }
            n.keys[i + 1] = key;
            n.numKeys++;
            return;
        }
        while (i >= 0 && n.keys[i] > key) i--;
        i++;
        if (n.children[i].isFull()) {
            splitChild(n, i);
            if (n.keys[i] < key) i++;
        }
        insertNonFull(n.children[i], key);
    }

    private void splitChild(BTreeNode parent, int idx) {
        BTreeNode full = parent.children[idx];
        BTreeNode right = new BTreeNode(full.leaf);
        int mid = (BTreeNode.ORDER - 1) / 2;
        right.numKeys = full.numKeys - mid - 1;
        System.arraycopy(full.keys, mid + 1, right.keys, 0, right.numKeys);
        if (!full.leaf) {
            System.arraycopy(full.children, mid + 1, right.children, 0, right.numKeys + 1);
        }
        // shift parent to make room for the promoted separator
        for (int j = parent.numKeys; j > idx; j--) {
            parent.keys[j] = parent.keys[j - 1];
            parent.children[j + 1] = parent.children[j];
        }
        parent.keys[idx] = full.keys[mid];
        parent.children[idx + 1] = right;
        parent.numKeys++;
        full.numKeys = mid;
    }
}

Complexity

Search: O(log_m n) page I/Os
Insert: O(log_m n) with amortized O(1) splits
RangeScan: O(log_m n + k) via linked leaves
Space: O(n), typically 50-100% full per page
Height: ceil(log_m(n)) — ~4-5 levels for billion-key index with m=200

Key design decisions & trade-offs

Fanout (order m) — Chosen: Fit one node per disk/SSD page (8-16 KiB) for ~100-500 keys per node. Bigger m means shallower tree and fewer I/Os per lookup, but also more CPU per node for binary search and more wasted space on underfilled nodes. Page-sized nodes are the sweet spot because a page is the unit of I/O anyway.
B-tree vs B+ tree — Chosen: B+ tree: values only in leaves, internal nodes hold separators only. Separator-only internal nodes have much higher fanout, making the tree shallower, and leaves linked as a doubly linked list give O(k) range scans which are critical for SQL.
Split policy — Chosen: 50/50 split by default, right-biased for monotonic keys. 50/50 keeps fill factor healthy under random inserts, but a right-biased split for append-heavy workloads (like timestamps) avoids leaving every page half-empty after an insert burst.
Concurrency control — Chosen: Latch-coupling (crabbing) or Blink-trees with right-links. Holding latches down the path serializes writers but keeps the tree consistent under concurrent splits; Blink-trees add right-links so readers can follow in-progress splits without blocking writers.

Common pitfalls

Forgetting to rewire sibling leaf pointers during a split — range scans silently skip keys.
Using a narrow node (m=4 like the example) in production; real engines run m in the hundreds because disk seek, not comparison, dominates.
Naively deleting without merge/redistribute; underfilled pages compound until every page is half empty, doubling the tree size.
Choosing a random UUID as the primary key on a clustered B-tree index — every insert hits a different leaf, destroying the buffer cache hit rate.
Assuming concurrent readers and writers just work; without latch-coupling or Blink-style right-links a reader can follow a stale pointer through an in-progress split.

Interview follow-ups

Walk me through what happens when a split cascades all the way to the root.
Why do real B+ trees prefer to split 50/50 for random inserts but right-bias for monotonic keys?
What is a Blink-tree and what problem does the right-link solve under concurrency?
How does a clustered index differ from a secondary index, and why does the choice of primary key matter so much?
How would you change the algorithm for an SSD-only engine where random reads are cheap but writes are expensive?