← System Design Simulator

LSM-Tree

By Rahul Kumar · Senior Software Engineer · Updated · Category: Petrov · Storage Engines (Part I)

Memtable flushes, compaction, Bloom-filter lookups. Ch 7.

This interactive explanation is built for system design interview prep: step through LSM-Tree, watch the internal state change, and connect the concept to real distributed-system trade-offs.

Overview

The Log-Structured Merge tree is the storage engine behind every write-optimized database of the last fifteen years — LevelDB, RocksDB, Cassandra, HBase, ScyllaDB, Bigtable, DynamoDB's internal store — and its dominance in hyperscale systems comes from a single structural bet: turn every random write into a sequential write. A B-tree has to update the exact page that owns a key and pays a random I/O for every key that falls on a cold page; an LSM tree instead buffers writes in an in-memory structure, then flushes them out as sorted, immutable files and merges those files in the background. The consequence is roughly 10-100x higher sustained write throughput at the cost of read amplification (a key might live in any of several files) and background compaction work. Bloom filters, block caches, and tiered or leveled compaction strategies all exist to claw back the read cost while keeping the write advantage.

LSM-Tree — Interactive Simulator

Runs fully client-side in your browser; no sign-up. Or open full screen →

Launch the interactive LSM-Tree widget — step through the algorithm or protocol and observe the internal state updating in real time.

How it works

Writes go to two places at once: a write-ahead log for durability and an in-memory sorted map (the memtable). WRITE(k, v) appends to the WAL, then inserts into the memtable; a commit is one sequential fsync plus an in-memory map update, which is why LSMs eat writes for breakfast. When the memtable crosses a size threshold it transitions to IMMUTABLE (no more inserts), a new memtable takes over live traffic, and a background thread begins FLUSH: walk the frozen memtable in order and write it out as an SSTable — a sorted, immutable file indexed by a sparse block index and filtered by a Bloom filter sized to a target false-positive rate. Now the tree has one memtable and N SSTables at L0. GET(k) proceeds in priority order: (1) active memtable, (2) immutable memtable being flushed, (3) SSTables newest to oldest, skipping any whose Bloom filter says 'definitely not here'. Deletes are tombstones — same flush path, just a marker — and they eventually overwrite older live values during COMPACTION. Compaction picks overlapping SSTables (leveled: merge L_i into L_i+1 until each level is 10x the previous; tiered: merge several same-sized files into one larger file), performs a k-way sorted merge, drops tombstones once nothing beneath references the key, and writes fresh SSTables. This is where read amplification gets paid down and disk space is reclaimed.

Implementation

Memtable.java
package com.example.lsm;

import java.util.concurrent.ConcurrentSkipListMap;

/** In-memory sorted buffer backed by a skip list for ordered iteration. */
public class Memtable {
    public static final byte[] TOMBSTONE = new byte[0];
    private final ConcurrentSkipListMap<byte[], byte[]> map =
        new ConcurrentSkipListMap<>(java.util.Arrays::compare);
    private volatile long bytes = 0;

    public void put(byte[] k, byte[] v) {
        byte[] prev = map.put(k, v);
        bytes += k.length + v.length - (prev == null ? 0 : prev.length);
    }

    public void delete(byte[] k) { put(k, TOMBSTONE); }

    /** null = not present here; TOMBSTONE = deleted (shadows older SSTables). */
    public byte[] get(byte[] k) { return map.get(k); }

    public long sizeBytes() { return bytes; }

    public Iterable<java.util.Map.Entry<byte[], byte[]>> sortedEntries() {
        return map.entrySet();
    }
}
SSTable.java
package com.example.lsm;

import java.io.*;
import java.nio.file.*;
import java.util.*;

public class SSTable {
    private final Path file;
    private final NavigableMap<byte[], Long> index;   // sparse: first key of each block -> offset
    private final BloomFilter bloom;

    private SSTable(Path f, NavigableMap<byte[], Long> idx, BloomFilter b) {
        this.file = f; this.index = idx; this.bloom = b;
    }

    /** Flush a memtable to disk as an SSTable. Returns the opened handle. */
    public static SSTable flush(Path file, Memtable m, int expectedKeys) throws IOException {
        NavigableMap<byte[], Long> idx = new TreeMap<>(Arrays::compare);
        BloomFilter bloom = new BloomFilter(expectedKeys, 0.01);
        try (DataOutputStream out = new DataOutputStream(
                new BufferedOutputStream(Files.newOutputStream(file)))) {
            long off = 0;
            int blockCount = 0;
            for (Map.Entry<byte[], byte[]> e : m.sortedEntries()) {
                if (blockCount == 0) idx.put(e.getKey(), off);   // sparse index entry
                out.writeInt(e.getKey().length); out.write(e.getKey());
                out.writeInt(e.getValue().length); out.write(e.getValue());
                bloom.add(e.getKey());
                off += 8 + e.getKey().length + e.getValue().length;
                blockCount = (blockCount + 1) % 64;               // ~block every 64 keys
            }
        }
        return new SSTable(file, idx, bloom);
    }

    public byte[] get(byte[] k) throws IOException {
        if (!bloom.mightContain(k)) return null;                   // definitive miss
        Map.Entry<byte[], Long> floor = index.floorEntry(k);
        if (floor == null) return null;
        try (DataInputStream in = new DataInputStream(
                new BufferedInputStream(Files.newInputStream(file)))) {
            in.skipBytes(floor.getValue().intValue());
            while (in.available() > 0) {
                int kl = in.readInt(); byte[] kb = in.readNBytes(kl);
                int vl = in.readInt(); byte[] vb = in.readNBytes(vl);
                int cmp = Arrays.compare(kb, k);
                if (cmp == 0) return vb;
                if (cmp > 0) return null;                           // past the key
            }
        }
        return null;
    }

    public Path file() { return file; }
}
LsmTree.java
package com.example.lsm;

import java.nio.file.*;
import java.util.*;
import java.util.concurrent.*;

public class LsmTree {
    private volatile Memtable active = new Memtable();
    private final Deque<SSTable> sstables = new ConcurrentLinkedDeque<>();   // newest first
    private final Path dir;
    private final long flushThresholdBytes;
    private final ExecutorService bg = Executors.newSingleThreadExecutor();

    public LsmTree(Path dir, long flushThresholdBytes) { this.dir = dir; this.flushThresholdBytes = flushThresholdBytes; }

    public void put(byte[] k, byte[] v) {
        active.put(k, v);
        if (active.sizeBytes() > flushThresholdBytes) rotateAndFlush();
    }

    public byte[] get(byte[] k) throws Exception {
        byte[] v = active.get(k);
        if (v != null) return v == Memtable.TOMBSTONE ? null : v;
        for (SSTable s : sstables) {                          // newest -> oldest
            byte[] r = s.get(k);
            if (r != null) return r.length == 0 ? null : r;   // tombstone shadows
        }
        return null;
    }

    private synchronized void rotateAndFlush() {
        Memtable frozen = active;
        active = new Memtable();
        bg.submit(() -> {
            Path f = dir.resolve("sst-" + System.nanoTime() + ".db");
            SSTable s = SSTable.flush(f, frozen, 100_000);
            sstables.addFirst(s);
        });
    }

    /** Very simplified leveled compaction: k-way merge of the oldest two SSTables. */
    public synchronized void compact() throws Exception {
        if (sstables.size() < 2) return;
        SSTable a = sstables.pollLast();
        SSTable b = sstables.pollLast();
        // walk both in sorted order, drop tombstones that nothing shadows, write new SSTable
        // (omitted: k-way merge implementation)
        Path out = dir.resolve("sst-" + System.nanoTime() + "-merged.db");
        // SSTable merged = SSTable.merge(out, List.of(a, b));
        // sstables.addLast(merged);
        Files.deleteIfExists(a.file()); Files.deleteIfExists(b.file());
    }
}

Complexity

Key design decisions & trade-offs

Common pitfalls

Interview follow-ups

Recommended reading

Related