LSM-Tree
Memtable flushes, compaction, Bloom-filter lookups. Ch 7.
This interactive explanation is built for system design interview prep: step through LSM-Tree, watch the internal state change, and connect the concept to real distributed-system trade-offs.
Overview
The Log-Structured Merge tree is the storage engine behind every write-optimized database of the last fifteen years — LevelDB, RocksDB, Cassandra, HBase, ScyllaDB, Bigtable, DynamoDB's internal store — and its dominance in hyperscale systems comes from a single structural bet: turn every random write into a sequential write. A B-tree has to update the exact page that owns a key and pays a random I/O for every key that falls on a cold page; an LSM tree instead buffers writes in an in-memory structure, then flushes them out as sorted, immutable files and merges those files in the background. The consequence is roughly 10-100x higher sustained write throughput at the cost of read amplification (a key might live in any of several files) and background compaction work. Bloom filters, block caches, and tiered or leveled compaction strategies all exist to claw back the read cost while keeping the write advantage.
How it works
Writes go to two places at once: a write-ahead log for durability and an in-memory sorted map (the memtable). WRITE(k, v) appends to the WAL, then inserts into the memtable; a commit is one sequential fsync plus an in-memory map update, which is why LSMs eat writes for breakfast. When the memtable crosses a size threshold it transitions to IMMUTABLE (no more inserts), a new memtable takes over live traffic, and a background thread begins FLUSH: walk the frozen memtable in order and write it out as an SSTable — a sorted, immutable file indexed by a sparse block index and filtered by a Bloom filter sized to a target false-positive rate. Now the tree has one memtable and N SSTables at L0. GET(k) proceeds in priority order: (1) active memtable, (2) immutable memtable being flushed, (3) SSTables newest to oldest, skipping any whose Bloom filter says 'definitely not here'. Deletes are tombstones — same flush path, just a marker — and they eventually overwrite older live values during COMPACTION. Compaction picks overlapping SSTables (leveled: merge L_i into L_i+1 until each level is 10x the previous; tiered: merge several same-sized files into one larger file), performs a k-way sorted merge, drops tombstones once nothing beneath references the key, and writes fresh SSTables. This is where read amplification gets paid down and disk space is reclaimed.
Implementation
package com.example.lsm;
import java.util.concurrent.ConcurrentSkipListMap;
/** In-memory sorted buffer backed by a skip list for ordered iteration. */
public class Memtable {
public static final byte[] TOMBSTONE = new byte[0];
private final ConcurrentSkipListMap<byte[], byte[]> map =
new ConcurrentSkipListMap<>(java.util.Arrays::compare);
private volatile long bytes = 0;
public void put(byte[] k, byte[] v) {
byte[] prev = map.put(k, v);
bytes += k.length + v.length - (prev == null ? 0 : prev.length);
}
public void delete(byte[] k) { put(k, TOMBSTONE); }
/** null = not present here; TOMBSTONE = deleted (shadows older SSTables). */
public byte[] get(byte[] k) { return map.get(k); }
public long sizeBytes() { return bytes; }
public Iterable<java.util.Map.Entry<byte[], byte[]>> sortedEntries() {
return map.entrySet();
}
}
package com.example.lsm;
import java.io.*;
import java.nio.file.*;
import java.util.*;
public class SSTable {
private final Path file;
private final NavigableMap<byte[], Long> index; // sparse: first key of each block -> offset
private final BloomFilter bloom;
private SSTable(Path f, NavigableMap<byte[], Long> idx, BloomFilter b) {
this.file = f; this.index = idx; this.bloom = b;
}
/** Flush a memtable to disk as an SSTable. Returns the opened handle. */
public static SSTable flush(Path file, Memtable m, int expectedKeys) throws IOException {
NavigableMap<byte[], Long> idx = new TreeMap<>(Arrays::compare);
BloomFilter bloom = new BloomFilter(expectedKeys, 0.01);
try (DataOutputStream out = new DataOutputStream(
new BufferedOutputStream(Files.newOutputStream(file)))) {
long off = 0;
int blockCount = 0;
for (Map.Entry<byte[], byte[]> e : m.sortedEntries()) {
if (blockCount == 0) idx.put(e.getKey(), off); // sparse index entry
out.writeInt(e.getKey().length); out.write(e.getKey());
out.writeInt(e.getValue().length); out.write(e.getValue());
bloom.add(e.getKey());
off += 8 + e.getKey().length + e.getValue().length;
blockCount = (blockCount + 1) % 64; // ~block every 64 keys
}
}
return new SSTable(file, idx, bloom);
}
public byte[] get(byte[] k) throws IOException {
if (!bloom.mightContain(k)) return null; // definitive miss
Map.Entry<byte[], Long> floor = index.floorEntry(k);
if (floor == null) return null;
try (DataInputStream in = new DataInputStream(
new BufferedInputStream(Files.newInputStream(file)))) {
in.skipBytes(floor.getValue().intValue());
while (in.available() > 0) {
int kl = in.readInt(); byte[] kb = in.readNBytes(kl);
int vl = in.readInt(); byte[] vb = in.readNBytes(vl);
int cmp = Arrays.compare(kb, k);
if (cmp == 0) return vb;
if (cmp > 0) return null; // past the key
}
}
return null;
}
public Path file() { return file; }
}
package com.example.lsm;
import java.nio.file.*;
import java.util.*;
import java.util.concurrent.*;
public class LsmTree {
private volatile Memtable active = new Memtable();
private final Deque<SSTable> sstables = new ConcurrentLinkedDeque<>(); // newest first
private final Path dir;
private final long flushThresholdBytes;
private final ExecutorService bg = Executors.newSingleThreadExecutor();
public LsmTree(Path dir, long flushThresholdBytes) { this.dir = dir; this.flushThresholdBytes = flushThresholdBytes; }
public void put(byte[] k, byte[] v) {
active.put(k, v);
if (active.sizeBytes() > flushThresholdBytes) rotateAndFlush();
}
public byte[] get(byte[] k) throws Exception {
byte[] v = active.get(k);
if (v != null) return v == Memtable.TOMBSTONE ? null : v;
for (SSTable s : sstables) { // newest -> oldest
byte[] r = s.get(k);
if (r != null) return r.length == 0 ? null : r; // tombstone shadows
}
return null;
}
private synchronized void rotateAndFlush() {
Memtable frozen = active;
active = new Memtable();
bg.submit(() -> {
Path f = dir.resolve("sst-" + System.nanoTime() + ".db");
SSTable s = SSTable.flush(f, frozen, 100_000);
sstables.addFirst(s);
});
}
/** Very simplified leveled compaction: k-way merge of the oldest two SSTables. */
public synchronized void compact() throws Exception {
if (sstables.size() < 2) return;
SSTable a = sstables.pollLast();
SSTable b = sstables.pollLast();
// walk both in sorted order, drop tombstones that nothing shadows, write new SSTable
// (omitted: k-way merge implementation)
Path out = dir.resolve("sst-" + System.nanoTime() + "-merged.db");
// SSTable merged = SSTable.merge(out, List.of(a, b));
// sstables.addLast(merged);
Files.deleteIfExists(a.file()); Files.deleteIfExists(b.file());
}
}
Complexity
- Write:
O(log M) into memtable (skip list); amortized O(1) sequential disk write - PointRead:
O(log N) per SSTable level, Bloom filter trims negatives to near-zero I/O - RangeScan:
O(k + levels) with a heap-merge across memtable + SSTables - SpaceAmplification:
~1.1x for leveled, up to ~2x for tiered compaction - WriteAmplification:
~10x for leveled, ~2-3x for tiered
Key design decisions & trade-offs
- Leveled vs tiered compaction — Chosen: Leveled for read-heavy / space-constrained, tiered for write-heavy. Leveled keeps one file per range per level, minimizing read amplification and space waste; tiered batches many same-sized files before merging, minimizing write amplification at the cost of reads and disk.
- Bloom filter per SSTable — Chosen: One Bloom filter per file sized for ~1% false-positive rate. A point GET that misses in memtable would otherwise hit every SSTable. A small in-memory Bloom filter per file cuts negative lookups to one check each and is the single biggest read-amplification mitigation.
- Delete as tombstone — Chosen: Write a tombstone record, rely on compaction to collapse. LSMs are append-only; there is no page to mutate. A tombstone shadows older values on reads and is dropped once compaction reaches the bottom of the tree.
- Memtable data structure — Chosen: Skip list (concurrent) or red-black tree. Needs ordered iteration for flush + concurrent inserts + concurrent reads. Skip lists give lock-free inserts and are easier to make concurrent than RB-trees, which is why RocksDB defaults to them.
- WAL vs purely in-memory memtable — Chosen: Mandatory WAL for durability before ack. Memtable lives in RAM, so a crash would lose it. The WAL makes each write durable as a single append and is truncated once the memtable has flushed.
Common pitfalls
- Undersizing the Bloom filter so point misses still hit disk on every level — cripples read latency.
- Letting compaction fall behind under a write burst; read amplification grows unboundedly as L0 SSTables pile up.
- Forgetting to include a tombstone in the merge output before the key's deletion has propagated to the bottom level, silently resurrecting a deleted value.
- Using one giant memtable; flush pauses become visible latency spikes instead of a smooth background stream.
- Treating LSM as a drop-in B-tree replacement for range-scan-heavy OLTP — the k-way merge tax is real, and leveled compaction interacts badly with very small key ranges.
Interview follow-ups
- Walk through what happens to a read when the key exists only at the oldest level.
- How does size-tiered compaction differ from leveled compaction and when would you pick each?
- Why does an LSM tree almost always outperform a B-tree on writes, and what is the cost in reads and space?
- How would you tune Bloom filter bits per key, and what is the read-latency tail implication?
- Explain how RocksDB's universal vs level compaction pickers make different trade-offs on write amp, space amp, and read amp.
Recommended reading
- Alex Petrov, Database Internals — storage engines and distributed systems internals.
- Martin Kleppmann, Designing Data-Intensive Applications (DDIA) — data models, replication, partitioning, consistency.
- The System Design Primer — high-level design building blocks.
- Foundational networking + web-security references (TCP/IP, TLS 1.3, OWASP Top 10).