Slotted Page
Variable-size cells, fragmentation, compaction. Ch 3.
This interactive explanation is built for system design interview prep: step through Slotted Page, watch the internal state change, and connect the concept to real distributed-system trade-offs.
Overview
The slotted page is the unit of storage inside nearly every row-oriented database ever written — Postgres, MySQL's InnoDB, Oracle, SQL Server, and SQLite all use a variant of the same layout on every 8 KiB or 16 KiB page. The idea is surgically simple: put fixed-size bookkeeping at the start of the page, let variable-length tuples grow from the end of the page toward the middle, and stitch the two together with a slot array (also called a line pointer array) that maps each row's logical slot number to its physical byte offset on the page. This indirection is the secret sauce — a row id (page_id, slot_id) survives even when the tuple is shuffled within the page during compaction, which means indexes pointing at those row ids never have to be rewritten when a page reorganizes. Understanding the slotted page is the prerequisite for understanding fragmentation, VACUUM, HOT updates, and half the locking story in MVCC databases.
How it works
A fresh page is split into three regions: a fixed header (page lsn, checksum, free-space hi/lo pointers), a slot array that grows from low addresses upward, and tuple data that grows from high addresses downward. INSERT(bytes): first check that slot_end + sizeof(slot) + len(bytes) <= data_start; if not the page is full. If it fits, decrement data_start by len(bytes), memcpy the tuple there, append a slot record (offset = data_start, length = len) at slot_end, increment slot_end, and return the slot index as the in-page row id. DELETE(slot): mark the slot's length as -1 (tombstone); the bytes stay on the page but are no longer reachable. UPDATE(slot, bytes): if new length <= old length, overwrite in place and shrink the slot length; otherwise insert the new version elsewhere on the page and have the old slot forward to the new slot id (the 'forward pointer' state, which is how Oracle's row chaining and Postgres' HOT updates work). Over time tombstones and forwarding accumulate, so the engine runs COMPACT(): walk the slot array in order, copy live tuples into a contiguous block at the end of the page, rewrite each slot's offset, and reset data_start. Because the slot index never changes during compaction, all external pointers (index entries, TIDs) stay valid without any rewriting.
Implementation
package com.example.page;
import java.nio.ByteBuffer;
/**
* 8 KiB page:
* [header][slot[0]..slot[n-1]] -> grows right
* <- [tuple[n-1]..tuple[0]] grows left
* A slot is (short offset, short length). length = -1 marks a tombstone.
*/
public class SlottedPage {
private static final int PAGE_SIZE = 8192;
private static final int HEADER_BYTES = 8; // numSlots(2) + freeStart(2) + freeEnd(2) + flags(2)
private static final int SLOT_BYTES = 4;
private final ByteBuffer buf = ByteBuffer.allocate(PAGE_SIZE);
private short numSlots = 0;
private short freeStart = HEADER_BYTES; // where the next slot will be written
private short freeEnd = (short) PAGE_SIZE; // where the last tuple was written (exclusive top)
/** @return slot id, or -1 if the page is full */
public int insert(byte[] tuple) {
int need = SLOT_BYTES + tuple.length;
if (freeEnd - freeStart < need) return -1;
freeEnd -= tuple.length;
System.arraycopy(tuple, 0, buf.array(), freeEnd, tuple.length);
buf.putShort(freeStart, (short) freeEnd);
buf.putShort(freeStart + 2, (short) tuple.length);
freeStart += SLOT_BYTES;
return numSlots++;
}
public byte[] read(int slot) {
int base = HEADER_BYTES + slot * SLOT_BYTES;
short off = buf.getShort(base);
short len = buf.getShort(base + 2);
if (len < 0) return null; // tombstone
byte[] out = new byte[len];
System.arraycopy(buf.array(), off, out, 0, len);
return out;
}
public void delete(int slot) {
int base = HEADER_BYTES + slot * SLOT_BYTES;
buf.putShort(base + 2, (short) -1); // tombstone; compact reclaims bytes
}
/** Reclaim space from tombstones and gaps by rewriting live tuples contiguously. */
public void compact() {
byte[] scratch = new byte[PAGE_SIZE];
short cursor = (short) PAGE_SIZE;
for (int s = 0; s < numSlots; s++) {
int base = HEADER_BYTES + s * SLOT_BYTES;
short len = buf.getShort(base + 2);
if (len < 0) continue; // skip tombstones, slot id preserved
short off = buf.getShort(base);
cursor -= len;
System.arraycopy(buf.array(), off, scratch, cursor, len);
buf.putShort(base, cursor); // rewrite offset; length unchanged
}
System.arraycopy(scratch, cursor, buf.array(), cursor, PAGE_SIZE - cursor);
freeEnd = cursor;
// freeStart and numSlots are intentionally unchanged — external TIDs remain valid.
}
public int freeSpace() { return freeEnd - freeStart; }
public int slots() { return numSlots; }
}
Complexity
- Insert:
O(1) if space available, else caller allocates new page - Read:
O(1) via slot array - Delete:
O(1) (tombstone); reclamation deferred to compact - Compact:
O(n) in page size - Space:
PAGE_SIZE bytes per page, typically 70-90% utilized
Key design decisions & trade-offs
- Stable slot ids across compaction — Chosen: Indirection array + tuple area; compact rewrites offsets but not slot numbers. External row ids (TIDs) are embedded in every secondary index. If compaction changed slot ids, every secondary index would have to be rewritten on every page reorg — prohibitive.
- Delete via tombstone — Chosen: Mark length = -1; defer reclamation. A physical shift on every delete is expensive and requires latching out readers. Tombstones keep deletes O(1) and batch reclamation work into a single compact pass.
- Update-in-place vs move-and-forward — Chosen: In-place if it fits, otherwise insert elsewhere and leave a forwarding pointer. Most updates shrink or keep size; in-place keeps TIDs valid for free. When a tuple grows past its old footprint a forwarding pointer avoids touching every secondary index at the cost of one extra hop on read.
- Page size — Chosen: Match OS/SSD I/O unit (8-16 KiB typical). Smaller pages waste header overhead and increase tree depth for indexes; larger pages amplify write-back cost and hurt small-tuple workloads. Aligning with the device's native I/O size is the dominant concern.
Common pitfalls
- Using the physical byte offset as the row id instead of the slot index — every compaction invalidates every index.
- Forgetting to leave room for the slot when checking free space (need SLOT_BYTES + tuple.length, not just tuple.length).
- Never running compact on pages that accumulate tombstones; the page reports plenty of 'free space' but each insert still fails.
- Allowing variable-length tuples to grow in place past their slot length without a forwarding pointer — readers see corrupted data.
- Ignoring page LSN / checksum fields in the header; WAL-based recovery depends on them to know whether a page's changes are already durable.
Interview follow-ups
- Why do slot ids have to remain stable across compaction — what breaks if they change?
- How would you handle a tuple larger than a page (TOAST in Postgres, overflow pages in InnoDB)?
- Describe HOT updates in Postgres and how they exploit the forwarding pointer on a slotted page.
- How does the page's LSN interact with the WAL protocol during recovery?
- How would you lay out a slotted page for a columnar store, or is the concept even applicable?
Recommended reading
- Alex Petrov, Database Internals — storage engines and distributed systems internals.
- Martin Kleppmann, Designing Data-Intensive Applications (DDIA) — data models, replication, partitioning, consistency.
- The System Design Primer — high-level design building blocks.
- Foundational networking + web-security references (TCP/IP, TLS 1.3, OWASP Top 10).