WAL + Recovery
Buffer pool, WAL, ARIES redo/undo after crash. Ch 5.
This interactive explanation is built for system design interview prep: step through WAL + Recovery, watch the internal state change, and connect the concept to real distributed-system trade-offs.
Overview
Write-ahead logging is the single most important idea in database durability — every modern system from Postgres to MySQL's InnoDB to Kafka's segment log to SQLite uses some variant of it, and the ARIES paper published by IBM in 1992 is still the reference design. The invariant is brutally simple: before any modified data page is written to disk, the log record describing the change must already be on disk. That one-line rule lets the database lie about when writes hit the data files — they can trickle out lazily, be reordered, be batched — and still guarantee that after a crash the on-disk state is recoverable to a transactionally consistent point. ARIES adds three refinements on top: LSN-tagged pages so recovery knows what is already applied, physiological logging so a single log record describes the logical change, and a three-phase recovery (Analysis, Redo, Undo) that handles uncommitted transactions, half-flushed pages, and in-flight compensating updates uniformly.
How it works
Every modification goes through this sequence. 1) The transaction acquires a new log sequence number (LSN), writes a log record into the in-memory log buffer, and stamps the affected data page's header with that LSN. 2) The data-page change happens in the buffer pool but is NOT flushed. 3) On COMMIT, the log buffer is force-flushed to disk up to and including the commit record (state FORCE_LOG_ON_COMMIT); only then does the client see 'OK'. Data pages flow out lazily during normal checkpointing. After a crash recovery enters three phases. ANALYSIS walks forward from the last checkpoint, reconstructing the transaction table (who was live at crash time) and the dirty page table (which pages might be stale on disk). REDO then replays every log record from the earliest recLSN in the dirty-page table forward to end-of-log, but skips any record whose pageLSN on disk is already greater-or-equal — because that change is already durable. Finally UNDO walks backward through the log for each transaction that was live but not committed, applying inverse operations and writing Compensation Log Records (CLRs) so that UNDO itself is idempotent if recovery is interrupted. When UNDO finishes, the on-disk state equals the state of a clean shutdown after all committed transactions.
Implementation
package com.example.wal;
import java.nio.ByteBuffer;
import java.util.zip.CRC32;
public final class WalRecord {
public enum Type { BEGIN, UPDATE, COMMIT, ABORT, CLR, CHECKPOINT }
public final long lsn;
public final long txnId;
public final Type type;
public final long pageId;
public final byte[] before;
public final byte[] after;
public WalRecord(long lsn, long txnId, Type t, long pageId, byte[] before, byte[] after) {
this.lsn = lsn; this.txnId = txnId; this.type = t;
this.pageId = pageId; this.before = before; this.after = after;
}
public ByteBuffer serialize() {
int bLen = before == null ? 0 : before.length;
int aLen = after == null ? 0 : after.length;
ByteBuffer buf = ByteBuffer.allocate(4 + 8 + 8 + 4 + 8 + 4 + bLen + 4 + aLen + 4);
int sizePos = buf.position(); buf.putInt(0); // placeholder for length
buf.putLong(lsn).putLong(txnId).putInt(type.ordinal()).putLong(pageId);
buf.putInt(bLen); if (bLen > 0) buf.put(before);
buf.putInt(aLen); if (aLen > 0) buf.put(after);
CRC32 crc = new CRC32();
crc.update(buf.array(), sizePos + 4, buf.position() - sizePos - 4);
buf.putInt((int) crc.getValue());
buf.putInt(sizePos, buf.position() - sizePos); // total record length
buf.flip();
return buf;
}
}
package com.example.wal;
import java.io.IOException;
import java.nio.channels.FileChannel;
import java.nio.file.*;
import java.util.concurrent.atomic.AtomicLong;
public class WalWriter implements AutoCloseable {
private final FileChannel ch;
private final AtomicLong nextLsn = new AtomicLong(1);
private volatile long flushedLsn = 0;
public WalWriter(Path file) throws IOException {
this.ch = FileChannel.open(file,
StandardOpenOption.CREATE, StandardOpenOption.APPEND, StandardOpenOption.DSYNC);
}
/** Append a record. Returns its LSN. Durability is guaranteed only after forceTo. */
public synchronized long append(WalRecord.Type t, long txn, long pageId,
byte[] before, byte[] after) throws IOException {
long lsn = nextLsn.getAndIncrement();
WalRecord r = new WalRecord(lsn, txn, t, pageId, before, after);
ch.write(r.serialize());
return lsn;
}
/** Block until every record with lsn <= target is on stable storage. */
public synchronized void forceTo(long targetLsn) throws IOException {
if (flushedLsn >= targetLsn) return;
ch.force(false); // fdatasync
flushedLsn = nextLsn.get() - 1;
}
/** Call exactly on COMMIT: append the commit record, then force. */
public long commit(long txnId) throws IOException {
long lsn = append(WalRecord.Type.COMMIT, txnId, 0L, null, null);
forceTo(lsn); // FORCE_LOG_ON_COMMIT
return lsn;
}
@Override public void close() throws IOException { ch.close(); }
}
package com.example.wal;
import java.util.*;
/** Sketch of the ARIES 3-phase recovery: Analysis -> Redo -> Undo. */
public class AriesRecovery {
private final List<WalRecord> log;
private final Map<Long, Long> pageLsnOnDisk; // page -> last applied LSN
public AriesRecovery(List<WalRecord> log, Map<Long, Long> pageLsnOnDisk) {
this.log = log; this.pageLsnOnDisk = pageLsnOnDisk;
}
public void recover(long lastCheckpointIdx) {
// 1) ANALYSIS: find loser transactions + earliest recLSN in dirty pages
Set<Long> live = new HashSet<>();
long redoFrom = Long.MAX_VALUE;
for (int i = lastCheckpointIdx; i < log.size(); i++) {
WalRecord r = log.get(i);
switch (r.type) {
case BEGIN -> live.add(r.txnId);
case COMMIT, ABORT -> live.remove(r.txnId);
case UPDATE -> redoFrom = Math.min(redoFrom, r.lsn);
default -> {}
}
}
// 2) REDO: replay every update whose pageLSN on disk is behind
for (WalRecord r : log) {
if (r.type != WalRecord.Type.UPDATE) continue;
if (r.lsn < redoFrom) continue;
long pLsn = pageLsnOnDisk.getOrDefault(r.pageId, 0L);
if (pLsn >= r.lsn) continue; // already durable
applyAfterImage(r); // write r.after into page
pageLsnOnDisk.put(r.pageId, r.lsn);
}
// 3) UNDO: walk backward, emit CLR for each losing-txn update (idempotent)
for (int i = log.size() - 1; i >= 0; i--) {
WalRecord r = log.get(i);
if (r.type == WalRecord.Type.UPDATE && live.contains(r.txnId)) {
applyBeforeImage(r);
// append CLR so re-recovery skips this record
}
}
}
private void applyAfterImage(WalRecord r) { /* write r.after to page */ }
private void applyBeforeImage(WalRecord r) { /* write r.before to page */ }
}
Complexity
- Append (per op):
O(1) amortized; one sequential write to the log - Commit:
O(1) plus one fsync on the log - Recovery:
O(log-since-last-checkpoint) - Checkpoint frequency:
Bounds recovery time: longer interval = faster normal path, slower recovery
Key design decisions & trade-offs
- Force-log-on-commit (WAL invariant) — Chosen: fsync the log before acknowledging the client, let data pages drift. One sequential fsync per commit is far cheaper than forcing every touched data page. The invariant — log before data — preserves durability while letting the buffer pool batch random writes.
- Physiological vs logical logging — Chosen: Physiological: identify page by id, describe the change logically within the page. Pure physical (byte-level) logs are huge and break under structural change; pure logical logs cannot handle torn pages. Physiological is the sweet spot ARIES chose.
- Group commit — Chosen: Batch multiple concurrent commits into one fsync. fsync is the bottleneck. Letting a small time window of commits share a single fsync multiplies commit throughput by orders of magnitude with only a few ms of added latency.
- Steal / no-force buffer policy — Chosen: STEAL (dirty pages may be evicted before commit) + NO-FORCE (no flush on commit). This is exactly the policy that makes WAL necessary and valuable; both ends give throughput wins, and UNDO + REDO in recovery pay for the flexibility.
- Checkpoint style — Chosen: Fuzzy checkpoint (non-blocking) over sharp checkpoint. Stopping the world to flush every dirty page is unacceptable in production; fuzzy checkpoints let the system keep serving traffic while bounding recovery time.
Common pitfalls
- Forgetting to stamp the page LSN in the page header; REDO then has no way to skip already-durable changes and either repeats work or corrupts state.
- Acknowledging a commit before the log reaches stable storage (D of ACID is gone).
- Using fsync when fdatasync suffices, doubling commit latency by also flushing inode metadata on every commit.
- Skipping checkpoints; recovery time grows unbounded with uptime, and one day a crash takes hours to recover.
- Omitting CLRs during UNDO; if recovery itself crashes the second pass undoes things twice and corrupts data.
Interview follow-ups
- Walk through what ARIES recovery does after a crash where some pages were flushed mid-transaction and others were not.
- Why is UNDO required even in a no-steal buffer policy? (Hint: it is not.)
- How does group commit work and what is the latency/throughput trade-off?
- What is the difference between a sharp checkpoint and a fuzzy checkpoint, and why does every production database use the fuzzy variant?
- How would you extend this WAL to feed a physical-replication standby without breaking the local durability guarantees?
Recommended reading
- Alex Petrov, Database Internals — storage engines and distributed systems internals.
- Martin Kleppmann, Designing Data-Intensive Applications (DDIA) — data models, replication, partitioning, consistency.
- The System Design Primer — high-level design building blocks.
- Foundational networking + web-security references (TCP/IP, TLS 1.3, OWASP Top 10).