← System Design Simulator

WAL + Recovery

By Rahul Kumar · Senior Software Engineer · Updated · Category: Petrov · Storage Engines (Part I)

Buffer pool, WAL, ARIES redo/undo after crash. Ch 5.

This interactive explanation is built for system design interview prep: step through WAL + Recovery, watch the internal state change, and connect the concept to real distributed-system trade-offs.

Overview

Write-ahead logging is the single most important idea in database durability — every modern system from Postgres to MySQL's InnoDB to Kafka's segment log to SQLite uses some variant of it, and the ARIES paper published by IBM in 1992 is still the reference design. The invariant is brutally simple: before any modified data page is written to disk, the log record describing the change must already be on disk. That one-line rule lets the database lie about when writes hit the data files — they can trickle out lazily, be reordered, be batched — and still guarantee that after a crash the on-disk state is recoverable to a transactionally consistent point. ARIES adds three refinements on top: LSN-tagged pages so recovery knows what is already applied, physiological logging so a single log record describes the logical change, and a three-phase recovery (Analysis, Redo, Undo) that handles uncommitted transactions, half-flushed pages, and in-flight compensating updates uniformly.

WAL + Recovery — Interactive Simulator

Runs fully client-side in your browser; no sign-up. Or open full screen →

Launch the interactive WAL + Recovery widget — step through the algorithm or protocol and observe the internal state updating in real time.

How it works

Every modification goes through this sequence. 1) The transaction acquires a new log sequence number (LSN), writes a log record into the in-memory log buffer, and stamps the affected data page's header with that LSN. 2) The data-page change happens in the buffer pool but is NOT flushed. 3) On COMMIT, the log buffer is force-flushed to disk up to and including the commit record (state FORCE_LOG_ON_COMMIT); only then does the client see 'OK'. Data pages flow out lazily during normal checkpointing. After a crash recovery enters three phases. ANALYSIS walks forward from the last checkpoint, reconstructing the transaction table (who was live at crash time) and the dirty page table (which pages might be stale on disk). REDO then replays every log record from the earliest recLSN in the dirty-page table forward to end-of-log, but skips any record whose pageLSN on disk is already greater-or-equal — because that change is already durable. Finally UNDO walks backward through the log for each transaction that was live but not committed, applying inverse operations and writing Compensation Log Records (CLRs) so that UNDO itself is idempotent if recovery is interrupted. When UNDO finishes, the on-disk state equals the state of a clean shutdown after all committed transactions.

Implementation

WalRecord.java
package com.example.wal;

import java.nio.ByteBuffer;
import java.util.zip.CRC32;

public final class WalRecord {
    public enum Type { BEGIN, UPDATE, COMMIT, ABORT, CLR, CHECKPOINT }
    public final long lsn;
    public final long txnId;
    public final Type type;
    public final long pageId;
    public final byte[] before;
    public final byte[] after;

    public WalRecord(long lsn, long txnId, Type t, long pageId, byte[] before, byte[] after) {
        this.lsn = lsn; this.txnId = txnId; this.type = t;
        this.pageId = pageId; this.before = before; this.after = after;
    }

    public ByteBuffer serialize() {
        int bLen = before == null ? 0 : before.length;
        int aLen = after  == null ? 0 : after.length;
        ByteBuffer buf = ByteBuffer.allocate(4 + 8 + 8 + 4 + 8 + 4 + bLen + 4 + aLen + 4);
        int sizePos = buf.position(); buf.putInt(0);     // placeholder for length
        buf.putLong(lsn).putLong(txnId).putInt(type.ordinal()).putLong(pageId);
        buf.putInt(bLen); if (bLen > 0) buf.put(before);
        buf.putInt(aLen); if (aLen > 0) buf.put(after);
        CRC32 crc = new CRC32();
        crc.update(buf.array(), sizePos + 4, buf.position() - sizePos - 4);
        buf.putInt((int) crc.getValue());
        buf.putInt(sizePos, buf.position() - sizePos);   // total record length
        buf.flip();
        return buf;
    }
}
WalWriter.java
package com.example.wal;

import java.io.IOException;
import java.nio.channels.FileChannel;
import java.nio.file.*;
import java.util.concurrent.atomic.AtomicLong;

public class WalWriter implements AutoCloseable {
    private final FileChannel ch;
    private final AtomicLong nextLsn = new AtomicLong(1);
    private volatile long flushedLsn = 0;

    public WalWriter(Path file) throws IOException {
        this.ch = FileChannel.open(file,
            StandardOpenOption.CREATE, StandardOpenOption.APPEND, StandardOpenOption.DSYNC);
    }

    /** Append a record. Returns its LSN. Durability is guaranteed only after forceTo. */
    public synchronized long append(WalRecord.Type t, long txn, long pageId,
                                    byte[] before, byte[] after) throws IOException {
        long lsn = nextLsn.getAndIncrement();
        WalRecord r = new WalRecord(lsn, txn, t, pageId, before, after);
        ch.write(r.serialize());
        return lsn;
    }

    /** Block until every record with lsn <= target is on stable storage. */
    public synchronized void forceTo(long targetLsn) throws IOException {
        if (flushedLsn >= targetLsn) return;
        ch.force(false);                       // fdatasync
        flushedLsn = nextLsn.get() - 1;
    }

    /** Call exactly on COMMIT: append the commit record, then force. */
    public long commit(long txnId) throws IOException {
        long lsn = append(WalRecord.Type.COMMIT, txnId, 0L, null, null);
        forceTo(lsn);                          // FORCE_LOG_ON_COMMIT
        return lsn;
    }

    @Override public void close() throws IOException { ch.close(); }
}
AriesRecovery.java
package com.example.wal;

import java.util.*;

/** Sketch of the ARIES 3-phase recovery: Analysis -> Redo -> Undo. */
public class AriesRecovery {
    private final List<WalRecord> log;
    private final Map<Long, Long> pageLsnOnDisk;      // page -> last applied LSN

    public AriesRecovery(List<WalRecord> log, Map<Long, Long> pageLsnOnDisk) {
        this.log = log; this.pageLsnOnDisk = pageLsnOnDisk;
    }

    public void recover(long lastCheckpointIdx) {
        // 1) ANALYSIS: find loser transactions + earliest recLSN in dirty pages
        Set<Long> live = new HashSet<>();
        long redoFrom = Long.MAX_VALUE;
        for (int i = lastCheckpointIdx; i < log.size(); i++) {
            WalRecord r = log.get(i);
            switch (r.type) {
                case BEGIN  -> live.add(r.txnId);
                case COMMIT, ABORT -> live.remove(r.txnId);
                case UPDATE -> redoFrom = Math.min(redoFrom, r.lsn);
                default -> {}
            }
        }
        // 2) REDO: replay every update whose pageLSN on disk is behind
        for (WalRecord r : log) {
            if (r.type != WalRecord.Type.UPDATE) continue;
            if (r.lsn < redoFrom) continue;
            long pLsn = pageLsnOnDisk.getOrDefault(r.pageId, 0L);
            if (pLsn >= r.lsn) continue;              // already durable
            applyAfterImage(r);                       // write r.after into page
            pageLsnOnDisk.put(r.pageId, r.lsn);
        }
        // 3) UNDO: walk backward, emit CLR for each losing-txn update (idempotent)
        for (int i = log.size() - 1; i >= 0; i--) {
            WalRecord r = log.get(i);
            if (r.type == WalRecord.Type.UPDATE && live.contains(r.txnId)) {
                applyBeforeImage(r);
                // append CLR so re-recovery skips this record
            }
        }
    }

    private void applyAfterImage(WalRecord r) { /* write r.after to page */ }
    private void applyBeforeImage(WalRecord r) { /* write r.before to page */ }
}

Complexity

Key design decisions & trade-offs

Common pitfalls

Interview follow-ups

Recommended reading

Related