← System Design Simulator

Raft Consensus

By Rahul Kumar · Senior Software Engineer · Updated · Category: Petrov · Distributed Systems (Part II)

Leader election + log replication across 5 nodes. Ch 14.

This interactive explanation is built for system design interview prep: step through Raft Consensus, watch the internal state change, and connect the concept to real distributed-system trade-offs.

Overview

Raft is the consensus algorithm that turned "please make this replicated log agree on the order of entries" from a PhD thesis into a weekend project. Designed by Ongaro and Ousterhout as an intentionally understandable alternative to Paxos, it powers etcd, Consul, CockroachDB, TiKV, MongoDB's replica set election, and countless homegrown systems. Raft splits the problem into three subproblems - leader election, log replication, safety - and enforces a strong-leader model: at any time only one node in the cluster can accept writes, and every write flows through it. Each node has a durable triple (currentTerm, votedFor, log[]) that survives crashes. Terms are monotonic logical clocks that fence out stale leaders. The leader replicates AppendEntries to followers and commits an entry once a quorum has persisted it; the committed entry is then applied to the local state machine and its effects become visible to clients. Safety arguments reduce to two properties: the leader contains all committed entries, and applied entries are never overwritten.

Raft Consensus — Interactive Simulator

Runs fully client-side in your browser; no sign-up. Or open full screen →

Launch the interactive Raft Consensus widget — step through the algorithm or protocol and observe the internal state updating in real time.

How it works

A Raft node is always in one of three roles: follower, candidate, or leader. Followers passively accept AppendEntries from a leader and reset a randomised election timeout on each heartbeat. If that timeout fires without contact, the follower bumps currentTerm, votes for itself, transitions to candidate, and sends RequestVote to every peer. A peer grants its vote only if the candidate's log is at least as up-to-date as its own (compared by lastLogTerm then lastLogIndex) and it has not already voted in this term. A candidate that wins a majority becomes leader and immediately sends empty AppendEntries as heartbeats. The leader accepts client commands, appends them to its log, and issues AppendEntries with prevLogIndex / prevLogTerm consistency checks. Followers reject entries whose predecessor does not match; the leader decrements nextIndex for that follower and retries, walking the follower's log back to a point of agreement. Once a log entry is stored on a majority, the leader advances commitIndex, includes it in the next heartbeat, and each node applies all entries up to commitIndex to its state machine in order. Safety is protected by term checks everywhere: any message carrying a higher term forces the recipient into follower mode with that term, so a partitioned old leader discovers it has been superseded the moment the partition heals and its AppendEntries is rejected. Crash recovery is automatic because (currentTerm, votedFor, log[]) is persisted before any RPC reply.

Implementation

RaftNode: persistent state and election RPC handlers
public class RaftNode {
    public enum Role { FOLLOWER, CANDIDATE, LEADER }

    // Persistent state (fsync before responding to RPCs).
    private long currentTerm = 0;
    private Integer votedFor = null;
    private final List<LogEntry> log = new ArrayList<>(); // 1-indexed

    // Volatile state.
    private long commitIndex = 0;
    private long lastApplied = 0;
    private Role role = Role.FOLLOWER;
    private final int id;
    private final List<Integer> peers;
    private final StateMachine sm;
    private final PersistentStore store;

    public RaftNode(int id, List<Integer> peers, StateMachine sm, PersistentStore store) {
        this.id = id; this.peers = peers; this.sm = sm; this.store = store;
        this.currentTerm = store.loadTerm();
        this.votedFor = store.loadVotedFor();
        this.log.addAll(store.loadLog());
    }

    public synchronized RequestVoteReply onRequestVote(RequestVoteArgs a) {
        if (a.term() > currentTerm) stepDown(a.term());
        boolean logOk = a.lastLogTerm() > lastLogTerm()
                || (a.lastLogTerm() == lastLogTerm() && a.lastLogIndex() >= log.size());
        boolean canVote = (votedFor == null || votedFor == a.candidateId());
        boolean grant = a.term() == currentTerm && canVote && logOk;
        if (grant) { votedFor = a.candidateId(); persistMeta(); }
        return new RequestVoteReply(currentTerm, grant);
    }

    private void stepDown(long newTerm) {
        currentTerm = newTerm; votedFor = null; role = Role.FOLLOWER;
        persistMeta();
    }

    private long lastLogTerm() { return log.isEmpty() ? 0 : log.get(log.size() - 1).term(); }

    private void persistMeta() { store.saveMeta(currentTerm, votedFor); }

    public record LogEntry(long term, byte[] command) {}
    public record RequestVoteArgs(long term, int candidateId, long lastLogIndex, long lastLogTerm) {}
    public record RequestVoteReply(long term, boolean voteGranted) {}
}
AppendEntries, commit advancement, apply
public class RaftReplication {
    private final RaftNode n;
    private final Map<Integer, Long> nextIndex = new ConcurrentHashMap<>();
    private final Map<Integer, Long> matchIndex = new ConcurrentHashMap<>();

    public RaftReplication(RaftNode n) { this.n = n; }

    public synchronized AppendEntriesReply onAppendEntries(AppendEntriesArgs a) {
        if (a.term() < n.currentTerm()) return new AppendEntriesReply(n.currentTerm(), false);
        if (a.term() > n.currentTerm()) n.stepDownExternal(a.term());

        // Consistency check: our entry at prevLogIndex must match prevLogTerm.
        if (a.prevLogIndex() > 0) {
            if (n.log().size() < a.prevLogIndex()) return new AppendEntriesReply(n.currentTerm(), false);
            if (n.log().get((int) a.prevLogIndex() - 1).term() != a.prevLogTerm())
                return new AppendEntriesReply(n.currentTerm(), false);
        }

        // Truncate any divergent suffix, then append new entries.
        int writeAt = (int) a.prevLogIndex();
        for (RaftNode.LogEntry e : a.entries()) {
            if (n.log().size() > writeAt && n.log().get(writeAt).term() != e.term()) {
                n.log().subList(writeAt, n.log().size()).clear();
            }
            if (n.log().size() == writeAt) n.log().add(e);
            writeAt++;
        }
        n.persistLog();

        if (a.leaderCommit() > n.commitIndex()) {
            n.setCommitIndex(Math.min(a.leaderCommit(), n.log().size()));
            applyCommitted();
        }
        return new AppendEntriesReply(n.currentTerm(), true);
    }

    /** Called by the leader after a successful AppendEntries response. */
    public synchronized void onFollowerMatch(int follower, long matched) {
        matchIndex.put(follower, matched);
        nextIndex.put(follower, matched + 1);
        // Advance commitIndex to the highest index replicated on a majority
        // AND whose entry term == currentTerm (Figure 8 safety rule).
        for (long N = n.log().size(); N > n.commitIndex(); N--) {
            if (n.log().get((int) N - 1).term() != n.currentTerm()) continue;
            long count = 1 + matchIndex.values().stream().filter(m -> m >= N).count();
            if (count > (n.peers().size() + 1) / 2) {
                n.setCommitIndex(N); applyCommitted(); break;
            }
        }
    }

    private void applyCommitted() {
        while (n.lastApplied() < n.commitIndex()) {
            long idx = n.lastApplied() + 1;
            n.stateMachine().apply(n.log().get((int) idx - 1).command());
            n.setLastApplied(idx);
        }
    }

    public record AppendEntriesArgs(long term, int leaderId, long prevLogIndex,
                                    long prevLogTerm, List<RaftNode.LogEntry> entries,
                                    long leaderCommit) {}
    public record AppendEntriesReply(long term, boolean success) {}
}

Complexity

Key design decisions & trade-offs

Common pitfalls

Interview follow-ups

Recommended reading

Related