Raft Consensus
Leader election + log replication across 5 nodes. Ch 14.
This interactive explanation is built for system design interview prep: step through Raft Consensus, watch the internal state change, and connect the concept to real distributed-system trade-offs.
Overview
Raft is the consensus algorithm that turned "please make this replicated log agree on the order of entries" from a PhD thesis into a weekend project. Designed by Ongaro and Ousterhout as an intentionally understandable alternative to Paxos, it powers etcd, Consul, CockroachDB, TiKV, MongoDB's replica set election, and countless homegrown systems. Raft splits the problem into three subproblems - leader election, log replication, safety - and enforces a strong-leader model: at any time only one node in the cluster can accept writes, and every write flows through it. Each node has a durable triple (currentTerm, votedFor, log[]) that survives crashes. Terms are monotonic logical clocks that fence out stale leaders. The leader replicates AppendEntries to followers and commits an entry once a quorum has persisted it; the committed entry is then applied to the local state machine and its effects become visible to clients. Safety arguments reduce to two properties: the leader contains all committed entries, and applied entries are never overwritten.
How it works
A Raft node is always in one of three roles: follower, candidate, or leader. Followers passively accept AppendEntries from a leader and reset a randomised election timeout on each heartbeat. If that timeout fires without contact, the follower bumps currentTerm, votes for itself, transitions to candidate, and sends RequestVote to every peer. A peer grants its vote only if the candidate's log is at least as up-to-date as its own (compared by lastLogTerm then lastLogIndex) and it has not already voted in this term. A candidate that wins a majority becomes leader and immediately sends empty AppendEntries as heartbeats. The leader accepts client commands, appends them to its log, and issues AppendEntries with prevLogIndex / prevLogTerm consistency checks. Followers reject entries whose predecessor does not match; the leader decrements nextIndex for that follower and retries, walking the follower's log back to a point of agreement. Once a log entry is stored on a majority, the leader advances commitIndex, includes it in the next heartbeat, and each node applies all entries up to commitIndex to its state machine in order. Safety is protected by term checks everywhere: any message carrying a higher term forces the recipient into follower mode with that term, so a partitioned old leader discovers it has been superseded the moment the partition heals and its AppendEntries is rejected. Crash recovery is automatic because (currentTerm, votedFor, log[]) is persisted before any RPC reply.
Implementation
public class RaftNode {
public enum Role { FOLLOWER, CANDIDATE, LEADER }
// Persistent state (fsync before responding to RPCs).
private long currentTerm = 0;
private Integer votedFor = null;
private final List<LogEntry> log = new ArrayList<>(); // 1-indexed
// Volatile state.
private long commitIndex = 0;
private long lastApplied = 0;
private Role role = Role.FOLLOWER;
private final int id;
private final List<Integer> peers;
private final StateMachine sm;
private final PersistentStore store;
public RaftNode(int id, List<Integer> peers, StateMachine sm, PersistentStore store) {
this.id = id; this.peers = peers; this.sm = sm; this.store = store;
this.currentTerm = store.loadTerm();
this.votedFor = store.loadVotedFor();
this.log.addAll(store.loadLog());
}
public synchronized RequestVoteReply onRequestVote(RequestVoteArgs a) {
if (a.term() > currentTerm) stepDown(a.term());
boolean logOk = a.lastLogTerm() > lastLogTerm()
|| (a.lastLogTerm() == lastLogTerm() && a.lastLogIndex() >= log.size());
boolean canVote = (votedFor == null || votedFor == a.candidateId());
boolean grant = a.term() == currentTerm && canVote && logOk;
if (grant) { votedFor = a.candidateId(); persistMeta(); }
return new RequestVoteReply(currentTerm, grant);
}
private void stepDown(long newTerm) {
currentTerm = newTerm; votedFor = null; role = Role.FOLLOWER;
persistMeta();
}
private long lastLogTerm() { return log.isEmpty() ? 0 : log.get(log.size() - 1).term(); }
private void persistMeta() { store.saveMeta(currentTerm, votedFor); }
public record LogEntry(long term, byte[] command) {}
public record RequestVoteArgs(long term, int candidateId, long lastLogIndex, long lastLogTerm) {}
public record RequestVoteReply(long term, boolean voteGranted) {}
}
public class RaftReplication {
private final RaftNode n;
private final Map<Integer, Long> nextIndex = new ConcurrentHashMap<>();
private final Map<Integer, Long> matchIndex = new ConcurrentHashMap<>();
public RaftReplication(RaftNode n) { this.n = n; }
public synchronized AppendEntriesReply onAppendEntries(AppendEntriesArgs a) {
if (a.term() < n.currentTerm()) return new AppendEntriesReply(n.currentTerm(), false);
if (a.term() > n.currentTerm()) n.stepDownExternal(a.term());
// Consistency check: our entry at prevLogIndex must match prevLogTerm.
if (a.prevLogIndex() > 0) {
if (n.log().size() < a.prevLogIndex()) return new AppendEntriesReply(n.currentTerm(), false);
if (n.log().get((int) a.prevLogIndex() - 1).term() != a.prevLogTerm())
return new AppendEntriesReply(n.currentTerm(), false);
}
// Truncate any divergent suffix, then append new entries.
int writeAt = (int) a.prevLogIndex();
for (RaftNode.LogEntry e : a.entries()) {
if (n.log().size() > writeAt && n.log().get(writeAt).term() != e.term()) {
n.log().subList(writeAt, n.log().size()).clear();
}
if (n.log().size() == writeAt) n.log().add(e);
writeAt++;
}
n.persistLog();
if (a.leaderCommit() > n.commitIndex()) {
n.setCommitIndex(Math.min(a.leaderCommit(), n.log().size()));
applyCommitted();
}
return new AppendEntriesReply(n.currentTerm(), true);
}
/** Called by the leader after a successful AppendEntries response. */
public synchronized void onFollowerMatch(int follower, long matched) {
matchIndex.put(follower, matched);
nextIndex.put(follower, matched + 1);
// Advance commitIndex to the highest index replicated on a majority
// AND whose entry term == currentTerm (Figure 8 safety rule).
for (long N = n.log().size(); N > n.commitIndex(); N--) {
if (n.log().get((int) N - 1).term() != n.currentTerm()) continue;
long count = 1 + matchIndex.values().stream().filter(m -> m >= N).count();
if (count > (n.peers().size() + 1) / 2) {
n.setCommitIndex(N); applyCommitted(); break;
}
}
}
private void applyCommitted() {
while (n.lastApplied() < n.commitIndex()) {
long idx = n.lastApplied() + 1;
n.stateMachine().apply(n.log().get((int) idx - 1).command());
n.setLastApplied(idx);
}
}
public record AppendEntriesArgs(long term, int leaderId, long prevLogIndex,
long prevLogTerm, List<RaftNode.LogEntry> entries,
long leaderCommit) {}
public record AppendEntriesReply(long term, boolean success) {}
}
Complexity
- Leader election:
O(N) messages per successful round - Commit one entry:
O(N) AppendEntries, 1 round-trip to a quorum - Follower log convergence:
O(log divergence length) retries - Throughput ceiling:
bounded by leader disk fsync rate - Memory per node:
O(log size) plus O(N) per-follower indices
Key design decisions & trade-offs
- Strong leader — Chosen: All writes route through the leader. Dramatically simplifies reasoning (one writer, one log order) at the cost of leader becoming a throughput bottleneck. Multi-Raft shards the problem by range for horizontal scale.
- Election timing — Chosen: Randomised election timeout. Prevents split votes where every candidate starts simultaneously. The randomisation range must exceed worst-case RTT, or repeated split votes stall progress.
- Figure 8 commit rule — Chosen: Leader only commits entries from its own term via majority. Prevents a subtle safety bug where a replicated-but-uncommitted old-term entry could be overwritten. The cost is adding a no-op at leader startup to expedite commit of prior-term entries.
Common pitfalls
- Forgetting to fsync (currentTerm, votedFor, log) before replying to RequestVote or AppendEntries, which breaks the at-most-one-vote-per-term safety invariant
- Reading from the leader without a lease or read-index check and serving a stale value from a partitioned ex-leader
- Setting election timeout too close to RTT, producing chronic split votes
- Letting the log grow without snapshotting; new followers spend hours catching up on replayable state
Interview follow-ups
- Add log compaction via snapshots so follower catch-up is bounded
- Implement read-index / leader-lease reads for linearisable reads without hitting the log
- Shard into Multi-Raft groups (one Raft group per key-range) for horizontal write scale
- Add learner nodes for safe membership changes without losing quorum during config rollover
Recommended reading
- Alex Petrov, Database Internals — storage engines and distributed systems internals.
- Martin Kleppmann, Designing Data-Intensive Applications (DDIA) — data models, replication, partitioning, consistency.
- The System Design Primer — high-level design building blocks.
- Foundational networking + web-security references (TCP/IP, TLS 1.3, OWASP Top 10).