Bully Leader Election
Highest-id wins; ELECTION/OK/COORDINATOR. Ch 10.
This interactive explanation is built for system design interview prep: step through Bully Leader Election, watch the internal state change, and connect the concept to real distributed-system trade-offs.
Overview
The Bully algorithm is the simplest leader election protocol that actually works in an asynchronous, crash-prone cluster, and it is still the one most engineers reach for when they do not need the full machinery of Raft or Paxos. The idea is embarrassingly direct: every node has a fixed, globally-unique priority (usually its node id), and the highest-surviving priority wins. When a node notices the current coordinator is gone, it challenges every higher-id node with an ELECTION message. If any of them respond, it loses the round and waits. If none respond, it declares itself COORDINATOR and tells everyone. The protocol was introduced by Garcia-Molina in 1982 and still underlies many production systems: classic MySQL replica promotion, ZooKeeper-less setups, and countless ad-hoc service coordinators. It converges in at most O(N^2) messages and, crucially, always elects the live node with the highest id, which makes reasoning about split-brain resolution easy.
How it works
Every node is configured with the full membership list and each peer's priority. A failure detector (timeout-based or phi-accrual) marks the current coordinator as suspect. When node P decides the coordinator is down, P enters an election by sending an ELECTION message to every node with a higher id than itself. Three outcomes are possible. First, no higher node responds within the election timeout; P concludes it is the highest surviving node and broadcasts COORDINATOR to everyone, which becomes the new leader announcement. Second, at least one higher node responds with an OK message, which acknowledges receipt and implicitly says "stand down, I will take over." P then waits for a new COORDINATOR message within a secondary timeout. Third, that secondary timeout expires without a COORDINATOR announcement (the higher node crashed mid-election), in which case P restarts the election from scratch. When a previously-dead node with a high id rejoins, it must immediately start its own election; the protocol guarantees it will bully its way back into the leader slot, which is the behaviour that gives it the name. Message types are kept explicit and stateless: every handler inspects sender id against local id and either acknowledges, takes over, or ignores. Because the protocol relies entirely on timeouts, correct tuning of election timeout and heartbeat interval is load-bearing. Under network partition it can elect two leaders (one per side), so applications pair it with a quorum check or a fencing token before allowing the new leader to make externally-visible decisions.
Implementation
public class Node {
public enum MsgType { ELECTION, OK, COORDINATOR }
public record Message(MsgType type, int fromId) {}
private final int id;
private final List<Integer> peerIds; // all other node ids, sorted
private final Transport transport;
private final ScheduledExecutorService timers;
private volatile int currentLeader = -1;
private volatile boolean waitingForOk = false;
private volatile boolean waitingForCoordinator = false;
private static final long ELECTION_TIMEOUT_MS = 500;
public Node(int id, List<Integer> peerIds, Transport transport,
ScheduledExecutorService timers) {
this.id = id; this.peerIds = peerIds;
this.transport = transport; this.timers = timers;
}
/** Triggered by failure detector when the current leader is suspect. */
public synchronized void startElection() {
List<Integer> higher = peerIds.stream()
.filter(p -> p > id).toList();
if (higher.isEmpty()) { becomeCoordinator(); return; }
waitingForOk = true;
higher.forEach(p -> transport.send(p, new Message(MsgType.ELECTION, id)));
timers.schedule(this::onOkTimeout, ELECTION_TIMEOUT_MS, TimeUnit.MILLISECONDS);
}
public synchronized void onElection(Message m) {
// Acknowledge and take over if we outrank the sender.
if (m.fromId() < id) {
transport.send(m.fromId(), new Message(MsgType.OK, id));
if (!waitingForOk && !waitingForCoordinator) startElection();
}
}
public synchronized void onOk(Message m) {
waitingForOk = false;
waitingForCoordinator = true;
timers.schedule(this::onCoordinatorTimeout,
2 * ELECTION_TIMEOUT_MS, TimeUnit.MILLISECONDS);
}
public synchronized void onCoordinator(Message m) {
currentLeader = m.fromId();
waitingForOk = false;
waitingForCoordinator = false;
}
private synchronized void onOkTimeout() {
if (waitingForOk) becomeCoordinator();
}
private synchronized void onCoordinatorTimeout() {
if (waitingForCoordinator) { waitingForCoordinator = false; startElection(); }
}
private void becomeCoordinator() {
currentLeader = id;
waitingForOk = false;
peerIds.forEach(p -> transport.send(p, new Message(MsgType.COORDINATOR, id)));
}
}
Complexity
- Election initiated by lowest live node:
O(N^2) messages worst case - Election initiated by highest live node:
O(N) messages (single COORDINATOR broadcast) - Time to converge:
O(election timeout) per round, up to log2(N) rounds - Memory per node:
O(N) peer state
Key design decisions & trade-offs
- Tiebreaker — Chosen: Static node id, highest wins. Zero-cost and deterministic. The cost is biased load: the top-priority node is always leader when alive, which may conflict with even load distribution.
- Safety under partition — Chosen: Algorithm allows dual leaders; caller must fence. Bully by itself has no quorum, so a network split elects two leaders. Production uses add a fencing token or a quorum check before a leader can commit externally-visible state.
- Reliance on timeouts — Chosen: Fixed election and OK timeouts. Easy to reason about; wrong tuning causes either slow failover (too long) or flapping (too short). Adaptive timeouts driven by a phi detector are a common upgrade.
Common pitfalls
- Running Bully without a fencing token, allowing a resurrected old leader to overwrite state written by the new one
- Reusing node ids after a crash; the protocol assumes ids are stable and globally unique
- Forgetting the OK-but-no-COORDINATOR restart path, which deadlocks the cluster when a higher node crashes mid-election
- Tuning election timeout shorter than worst-case GC pause, producing phantom elections
Interview follow-ups
- Layer a fencing token (monotonic epoch) so external systems can reject writes from a stale leader
- Swap fixed timeouts for a phi-accrual detector to adapt to WAN jitter
- Use Ring election instead when the full membership list is expensive to maintain
- Replace Bully with Raft when the application also needs log replication, not just leader identity
Recommended reading
- Alex Petrov, Database Internals — storage engines and distributed systems internals.
- Martin Kleppmann, Designing Data-Intensive Applications (DDIA) — data models, replication, partitioning, consistency.
- The System Design Primer — high-level design building blocks.
- Foundational networking + web-security references (TCP/IP, TLS 1.3, OWASP Top 10).