Phi-Accrual Detector
Heartbeat stream → suspicion level over time. Ch 9.
This interactive explanation is built for system design interview prep: step through Phi-Accrual Detector, watch the internal state change, and connect the concept to real distributed-system trade-offs.
Overview
A phi-accrual failure detector is the adult answer to the question "is that node down?" It replaces the usual binary timeout ("if we have not heard from you in 5 seconds, you are dead") with a continuous suspicion level phi, computed from the statistical distribution of recent heartbeat inter-arrival times. If a node has been heartbeating every 1 second plus or minus 50ms for an hour, a 2-second gap is already very suspicious. If it has been jittering between 0.5s and 4s, a 2-second gap means nothing. The detector captures exactly that intuition. It was introduced by Hayashibara et al. and is the same one Cassandra and Akka ship in production. The output phi grows without bound as a node goes silent, so services can pick their own threshold (commonly 8 or 12) and trade false positives against detection latency without rewriting the underlying math.
How it works
Each node maintains a bounded rolling window (typically 1000 samples) of inter-arrival times between heartbeats from a peer. On every heartbeat, the detector records the delta since the previous heartbeat, appends it to the window, and updates the running mean and variance. Modern implementations use Welford's online algorithm, or simply recompute from the window when a new sample pushes an old one out. To compute phi at time T, we first measure the time since the last received heartbeat, then ask: given the historical distribution of inter-arrival times, how surprising is it that we have not received one yet? The paper assumes a normal distribution, so P_later(t) is 1 minus the CDF at t, and phi = -log10(P_later). A phi of 1 means roughly one in ten chance of being a transient delay; phi of 8 is one in one hundred million. The detector never declares a node dead on its own; it just publishes phi continuously, and the failover layer applies a threshold. When a heartbeat finally arrives, the window absorbs the long gap as just another sample, the variance inflates, and phi naturally recalibrates for that link's real behaviour. This adaptivity is what makes it survive GC pauses, WAN jitter, and cross-region links without human tuning per environment.
Implementation
public class PhiDetector {
private final int maxSamples;
private final ArrayDeque<Long> window = new ArrayDeque<>();
private double mean = 0.0;
private double m2 = 0.0; // sum of squared deltas from mean
private long lastHeartbeatNanos = -1L;
private final long firstHeartbeatEstimateMillis;
public PhiDetector(int maxSamples, long firstHeartbeatEstimateMillis) {
this.maxSamples = maxSamples;
this.firstHeartbeatEstimateMillis = firstHeartbeatEstimateMillis;
}
/** Called every time a heartbeat arrives from the monitored peer. */
public synchronized void onHeartbeat(long nowNanos) {
if (lastHeartbeatNanos > 0) {
long deltaMs = (nowNanos - lastHeartbeatNanos) / 1_000_000L;
addSample(deltaMs);
}
lastHeartbeatNanos = nowNanos;
}
/** Current suspicion level. Grows without bound as silence stretches. */
public synchronized double phi(long nowNanos) {
if (lastHeartbeatNanos < 0 || window.size() < 2) return 0.0;
long sinceLastMs = (nowNanos - lastHeartbeatNanos) / 1_000_000L;
double variance = m2 / (window.size() - 1);
double stdDev = Math.max(Math.sqrt(variance), 1.0); // floor to avoid div-by-zero
double y = (sinceLastMs - mean) / stdDev;
// Upper-tail of normal via Chernoff approximation from the paper.
double e = Math.exp(-y * (1.5976 + 0.070566 * y * y));
double pLater = sinceLastMs > mean ? e / (1.0 + e) : 1.0 - 1.0 / (1.0 + e);
return -Math.log10(Math.max(pLater, 1e-300));
}
private void addSample(long deltaMs) {
if (window.size() == maxSamples) {
long evicted = window.pollFirst();
// Remove contribution of evicted sample via reverse Welford.
int n = window.size();
double oldMean = (mean * (n + 1) - evicted) / Math.max(n, 1);
m2 -= (evicted - oldMean) * (evicted - mean);
mean = oldMean;
}
window.addLast(deltaMs);
int n = window.size();
double delta = deltaMs - mean;
mean += delta / n;
m2 += delta * (deltaMs - mean);
}
}
Complexity
- onHeartbeat update:
O(1) amortized - phi(now) computation:
O(1) - Window storage per peer:
O(W) where W is sample count - Cluster-wide monitoring of N peers:
O(N) memory, O(N) CPU per heartbeat tick
Key design decisions & trade-offs
- Distribution assumption — Chosen: Normal distribution over inter-arrival times. Closed-form CDF and cheap to update online. Real heartbeat distributions are fat-tailed, so the detector over-suspects during long GC pauses, which is usually the safer direction anyway.
- Window size — Chosen: Fixed ring of ~1000 samples. Large enough to smooth jitter, small enough to adapt when a link's baseline latency shifts. Unbounded history would make phi insensitive to recent regime changes.
- Threshold location — Chosen: Detector publishes phi; caller picks the cutoff. Different layers want different sensitivities. Leader election may trip at phi=8; storage routing might wait for phi=12. Separating the math from the policy is what makes the primitive reusable.
Common pitfalls
- Forgetting a floor on stdDev: a perfectly regular stream collapses variance to zero and phi explodes on the first jitter
- Feeding the detector wall-clock time instead of monotonic time, making NTP adjustments look like gigantic pauses
- Sharing a single window across multiple peers; every link has its own latency distribution and must have its own detector
- Treating phi as a boolean by thresholding inside the detector rather than exposing the continuous value
Interview follow-ups
- Bootstrap the window with a configured expected interval so cold starts do not trigger instant suspicion
- Emit phi as a metric so operators can see flapping links before they cause failovers
- Couple phi thresholds with hysteresis so a node does not oscillate between UP and DOWN
- Replace the normal-distribution assumption with a log-normal or empirical CDF for fat-tailed WAN links
Recommended reading
- Alex Petrov, Database Internals — storage engines and distributed systems internals.
- Martin Kleppmann, Designing Data-Intensive Applications (DDIA) — data models, replication, partitioning, consistency.
- The System Design Primer — high-level design building blocks.
- Foundational networking + web-security references (TCP/IP, TLS 1.3, OWASP Top 10).