← System Design Simulator

Phi-Accrual Detector

By Rahul Kumar · Senior Software Engineer · Updated · Category: Petrov · Distributed Systems (Part II)

Heartbeat stream → suspicion level over time. Ch 9.

This interactive explanation is built for system design interview prep: step through Phi-Accrual Detector, watch the internal state change, and connect the concept to real distributed-system trade-offs.

Overview

A phi-accrual failure detector is the adult answer to the question "is that node down?" It replaces the usual binary timeout ("if we have not heard from you in 5 seconds, you are dead") with a continuous suspicion level phi, computed from the statistical distribution of recent heartbeat inter-arrival times. If a node has been heartbeating every 1 second plus or minus 50ms for an hour, a 2-second gap is already very suspicious. If it has been jittering between 0.5s and 4s, a 2-second gap means nothing. The detector captures exactly that intuition. It was introduced by Hayashibara et al. and is the same one Cassandra and Akka ship in production. The output phi grows without bound as a node goes silent, so services can pick their own threshold (commonly 8 or 12) and trade false positives against detection latency without rewriting the underlying math.

Phi-Accrual Detector — Interactive Simulator

Runs fully client-side in your browser; no sign-up. Or open full screen →

Launch the interactive Phi-Accrual Detector widget — step through the algorithm or protocol and observe the internal state updating in real time.

How it works

Each node maintains a bounded rolling window (typically 1000 samples) of inter-arrival times between heartbeats from a peer. On every heartbeat, the detector records the delta since the previous heartbeat, appends it to the window, and updates the running mean and variance. Modern implementations use Welford's online algorithm, or simply recompute from the window when a new sample pushes an old one out. To compute phi at time T, we first measure the time since the last received heartbeat, then ask: given the historical distribution of inter-arrival times, how surprising is it that we have not received one yet? The paper assumes a normal distribution, so P_later(t) is 1 minus the CDF at t, and phi = -log10(P_later). A phi of 1 means roughly one in ten chance of being a transient delay; phi of 8 is one in one hundred million. The detector never declares a node dead on its own; it just publishes phi continuously, and the failover layer applies a threshold. When a heartbeat finally arrives, the window absorbs the long gap as just another sample, the variance inflates, and phi naturally recalibrates for that link's real behaviour. This adaptivity is what makes it survive GC pauses, WAN jitter, and cross-region links without human tuning per environment.

Implementation

PhiDetector with rolling window and Welford statistics
public class PhiDetector {
    private final int maxSamples;
    private final ArrayDeque<Long> window = new ArrayDeque<>();
    private double mean = 0.0;
    private double m2 = 0.0; // sum of squared deltas from mean
    private long lastHeartbeatNanos = -1L;
    private final long firstHeartbeatEstimateMillis;

    public PhiDetector(int maxSamples, long firstHeartbeatEstimateMillis) {
        this.maxSamples = maxSamples;
        this.firstHeartbeatEstimateMillis = firstHeartbeatEstimateMillis;
    }

    /** Called every time a heartbeat arrives from the monitored peer. */
    public synchronized void onHeartbeat(long nowNanos) {
        if (lastHeartbeatNanos > 0) {
            long deltaMs = (nowNanos - lastHeartbeatNanos) / 1_000_000L;
            addSample(deltaMs);
        }
        lastHeartbeatNanos = nowNanos;
    }

    /** Current suspicion level. Grows without bound as silence stretches. */
    public synchronized double phi(long nowNanos) {
        if (lastHeartbeatNanos < 0 || window.size() < 2) return 0.0;
        long sinceLastMs = (nowNanos - lastHeartbeatNanos) / 1_000_000L;
        double variance = m2 / (window.size() - 1);
        double stdDev = Math.max(Math.sqrt(variance), 1.0); // floor to avoid div-by-zero
        double y = (sinceLastMs - mean) / stdDev;
        // Upper-tail of normal via Chernoff approximation from the paper.
        double e = Math.exp(-y * (1.5976 + 0.070566 * y * y));
        double pLater = sinceLastMs > mean ? e / (1.0 + e) : 1.0 - 1.0 / (1.0 + e);
        return -Math.log10(Math.max(pLater, 1e-300));
    }

    private void addSample(long deltaMs) {
        if (window.size() == maxSamples) {
            long evicted = window.pollFirst();
            // Remove contribution of evicted sample via reverse Welford.
            int n = window.size();
            double oldMean = (mean * (n + 1) - evicted) / Math.max(n, 1);
            m2 -= (evicted - oldMean) * (evicted - mean);
            mean = oldMean;
        }
        window.addLast(deltaMs);
        int n = window.size();
        double delta = deltaMs - mean;
        mean += delta / n;
        m2 += delta * (deltaMs - mean);
    }
}

Complexity

Key design decisions & trade-offs

Common pitfalls

Interview follow-ups

Recommended reading

Related