Phi-Accrual Detector

By Rahul Kumar · Senior Software Engineer · Updated May 2026 · Category: Petrov · Distributed Systems (Part II)

Heartbeat stream → suspicion level over time. Ch 9.

This interactive explanation is built for system design interview prep: step through Phi-Accrual Detector, watch the internal state change, and connect the concept to real distributed-system trade-offs.

Overview

A phi-accrual failure detector is the adult answer to the question "is that node down?" It replaces the usual binary timeout ("if we have not heard from you in 5 seconds, you are dead") with a continuous suspicion level phi, computed from the statistical distribution of recent heartbeat inter-arrival times. If a node has been heartbeating every 1 second plus or minus 50ms for an hour, a 2-second gap is already very suspicious. If it has been jittering between 0.5s and 4s, a 2-second gap means nothing. The detector captures exactly that intuition. It was introduced by Hayashibara et al. and is the same one Cassandra and Akka ship in production. The output phi grows without bound as a node goes silent, so services can pick their own threshold (commonly 8 or 12) and trade false positives against detection latency without rewriting the underlying math.

Phi-Accrual Detector — Interactive Simulator

Runs fully client-side in your browser; no sign-up. Or open full screen →

Launch the interactive Phi-Accrual Detector widget — step through the algorithm or protocol and observe the internal state updating in real time.

How it works

Each node maintains a bounded rolling window (typically 1000 samples) of inter-arrival times between heartbeats from a peer. On every heartbeat, the detector records the delta since the previous heartbeat, appends it to the window, and updates the running mean and variance. Modern implementations use Welford's online algorithm, or simply recompute from the window when a new sample pushes an old one out. To compute phi at time T, we first measure the time since the last received heartbeat, then ask: given the historical distribution of inter-arrival times, how surprising is it that we have not received one yet? The paper assumes a normal distribution, so P_later(t) is 1 minus the CDF at t, and phi = -log10(P_later). A phi of 1 means roughly one in ten chance of being a transient delay; phi of 8 is one in one hundred million. The detector never declares a node dead on its own; it just publishes phi continuously, and the failover layer applies a threshold. When a heartbeat finally arrives, the window absorbs the long gap as just another sample, the variance inflates, and phi naturally recalibrates for that link's real behaviour. This adaptivity is what makes it survive GC pauses, WAN jitter, and cross-region links without human tuning per environment.

Implementation

PhiDetector with rolling window and Welford statistics

public class PhiDetector {
    private final int maxSamples;
    private final ArrayDeque<Long> window = new ArrayDeque<>();
    private double mean = 0.0;
    private double m2 = 0.0; // sum of squared deltas from mean
    private long lastHeartbeatNanos = -1L;
    private final long firstHeartbeatEstimateMillis;

    public PhiDetector(int maxSamples, long firstHeartbeatEstimateMillis) {
        this.maxSamples = maxSamples;
        this.firstHeartbeatEstimateMillis = firstHeartbeatEstimateMillis;
    }

    /** Called every time a heartbeat arrives from the monitored peer. */
    public synchronized void onHeartbeat(long nowNanos) {
        if (lastHeartbeatNanos > 0) {
            long deltaMs = (nowNanos - lastHeartbeatNanos) / 1_000_000L;
            addSample(deltaMs);
        }
        lastHeartbeatNanos = nowNanos;
    }

    /** Current suspicion level. Grows without bound as silence stretches. */
    public synchronized double phi(long nowNanos) {
        if (lastHeartbeatNanos < 0 || window.size() < 2) return 0.0;
        long sinceLastMs = (nowNanos - lastHeartbeatNanos) / 1_000_000L;
        double variance = m2 / (window.size() - 1);
        double stdDev = Math.max(Math.sqrt(variance), 1.0); // floor to avoid div-by-zero
        double y = (sinceLastMs - mean) / stdDev;
        // Upper-tail of normal via Chernoff approximation from the paper.
        double e = Math.exp(-y * (1.5976 + 0.070566 * y * y));
        double pLater = sinceLastMs > mean ? e / (1.0 + e) : 1.0 - 1.0 / (1.0 + e);
        return -Math.log10(Math.max(pLater, 1e-300));
    }

    private void addSample(long deltaMs) {
        if (window.size() == maxSamples) {
            long evicted = window.pollFirst();
            // Remove contribution of evicted sample via reverse Welford.
            int n = window.size();
            double oldMean = (mean * (n + 1) - evicted) / Math.max(n, 1);
            m2 -= (evicted - oldMean) * (evicted - mean);
            mean = oldMean;
        }
        window.addLast(deltaMs);
        int n = window.size();
        double delta = deltaMs - mean;
        mean += delta / n;
        m2 += delta * (deltaMs - mean);
    }
}

Complexity

onHeartbeat update: O(1) amortized
phi(now) computation: O(1)
Window storage per peer: O(W) where W is sample count
Cluster-wide monitoring of N peers: O(N) memory, O(N) CPU per heartbeat tick

Key design decisions & trade-offs

Distribution assumption — Chosen: Normal distribution over inter-arrival times. Closed-form CDF and cheap to update online. Real heartbeat distributions are fat-tailed, so the detector over-suspects during long GC pauses, which is usually the safer direction anyway.
Window size — Chosen: Fixed ring of ~1000 samples. Large enough to smooth jitter, small enough to adapt when a link's baseline latency shifts. Unbounded history would make phi insensitive to recent regime changes.
Threshold location — Chosen: Detector publishes phi; caller picks the cutoff. Different layers want different sensitivities. Leader election may trip at phi=8; storage routing might wait for phi=12. Separating the math from the policy is what makes the primitive reusable.

Common pitfalls

Forgetting a floor on stdDev: a perfectly regular stream collapses variance to zero and phi explodes on the first jitter
Feeding the detector wall-clock time instead of monotonic time, making NTP adjustments look like gigantic pauses
Sharing a single window across multiple peers; every link has its own latency distribution and must have its own detector
Treating phi as a boolean by thresholding inside the detector rather than exposing the continuous value

Interview follow-ups

Bootstrap the window with a configured expected interval so cold starts do not trigger instant suspicion
Emit phi as a metric so operators can see flapping links before they cause failovers
Couple phi thresholds with hysteresis so a node does not oscillate between UP and DOWN
Replace the normal-distribution assumption with a log-normal or empirical CDF for fat-tailed WAN links