Replication Anomalies

By Rahul Kumar · Senior Software Engineer · Updated May 2026 · Category: Kleppmann · Designing Data-Intensive Applications

Read-your-writes, monotonic, consistent prefix under lag. Ch 5.

This interactive explanation is built for system design interview prep: step through Replication Anomalies, watch the internal state change, and connect the concept to real distributed-system trade-offs.

Overview

Replication is what gives a distributed database its durability and read throughput, but the moment you allow a client to read from any replica you inherit a catalogue of anomalies that do not exist on a single node. The four canonical ones Kleppmann enumerates are read-your-writes violations (you write, you immediately read, and see your old value), monotonic-read violations (successive reads go backwards in time as you hop replicas), consistent-prefix violations (you see effects before causes), and stale reads (you see data committed minutes ago on another replica). These anomalies are not exotic: they happen on every asynchronously replicated system, from MySQL primary-replica to geo-distributed Cassandra. The bug rarely shows up in testing because local latency hides it; it shows up in production when a user posts a comment, refreshes the page, and sees their own comment vanish. The fix is always a form of session consistency: pin reads to a replica that has seen your writes.

Replication Anomalies — Interactive Simulator

Runs fully client-side in your browser; no sign-up. Or open full screen →

Launch the interactive Replication Anomalies widget — step through the algorithm or protocol and observe the internal state updating in real time.

How it works

In a typical leader-follower setup, writes land on the leader and propagate asynchronously to N followers. Replication lag varies per follower: one might be 10 ms behind, another might be 5 seconds behind after a GC pause. A client whose requests are load-balanced round-robin across followers will hit different lags on successive reads, producing non-monotonic behaviour. Monotonic-reads says: once you have seen a version at timestamp T, you should never again see a version earlier than T. The standard fix is session stickiness — route all reads from one session to one replica, usually by hashing the user ID to a follower. Read-your-writes is stronger: after a successful write, reads in that session must observe that write. Solutions include routing reads for a short window to the leader, passing a write-token (the last observed LSN) with the read and having the follower wait until it has replicated past it, or always reading from the leader for the user's own timeline. Consistent-prefix matters when writes are causally ordered (comment A replies to comment B): if replicas apply writes in different orders, readers can see A without B. Vector clocks or logical timestamps can be attached to writes so followers apply them in a valid causal order. None of these fixes give linearizability; they give session-scoped guarantees at a fraction of the cost.

Implementation

ReplicaClient: demonstrates monotonic-reads breaking across replicas

import java.util.*;
import java.util.concurrent.ThreadLocalRandom;

/** Three replicas with different lags behind the leader. */
public class ReplicaClient {
    static final class Replica {
        final String name;
        final long lagMs;              // how far behind the leader
        final NavigableMap<Long, String> log = new TreeMap<>(); // applyTime -> value
        Replica(String name, long lagMs) { this.name = name; this.lagMs = lagMs; }

        void applyFromLeader(long leaderWriteTime, String value) {
            log.put(leaderWriteTime + lagMs, value);
        }

        /** Value visible to a reader at wallClockNow. */
        String read(long wallClockNow) {
            Map.Entry<Long, String> e = log.floorEntry(wallClockNow);
            return e == null ? null : e.getValue();
        }
    }

    private final Replica leader = new Replica("leader", 0);
    private final List<Replica> followers = List.of(
        new Replica("follower-a", 200),
        new Replica("follower-b", 1500),   // slow follower — big lag
        new Replica("follower-c", 400)
    );

    public void write(String value, long now) {
        leader.log.put(now, value);
        for (Replica f : followers) f.applyFromLeader(now, value);
    }

    /** Bug: round-robin reads across followers with different lags. */
    public String readRoundRobin(long now) {
        Replica f = followers.get(ThreadLocalRandom.current().nextInt(followers.size()));
        return f.read(now);
    }

    public static void demoAnomaly() {
        ReplicaClient c = new ReplicaClient();
        long t0 = 0;
        c.write("v1", t0);
        c.write("v2", t0 + 1000);
        // At t=t0+1800, follower-a and follower-c have v2, but follower-b still shows v1.
        // Successive round-robin reads can return v2 then v1 then v2 — non-monotonic.
        for (int i = 0; i < 5; i++) {
            System.out.println("read -> " + c.readRoundRobin(1800));
        }
    }
}

Session stickiness: pin reads to one replica per user

import java.util.*;

/** Routes each session to a deterministic replica, giving monotonic reads for free. */
public class StickyReplicaRouter {
    private final List<ReplicaClient.Replica> replicas;
    private final ReplicaClient.Replica leader;

    public StickyReplicaRouter(ReplicaClient.Replica leader,
                               List<ReplicaClient.Replica> followers) {
        this.leader = leader;
        this.replicas = List.copyOf(followers);
    }

    /** Consistent hash: same userId -> same replica every time. */
    private ReplicaClient.Replica pick(String userId) {
        int h = Math.floorMod(userId.hashCode(), replicas.size());
        return replicas.get(h);
    }

    /** Monotonic reads: one user, one replica. */
    public String read(String userId, long now) {
        return pick(userId).read(now);
    }

    /** Read-your-writes: for a short window after a write, read from leader. */
    public String readYourWrites(String userId, long lastWriteTime, long now) {
        long freshnessWindowMs = 2_000;
        if (now - lastWriteTime < freshnessWindowMs) {
            return leader.read(now);
        }
        return pick(userId).read(now);
    }

    /** Bounded-staleness variant: wait until chosen replica has caught up past LSN. */
    public String readAfterLsn(String userId, long requiredLsn, long now) {
        ReplicaClient.Replica r = pick(userId);
        // In real code we would poll/notify until r.appliedLsn >= requiredLsn;
        // if the replica is too far behind, fall back to the leader.
        long appliedLsn = r.log.isEmpty() ? -1 : r.log.lastKey();
        return appliedLsn >= requiredLsn ? r.read(now) : leader.read(now);
    }
}

Complexity

Monotonic reads with sticky routing: O(1) routing cost, zero replication change
Read-your-writes via leader reads: O(1) but concentrates load on leader
Read-your-writes via LSN token: O(W) wait time where W is replication lag
Consistent-prefix with causal tokens: O(E) per event with vector clock size E
Worst-case staleness: bounded by max follower lag

Key design decisions & trade-offs

Where to enforce session consistency — Chosen: Sticky routing at the load balancer vs LSN tokens passed by client. Sticky routing is transparent to the client but breaks when the chosen follower crashes or when the user changes IPs. LSN tokens are correct across failovers but require every client and API boundary to carry them.
Read-your-writes scope — Chosen: Apply only to the writing user's own timeline. A user must see their own post immediately, but does not notice if someone else's post takes 500 ms to appear. Scoping the guarantee narrowly lets most reads stay cheap while still fixing the visible UX bug.
Leader reads as a fallback — Chosen: Route to leader for a freshness window after writes. Simple, correct, and capacity-bounded. The cost is a short spike of leader reads after every write; acceptable when the write rate is far below the read rate.
Causal order preservation — Chosen: Use logical timestamps or vector clocks at the replication layer. Applying writes in commit order on each follower prevents consistent-prefix anomalies. Vector clocks cost a few bytes per write but are the only general solution across unrelated writes with causal dependencies.

Common pitfalls

Testing locally where replication lag is sub-millisecond and concluding the system is fine
Sticky-routing by client IP instead of user ID; a NAT rebalance kicks every user to a fresh replica and breaks monotonic reads
Treating "eventual consistency" as "a few seconds stale" when GC pauses and network partitions can push a follower minutes behind
Forgetting to carry the causality token across service boundaries; a microservice call strips the LSN and the anomaly returns
Reading from any replica during failover, when the new leader may itself be behind the old one

Interview follow-ups

Implement LSN-token-based read-after-write over an HTTP gateway with automatic fallback to leader
Measure actual replication lag distribution in production and set SLOs per-percentile
Extend the monotonic-reads demo to handle a replica crashing and a new one being selected
Combine session stickiness with bounded staleness to give explicit consistency knobs per endpoint