Replication Anomalies
Read-your-writes, monotonic, consistent prefix under lag. Ch 5.
This interactive explanation is built for system design interview prep: step through Replication Anomalies, watch the internal state change, and connect the concept to real distributed-system trade-offs.
Overview
Replication is what gives a distributed database its durability and read throughput, but the moment you allow a client to read from any replica you inherit a catalogue of anomalies that do not exist on a single node. The four canonical ones Kleppmann enumerates are read-your-writes violations (you write, you immediately read, and see your old value), monotonic-read violations (successive reads go backwards in time as you hop replicas), consistent-prefix violations (you see effects before causes), and stale reads (you see data committed minutes ago on another replica). These anomalies are not exotic: they happen on every asynchronously replicated system, from MySQL primary-replica to geo-distributed Cassandra. The bug rarely shows up in testing because local latency hides it; it shows up in production when a user posts a comment, refreshes the page, and sees their own comment vanish. The fix is always a form of session consistency: pin reads to a replica that has seen your writes.
How it works
In a typical leader-follower setup, writes land on the leader and propagate asynchronously to N followers. Replication lag varies per follower: one might be 10 ms behind, another might be 5 seconds behind after a GC pause. A client whose requests are load-balanced round-robin across followers will hit different lags on successive reads, producing non-monotonic behaviour. Monotonic-reads says: once you have seen a version at timestamp T, you should never again see a version earlier than T. The standard fix is session stickiness — route all reads from one session to one replica, usually by hashing the user ID to a follower. Read-your-writes is stronger: after a successful write, reads in that session must observe that write. Solutions include routing reads for a short window to the leader, passing a write-token (the last observed LSN) with the read and having the follower wait until it has replicated past it, or always reading from the leader for the user's own timeline. Consistent-prefix matters when writes are causally ordered (comment A replies to comment B): if replicas apply writes in different orders, readers can see A without B. Vector clocks or logical timestamps can be attached to writes so followers apply them in a valid causal order. None of these fixes give linearizability; they give session-scoped guarantees at a fraction of the cost.
Implementation
import java.util.*;
import java.util.concurrent.ThreadLocalRandom;
/** Three replicas with different lags behind the leader. */
public class ReplicaClient {
static final class Replica {
final String name;
final long lagMs; // how far behind the leader
final NavigableMap<Long, String> log = new TreeMap<>(); // applyTime -> value
Replica(String name, long lagMs) { this.name = name; this.lagMs = lagMs; }
void applyFromLeader(long leaderWriteTime, String value) {
log.put(leaderWriteTime + lagMs, value);
}
/** Value visible to a reader at wallClockNow. */
String read(long wallClockNow) {
Map.Entry<Long, String> e = log.floorEntry(wallClockNow);
return e == null ? null : e.getValue();
}
}
private final Replica leader = new Replica("leader", 0);
private final List<Replica> followers = List.of(
new Replica("follower-a", 200),
new Replica("follower-b", 1500), // slow follower — big lag
new Replica("follower-c", 400)
);
public void write(String value, long now) {
leader.log.put(now, value);
for (Replica f : followers) f.applyFromLeader(now, value);
}
/** Bug: round-robin reads across followers with different lags. */
public String readRoundRobin(long now) {
Replica f = followers.get(ThreadLocalRandom.current().nextInt(followers.size()));
return f.read(now);
}
public static void demoAnomaly() {
ReplicaClient c = new ReplicaClient();
long t0 = 0;
c.write("v1", t0);
c.write("v2", t0 + 1000);
// At t=t0+1800, follower-a and follower-c have v2, but follower-b still shows v1.
// Successive round-robin reads can return v2 then v1 then v2 — non-monotonic.
for (int i = 0; i < 5; i++) {
System.out.println("read -> " + c.readRoundRobin(1800));
}
}
}
import java.util.*;
/** Routes each session to a deterministic replica, giving monotonic reads for free. */
public class StickyReplicaRouter {
private final List<ReplicaClient.Replica> replicas;
private final ReplicaClient.Replica leader;
public StickyReplicaRouter(ReplicaClient.Replica leader,
List<ReplicaClient.Replica> followers) {
this.leader = leader;
this.replicas = List.copyOf(followers);
}
/** Consistent hash: same userId -> same replica every time. */
private ReplicaClient.Replica pick(String userId) {
int h = Math.floorMod(userId.hashCode(), replicas.size());
return replicas.get(h);
}
/** Monotonic reads: one user, one replica. */
public String read(String userId, long now) {
return pick(userId).read(now);
}
/** Read-your-writes: for a short window after a write, read from leader. */
public String readYourWrites(String userId, long lastWriteTime, long now) {
long freshnessWindowMs = 2_000;
if (now - lastWriteTime < freshnessWindowMs) {
return leader.read(now);
}
return pick(userId).read(now);
}
/** Bounded-staleness variant: wait until chosen replica has caught up past LSN. */
public String readAfterLsn(String userId, long requiredLsn, long now) {
ReplicaClient.Replica r = pick(userId);
// In real code we would poll/notify until r.appliedLsn >= requiredLsn;
// if the replica is too far behind, fall back to the leader.
long appliedLsn = r.log.isEmpty() ? -1 : r.log.lastKey();
return appliedLsn >= requiredLsn ? r.read(now) : leader.read(now);
}
}
Complexity
- Monotonic reads with sticky routing:
O(1) routing cost, zero replication change - Read-your-writes via leader reads:
O(1) but concentrates load on leader - Read-your-writes via LSN token:
O(W) wait time where W is replication lag - Consistent-prefix with causal tokens:
O(E) per event with vector clock size E - Worst-case staleness:
bounded by max follower lag
Key design decisions & trade-offs
- Where to enforce session consistency — Chosen: Sticky routing at the load balancer vs LSN tokens passed by client. Sticky routing is transparent to the client but breaks when the chosen follower crashes or when the user changes IPs. LSN tokens are correct across failovers but require every client and API boundary to carry them.
- Read-your-writes scope — Chosen: Apply only to the writing user's own timeline. A user must see their own post immediately, but does not notice if someone else's post takes 500 ms to appear. Scoping the guarantee narrowly lets most reads stay cheap while still fixing the visible UX bug.
- Leader reads as a fallback — Chosen: Route to leader for a freshness window after writes. Simple, correct, and capacity-bounded. The cost is a short spike of leader reads after every write; acceptable when the write rate is far below the read rate.
- Causal order preservation — Chosen: Use logical timestamps or vector clocks at the replication layer. Applying writes in commit order on each follower prevents consistent-prefix anomalies. Vector clocks cost a few bytes per write but are the only general solution across unrelated writes with causal dependencies.
Common pitfalls
- Testing locally where replication lag is sub-millisecond and concluding the system is fine
- Sticky-routing by client IP instead of user ID; a NAT rebalance kicks every user to a fresh replica and breaks monotonic reads
- Treating "eventual consistency" as "a few seconds stale" when GC pauses and network partitions can push a follower minutes behind
- Forgetting to carry the causality token across service boundaries; a microservice call strips the LSN and the anomaly returns
- Reading from any replica during failover, when the new leader may itself be behind the old one
Interview follow-ups
- Implement LSN-token-based read-after-write over an HTTP gateway with automatic fallback to leader
- Measure actual replication lag distribution in production and set SLOs per-percentile
- Extend the monotonic-reads demo to handle a replica crashing and a new one being selected
- Combine session stickiness with bounded staleness to give explicit consistency knobs per endpoint
Recommended reading
- Alex Petrov, Database Internals — storage engines and distributed systems internals.
- Martin Kleppmann, Designing Data-Intensive Applications (DDIA) — data models, replication, partitioning, consistency.
- The System Design Primer — high-level design building blocks.
- Foundational networking + web-security references (TCP/IP, TLS 1.3, OWASP Top 10).