CRDTs
G-Counter, PN-Counter, OR-Set merging across replicas. Ch 9.
This interactive explanation is built for system design interview prep: step through CRDTs, watch the internal state change, and connect the concept to real distributed-system trade-offs.
Overview
Conflict-free Replicated Data Types (CRDTs) are data structures designed so that replicas can accept concurrent updates without coordination and still converge to the same state once they exchange messages. They solve the nightmare of multi-writer replication: two nodes accept the same logical operation at the same time, and the system must reconcile them without losing either. Traditional last-write-wins loses data; vector clocks detect conflicts but leave resolution to the application. CRDTs sidestep the problem by designing operations that are commutative, associative, and idempotent, so the merge function is mathematically guaranteed to produce the same result regardless of order. The two main families are state-based (CvRDTs), which ship full state and merge with a lattice join, and operation-based (CmRDTs), which ship operations over a reliable broadcast. Riak, Redis, Automerge, and every collaborative editor you have used ship CRDTs under the hood.
How it works
The core invariant is that merge must be a join over a semi-lattice: commutative (A merge B = B merge A), associative ((A merge B) merge C = A merge (B merge C)), and idempotent (A merge A = A). A G-Counter (grow-only) achieves this by giving each replica its own counter slot; increments only touch the local slot, and merge takes the element-wise maximum. Since max is commutative, associative, and idempotent, the lattice works. PN-Counters combine two G-Counters, one for increments and one for decrements, and the value is their difference. Sets are harder because removal must not be undone by a concurrent add. OR-Sets (Observed-Remove) tag every add with a unique identifier; a remove only removes the specific tagged instances it has observed, so a concurrent add with a fresh tag survives. The cost is tombstones: removed tags must be remembered forever, or at least until all replicas have seen the remove, otherwise a late-arriving concurrent add would "come back from the dead." In practice CRDTs pair well with anti-entropy protocols (Merkle trees, gossip) that periodically synchronize replicas. They don't give you linearizability and they don't give you transactions; what they give you is partition-tolerance plus eventual convergence without conflict callbacks.
Implementation
import java.util.*;
/** Grow-only counter. Increments are local; merge is per-node max. */
public final class GCounter {
private final String nodeId;
private final Map<String, Long> counts = new HashMap<>();
public GCounter(String nodeId) {
this.nodeId = Objects.requireNonNull(nodeId);
counts.put(nodeId, 0L);
}
/** Increment by a positive amount. Only the local slot is touched. */
public void increment(long delta) {
if (delta < 0) throw new IllegalArgumentException("use PNCounter for decrements");
counts.merge(nodeId, delta, Long::sum);
}
/** Value is the sum across all node slots. */
public long value() {
long total = 0;
for (long v : counts.values()) total += v;
return total;
}
/** Merge: element-wise max. Commutative, associative, idempotent. */
public void merge(GCounter other) {
for (var e : other.counts.entrySet()) {
counts.merge(e.getKey(), e.getValue(), Math::max);
}
}
public Map<String, Long> snapshot() { return Map.copyOf(counts); }
}
/** Positive-Negative counter: two GCounters, one for ++ and one for --. */
public final class PNCounter {
private final GCounter positive;
private final GCounter negative;
public PNCounter(String nodeId) {
this.positive = new GCounter(nodeId);
this.negative = new GCounter(nodeId);
}
public void increment(long delta) {
if (delta < 0) { decrement(-delta); return; }
positive.increment(delta);
}
public void decrement(long delta) {
if (delta < 0) throw new IllegalArgumentException("delta must be non-negative");
negative.increment(delta);
}
public long value() {
return positive.value() - negative.value();
}
/** Merge merges both halves; still a lattice join. */
public void merge(PNCounter other) {
positive.merge(other.positive);
negative.merge(other.negative);
}
}
import java.util.*;
/** Observed-Remove Set. Adds are tagged with UUIDs; removes carry the tags
* they have observed, so a concurrent add with a fresh tag survives. */
public final class ORSet<T> {
private final Map<T, Set<UUID>> adds = new HashMap<>(); // element -> live add tags
private final Map<T, Set<UUID>> tombstones = new HashMap<>(); // element -> removed tags
/** Add returns the tag used so the caller can later target this specific add. */
public UUID add(T element) {
UUID tag = UUID.randomUUID();
adds.computeIfAbsent(element, k -> new HashSet<>()).add(tag);
return tag;
}
/** Remove all add-tags for element we have currently observed. */
public void remove(T element) {
Set<UUID> observed = adds.getOrDefault(element, Set.of());
if (observed.isEmpty()) return;
tombstones.computeIfAbsent(element, k -> new HashSet<>()).addAll(observed);
}
/** Element is present iff it has at least one add tag not covered by a tombstone. */
public boolean contains(T element) {
Set<UUID> live = new HashSet<>(adds.getOrDefault(element, Set.of()));
live.removeAll(tombstones.getOrDefault(element, Set.of()));
return !live.isEmpty();
}
/** Merge unions add-sets and tombstone-sets per element. */
public void merge(ORSet<T> other) {
for (var e : other.adds.entrySet()) {
adds.computeIfAbsent(e.getKey(), k -> new HashSet<>()).addAll(e.getValue());
}
for (var e : other.tombstones.entrySet()) {
tombstones.computeIfAbsent(e.getKey(), k -> new HashSet<>()).addAll(e.getValue());
}
}
public Set<T> elements() {
Set<T> out = new HashSet<>();
for (T e : adds.keySet()) if (contains(e)) out.add(e);
return out;
}
}
Complexity
- GCounter increment:
O(1) - GCounter merge:
O(N) where N is number of replicas - ORSet add:
O(1) - ORSet merge:
O(|adds| + |tombstones|) - ORSet contains:
O(|add_tags| + |tombstone_tags|) for the element - Tombstone garbage collection:
O(N * K) with coordination across N replicas
Key design decisions & trade-offs
- State-based vs operation-based CRDTs — Chosen: State-based when transport is best-effort, operation-based when it is reliable. CvRDTs ship whole state so a lost message is no problem — the next sync catches up. CmRDTs ship deltas which are smaller but require exactly-once, causally-ordered delivery, which is expensive to build correctly.
- Tombstone retention — Chosen: Keep tombstones until a causal stability threshold. Garbage-collecting tombstones too early can resurrect deleted items if a concurrent add shows up late. A causal stability algorithm — all replicas have observed the remove — lets you prune safely.
- CRDT vs last-write-wins — Chosen: CRDT where concurrent updates are common and semantically meaningful. LWW loses data silently: two concurrent shopping-cart adds become one. CRDTs preserve both. For fields where one writer is dominant (user profile), LWW with a session clock is cheaper and fine.
- Metadata overhead — Chosen: Accept O(replicas) metadata for counters and O(ops) for sets. The per-node slot in a GCounter grows with cluster size; the OR-Set tag list grows with add/remove history. For sets you often delta-compress and prune; for counters the slot count is usually small and bounded.
Common pitfalls
- Writing a "CRDT" whose merge is not idempotent; applying the same delta twice during gossip retransmit corrupts state
- Using wall-clock timestamps as tie-breakers and discovering that clock skew reorders updates
- Pruning tombstones on a single replica without coordination — a lagging peer's concurrent add resurrects deleted items
- Shipping an LWW-Set and calling it a CRDT; concurrent add and remove of the same element can drop the add
- Using CRDTs for operations that are inherently non-commutative (e.g. "prepend to list"); you need a sequence CRDT like RGA or LSEQ, not a plain set
Interview follow-ups
- Implement a delta-state CRDT that ships only incremental changes, not the full state
- Design a causal stability detector so tombstones can be safely garbage-collected
- Build a collaborative document using an RGA sequence CRDT and measure convergence time
- Compare CRDT merge latency with a Raft-based strongly consistent counter under partition
Recommended reading
- Alex Petrov, Database Internals — storage engines and distributed systems internals.
- Martin Kleppmann, Designing Data-Intensive Applications (DDIA) — data models, replication, partitioning, consistency.
- The System Design Primer — high-level design building blocks.
- Foundational networking + web-security references (TCP/IP, TLS 1.3, OWASP Top 10).