CRDTs

By Rahul Kumar · Senior Software Engineer · Updated May 2026 · Category: Kleppmann · Designing Data-Intensive Applications

G-Counter, PN-Counter, OR-Set merging across replicas. Ch 9.

This interactive explanation is built for system design interview prep: step through CRDTs, watch the internal state change, and connect the concept to real distributed-system trade-offs.

Overview

Conflict-free Replicated Data Types (CRDTs) are data structures designed so that replicas can accept concurrent updates without coordination and still converge to the same state once they exchange messages. They solve the nightmare of multi-writer replication: two nodes accept the same logical operation at the same time, and the system must reconcile them without losing either. Traditional last-write-wins loses data; vector clocks detect conflicts but leave resolution to the application. CRDTs sidestep the problem by designing operations that are commutative, associative, and idempotent, so the merge function is mathematically guaranteed to produce the same result regardless of order. The two main families are state-based (CvRDTs), which ship full state and merge with a lattice join, and operation-based (CmRDTs), which ship operations over a reliable broadcast. Riak, Redis, Automerge, and every collaborative editor you have used ship CRDTs under the hood.

CRDTs — Interactive Simulator

Runs fully client-side in your browser; no sign-up. Or open full screen →

Launch the interactive CRDTs widget — step through the algorithm or protocol and observe the internal state updating in real time.

How it works

The core invariant is that merge must be a join over a semi-lattice: commutative (A merge B = B merge A), associative ((A merge B) merge C = A merge (B merge C)), and idempotent (A merge A = A). A G-Counter (grow-only) achieves this by giving each replica its own counter slot; increments only touch the local slot, and merge takes the element-wise maximum. Since max is commutative, associative, and idempotent, the lattice works. PN-Counters combine two G-Counters, one for increments and one for decrements, and the value is their difference. Sets are harder because removal must not be undone by a concurrent add. OR-Sets (Observed-Remove) tag every add with a unique identifier; a remove only removes the specific tagged instances it has observed, so a concurrent add with a fresh tag survives. The cost is tombstones: removed tags must be remembered forever, or at least until all replicas have seen the remove, otherwise a late-arriving concurrent add would "come back from the dead." In practice CRDTs pair well with anti-entropy protocols (Merkle trees, gossip) that periodically synchronize replicas. They don't give you linearizability and they don't give you transactions; what they give you is partition-tolerance plus eventual convergence without conflict callbacks.

Implementation

GCounter: grow-only counter with per-node slots

import java.util.*;

/** Grow-only counter. Increments are local; merge is per-node max. */
public final class GCounter {
    private final String nodeId;
    private final Map<String, Long> counts = new HashMap<>();

    public GCounter(String nodeId) {
        this.nodeId = Objects.requireNonNull(nodeId);
        counts.put(nodeId, 0L);
    }

    /** Increment by a positive amount. Only the local slot is touched. */
    public void increment(long delta) {
        if (delta < 0) throw new IllegalArgumentException("use PNCounter for decrements");
        counts.merge(nodeId, delta, Long::sum);
    }

    /** Value is the sum across all node slots. */
    public long value() {
        long total = 0;
        for (long v : counts.values()) total += v;
        return total;
    }

    /** Merge: element-wise max. Commutative, associative, idempotent. */
    public void merge(GCounter other) {
        for (var e : other.counts.entrySet()) {
            counts.merge(e.getKey(), e.getValue(), Math::max);
        }
    }

    public Map<String, Long> snapshot() { return Map.copyOf(counts); }
}

PNCounter: increments and decrements via two GCounters

/** Positive-Negative counter: two GCounters, one for ++ and one for --. */
public final class PNCounter {
    private final GCounter positive;
    private final GCounter negative;

    public PNCounter(String nodeId) {
        this.positive = new GCounter(nodeId);
        this.negative = new GCounter(nodeId);
    }

    public void increment(long delta) {
        if (delta < 0) { decrement(-delta); return; }
        positive.increment(delta);
    }

    public void decrement(long delta) {
        if (delta < 0) throw new IllegalArgumentException("delta must be non-negative");
        negative.increment(delta);
    }

    public long value() {
        return positive.value() - negative.value();
    }

    /** Merge merges both halves; still a lattice join. */
    public void merge(PNCounter other) {
        positive.merge(other.positive);
        negative.merge(other.negative);
    }
}

ORSet: Observed-Remove Set with unique add tags

import java.util.*;

/** Observed-Remove Set. Adds are tagged with UUIDs; removes carry the tags
 *  they have observed, so a concurrent add with a fresh tag survives. */
public final class ORSet<T> {
    private final Map<T, Set<UUID>> adds = new HashMap<>();    // element -> live add tags
    private final Map<T, Set<UUID>> tombstones = new HashMap<>(); // element -> removed tags

    /** Add returns the tag used so the caller can later target this specific add. */
    public UUID add(T element) {
        UUID tag = UUID.randomUUID();
        adds.computeIfAbsent(element, k -> new HashSet<>()).add(tag);
        return tag;
    }

    /** Remove all add-tags for element we have currently observed. */
    public void remove(T element) {
        Set<UUID> observed = adds.getOrDefault(element, Set.of());
        if (observed.isEmpty()) return;
        tombstones.computeIfAbsent(element, k -> new HashSet<>()).addAll(observed);
    }

    /** Element is present iff it has at least one add tag not covered by a tombstone. */
    public boolean contains(T element) {
        Set<UUID> live = new HashSet<>(adds.getOrDefault(element, Set.of()));
        live.removeAll(tombstones.getOrDefault(element, Set.of()));
        return !live.isEmpty();
    }

    /** Merge unions add-sets and tombstone-sets per element. */
    public void merge(ORSet<T> other) {
        for (var e : other.adds.entrySet()) {
            adds.computeIfAbsent(e.getKey(), k -> new HashSet<>()).addAll(e.getValue());
        }
        for (var e : other.tombstones.entrySet()) {
            tombstones.computeIfAbsent(e.getKey(), k -> new HashSet<>()).addAll(e.getValue());
        }
    }

    public Set<T> elements() {
        Set<T> out = new HashSet<>();
        for (T e : adds.keySet()) if (contains(e)) out.add(e);
        return out;
    }
}

Complexity

GCounter increment: O(1)
GCounter merge: O(N) where N is number of replicas
ORSet add: O(1)
ORSet merge: O(|adds| + |tombstones|)
ORSet contains: O(|add_tags| + |tombstone_tags|) for the element
Tombstone garbage collection: O(N * K) with coordination across N replicas

Key design decisions & trade-offs

State-based vs operation-based CRDTs — Chosen: State-based when transport is best-effort, operation-based when it is reliable. CvRDTs ship whole state so a lost message is no problem — the next sync catches up. CmRDTs ship deltas which are smaller but require exactly-once, causally-ordered delivery, which is expensive to build correctly.
Tombstone retention — Chosen: Keep tombstones until a causal stability threshold. Garbage-collecting tombstones too early can resurrect deleted items if a concurrent add shows up late. A causal stability algorithm — all replicas have observed the remove — lets you prune safely.
CRDT vs last-write-wins — Chosen: CRDT where concurrent updates are common and semantically meaningful. LWW loses data silently: two concurrent shopping-cart adds become one. CRDTs preserve both. For fields where one writer is dominant (user profile), LWW with a session clock is cheaper and fine.
Metadata overhead — Chosen: Accept O(replicas) metadata for counters and O(ops) for sets. The per-node slot in a GCounter grows with cluster size; the OR-Set tag list grows with add/remove history. For sets you often delta-compress and prune; for counters the slot count is usually small and bounded.

Common pitfalls

Writing a "CRDT" whose merge is not idempotent; applying the same delta twice during gossip retransmit corrupts state
Using wall-clock timestamps as tie-breakers and discovering that clock skew reorders updates
Pruning tombstones on a single replica without coordination — a lagging peer's concurrent add resurrects deleted items
Shipping an LWW-Set and calling it a CRDT; concurrent add and remove of the same element can drop the add
Using CRDTs for operations that are inherently non-commutative (e.g. "prepend to list"); you need a sequence CRDT like RGA or LSEQ, not a plain set

Interview follow-ups

Implement a delta-state CRDT that ships only incremental changes, not the full state
Design a causal stability detector so tombstones can be safely garbage-collected
Build a collaborative document using an RGA sequence CRDT and measure convergence time
Compare CRDT merge latency with a Raft-based strongly consistent counter under partition