CAP under Partition
3-node cluster, partition + heal. CP rejects minority writes; AP diverges.
This interactive explanation is built for system design interview prep: step through CAP under Partition, watch the internal state change, and connect the concept to real distributed-system trade-offs.
Overview
CAP is the theorem that forces every distributed system to pick a side during a network partition. Eric Brewer's 2000 conjecture, formalized by Gilbert and Lynch in 2002, says a system cannot simultaneously guarantee Consistency (every read sees the most recent write), Availability (every request gets a non-error response), and Partition tolerance (the system keeps working when messages between nodes are dropped). Because partitions are a fact of physics, not a design choice, real systems get to pick only between CP (refuse requests on the minority side) and AP (accept writes on both sides, reconcile later). The theorem is often misunderstood as a three-way trade-off; it is really a binary choice made at partition time. When the network is healthy, both CP and AP systems can be consistent and available. CAP only matters once packets start dropping. Understanding CAP is the difference between picking a database that loses data and picking one that refuses to accept it.
How it works
Every distributed write needs to reach a quorum of replicas to be considered durable. When the network splits the cluster into two groups that cannot see each other, one of two things must happen to each partition. A CP system (ZooKeeper, etcd, HBase) detects the partition and allows only the majority side to accept writes; the minority side returns errors. This preserves linearizability — every successful write is visible to every subsequent successful read — at the cost of rejecting requests routed to the minority. The detection mechanism is usually a lease or a Raft leader whose heartbeat times out. An AP system (Cassandra, DynamoDB, Riak) allows both sides to accept writes, tagging them with timestamps or vector clocks. When the partition heals, the system reconciles divergent writes using last-write-wins, merge functions, or CRDTs. Every request gets an answer, but two clients reading the same key during the partition may see different values, and reconciliation can drop writes silently if two updates collide. The PACELC extension captures the practical reality: in the absence of a partition (Else), systems also trade Latency for Consistency. A strongly consistent read across three regions pays three cross-region RTTs; an eventually consistent read pays one. Most production systems are tunable — Cassandra lets you pick QUORUM or ONE per request — so the CAP label is really a default, not a prison.
Implementation
public class CPStrategy implements KVStore {
private final List<Replica> replicas;
private final int quorum;
private final PartitionDetector detector;
public CPStrategy(List<Replica> replicas, PartitionDetector detector) {
this.replicas = replicas;
this.quorum = replicas.size() / 2 + 1;
this.detector = detector;
}
@Override
public void put(String key, byte[] value) {
if (!detector.hasMajorityQuorum()) {
throw new UnavailableException("Refusing write on minority partition");
}
int acks = 0;
for (Replica r : replicas) {
try { r.write(key, value); acks++; }
catch (IOException ignored) {}
if (acks >= quorum) return;
}
throw new UnavailableException("Quorum not reached, acks=" + acks);
}
@Override
public byte[] get(String key) {
if (!detector.hasMajorityQuorum())
throw new UnavailableException("Refusing read on minority partition");
return replicas.get(0).read(key); // leader-only read
}
}
public class APStrategy implements KVStore {
private final List<Replica> replicas;
private final ConflictResolver resolver; // LWW or CRDT merge
public APStrategy(List<Replica> replicas, ConflictResolver resolver) {
this.replicas = replicas; this.resolver = resolver;
}
@Override
public void put(String key, byte[] value) {
long ts = System.currentTimeMillis();
for (Replica r : replicas) {
try { r.writeTimestamped(key, value, ts); return; }
catch (IOException e) { /* try next */ }
}
throw new UnavailableException("No replica reachable");
}
@Override
public byte[] get(String key) {
List<TimestampedValue> seen = new ArrayList<>();
for (Replica r : replicas) {
try { seen.add(r.readTimestamped(key)); }
catch (IOException ignored) {}
if (!seen.isEmpty()) break; // serve first reachable
}
if (seen.isEmpty()) throw new UnavailableException("All replicas down");
return resolver.merge(seen).value();
}
}
Complexity
- CP write (healthy):
1 cross-node RTT + majority acks - CP write (partitioned minority):
error, O(1) - AP write:
1 RTT to any reachable replica - AP read (conflict):
N RTTs + merge cost
Key design decisions & trade-offs
- CP vs AP default — Chosen: CP for systems of record, AP for user-facing reads. Billing and inventory must refuse uncertain writes; feeds and timelines must stay responsive even during partition.
- Quorum size — Chosen: Majority (N/2 + 1). Smallest quorum that guarantees no two disjoint majorities exist simultaneously.
- Conflict resolution — Chosen: CRDT over LWW when possible. LWW silently drops concurrent writes; CRDTs preserve both by design at the cost of larger payloads.
Common pitfalls
- Assuming a single datacenter is partition-free: rack-level splits happen routinely
- Using LWW with unsynchronized clocks — a wrong clock rewrites history
- Confusing 'eventual consistency' with 'strong eventual consistency'; only the latter guarantees convergence
- Forgetting that read-your-writes requires session stickiness or read quorums in AP systems
Interview follow-ups
- Tune read/write quorums (R + W > N) for stronger consistency on AP stores
- Add hinted handoff for transient partitions
- Model PACELC explicitly: latency-vs-consistency in the non-partition case
- Introduce read-your-writes via session tokens
Recommended reading
- Alex Petrov, Database Internals — storage engines and distributed systems internals.
- Martin Kleppmann, Designing Data-Intensive Applications (DDIA) — data models, replication, partitioning, consistency.
- The System Design Primer — high-level design building blocks.
- Foundational networking + web-security references (TCP/IP, TLS 1.3, OWASP Top 10).