Partitioning Strategies
Hash vs range, hotspots, rebalancing. Ch 6.
This interactive explanation is built for system design interview prep: step through Partitioning Strategies, watch the internal state change, and connect the concept to real distributed-system trade-offs.
Overview
Partitioning is how a distributed database turns one big dataset into many small ones that fit on individual machines, and the choice of partitioning strategy determines whether the system scales gracefully or spends its life fighting hotspots. The two fundamental approaches are hash partitioning, which destroys key locality in exchange for uniform load, and range partitioning, which preserves ordering at the cost of keys clustering onto a few partitions. A third option, consistent hashing, sits between them: it gives hash-style uniformity but minimizes reshuffling when nodes are added or removed. Real systems rarely pick one cleanly — Cassandra ranges hashed tokens; DynamoDB hashes the partition key but range-orders the sort key within it; Postgres lets you declare either. The question the designer must answer before writing any code is: what is the dominant read pattern? Point lookups want hash; range scans want range; time-series want compound partitioning. Pick wrong and you are rebalancing forever.
How it works
Hash partitioning applies a function like MurmurHash or SHA-1 to the partition key, then uses modulo or ring position to assign the key to a partition. The hash spreads keys uniformly regardless of distribution in the input, eliminating the celebrity-user hotspot problem, but it destroys order: a scan over adjacent keys now touches every partition. Range partitioning keeps keys sorted and assigns contiguous ranges to partitions, so scans are cheap and ordered, but any skew in the key distribution (timestamps, alphabetical usernames, auto-increment IDs) produces catastrophic hotspotting on the latest partition. Consistent hashing builds a ring of hash values, places each node at several "virtual" positions on the ring, and assigns each key to the first node clockwise from its hash. Adding a node only moves the keys that fall between the new node's positions and its predecessors', keeping rebalance work proportional to 1/N rather than N/N. Compound strategies layer both: hash the user ID to pick a partition, then range-sort within by timestamp to make "user's recent messages" a single-partition ordered scan. Rebalancing is where partitioning gets hardest — naive mod-N hashing means adding one node moves every key; fixed-partition schemes (10x more partitions than nodes) let you move whole partitions instead of individual keys; dynamic partitioning splits hot ranges on the fly.
Implementation
import java.nio.charset.StandardCharsets;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
/** Hash a key and mod into a fixed number of partitions. */
public final class HashPartitioner<K> {
private final int partitionCount;
public HashPartitioner(int partitionCount) {
if (partitionCount <= 0) throw new IllegalArgumentException("partitionCount > 0");
this.partitionCount = partitionCount;
}
public int assignPartition(K key) {
int h = stableHash(key.toString());
return Math.floorMod(h, partitionCount);
}
/** Use a stable cross-JVM hash; Object.hashCode is not stable across versions. */
private static int stableHash(String s) {
try {
MessageDigest md = MessageDigest.getInstance("MD5");
byte[] d = md.digest(s.getBytes(StandardCharsets.UTF_8));
return ((d[0] & 0xff) << 24) | ((d[1] & 0xff) << 16)
| ((d[2] & 0xff) << 8) | (d[3] & 0xff);
} catch (NoSuchAlgorithmException e) {
throw new IllegalStateException(e);
}
}
public int partitionCount() { return partitionCount; }
}
import java.util.*;
/** Assign keys to partitions based on sorted boundary splits. */
public final class RangePartitioner<K extends Comparable<K>> {
private final List<K> splitPoints; // sorted; partition i covers [splits[i-1], splits[i])
public RangePartitioner(List<K> splitPoints) {
List<K> copy = new ArrayList<>(splitPoints);
Collections.sort(copy);
for (int i = 1; i < copy.size(); i++) {
if (copy.get(i - 1).compareTo(copy.get(i)) == 0)
throw new IllegalArgumentException("duplicate split point");
}
this.splitPoints = List.copyOf(copy);
}
/** Partition id in [0, splitPoints.size()]. */
public int assignPartition(K key) {
int idx = Collections.binarySearch(splitPoints, key);
// binarySearch returns -(insertion_point) - 1 when key is not found.
return idx >= 0 ? idx + 1 : -idx - 1;
}
/** Scan across a key range — only the covering partitions need be queried. */
public List<Integer> partitionsForRange(K fromInclusive, K toExclusive) {
int lo = assignPartition(fromInclusive);
int hi = assignPartition(toExclusive);
List<Integer> out = new ArrayList<>();
for (int i = lo; i <= hi; i++) out.add(i);
return out;
}
public int partitionCount() { return splitPoints.size() + 1; }
}
Complexity
- Hash assignPartition:
O(1) with a constant-time hash - Range assignPartition:
O(log P) binary search over split points - Rebalance on mod-N hashing:
O(N/K) keys moved when cluster grows by one - Rebalance with consistent hashing:
O(K/N) keys moved per node change - Range scan over [a,b]:
O(partitions_touched) — 1 partition in range, all partitions in hash
Key design decisions & trade-offs
- Hash vs range partitioning — Chosen: Hash when load uniformity matters more than scan locality. Hash kills hotspots but also kills ordered scans. Range preserves scans but needs careful split-point management to avoid timestamp-style hotspots. Compound partitioning gets both when queries are known ahead of time.
- Mod-N vs consistent hashing — Chosen: Consistent hashing for any system that adds or removes nodes. Mod-N rehashing moves ~N/K keys for every topology change, which is catastrophic at scale. Consistent hashing moves only the share owned by the changed node, letting clusters grow online.
- Fixed vs dynamic partitions — Chosen: Fixed-count partitions (e.g. 256) with many per node. Fixed partitions are simple: scale by moving whole partitions, not splitting ranges. Dynamic splitting handles skewed key distributions better but adds a coordinator and makes replication/rebalance logic harder.
- Secondary indexes — Chosen: Local (partition-scoped) indexes by default, global indexes only when required. A global secondary index is itself a partitioned table with independent distribution, which means extra coordination on writes. Local indexes cost a scatter-gather on lookup but keep writes cheap.
Common pitfalls
- Partitioning by timestamp or auto-increment ID; all writes go to the latest partition and obliterate one node
- Hashing without a stable cross-JVM hash function (Object.hashCode) — the same key lands on different partitions in different language versions
- Choosing a user-chosen string as partition key and not considering that one user can produce 1000x the traffic of another
- Splitting a range partition down the middle of a hot key range, only to discover the hotspot is a single key that cannot be split
- Forgetting that adding a global secondary index means every write now touches N+1 partitions
Interview follow-ups
- Implement consistent hashing with virtual nodes and measure key movement on cluster growth
- Design a compound partitioning scheme for time-series data: hash(sensor_id) then range(timestamp)
- Handle a celebrity-key hotspot by introducing a random suffix on write and scatter-gather on read
- Rebalance 256 fixed partitions across a cluster growing from 4 to 16 nodes with zero downtime
Recommended reading
- Alex Petrov, Database Internals — storage engines and distributed systems internals.
- Martin Kleppmann, Designing Data-Intensive Applications (DDIA) — data models, replication, partitioning, consistency.
- The System Design Primer — high-level design building blocks.
- Foundational networking + web-security references (TCP/IP, TLS 1.3, OWASP Top 10).