MapReduce
Word-count: map → animated shuffle → reduce. Ch 10.
This interactive explanation is built for system design interview prep: step through MapReduce, watch the internal state change, and connect the concept to real distributed-system trade-offs.
Overview
MapReduce is the programming model that made batch processing on thousand-node clusters feel boring. You write two functions — map and reduce — and the framework handles partitioning, shuffling, restarting failed tasks, and scheduling across a cluster. Google's 2004 paper codified what had been ad-hoc parallelism into a shape every large data system has copied: Hadoop, Spark, Flink batch mode, BigQuery's Dremel execution, and a hundred internal systems. The map phase parses input records and emits key-value pairs. The framework shuffles those pairs so that all values for a given key end up together. The reduce phase collapses each key's group into a final output. The beauty of the model is that both phases are deterministic and side-effect-free, so any failed task can be retried safely. The limitation is that each job reads from and writes to durable storage, so chains of jobs pay I/O for every stage — which is why Spark's in-memory DAG replaced it for iterative workloads.
How it works
A MapReduce job starts with an input split: the source data is divided into roughly 64-128MB chunks that can be processed independently. Each map task reads its split, invokes the user's map() on each record, and writes emitted key-value pairs to a local spill file partitioned by the target reducer (usually hash(key) mod R). This is where data locality matters: the scheduler tries to place map tasks on the node holding the input data so map is a local disk read. Once mappers finish, the shuffle phase begins: each reducer pulls its partition from every mapper over the network, merges the streams, and sorts by key so the reduce function sees all values for a key grouped together. The shuffle is the expensive part — it is the only all-to-all communication — and its cost is why combiner functions (a local pre-reduce on each mapper) exist: they shrink the data that has to cross the network. Reduce tasks run the user's reduce() once per key, emit output records, and write them to the distributed file system. Failure handling is trivial because of the functional shape: if a map task dies, re-run it on another node; if a reduce task dies, re-pull the shuffle data and re-run. The framework takes care of idempotency as long as the user functions are pure.
Implementation
import java.util.*;
/** Minimal MapReduce API — one job, one key-value space. */
public interface Mapper<KI, VI, KO, VO> {
void map(KI key, VI value, EmitContext<KO, VO> ctx);
}
public interface Reducer<K, V, KO, VO> {
void reduce(K key, Iterable<V> values, EmitContext<KO, VO> ctx);
}
/** Framework-provided sink that map/reduce uses to emit records. */
public final class EmitContext<K, V> {
private final List<Map.Entry<K, V>> buffer = new ArrayList<>();
public void emit(K key, V value) { buffer.add(Map.entry(key, value)); }
public List<Map.Entry<K, V>> drain() { return buffer; }
}
/** Split each line into words and emit (word, 1). */
public final class WordCountMapper
implements Mapper<Long, String, String, Integer> {
@Override
public void map(Long lineNumber, String line, EmitContext<String, Integer> ctx) {
for (String tok : line.toLowerCase().split("[^a-z0-9']+")) {
if (tok.isEmpty()) continue;
ctx.emit(tok, 1);
}
}
}
/** Sum the counts for each word. */
public final class WordCountReducer
implements Reducer<String, Integer, String, Long> {
@Override
public void reduce(String word, Iterable<Integer> values,
EmitContext<String, Long> ctx) {
long sum = 0;
for (int v : values) sum += v;
ctx.emit(word, sum);
}
}
import java.util.*;
import java.util.stream.*;
/** Tiny single-process driver: good for tests, not for a petabyte. */
public final class Driver {
public static <K, V, KO, VO> List<Map.Entry<KO, VO>> run(
List<Map.Entry<K, V>> input,
Mapper<K, V, KO, ?> mapper,
Reducer<KO, Object, KO, VO> reducer) {
// 1. MAP phase — run user map() on every input record.
EmitContext<KO, Object> mapOut = new EmitContext<>();
for (var rec : input) {
@SuppressWarnings({"rawtypes", "unchecked"})
Mapper raw = mapper;
raw.map(rec.getKey(), rec.getValue(), mapOut);
}
// 2. SHUFFLE phase — group map output by key.
Map<KO, List<Object>> grouped = mapOut.drain().stream()
.collect(Collectors.groupingBy(
Map.Entry::getKey,
Collectors.mapping(Map.Entry::getValue, Collectors.toList())));
// 3. REDUCE phase — one call per distinct key.
EmitContext<KO, VO> redOut = new EmitContext<>();
for (var e : grouped.entrySet()) {
reducer.reduce(e.getKey(), e.getValue(), redOut);
}
return redOut.drain();
}
public static void main(String[] args) {
List<Map.Entry<Long, String>> input = List.of(
Map.entry(1L, "the quick brown fox"),
Map.entry(2L, "the lazy dog sleeps"),
Map.entry(3L, "the fox and the dog"));
var out = run(input, new WordCountMapper(),
(Reducer) new WordCountReducer());
out.forEach(e -> System.out.println(e.getKey() + "=" + e.getValue()));
}
}
Complexity
- Map phase:
O(N) where N is input record count - Shuffle phase:
O(N log N) for sort-merge by key - Reduce phase:
O(K * avg_group_size) where K is distinct keys - Network cost:
O(map_output_size) all-to-all transfer - Retry on failure:
O(task_size) per failed task — the rest of the job is preserved
Key design decisions & trade-offs
- Disk-based shuffle vs in-memory DAG — Chosen: Disk for bulletproof batch, memory (Spark) for iterative. MapReduce writes shuffle data to disk so any task can be re-run cheaply; this is why it is so failure-tolerant on cheap hardware. Spark keeps the data in memory between stages and re-computes lineage on failure, winning on latency but losing when memory is scarce.
- Combiner functions — Chosen: Run a pre-reduce on each mapper before the shuffle. A combiner shrinks the wire data by orders of magnitude for aggregations (count, sum, max). It is only safe when reduce is associative and commutative and the combiner's output type equals reduce's input type.
- Partitioner choice — Chosen: Hash partition by key by default; custom partitioner for skewed keys. Hash gives even load across reducers. Skewed keys (hot user, hot word) send one reducer into hours of work while the others finish fast — the infamous "straggler" problem. Salting the key or splitting large groups can fix it.
- Job chaining — Chosen: Build DAGs of jobs with an orchestrator. Real pipelines are 10-50 MapReduce jobs wired together. Running them manually is brittle; Oozie, Airflow, and DAG-native engines (Spark, Flink) handle dependencies, retries, and data lineage for you.
Common pitfalls
- Emitting a non-hashable key from map and watching the shuffle produce nonsense groups
- Relying on reduce input order within a key group; MapReduce does not guarantee value order unless you use secondary sort
- Running a combiner that is not associative; intermediate aggregates silently diverge from the full reduce
- Producing 100M distinct keys and sending them to 2 reducers; everything serializes through the straggler
- Writing state into map() or reduce() and expecting it to persist across retries; the framework will re-run and your counters double
Interview follow-ups
- Implement secondary sort so values arrive at reduce in a deterministic order
- Handle a hot-key straggler by salting the key and running a two-stage reduce
- Compare runtime of WordCount on 1TB as MapReduce vs Spark vs Flink batch
- Extend the driver to run map/reduce in parallel threads with a hash-partitioned shuffle
Recommended reading
- Alex Petrov, Database Internals — storage engines and distributed systems internals.
- Martin Kleppmann, Designing Data-Intensive Applications (DDIA) — data models, replication, partitioning, consistency.
- The System Design Primer — high-level design building blocks.
- Foundational networking + web-security references (TCP/IP, TLS 1.3, OWASP Top 10).