Huffman Coding Visualization for Coding Interviews
Priority queue merges two least-frequent nodes until one tree remains.
Concept
Huffman coding is the classic greedy algorithm for building an optimal prefix-free binary code given symbol frequencies. It is the compression core inside gzip, zlib, DEFLATE, JPEG's entropy stage, and countless file formats. The elegance is that a purely local greedy rule — repeatedly merge the two least-frequent symbols — yields the globally optimal code, a fact provable by an exchange argument. Output is a binary tree where each leaf is a symbol, and the code for a symbol is the sequence of 0/1 bits on the path from root to leaf. Prefix-freeness (no code is a prefix of another) comes for free because symbols only live at leaves. The resulting average code length approaches the Shannon entropy of the source, with rounding overhead because Huffman must emit integer bits per symbol.
How it works
Start with a min-heap of nodes, one per symbol, keyed by frequency. Repeatedly pop the two nodes with the smallest frequencies, create a new internal node whose frequency is the sum and whose children are those two nodes, and push the new node back into the heap. After n - 1 merges the heap contains a single node — the root of the Huffman tree. To generate the codebook, traverse the tree: when you descend to a left child append '0', to a right child append '1', and when you reach a leaf record its accumulated path as that symbol's code. Edge case: a one-symbol alphabet produces a degenerate tree with zero-length codes, which is usually patched by emitting a single '0'. Decoding reverses the process: read bits one by one, walk the tree from the root, and emit a symbol when you land on a leaf, then reset to the root. Huffman is optimal among prefix codes when we must assign an integer number of bits per symbol; arithmetic coding can beat it by fractional bits but is more complex to implement and historically had patent issues.
Implementation
import java.util.*;
public final class Huffman {
public static final class HuffmanNode {
public final char ch;
public final int freq;
public final HuffmanNode left, right;
public HuffmanNode(char ch, int freq) {
this(ch, freq, null, null);
}
public HuffmanNode(char ch, int freq, HuffmanNode left, HuffmanNode right) {
this.ch = ch; this.freq = freq; this.left = left; this.right = right;
}
public boolean isLeaf() { return left == null && right == null; }
}
private Huffman() {}
public static HuffmanNode buildTree(Map<Character, Integer> freq) {
if (freq.isEmpty()) return null;
PriorityQueue<HuffmanNode> pq = new PriorityQueue<>(Comparator.comparingInt(n -> n.freq));
for (Map.Entry<Character, Integer> e : freq.entrySet()) {
pq.offer(new HuffmanNode(e.getKey(), e.getValue()));
}
if (pq.size() == 1) {
HuffmanNode only = pq.poll();
return new HuffmanNode('\0', only.freq, only, null);
}
while (pq.size() > 1) {
HuffmanNode a = pq.poll();
HuffmanNode b = pq.poll();
pq.offer(new HuffmanNode('\0', a.freq + b.freq, a, b));
}
return pq.poll();
}
public static Map<Character, String> generateCodes(HuffmanNode root) {
Map<Character, String> codes = new HashMap<>();
if (root == null) return codes;
walk(root, new StringBuilder(), codes);
return codes;
}
private static void walk(HuffmanNode node, StringBuilder path, Map<Character, String> out) {
if (node.isLeaf()) {
out.put(node.ch, path.length() == 0 ? "0" : path.toString());
return;
}
if (node.left != null) { path.append('0'); walk(node.left, path, out); path.deleteCharAt(path.length() - 1); }
if (node.right != null) { path.append('1'); walk(node.right, path, out); path.deleteCharAt(path.length() - 1); }
}
}
Complexity
- Build Time:
O(n log n) - Encode/Decode:
O(total bits) - Space:
O(n)
Common pitfalls
- Forgetting the single-symbol edge case — path length zero produces broken codes.
- Not storing the tree or frequency table alongside the encoded stream; decoder cannot work without it.
- Using an unstable priority queue with tie-breaking that varies run-to-run, producing non-deterministic codes.
- Reusing a StringBuilder across recursive calls without pop on return — codes leak across siblings.
- Confusing Huffman (optimal prefix code) with arithmetic coding (optimal fractional code).
Key design decisions & trade-offs
- Huffman vs arithmetic coding — Chosen: Huffman for simplicity. Arithmetic is closer to entropy but harder to implement and historically patent-encumbered.
- Static vs adaptive Huffman — Chosen: Adaptive for unknown distributions. Static requires two passes (frequency scan then encode); adaptive updates the tree online.
- Canonical Huffman codes — Chosen: When the tree must be serialized compactly. Canonical form lets decoders rebuild the codebook from just code lengths — standard in DEFLATE.
Practice problems
Interview follow-ups
- Implement canonical Huffman code assignment from the code-length vector.
- Build an adaptive Huffman encoder that updates frequencies online.
- Compare compression ratio vs LZ77 on a sample text file.
- Add arithmetic coding and measure the gap vs Huffman on highly skewed distributions.
Why it matters
Huffman Coding is a core part of the Greedy & Backtracking toolkit. If you are preparing for a technical interview at a top-tier company or teaching a course, a working visualization is the fastest path from "I read about it" to "I understand it".