Rabin-Karp Visualization for Coding Interviews
Rolling hash pattern search; hash collisions trigger char-by-char verify.
Concept
Rabin-Karp is the pattern-matching algorithm that trades determinism for elegance: it hashes the pattern once, then slides a rolling hash across the text so each length-m window's hash is computed in O(1) from the previous window. Windows whose hash differs cannot match, so you skip them outright; windows whose hash matches are verified character-by-character to guard against collisions. Expected time is O(n + m), worst case is O(nm) when adversarial input forces every window to collide. The real superpower is multi-pattern search: once you have a rolling hash, testing a text window against k patterns is a hashset lookup, giving O(n + total_m) expected, which is why plagiarism detectors and near-duplicate-document systems lean on it instead of KMP.
How it works
Pick a base B (typically 31 or 131 for ASCII; a prime slightly above the alphabet size) and a modulus M (a large prime, e.g. 1_000_000_007) to keep the hash inside a 64-bit long. The pattern hash is h_p = (p[0] * B^(m-1) + p[1] * B^(m-2) + ... + p[m-1]) mod M, computed via Horner's rule. The first window's hash h_0 is computed the same way. To roll from window i to window i+1 you subtract the leading character's contribution (text[i] * B^(m-1)), multiply what remains by B, and add the new trailing character — all mod M. Because modular subtraction can go negative in Java, you add M before taking the final mod. When h_window == h_p you run a character-by-character equality check to confirm; on mismatch you keep rolling. Collision probability with one good prime is roughly m/M — small enough for most uses but not zero. For adversarial inputs (hash-flood DoS) use double hashing with two independent (B, M) pairs; the combined collision rate drops to m/(M1 * M2), which is negligible. The rolling step is the whole value proposition: without it you pay O(m) per window and you might as well run naive matching.
Implementation
package dsa.rabinkarp;
import java.util.ArrayList;
import java.util.List;
public final class RabinKarp {
private static final long BASE = 31L;
private static final long MOD = 1_000_000_007L;
// Returns all starting indices where pattern occurs in text.
public static List<Integer> rabinKarpSearch(String text, String pattern) {
List<Integer> hits = new ArrayList<>();
int n = text.length(), m = pattern.length();
if (m == 0 || m > n) return hits;
long patternHash = 0L;
long windowHash = 0L;
long highestPow = 1L; // BASE^(m-1) mod MOD
for (int i = 0; i < m - 1; i++) {
highestPow = (highestPow * BASE) % MOD;
}
for (int i = 0; i < m; i++) {
patternHash = (patternHash * BASE + pattern.charAt(i)) % MOD;
windowHash = (windowHash * BASE + text.charAt(i)) % MOD;
}
for (int i = 0; i <= n - m; i++) {
if (windowHash == patternHash && equalsAt(text, pattern, i)) {
hits.add(i);
}
if (i < n - m) {
long leading = (text.charAt(i) * highestPow) % MOD;
windowHash = (windowHash - leading + MOD) % MOD;
windowHash = (windowHash * BASE + text.charAt(i + m)) % MOD;
}
}
return hits;
}
private static boolean equalsAt(String text, String pattern, int start) {
for (int j = 0; j < pattern.length(); j++) {
if (text.charAt(start + j) != pattern.charAt(j)) return false;
}
return true;
}
}
Complexity
- Time:
O(n + m) expected, O(nm) worst case on full collision - Space:
O(1) extra beyond the hit list
Common pitfalls
- Forgetting to add MOD before the subtraction when rolling the hash, which lets the value go negative in Java
- Using BASE smaller than the alphabet size — two distinct chars can produce the same partial hash
- Picking a small MOD (e.g. Integer.MAX_VALUE) and then multiplying without casting to long — silent overflow
- Skipping the equality verification on hash match — collisions do happen and will corrupt results
- Recomputing BASE^(m-1) inside the loop instead of caching it; turns O(n) into O(n log m)
Key design decisions & trade-offs
- Modulus choice — Chosen: Single 1e9+7 vs double hashing with two primes. Single prime is fast and fine for trusted input; double hashing eliminates collision worries against adversarial or very long texts at roughly 2x cost.
- Base choice — Chosen: 31 for ASCII vs 131 or 257 for larger alphabets. The base should exceed the alphabet to avoid structural collisions. 31 is idiomatic for lowercase English (it's what Java String.hashCode uses).
- When to use Rabin-Karp — Chosen: Multi-pattern search vs single-pattern. For a single pattern, KMP or Z-algorithm is strictly better worst-case. Rabin-Karp dominates when you're searching for many patterns in one text via a hashset of pattern hashes.
Practice problems
Interview follow-ups
- Implement double hashing and show that the expected number of false positives drops from m/M to m/(M1*M2)
- Extend to 2D Rabin-Karp to find a rectangular pattern inside a larger grid
- Use Rabin-Karp to detect near-duplicate documents via shingled hashes — how is this related to MinHash?
- Construct an adversarial input that forces worst-case O(nm) against a known (base, mod) pair
- Compare Rabin-Karp's constant factor against KMP on modern CPUs — when does the branch-free rolling hash actually win?
Why it matters
Rabin-Karp is a core part of the Arrays & Strings toolkit. If you are preparing for a technical interview at a top-tier company or teaching a course, a working visualization is the fastest path from "I read about it" to "I understand it".