Bloom Filter Visualization for Coding Interviews
k hashes → m bits; zero false negatives, tunable false-positive rate.
Concept
A Bloom filter is a probabilistic set that trades perfect membership for dramatic memory savings. It can tell you with certainty that a key is not in the set, and it can tell you with controlled probability that a key might be in the set, but it never produces false negatives. That one-sided error profile is exactly what you want for cache miss short-circuits, LSM-tree SSTable index reads, malicious URL filters, and distributed join pre-filters: a negative response is cheap and correct, a positive response requires the slower confirmation lookup. A typical Bloom filter uses a bit array of size m and k independent hash functions, storing no actual keys. For the right tuning of m and k relative to n (the expected number of inserted elements), you can hit a 1% false positive rate at around 10 bits per key, which is one to two orders of magnitude less memory than storing the keys themselves.
How it works
Allocate a bit array of m bits, all zero. Pick k hash functions that map any key uniformly over [0, m). On add(key), hash the key k times and set all k resulting bit positions to 1. On mightContain(key), hash the key k times and return true only if all k bits are set; if any bit is zero, the key was definitely never added. False negatives are impossible because add can only set bits, never clear them. False positives occur when enough other keys collectively happen to set the k bits a given key hashes to, and the probability of that is approximately (1 - e^(-kn/m))^k for n inserted items. This has a minimum at k = (m/n) ln 2, which tells you the ideal number of hash functions for a given memory budget. In practice you rarely run k independent hash functions; the Kirsch-Mitzenmacher double-hashing trick derives k hashes from two 64-bit hashes as h1 + i*h2 for i = 0..k-1, which is indistinguishable from independent hashes at this granularity and far faster. Bloom filters do not support delete directly because clearing bits could create false negatives for other keys; counting Bloom filters or cuckoo filters are the usual answer when deletes are required.
Implementation
import java.util.BitSet;
public class BloomFilter<T> {
private final BitSet bits;
private final int m; // number of bits
private final int k; // number of hash functions
private int insertedCount; // for fpr estimation
/** Construct a filter sized for expectedInsertions at target falsePositiveRate. */
public BloomFilter(int expectedInsertions, double fpr) {
if (expectedInsertions <= 0 || fpr <= 0 || fpr >= 1)
throw new IllegalArgumentException("bad params");
this.m = Math.max(64, (int) Math.ceil(-expectedInsertions * Math.log(fpr) / (Math.log(2) * Math.log(2))));
this.k = Math.max(1, (int) Math.round((double) m / expectedInsertions * Math.log(2)));
this.bits = new BitSet(m);
}
public void add(T key) {
long[] h = hash128(key);
for (int i = 0; i < k; i++) {
int idx = Math.floorMod(h[0] + (long) i * h[1], m);
bits.set(idx);
}
insertedCount++;
}
public boolean mightContain(T key) {
long[] h = hash128(key);
for (int i = 0; i < k; i++) {
int idx = Math.floorMod(h[0] + (long) i * h[1], m);
if (!bits.get(idx)) return false;
}
return true;
}
/** Analytic false positive rate for n inserted items. */
public double falsePositiveRate(int n) {
return Math.pow(1 - Math.exp(-((double) k * n) / m), k);
}
public int bitSize() { return m; }
public int hashCount() { return k; }
public int insertedCount() { return insertedCount; }
/**
* Produce two independent 64-bit hashes from the object's hashCode and its
* bit-reversed mix. For production use swap in Murmur3 or xxHash.
*/
private static long[] hash128(Object key) {
long h = (key == null) ? 0L : key.hashCode();
long h1 = mix(h);
long h2 = mix(h ^ 0x9E3779B97F4A7C15L);
return new long[]{h1, h2 == 0 ? 1 : h2};
}
private static long mix(long x) {
x ^= x >>> 33; x *= 0xff51afd7ed558ccdL;
x ^= x >>> 33; x *= 0xc4ceb9fe1a85ec53L;
x ^= x >>> 33; return x;
}
}
Complexity
- Time:
O(k) per add and mightContain, independent of n - Space:
O(m) bits, typically about 10 bits per key for 1% fpr
Common pitfalls
- Using only one hash function: false positive rate explodes as n grows
- Undersizing m for expected n: even the best k cannot hit the target fpr
- Attempting delete by clearing bits: introduces false negatives for other keys
- Using weak hash functions that correlate across k applications, creating clustering
- Assuming serialized Bloom filters are portable across hash-function changes
- Ignoring saturation: once most bits are set, mightContain approaches always-true
Key design decisions & trade-offs
- Error profile — Chosen: False positives allowed, false negatives never. Fits cache/pre-filter use cases where a miss is authoritative and a hit is verified downstream
- Hash derivation — Chosen: Double hashing via h1 + i*h2. Kirsch-Mitzenmacher shows this matches independent hashes statistically at a fraction of the CPU cost
- Delete support — Chosen: Not supported. Use a counting Bloom filter or cuckoo filter if deletes are required; plain Bloom keeps the memory minimal
Practice problems
Interview follow-ups
- Counting Bloom filter with small counters instead of bits to allow deletion
- Scalable Bloom filter that grows dynamically as inserts accumulate
- Cuckoo filter with better space efficiency and support for deletes
- HyperLogLog for cardinality estimation (companion probabilistic structure)
- Bloom filter cascade used in TLS CRLite for revocation checks
Why it matters
Bloom Filter is a core part of the Hashing toolkit. If you are preparing for a technical interview at a top-tier company or teaching a course, a working visualization is the fastest path from "I read about it" to "I understand it".