Web Crawler System Design Interview Question
Problem: Design a polite, scalable web crawler that ingests billions of URLs, respects robots.txt, and avoids both URL-level and content-level duplicates.
Overview
A web crawler is a deceptively simple system: fetch a URL, parse its links, enqueue the new ones, repeat. At web scale the hard part is not the fetch loop; it is being polite to the hosts you hit, deduping URLs across tens of billions of entries, and keeping the frontier balanced so BFS actually reaches the long tail of the internet. This design sketches a Mercator-style crawler sized for roughly one billion pages per month. The ingredients are a prioritized URL frontier with per-host back queues, a robots.txt parser that caches policies per domain, a Bloom filter in front of the URL-seen database so the hot path is a RAM lookup, and a fetcher worker pool that drives a per-domain token bucket. The tradeoffs worth arguing in an interview are Bloom false positives, content-hash versus shingle dedup, and whether to trust the robots.txt Crawl-Delay or derive it from measured latency.
Summary
A breadth-first crawler (BFS, not DFS — DFS depth is unbounded on the web) fronted by a prioritized URL Frontier that enforces politeness, priority, and freshness. HTML Downloader fetches through a DNS Resolver; the Content Parser validates + normalizes HTML; a 'Content Seen?' check dedupes near-identical bytes via a hash DB; the URL Extractor pulls outbound links; a URL Filter drops blocklisted/disallowed schemes and robots-disallowed paths; a 'URL Seen?' check (Bloom filter + URL Storage) prevents adding duplicates back to the frontier. The dominant design choices are (1) a politeness-aware priority queue per host with front-queue priority bands and back-queue per-host serialization, and (2) a Bloom filter in front of the URL-seen DB so the fast path is a ~200ns RAM check. The main tradeoffs are Bloom false positives (rarely re-crawl), content-hash dedupe missing near-duplicates (A/B-tested pages), and BFS needing politeness discipline to avoid DoS-ing popular hosts. Sized for ~1B pages crawled/month at roughly 400 pages/sec sustained.
Requirements
Functional
- Crawl seed URLs and follow outbound links in BFS order
- Respect robots.txt directives and Crawl-Delay per host
- Enforce politeness: at most one in-flight connection per host and a minimum gap between fetches
- Deduplicate URLs globally so the same page is not re-enqueued
- Detect near-duplicate content via hash or shingle fingerprint
- Persist fetched HTML to durable storage and emit extracted links for indexing
Non-functional
- Sustained throughput of 400 pages per second, peaking at 1K per second
- URL-seen lookup under 1 ms for the 99th percentile request
- No single host receives more than two requests per second from the crawler
- Frontier survives worker restarts without losing pending URLs
- Horizontal scalability: add fetcher workers linearly without rebalancing the frontier shards
Capacity Assumptions
- 1B pages crawled per month (~400 pages/sec average, 1K peak)
- Average HTML payload 100-500 KB (median ~120 KB); stored compressed ~40 KB
- Politeness: ≤ 1 concurrent connection per host, ≥ 500ms between fetches to same host
- URL-seen set grows unbounded; at 1B/mo, 10^10 entries after a year
- Content-seen set grows with unique pages ≈ 60% of fetched (40% are duplicates)
Back-of-Envelope Estimates
- Frontier size at steady state: ~10-100M pending URLs
- URL-seen Bloom filter: 10^10 entries * 10 bits/entry ≈ 12 GB RAM, 1% false-positive rate
- Bandwidth: 400 pages/s * 200KB ≈ 80 MB/s sustained, 200 MB/s peak
- Storage: 1B pages/mo * 40KB compressed ≈ 40 TB/mo → ~500 TB/year
- Fetcher pool: 1000 QPS / (1 fetch/host/500ms) → target ~500 distinct hosts in flight → ~500 fetcher goroutines per worker, ~4 worker hosts
High-level architecture
The pipeline starts at the URL Frontier, a two-stage queue inspired by the Mercator paper. Front queues hold URLs in priority bands assigned by PageRank, host traffic, or freshness. A router pops from them with weighted probability and pushes into back queues, one per politeness host, chosen by hash(host). Each back queue remembers a next-allowed-at timestamp so a worker only pulls the head if the token bucket for that host has refilled. This keeps BFS global while per-host traffic stays within the Crawl-Delay budget. Fetcher workers resolve DNS through a dedicated resolver tier with a local cache, stream the response body, and hand it to the Content Parser. The parser validates encoding, normalizes URLs, and hashes the body; the Content Seen check drops near-duplicates into a hash database before we waste extractor cycles on them. Surviving pages go to Content Storage (S3 or HDFS, compressed) and their extracted links go through the URL Filter (scheme, robots, blocklist). The URL Seen check queries an in-memory Bloom filter first; a miss is authoritative, a hit falls back to a sharded URL-storage database to confirm. New URLs get enqueued back into the front queue with a priority score derived from the parent page. Robots.txt lives in its own cache, refreshed per host every 24 hours, and is consulted both at URL Filter time and at fetch time for a defense-in-depth check.
Architecture Components (11)
- Seed URLs (kv) — Static or periodically refreshed list of high-quality starting URLs — the first input to the crawler. The book emphasizes that choosing good seeds is the difference between quick full-web coverage and a biased, narrow crawl.
- URL Frontier (priority queue) (queue) — Politeness-aware, multi-priority queue of URLs waiting to be fetched. Enforces the three properties the book emphasizes: politeness, priority, and freshness.
- HTML Downloader (worker) — Bank of worker goroutines/threads that issue HTTP GETs and stream responses. The book calls this the HTML Downloader.
- DNS Resolver + Cache (cache) — Local DNS cache (unbound) sitting in front of public resolvers to keep lookups off the hot path.
- Content Parser (worker) — Parses HTML, strips boilerplate, extracts text, and computes a SHA-256 of normalized content. Also validates that the page is well-formed (the book calls out invalid/malformed HTML as a real failure mode that wastes storage).
- Content Seen? (hash store) (kv) — Decision block + hash store. The book cites this as one of the key filters — ~29% of web pages are duplicates (it reports this statistic) and content-level dedupe reclaims that budget.
- URL Extractor (worker) — Pulls outbound <a href>, <link>, <img src> URLs from the parsed DOM.
- URL Filter (robots + blocklist) (worker) — Applies robots.txt rules, host blocklist, scheme whitelist, and quota caps.
- URL Seen? (Bloom filter + URL Storage) (kv) — Two-tier dedupe check: in-RAM Bloom filter fast path, plus persistent URL Storage as truth. Keeps the crawler from adding already-visited or already-enqueued URLs back to the frontier.
- Content Storage (blob) — Compressed HTML archive + extracted text, for downstream indexing / ML. The book recommends a hybrid: popular/hot content in memory, the bulk on disk.
- URL Storage (kv) — Persistent record of every URL the crawler has ever decided to process — whether crawled successfully, blocked by robots, or dropped by filter. Source of truth for the 'URL Seen?' check when the Bloom filter reports a maybe-hit.
Operations Walked Through (3)
- crawl-happy — URL is popped from the frontier, fetched, parsed, stored, and newly discovered links are enqueued.
- duplicate-content — The URL is fetched successfully but its normalized content matches a hash we've already indexed — skip storage but still mine the page for outbound links.
- robots-blocked — Worker is about to fetch a URL but the host's robots.txt disallows it. The URL is dropped politely before any HTTP request is made.
Implementation
public final class URLFrontier {
private final List<BlockingQueue<String>> frontQueues; // priority bands
private final Map<String, BackQueue> backQueues = new ConcurrentHashMap<>();
private final PolitenessPolicy politeness;
public URLFrontier(int priorityBands, PolitenessPolicy politeness) {
this.frontQueues = IntStream.range(0, priorityBands)
.mapToObj(i -> (BlockingQueue<String>) new LinkedBlockingQueue<String>())
.toList();
this.politeness = politeness;
}
public void enqueue(String url, int priorityBand) throws InterruptedException {
frontQueues.get(priorityBand).put(url);
}
// Router: pull from a front queue and append to the correct back queue.
public void route() throws InterruptedException {
int band = weightedPickBand();
String url = frontQueues.get(band).take();
String host = URI.create(url).getHost();
backQueues.computeIfAbsent(host, h -> new BackQueue(h, politeness)).offer(url);
}
// Worker: return the next URL whose host is past its next-allowed-at.
public Optional<String> nextReady() {
long now = System.currentTimeMillis();
for (BackQueue bq : backQueues.values()) {
if (bq.nextAllowedAt() <= now) {
String url = bq.poll();
if (url != null) { bq.markFetched(now); return Optional.of(url); }
}
}
return Optional.empty();
}
private int weightedPickBand() { /* P = {0.5, 0.3, 0.15, 0.05 ...} */ return 0; }
}
final class BackQueue {
private final Deque<String> q = new ArrayDeque<>();
private final PolitenessPolicy policy;
private final String host;
private volatile long nextAllowedAt = 0L;
BackQueue(String host, PolitenessPolicy policy) { this.host = host; this.policy = policy; }
synchronized void offer(String url) { q.addLast(url); }
synchronized String poll() { return q.pollFirst(); }
long nextAllowedAt() { return nextAllowedAt; }
void markFetched(long now) { nextAllowedAt = now + policy.delayMs(host); }
}
public final class RobotsTxtCache {
private final HttpClient http;
private final Map<String, Entry> cache = new ConcurrentHashMap<>();
private static final Duration TTL = Duration.ofHours(24);
public RobotsTxtCache(HttpClient http) { this.http = http; }
public boolean isAllowed(String userAgent, URI url) {
Entry e = cache.computeIfAbsent(url.getHost(), this::load);
return e.rules().isAllowed(userAgent, url.getPath());
}
public long crawlDelayMs(String userAgent, String host) {
Entry e = cache.computeIfAbsent(host, this::load);
return e.rules().crawlDelaySeconds(userAgent) * 1000L;
}
private Entry load(String host) {
try {
HttpResponse<String> res = http.send(
HttpRequest.newBuilder(URI.create("https://" + host + "/robots.txt"))
.timeout(Duration.ofSeconds(5)).build(),
HttpResponse.BodyHandlers.ofString());
RobotRules rules = res.statusCode() == 200
? RobotRules.parse(res.body())
: RobotRules.allowAll();
return new Entry(rules, Instant.now().plus(TTL));
} catch (Exception ex) {
// Be conservative: on error, deny everything for an hour.
return new Entry(RobotRules.denyAll(), Instant.now().plus(Duration.ofHours(1)));
}
}
private record Entry(RobotRules rules, Instant expiresAt) {}
}
public final class SeenUrlCache {
private final BloomFilter<CharSequence> bloom; // Guava BloomFilter
private final UrlDatabase authoritative; // sharded KV or Cassandra
public SeenUrlCache(long expectedInsertions, double fpp, UrlDatabase authoritative) {
// 10^10 entries at fpp=0.01 ~= 12 GB of RAM
this.bloom = BloomFilter.create(Funnels.stringFunnel(StandardCharsets.UTF_8), expectedInsertions, fpp);
this.authoritative = authoritative;
}
// Fast path: if bloom says no, the URL is definitely new.
public boolean seen(String canonicalUrl) {
if (!bloom.mightContain(canonicalUrl)) return false;
return authoritative.exists(canonicalUrl); // confirm on possible FP
}
public void markSeen(String canonicalUrl) {
bloom.put(canonicalUrl);
authoritative.put(canonicalUrl);
}
}
public final class FetcherWorker implements Runnable {
private final URLFrontier frontier;
private final RobotsTxtCache robots;
private final SeenUrlCache seen;
private final Map<String, TokenBucket> domainBuckets = new ConcurrentHashMap<>();
private final HttpClient http;
private volatile boolean running = true;
public FetcherWorker(URLFrontier f, RobotsTxtCache r, SeenUrlCache s, HttpClient http) {
this.frontier = f; this.robots = r; this.seen = s; this.http = http;
}
@Override public void run() {
while (running) {
Optional<String> maybe = frontier.nextReady();
if (maybe.isEmpty()) { sleepMs(10); continue; }
String url = maybe.get();
URI uri = URI.create(url);
if (!robots.isAllowed("HldSimBot", uri)) continue;
TokenBucket bucket = domainBuckets.computeIfAbsent(uri.getHost(),
h -> new TokenBucket(/*ratePerSec=*/ 2.0, /*burst=*/ 2));
if (!bucket.tryAcquire()) { frontier.requeue(url); continue; }
try {
HttpResponse<byte[]> res = http.send(
HttpRequest.newBuilder(uri).timeout(Duration.ofSeconds(10)).build(),
HttpResponse.BodyHandlers.ofByteArray());
if (res.statusCode() == 200) {
seen.markSeen(url);
Parser.parseAndEmit(url, res.body());
}
} catch (Exception e) {
// log and continue; retry logic lives elsewhere
}
}
}
public void stop() { running = false; }
private static void sleepMs(long ms) { try { Thread.sleep(ms); } catch (InterruptedException ignored) {} }
}
Key design decisions & trade-offs
- BFS traversal with per-host politeness vs DFS — Chosen: BFS with per-host back queues. DFS concentrates traffic on one host and has unbounded depth on the web; BFS with back queues spreads load and bounds the damage any single misbehaving site can do.
- URL-seen check on Bloom filter vs direct DB lookup — Chosen: Bloom filter in front of authoritative store. A 12 GB Bloom filter serves 10^10 entries at 1% FPP with ~200 ns lookups; direct DB lookups would multiply read IOPS by the extraction rate and cost far more than the occasional re-crawl a false positive would imply.
- Content dedup via exact hash vs shingle fingerprint — Chosen: Exact MD5/SHA-256 of the normalized body. Exact hashing is cheap and catches the vast majority of true duplicates (mirrors, syndication). Shingle or SimHash would catch near-duplicates too but triples the hash DB size and slows the hot path; defer to a batch job if needed.
- Trust robots.txt Crawl-Delay vs derive politeness from measured latency — Chosen: Trust Crawl-Delay with a floor of 500 ms, fall back to measured. Respecting the site operator intent is the right default and avoids legal and reputational risk. The floor keeps us polite when sites omit Crawl-Delay or set it unrealistically low.
- Frontier persistence (in-memory vs on-disk with snapshots) — Chosen: On-disk WAL with periodic snapshots. An in-memory frontier loses tens of millions of pending URLs on restart. A RocksDB-backed queue with snapshots every few minutes caps recovery to minutes at a small throughput cost.
Interview follow-ups
- How would you keep the frontier balanced across hosts if one host dominates the outbound link graph?
- How do you crawl JavaScript-rendered pages without blowing up per-page fetch cost?
- How do you detect and handle crawler traps (infinite calendars, session-id parameter explosions)?
- How would you re-crawl changed pages on a schedule without blowing the politeness budget?
- How would you shard the URL-seen database across regions and still handle cross-shard extractor writes?