← System Design Simulator

Web Crawler System Design Interview Question

By Rahul Kumar · Senior Software Engineer · Updated · 11 components · 3 operations ·Source: Alex Xu, System Design Interview Vol 1, Chapter 9; 'Mercator: A Scalable, Extensible Web Crawler' (Heydon & Najork, 1999); Google 'Crawling and Indexing' docs

Problem: Design a polite, scalable web crawler that ingests billions of URLs, respects robots.txt, and avoids both URL-level and content-level duplicates.

Overview

A web crawler is a deceptively simple system: fetch a URL, parse its links, enqueue the new ones, repeat. At web scale the hard part is not the fetch loop; it is being polite to the hosts you hit, deduping URLs across tens of billions of entries, and keeping the frontier balanced so BFS actually reaches the long tail of the internet. This design sketches a Mercator-style crawler sized for roughly one billion pages per month. The ingredients are a prioritized URL frontier with per-host back queues, a robots.txt parser that caches policies per domain, a Bloom filter in front of the URL-seen database so the hot path is a RAM lookup, and a fetcher worker pool that drives a per-domain token bucket. The tradeoffs worth arguing in an interview are Bloom false positives, content-hash versus shingle dedup, and whether to trust the robots.txt Crawl-Delay or derive it from measured latency.

Web Crawler — Interactive Simulator

Runs fully client-side in your browser; no sign-up. Or open full screen →

Launch the interactive walkthrough for Web Crawler — animated architecture diagram, step-by-step flow with real payloads, component swap, and a discrete-event stress simulator.

Summary

A breadth-first crawler (BFS, not DFS — DFS depth is unbounded on the web) fronted by a prioritized URL Frontier that enforces politeness, priority, and freshness. HTML Downloader fetches through a DNS Resolver; the Content Parser validates + normalizes HTML; a 'Content Seen?' check dedupes near-identical bytes via a hash DB; the URL Extractor pulls outbound links; a URL Filter drops blocklisted/disallowed schemes and robots-disallowed paths; a 'URL Seen?' check (Bloom filter + URL Storage) prevents adding duplicates back to the frontier. The dominant design choices are (1) a politeness-aware priority queue per host with front-queue priority bands and back-queue per-host serialization, and (2) a Bloom filter in front of the URL-seen DB so the fast path is a ~200ns RAM check. The main tradeoffs are Bloom false positives (rarely re-crawl), content-hash dedupe missing near-duplicates (A/B-tested pages), and BFS needing politeness discipline to avoid DoS-ing popular hosts. Sized for ~1B pages crawled/month at roughly 400 pages/sec sustained.

Requirements

Functional

Non-functional

Capacity Assumptions

Back-of-Envelope Estimates

High-level architecture

The pipeline starts at the URL Frontier, a two-stage queue inspired by the Mercator paper. Front queues hold URLs in priority bands assigned by PageRank, host traffic, or freshness. A router pops from them with weighted probability and pushes into back queues, one per politeness host, chosen by hash(host). Each back queue remembers a next-allowed-at timestamp so a worker only pulls the head if the token bucket for that host has refilled. This keeps BFS global while per-host traffic stays within the Crawl-Delay budget. Fetcher workers resolve DNS through a dedicated resolver tier with a local cache, stream the response body, and hand it to the Content Parser. The parser validates encoding, normalizes URLs, and hashes the body; the Content Seen check drops near-duplicates into a hash database before we waste extractor cycles on them. Surviving pages go to Content Storage (S3 or HDFS, compressed) and their extracted links go through the URL Filter (scheme, robots, blocklist). The URL Seen check queries an in-memory Bloom filter first; a miss is authoritative, a hit falls back to a sharded URL-storage database to confirm. New URLs get enqueued back into the front queue with a priority score derived from the parent page. Robots.txt lives in its own cache, refreshed per host every 24 hours, and is consulted both at URL Filter time and at fetch time for a defense-in-depth check.

Architecture Components (11)

Operations Walked Through (3)

Implementation

URLFrontier with per-domain politeness
public final class URLFrontier {
  private final List<BlockingQueue<String>> frontQueues; // priority bands
  private final Map<String, BackQueue> backQueues = new ConcurrentHashMap<>();
  private final PolitenessPolicy politeness;

  public URLFrontier(int priorityBands, PolitenessPolicy politeness) {
    this.frontQueues = IntStream.range(0, priorityBands)
        .mapToObj(i -> (BlockingQueue<String>) new LinkedBlockingQueue<String>())
        .toList();
    this.politeness = politeness;
  }

  public void enqueue(String url, int priorityBand) throws InterruptedException {
    frontQueues.get(priorityBand).put(url);
  }

  // Router: pull from a front queue and append to the correct back queue.
  public void route() throws InterruptedException {
    int band = weightedPickBand();
    String url = frontQueues.get(band).take();
    String host = URI.create(url).getHost();
    backQueues.computeIfAbsent(host, h -> new BackQueue(h, politeness)).offer(url);
  }

  // Worker: return the next URL whose host is past its next-allowed-at.
  public Optional<String> nextReady() {
    long now = System.currentTimeMillis();
    for (BackQueue bq : backQueues.values()) {
      if (bq.nextAllowedAt() <= now) {
        String url = bq.poll();
        if (url != null) { bq.markFetched(now); return Optional.of(url); }
      }
    }
    return Optional.empty();
  }

  private int weightedPickBand() { /* P = {0.5, 0.3, 0.15, 0.05 ...} */ return 0; }
}

final class BackQueue {
  private final Deque<String> q = new ArrayDeque<>();
  private final PolitenessPolicy policy;
  private final String host;
  private volatile long nextAllowedAt = 0L;

  BackQueue(String host, PolitenessPolicy policy) { this.host = host; this.policy = policy; }
  synchronized void offer(String url) { q.addLast(url); }
  synchronized String poll() { return q.pollFirst(); }
  long nextAllowedAt() { return nextAllowedAt; }
  void markFetched(long now) { nextAllowedAt = now + policy.delayMs(host); }
}
Robots.txt parser with per-host cache
public final class RobotsTxtCache {
  private final HttpClient http;
  private final Map<String, Entry> cache = new ConcurrentHashMap<>();
  private static final Duration TTL = Duration.ofHours(24);

  public RobotsTxtCache(HttpClient http) { this.http = http; }

  public boolean isAllowed(String userAgent, URI url) {
    Entry e = cache.computeIfAbsent(url.getHost(), this::load);
    return e.rules().isAllowed(userAgent, url.getPath());
  }

  public long crawlDelayMs(String userAgent, String host) {
    Entry e = cache.computeIfAbsent(host, this::load);
    return e.rules().crawlDelaySeconds(userAgent) * 1000L;
  }

  private Entry load(String host) {
    try {
      HttpResponse<String> res = http.send(
          HttpRequest.newBuilder(URI.create("https://" + host + "/robots.txt"))
              .timeout(Duration.ofSeconds(5)).build(),
          HttpResponse.BodyHandlers.ofString());
      RobotRules rules = res.statusCode() == 200
          ? RobotRules.parse(res.body())
          : RobotRules.allowAll();
      return new Entry(rules, Instant.now().plus(TTL));
    } catch (Exception ex) {
      // Be conservative: on error, deny everything for an hour.
      return new Entry(RobotRules.denyAll(), Instant.now().plus(Duration.ofHours(1)));
    }
  }

  private record Entry(RobotRules rules, Instant expiresAt) {}
}
Bloom-filter-backed seen-URL cache
public final class SeenUrlCache {
  private final BloomFilter<CharSequence> bloom;  // Guava BloomFilter
  private final UrlDatabase authoritative;         // sharded KV or Cassandra

  public SeenUrlCache(long expectedInsertions, double fpp, UrlDatabase authoritative) {
    // 10^10 entries at fpp=0.01 ~= 12 GB of RAM
    this.bloom = BloomFilter.create(Funnels.stringFunnel(StandardCharsets.UTF_8), expectedInsertions, fpp);
    this.authoritative = authoritative;
  }

  // Fast path: if bloom says no, the URL is definitely new.
  public boolean seen(String canonicalUrl) {
    if (!bloom.mightContain(canonicalUrl)) return false;
    return authoritative.exists(canonicalUrl); // confirm on possible FP
  }

  public void markSeen(String canonicalUrl) {
    bloom.put(canonicalUrl);
    authoritative.put(canonicalUrl);
  }
}
Fetcher worker with per-domain token bucket
public final class FetcherWorker implements Runnable {
  private final URLFrontier frontier;
  private final RobotsTxtCache robots;
  private final SeenUrlCache seen;
  private final Map<String, TokenBucket> domainBuckets = new ConcurrentHashMap<>();
  private final HttpClient http;
  private volatile boolean running = true;

  public FetcherWorker(URLFrontier f, RobotsTxtCache r, SeenUrlCache s, HttpClient http) {
    this.frontier = f; this.robots = r; this.seen = s; this.http = http;
  }

  @Override public void run() {
    while (running) {
      Optional<String> maybe = frontier.nextReady();
      if (maybe.isEmpty()) { sleepMs(10); continue; }
      String url = maybe.get();
      URI uri = URI.create(url);
      if (!robots.isAllowed("HldSimBot", uri)) continue;

      TokenBucket bucket = domainBuckets.computeIfAbsent(uri.getHost(),
          h -> new TokenBucket(/*ratePerSec=*/ 2.0, /*burst=*/ 2));
      if (!bucket.tryAcquire()) { frontier.requeue(url); continue; }

      try {
        HttpResponse<byte[]> res = http.send(
            HttpRequest.newBuilder(uri).timeout(Duration.ofSeconds(10)).build(),
            HttpResponse.BodyHandlers.ofByteArray());
        if (res.statusCode() == 200) {
          seen.markSeen(url);
          Parser.parseAndEmit(url, res.body());
        }
      } catch (Exception e) {
        // log and continue; retry logic lives elsewhere
      }
    }
  }

  public void stop() { running = false; }
  private static void sleepMs(long ms) { try { Thread.sleep(ms); } catch (InterruptedException ignored) {} }
}

Key design decisions & trade-offs

Interview follow-ups

Related