How do you design Search Autocomplete (Type-Ahead) for a system design interview?

Start from requirements, estimate scale, draw the high-level architecture, explain the main data flow, identify bottlenecks, and compare trade-offs. This walkthrough covers Search Autocomplete (Type-Ahead) with components, operations, capacity estimates, and an interactive simulator.

What should I discuss in a Search Autocomplete (Type-Ahead) system design answer?

Cover functional requirements, non-functional requirements, APIs, data model, key services, storage choices, caching, queues or streams where relevant, failure handling, observability, and scaling trade-offs.

Can I practice Search Autocomplete (Type-Ahead) interactively?

Yes. The page includes an interactive browser-based simulator for Search Autocomplete (Type-Ahead), including an architecture diagram, step-by-step request flow, component swap, and stress simulation.

Search Autocomplete (Type-Ahead) System Design Interview Question

By Rahul Kumar · Senior Software Engineer · Updated May 2026 · 11 components · 5 operations ·Source: Alex Xu, System Design Interview Vol 1, Chapter 13

Problem: Design a top-k search autocomplete system that returns the 5 most popular queries for a typed prefix within 100ms.

Overview

Search autocomplete (type-ahead) is the feature that turns every keystroke into a backend request and expects the top-5 completions to arrive within roughly 100 milliseconds. The workload is brutally read-heavy: at 10 million daily actives typing 20-character queries with one request per keystroke, the serving tier must sustain around 24K QPS average and roughly 48K QPS at peak, while still looking instant from the user's perspective. The design we describe here accepts a one-week-old view of the world in exchange for drastically simpler serving. A trie with the top-K completions precomputed at every node lets the query tier answer any prefix in O(P) where P is the prefix length, which easily fits a 100ms budget even on commodity hardware. A separate offline pipeline aggregates query logs, re-ranks by frequency, and rebuilds the trie on a cadence the business can tolerate. The end result is a system where writes are large and slow but infrequent, and reads are small and fast and constant.

Search Autocomplete (Type-Ahead) — Interactive Simulator

Runs fully client-side in your browser; no sign-up. Or open full screen →

Launch the interactive walkthrough for Search Autocomplete (Type-Ahead) — animated architecture diagram, step-by-step flow with real payloads, component swap, and a discrete-event stress simulator.

Summary

A read-dominated system for 10M DAU that returns top-5 completions for each keystroke under 100ms (book: Facebook typeahead SLO). The dominant design choice is a trie with top-k precomputed at every node (serving is O(1) after the prefix walk), populated by an offline batch pipeline (analytics logs → aggregator → worker builds trie → Trie Cache + Trie DB) that rebuilds weekly. Online path: Client → LB → API → Trie Cache (with fallback to Trie DB). The main tradeoff is freshness vs serving simplicity: an in-memory trie is blazing fast to read but rebuild cadence (weekly in the book's base design) means trending queries lag until the next cycle — the book's 'how to support real-time' follow-up outlines stream processing as the fix. A filter layer in front of the cache removes hateful/violent/NSFW suggestions; a shard-map manager handles prefix-shard imbalance.

Requirements

Functional

Return the top 5 autocomplete suggestions for any typed prefix
Support per-keystroke requests with sub-100ms p99 latency
Rank suggestions by historical query frequency over a trailing window
Filter hateful, violent, and NSFW phrases out of the returned list
Rebuild the suggestion index from raw query logs on a daily or weekly cadence
Support prefix-only matching in ASCII English in the base design

Non-functional

p99 response time under 100ms at the 48K peak QPS target
Availability of at least 99.9% on the read path
Horizontal read scalability by prefix-shard
Bounded memory per node: trie cache fits in RAM on each replica
Freshness SLA: new trending queries visible within one rebuild cycle

Capacity Assumptions

10M DAU (book's given scale)
10 searches/user/day, ~20 characters avg per query → 20 requests per query (one per keystroke)
Top-K = 5 suggestions (book's answer)
Query string ≈ 20 bytes (4 words × 5 chars, ASCII)
~20% of daily queries are new → ~0.4 GB/day of new data (book's calc)
Matching is prefix-only (not mid-string), English lowercase only in base design
Trie rebuild cadence: weekly (book's assumption)
Response-time target: under 100ms (book cites Facebook typeahead article [1])

Back-of-Envelope Estimates

Suggest QPS avg ≈ 10M * 10 * 20 / 86400 ≈ 24K QPS (book's math)
Peak QPS ≈ 2x avg ≈ 48K QPS (book's estimate)
Per-prefix top-K cached at every trie node (space-for-time trade)
New data storage growth: 10M * 10 * 20B * 20% ≈ 0.4 GB/day
Browser / Cache-Control caching: max-age=3600 (book references Google's 1h cache)

High-level architecture

The serving path is intentionally short. A client fires an AJAX GET on every keystroke, hits an L7 load balancer, and lands on a stateless API replica. The API reads from an in-memory Trie Cache that stores, at every node, the precomputed top-K completions for the prefix ending at that node. That caching is the whole trick: walking the prefix is O(P), and once we hit the target node the answer is already materialized, so no sorting or scanning happens on the hot path. If the cache is cold, the API falls back to a durable Trie DB that holds the same structure serialized. A Filter Layer strips disallowed phrases before the suggestions leave the building, and a Shard Map Manager keeps prefix shards balanced as popular prefixes like "how to" drift hotter over time. The build path is completely separate. Analytics logs stream into aggregators that compute per-query frequency over the rebuild window. A pool of Trie Build Workers consumes that aggregated data and produces a new trie, top-K annotated at every node, which is then atomically swapped into the Trie DB and warmed into the Trie Cache. The rebuild can run weekly in the base design; moving to minute-level freshness requires a streaming pipeline, covered as a follow-up. The clean split between read-optimized in-memory serving and write-batched offline builds is what lets a single team operate this system without 24/7 incidents.

Architecture Components (11)

Client (Browser / Mobile) (client) — Search box that fires an AJAX /search?q= on every keystroke and renders the top-5 dropdown. Browser caches responses for ~1h.
Load Balancer (lb) — L7 LB routing /search requests to API servers.
API Servers (api) — Stateless service that fetches trie data from Trie Cache, constructs the top-5 suggestions, and returns. Also samples and emits the query to analytics logs.
Filter Layer (rate-limiter) — Filters out hateful, violent, sexually explicit, or dangerous autocomplete suggestions before they reach the client. Book §Trie operations / Delete.
Trie Cache (cache) — Distributed cache holding the trie (or per-prefix top-K) in memory for fast read. Takes weekly snapshots of Trie DB.
Trie DB (persistent) (kv) — Durable storage of the serialized trie. Book considers two options: document store (MongoDB) or hash-table-form KV store.
Shard Map Manager (coordinator) — Maintains the prefix → shard lookup to balance load across storage/cache shards. Book §Scale the storage, Figure 13-15.
Analytics Logs (stream) — Append-only, unindexed log of raw search queries. Book §Data gathering service.
Aggregators (stream-processor) — Batch (or micro-batch) jobs that roll raw log rows up into (query, time-window, frequency) aggregates. Book §Aggregators.
Aggregated Data (sql) — Aggregated weekly frequency table (query, time, frequency). Consumed by the trie build worker.
Trie Build Workers (worker) — Async servers that read aggregated data and build the trie data structure (with top-K cached at each node), then publish it to Trie DB and Trie Cache.

Operations Walked Through (5)

suggest-cache-hit — Book Figure 13-11 happy path. User types 'tr'; LB → API → Trie Cache → filter → return top-5 within ~10ms.
suggest-cache-miss — Book Figure 13-11 step 4. Cache miss → read Trie DB → replenish cache → respond.
trie-rebuild — Book §Data gathering service (Figure 13-9). Analytics Logs → Aggregators → Aggregated Data → Workers build trie → publish to Trie DB + Trie Cache.
takedown — Book §Delete. Ops team flags a term; filter layer starts masking it immediately; physical removal happens offline before next rebuild cycle.
reshard — Book §Scale the storage. Shard map manager detects imbalance (e.g. 'c' shard is hot) and publishes a new mapping.

Implementation

Trie with top-K cached at every node

public class AutocompleteTrie {
    private final int topK;
    private final Node root = new Node();

    public AutocompleteTrie(int topK) { this.topK = topK; }

    static final class Node {
        final Map<Character, Node> children = new HashMap<>();
        // Top-K completions that pass through this node, sorted desc by frequency.
        List<Suggestion> topK = Collections.emptyList();
    }

    static final class Suggestion implements Comparable<Suggestion> {
        final String query; final long freq;
        Suggestion(String q, long f) { this.query = q; this.freq = f; }
        public int compareTo(Suggestion o) { return Long.compare(o.freq, this.freq); }
    }

    /** Offline: insert every known query and update top-K on every node along the path. */
    public void insert(String query, long freq) {
        Node cur = root;
        List<Node> path = new ArrayList<>(query.length());
        for (char c : query.toCharArray()) {
            cur = cur.children.computeIfAbsent(c, k -> new Node());
            path.add(cur);
        }
        for (Node n : path) n.topK = mergeTopK(n.topK, new Suggestion(query, freq));
    }

    /** Online: O(P) prefix walk, then O(1) read of the precomputed top-K. */
    public List<Suggestion> suggest(String prefix) {
        Node cur = root;
        for (char c : prefix.toCharArray()) {
            cur = cur.children.get(c);
            if (cur == null) return Collections.emptyList();
        }
        return cur.topK;
    }

    private List<Suggestion> mergeTopK(List<Suggestion> existing, Suggestion candidate) {
        List<Suggestion> merged = new ArrayList<>(existing);
        merged.removeIf(s -> s.query.equals(candidate.query));
        merged.add(candidate);
        Collections.sort(merged);
        return merged.subList(0, Math.min(topK, merged.size()));
    }
}

PrefixController with 100ms SLO

@RestController
public class PrefixController {
    private static final Duration SLO = Duration.ofMillis(100);
    private final AutocompleteTrie cache;
    private final TrieDb fallback;
    private final PhraseFilter filter;

    public PrefixController(AutocompleteTrie cache, TrieDb db, PhraseFilter filter) {
        this.cache = cache; this.fallback = db; this.filter = filter;
    }

    @GetMapping("/search")
    public ResponseEntity<List<String>> suggest(@RequestParam("q") String raw) {
        long start = System.nanoTime();
        String prefix = raw == null ? "" : raw.trim().toLowerCase(Locale.ROOT);
        if (prefix.isEmpty() || prefix.length() > 50) return ResponseEntity.ok(List.of());

        List<AutocompleteTrie.Suggestion> hit = cache.suggest(prefix);
        if (hit.isEmpty() && elapsed(start).compareTo(SLO.dividedBy(2)) < 0) {
            hit = fallback.suggest(prefix); // only try the DB if we still have budget
        }

        List<String> out = hit.stream()
                .map(s -> s.query)
                .filter(filter::allow)
                .limit(5)
                .toList();

        HttpHeaders h = new HttpHeaders();
        h.setCacheControl("private, max-age=3600");
        h.add("X-Elapsed-Ms", Long.toString(elapsed(start).toMillis()));
        return new ResponseEntity<>(out, h, HttpStatus.OK);
    }

    private Duration elapsed(long startNanos) {
        return Duration.ofNanos(System.nanoTime() - startNanos);
    }
}

Daily batch rebuild from query logs

public class TrieRebuildJob {
    private final QueryLogReader logs;
    private final TrieDb trieDb;
    private final TrieCacheWarmer warmer;

    public TrieRebuildJob(QueryLogReader logs, TrieDb trieDb, TrieCacheWarmer warmer) {
        this.logs = logs; this.trieDb = trieDb; this.warmer = warmer;
    }

    /** Nightly: aggregate -> build -> swap -> warm. */
    public void runFor(LocalDate day) {
        // 1. aggregate frequencies across the trailing window
        Map<String, Long> freq = new HashMap<>();
        for (QueryLogReader.Row r : logs.rowsFor(day.minusDays(7), day)) {
            freq.merge(r.normalizedQuery(), r.count(), Long::sum);
        }

        // 2. build the new trie with top-K precomputed at every node
        AutocompleteTrie next = new AutocompleteTrie(5);
        freq.entrySet().stream()
            .filter(e -> e.getValue() >= 3) // drop long-tail noise
            .forEach(e -> next.insert(e.getKey(), e.getValue()));

        // 3. atomic swap in the durable store and warm the read-tier cache
        String version = "v-" + day;
        trieDb.publish(version, next);
        warmer.warmAllReplicas(version);
    }
}

Key design decisions & trade-offs

Suggestion storage — Chosen: Trie with top-K cached at every node. Answers any prefix in O(P) with zero sorting on the hot path. Costs extra memory and makes inserts more expensive, but inserts happen offline and serving is what the user feels.
Index freshness — Chosen: Daily (or weekly) batch rebuild from logs. Simple, testable, and cheap. Accepts that trending queries lag by one rebuild cycle; a streaming pipeline is the follow-up when that lag becomes a product problem.
Sharding strategy — Chosen: Range-shard by first characters of the prefix. Locality-friendly and easy to reason about, but popular first letters (e.g. "a", "s") go hot. A shard-map manager reshards on observed load rather than using a fixed split.
Client caching — Chosen: Cache-Control private, max-age=3600. Most users retype the same prefixes within an hour, so a short browser cache eliminates a big fraction of requests. Downside: newly trending queries remain invisible to a given user for up to an hour.
Safety filtering — Chosen: Post-rank filter layer rather than pre-index blocklist. Keeps the trie neutral and lets policy evolve without a full rebuild, at the cost of a small per-request filtering step.

Interview follow-ups

Extend to real-time trending by feeding a Flink/Kafka pipeline into an incremental trie updater
Add Unicode and IME-aware prefix matching for non-ASCII languages
Support personalized top-K by blending user history with the global trie result
Add spell correction and fuzzy matching for single-character typos
Introduce a cold-tier sample store so rebuilds can be validated against a canary slice before full rollout