← System Design Simulator

Twitter Trending Topics (Streaming Analytics) System Design Interview Question

By Rahul Kumar · Senior Software Engineer · Updated · 7 components · 3 operations ·Source: Streaming analytics — Kafka + Flink + Count-Min Sketch (Cormode & Muthukrishnan, 2005)

Problem: Design a real-time trending topics / hashtag system over a 500M-tweets/day firehose.

Overview

Twitter-scale trending is a streaming analytics problem disguised as a list of ten hashtags. The firehose is half a billion tweets a day, peaking roughly three times higher during globally correlated events like a championship final or an election call, and the product contract is still just "show me what is popular right now". Exact counting at that volume is not free: a naive aggregator would need to track a counter for every distinct term across a rolling five-minute window, which on the long tail of a social network means hundreds of millions of keys. Worse, keeping those counts accurate under parallelism forces a shuffle on every single update. The canonical answer is to trade exact counts for bounded error. A Count-Min Sketch, combined with a MinHeap of the current heavy hitters, holds the entire window in a few megabytes per task, updates in constant time, and is correct enough for a human-visible trending panel. Freshness comes from Kafka plus a stream processor, not from a database query.

Twitter Trending Topics (Streaming Analytics) — Interactive Simulator

Runs fully client-side in your browser; no sign-up. Or open full screen →

Launch the interactive walkthrough for Twitter Trending Topics (Streaming Analytics) — animated architecture diagram, step-by-step flow with real payloads, component swap, and a discrete-event stress simulator.

Summary

A streaming analytics pipeline that ingests tweets into Kafka, runs Flink jobs that maintain Count-Min Sketch + heavy-hitter Top-K per sliding window, and surfaces results via a small hot-topic cache behind a read API. The dominant design choice is approximate sketches (CMS + Top-K heap) over exact counting — exact counts would require O(unique-terms) memory per window and cross-node shuffles on every update, while CMS gives ε-bounded counts in kilobytes with single-pass updates. The main tradeoff is accuracy at the long tail: heavy hitters are correct within a few percent, but low-frequency terms have noisy counts that aren't trustworthy.

Requirements

Functional

Non-functional

Capacity Assumptions

Back-of-Envelope Estimates

High-level architecture

Tweets land on a tweet-ingest API that publishes to a partitioned Kafka topic; this is the one boundary where we accept asynchronous durability in exchange for never blocking the tweet write path. A fleet of stream processors each own some subset of Kafka partitions and maintain two pieces of state per sliding window: a Count-Min Sketch keyed by (term, window) and a MinHeap of size K that remembers the current top-K terms and their estimated counts. For every tweet, the processor extracts hashtags and tokens, increments the CMS once per token, and probes the heap to decide whether the updated estimate displaces the current min. Every 10 seconds the processor snapshots the heap and writes it to a hot topic cache keyed by window_end_ts. The read API is almost trivial: GET /trending reads the latest snapshot from the cache and returns it with a short TTL and a generated_at timestamp so clients can reason about staleness. Sliding windows are managed by keeping several sketches in flight (one per recent window) and dropping the oldest as time advances; expiry is cheap because a sketch is small and flat. Global top-K is computed by merging per-partition heaps at the snapshot boundary. The separation between a tiny stateless read API and a stateful but bounded stream job is what keeps this design operable at Twitter scale without a database in the critical path.

Architecture Components (7)

Operations Walked Through (3)

Implementation

Count-Min Sketch (add, estimate)
public class CountMinSketch {
    private final int depth;          // number of hash functions
    private final int width;          // counters per row (power of 2)
    private final long[][] counters;  // depth x width
    private final long[] seeds;

    public CountMinSketch(double epsilon, double delta) {
        this.width = Integer.highestOneBit((int) Math.ceil(Math.E / epsilon)) << 1;
        this.depth = (int) Math.ceil(Math.log(1.0 / delta));
        this.counters = new long[depth][width];
        this.seeds = new long[depth];
        Random r = new Random(0xC0FFEEL);
        for (int i = 0; i < depth; i++) seeds[i] = r.nextLong();
    }

    public void add(String term, long count) {
        for (int i = 0; i < depth; i++) {
            int idx = bucket(term, i);
            counters[i][idx] += count;
        }
    }

    /** Estimate is the min across rows; never under-counts, may over-count. */
    public long estimate(String term) {
        long min = Long.MAX_VALUE;
        for (int i = 0; i < depth; i++) {
            min = Math.min(min, counters[i][bucket(term, i)]);
        }
        return min;
    }

    private int bucket(String term, int row) {
        long h = term.hashCode() * 0x9E3779B97F4A7C15L ^ seeds[row];
        h ^= (h >>> 33);
        return (int) (h & (width - 1));
    }
}
Top-K tracker (MinHeap by estimated count)
public class TopKTracker {
    private final int k;
    private final PriorityQueue<Entry> heap; // min-heap by count
    private final Map<String, Entry> byTerm;  // O(1) lookup for updates

    public TopKTracker(int k) {
        this.k = k;
        this.heap = new PriorityQueue<>(Comparator.comparingLong(e -> e.count));
        this.byTerm = new HashMap<>();
    }

    /** Offer a term with its current CMS estimate; cheaper than scanning the heap. */
    public void offer(String term, long estimate) {
        Entry existing = byTerm.get(term);
        if (existing != null) {
            heap.remove(existing);
            existing.count = estimate;
            heap.offer(existing);
            return;
        }
        if (heap.size() < k) {
            Entry e = new Entry(term, estimate);
            heap.offer(e); byTerm.put(term, e);
        } else if (heap.peek().count < estimate) {
            Entry evicted = heap.poll();
            byTerm.remove(evicted.term);
            Entry e = new Entry(term, estimate);
            heap.offer(e); byTerm.put(term, e);
        }
    }

    public List<Entry> snapshot() {
        return heap.stream()
                .sorted(Comparator.comparingLong((Entry e) -> e.count).reversed())
                .toList();
    }

    public static final class Entry {
        final String term; long count;
        Entry(String t, long c) { this.term = t; this.count = c; }
    }
}
StreamProcessor: Kafka consumer loop driving CMS + Top-K
public class StreamProcessor implements Runnable {
    private final Consumer<String, Tweet> consumer; // Kafka-style consumer
    private final CountMinSketch cms;
    private final TopKTracker topK;
    private final HotTopicCache cache;
    private final Clock clock;
    private Instant nextSnapshot;

    public StreamProcessor(Consumer<String, Tweet> consumer, CountMinSketch cms,
                           TopKTracker topK, HotTopicCache cache, Clock clock) {
        this.consumer = consumer; this.cms = cms; this.topK = topK;
        this.cache = cache; this.clock = clock;
        this.nextSnapshot = clock.instant().plusSeconds(10);
    }

    @Override public void run() {
        while (!Thread.currentThread().isInterrupted()) {
            ConsumerRecords<String, Tweet> batch = consumer.poll(Duration.ofMillis(200));
            for (ConsumerRecord<String, Tweet> r : batch) {
                Tweet t = r.value();
                for (String tag : t.hashtags()) {
                    cms.add(tag, 1);
                    topK.offer(tag, cms.estimate(tag));
                }
            }
            Instant now = clock.instant();
            if (!now.isBefore(nextSnapshot)) {
                cache.publish(now, topK.snapshot());
                consumer.commitAsync();
                nextSnapshot = now.plusSeconds(10);
            }
        }
    }
}

Key design decisions & trade-offs

Interview follow-ups

Related