Stock Exchange System Design Interview Question
Problem: Design a stock exchange with a matching engine that processes limit orders, a market-data publisher that broadcasts book updates at 100K ticks/s, and a batched clearing/settlement pipeline.
Overview
A stock exchange is the hardest latency-sensitive system in common system-design interviews. It is not a CRUD app: the matching engine is a single logical process that must accept millions of orders per second, match them against the book in strict price-time priority, emit fills atomically, and publish a fan-out stream of book deltas to every connected client within microseconds. The interesting tension is that matching is inherently sequential — you cannot parallelize 'match the next incoming order against the best bid' without introducing reordering that changes who gets filled — so the system scales by partitioning on symbol rather than by sharding the book itself. Correctness is as brutal as in a payment system, because a lost fill or a double fill creates positions that don't exist. Latency is simultaneously brutal because HFT firms colocate in the same rack and measure competitive advantage in nanoseconds. The design below sketches a single-symbol matching engine, the price-time priority book that feeds it, and the market-data fan-out that takes the output to the world.
Summary
A low-latency exchange built around a SINGLE-THREADED, in-memory matching engine per symbol. The dominant design choice is determinism over raw throughput: each symbol gets one thread owning one in-memory order book, which makes the match order provably reproducible (critical for audit and dispute resolution) and eliminates lock contention on the hot path. Sharding is by symbol — not thread-per-order — so the exchange scales by adding more symbols to more engines. The market-data publisher fans out book updates via UDP multicast (exchange-native) or Kafka (cloud-native) to thousands of subscribers. Clearing and settlement are intentionally BATCHED (T+1 or T+2) so the hot path stays fast while back-office systems handle the heavyweight legal transfer of ownership.
Requirements
Functional
- Accept limit orders (buy/sell, price, quantity, client_order_id) and match them against the resting book in price-time priority
- Accept market orders that consume liquidity at the best available prices until filled or exhausted
- Support cancel and cancel-replace on resting orders by client_order_id or exchange_order_id
- Emit an execution report (fill, partial fill, reject, cancel) per state change to the owning client
- Publish market data: top-of-book quotes, full order book snapshots, and incremental book deltas
- Halt, resume, and auction open/close per symbol under operator or circuit-breaker control
Non-functional
- Tail latency: p99 order-ack under 100 microseconds intra-rack; market data fanout under 10 microseconds beyond the matcher
- Determinism: two replicas fed the same sequenced input must produce byte-identical output
- Zero order loss: every accepted order is durably sequenced before it enters the book
- Throughput per symbol: 1M+ orders per second sustained on a single matching engine core
- Availability via hot-standby replica with sequence-number-driven failover; no split brain
- Full audit trail of every order and every state change retained per regulatory rules
Capacity Assumptions
- ~5000 listed symbols, of which ~500 are actively traded
- Peak 100K orders/s across all symbols; hot symbols handle 5K–10K orders/s each
- Peak 100K market-data ticks/s (every book change fans out)
- Thousands of subscribers to market data (brokers, HFT firms, retail data vendors)
- 7-year regulatory retention (SEC Rule 17a-4, MiFID II) on every order + execution + cancel
Back-of-Envelope Estimates
- Matching engine budget: in-memory O(log N) per match on a skip-list/priority-queue; p99 <100μs on hot path
- Order gateway: 100K orders/s * ~500B per order ≈ 50 MB/s ingress, well within one NIC
- Execution log: 100K executions/s peak * ~200B ≈ 20 MB/s sustained → ~1.7 TB/day; Kafka tiered to S3
- Market data fanout: 100K ticks/s * 100B * 1000 subscribers ≈ 10 GB/s aggregate egress — needs multicast
- Historical / TSDB: bar aggregation (1s, 1m, 1h, 1d) stored long-term; raw tape in cold storage
High-level architecture
The exchange is partitioned by symbol — AAPL and TSLA run on different matching engines, each single-threaded on a pinned CPU core. Orders arrive at gateway servers over binary FIX or a proprietary protocol; the gateway authenticates, checks pre-trade risk (exposure, credit, fat-finger limits), and forwards to a Sequencer. The Sequencer is the single serialization point: it assigns a monotonic seq_no to every order and appends it to a replicated log (Aeron, LBM, or a custom multicast log) before returning. The matching engine for a symbol consumes its partition of the sequenced log, applies each order to an in-memory OrderBook, and emits a stream of events: OrderAccepted, Trade, OrderCanceled, BookDelta. A hot-standby replica consumes the same sequenced log and maintains an identical book; if the primary fails, the standby takes over at the last acknowledged seq_no with no replay ambiguity. Market-data publishers consume the event stream and fan it out over multicast to subscribers — top-of-book goes on a Level-1 feed, full depth changes on a Level-2 feed. Persistence is not on the matching-engine critical path: the sequenced log is the durable record, and end-of-day the engine snapshots the book to disk for compliance and restart. Clock discipline matters — all gateways, sequencers, and engines run PTP-synced within microseconds so regulatory timestamps are trustworthy. Capacity is scaled by adding more symbols per sequencer shard, not by splitting a single symbol's book, because price-time priority across a split book is ill-defined.
Architecture Components (9)
- Client / Broker (client) — Broker or algorithmic trading system that submits orders on behalf of end investors.
- Load Balancer (lb) — L4 TCP load balancer (or a FIX-aware session router) fronting the order gateway fleet.
- Order Gateway (api) — FIX/binary protocol endpoint. Parses incoming orders, assigns a monotonic exchange order_id, and forwards to validation.
- Order Validation / Risk (api) — Pre-trade risk checks: symbol tradable, price sanity, firm credit/position limits, short-sale locate.
- Matching Engine (per-symbol shard) (worker) — In-memory, SINGLE-THREADED matching engine. One thread owns one symbol's order book. Match logic is price-time priority.
- Execution Log / Trade Capture (stream) — Append-only Kafka log of every execution and book event. The audit tape.
- Market Data Publisher (stream) — Fans out top-of-book, depth, and trade ticks to thousands of subscribers via UDP multicast (exchange-native) or Kafka (cloud-native).
- Clearing / Settlement (worker) — Batched end-of-day job that nets executions by firm and submits to the clearinghouse (DTCC/NSCC in US, LCH in EU). T+1 in US equities as of 2024.
- Historical / TSDB (nosql) — Time-series store for bar aggregations (1s, 1m, 1h, 1d) and the raw execution tape. Serves charts and analytics.
Operations Walked Through (4)
- place-limit-order — Broker submits a BUY limit order for AAPL at 186.25. The book has resting sells at 186.25 and lower — the order crosses, matches partially, and the remainder rests on the book.
- cancel-order — Broker sends OrderCancelRequest for an order resting on the book. Engine locates it by exchange order_id, removes it, emits a CancelAck.
- market-data-tick — Every accepted, canceled, or filled order mutates the book. The publisher fans out the resulting tick to ~1200 subscribers via UDP multicast. This is the hottest path in the entire design.
- clearing-batch — After market close, the clearing job tails the day's executions, nets them by firm, and submits to the clearinghouse for T+1 settlement.
Implementation
public final class OrderBook {
// Buy side: highest price first. Sell side: lowest price first.
private final TreeMap<Long, PriceLevel> bids = new TreeMap<>(Comparator.reverseOrder());
private final TreeMap<Long, PriceLevel> asks = new TreeMap<>();
private final Long2ObjectOpenHashMap<Order> byId = new Long2ObjectOpenHashMap<>();
public void add(Order o) {
TreeMap<Long, PriceLevel> side = o.isBuy() ? bids : asks;
PriceLevel lvl = side.computeIfAbsent(o.priceTicks(), PriceLevel::new);
lvl.append(o); // FIFO within a price level = time priority
byId.put(o.id(), o);
}
public Order cancel(long orderId) {
Order o = byId.remove(orderId);
if (o == null) return null;
TreeMap<Long, PriceLevel> side = o.isBuy() ? bids : asks;
PriceLevel lvl = side.get(o.priceTicks());
lvl.remove(o);
if (lvl.isEmpty()) side.remove(o.priceTicks());
return o;
}
public Long bestBid() { return bids.isEmpty() ? null : bids.firstKey(); }
public Long bestAsk() { return asks.isEmpty() ? null : asks.firstKey(); }
TreeMap<Long, PriceLevel> oppositeSide(boolean isBuy) { return isBuy ? asks : bids; }
}
public final class MatchingEngine {
private final OrderBook book;
private final EventSink out;
/** Applies one sequenced order. Single-threaded: caller guarantees ordering. */
public void onOrder(Order incoming) {
out.accept(new OrderAccepted(incoming.id(), incoming.seqNo()));
TreeMap<Long, PriceLevel> opp = book.oppositeSide(incoming.isBuy());
long remaining = incoming.qty();
Iterator<Map.Entry<Long, PriceLevel>> it = opp.entrySet().iterator();
while (remaining > 0 && it.hasNext()) {
Map.Entry<Long, PriceLevel> e = it.next();
long bestPrice = e.getKey();
if (!crosses(incoming, bestPrice)) break; // price no longer crosses
PriceLevel lvl = e.getValue();
while (remaining > 0 && !lvl.isEmpty()) {
Order resting = lvl.head();
long fillQty = Math.min(remaining, resting.qty());
out.accept(new Trade(incoming.id(), resting.id(), bestPrice, fillQty, System.nanoTime()));
resting.reduce(fillQty);
remaining -= fillQty;
if (resting.qty() == 0) { lvl.removeHead(); book.forget(resting.id()); }
}
if (lvl.isEmpty()) it.remove();
}
if (remaining > 0) {
incoming.reduceTo(remaining);
book.add(incoming); // rest the unmatched remainder
out.accept(new BookDelta(incoming.isBuy(), incoming.priceTicks(), +remaining));
}
}
private static boolean crosses(Order taker, long restingPrice) {
return taker.isBuy() ? taker.priceTicks() >= restingPrice
: taker.priceTicks() <= restingPrice;
}
}
public final class MarketDataPublisher implements EventSink {
private final MulticastChannel l1; // top-of-book
private final MulticastChannel l2; // full depth
private final ByteBuffer scratch = ByteBuffer.allocateDirect(256);
private long lastBestBid = Long.MIN_VALUE;
private long lastBestAsk = Long.MAX_VALUE;
@Override
public void accept(ExchangeEvent evt) {
switch (evt) {
case Trade t -> {
emit(l1, encodeTrade(t));
emit(l2, encodeTrade(t));
}
case BookDelta d -> {
emit(l2, encodeDelta(d));
maybeEmitTopOfBook(d);
}
case OrderAccepted ignored -> {} // not market data
default -> {}
}
}
private void maybeEmitTopOfBook(BookDelta d) {
long bb = d.bestBidAfter();
long ba = d.bestAskAfter();
if (bb != lastBestBid || ba != lastBestAsk) {
lastBestBid = bb;
lastBestAsk = ba;
emit(l1, encodeTopOfBook(bb, ba, d.timestampNs()));
}
}
private void emit(MulticastChannel ch, ByteBuffer frame) {
try { ch.send(frame, ch.group()); } catch (IOException e) { throw new UncheckedIOException(e); }
}
}
Key design decisions & trade-offs
- Single-threaded matching per symbol vs parallel matching — Chosen: Single-threaded, one symbol per CPU core, pinned and isolated from the OS scheduler. Price-time priority requires a single global order over events in a symbol. Any concurrent matching reintroduces reordering that violates the priority rule. We scale horizontally across symbols, not within them — a single core sustains 1M+ orders/s in practice, which is plenty per symbol.
- Durability: synchronous disk write vs replicated log — Chosen: Replicated sequenced log (Aeron/custom multicast) with quorum ack, no disk on the critical path. A fsync-to-disk on every order costs milliseconds, which is an eternity at exchange latency. Replicating to N+1 in-memory followers over low-latency transport gives the same durability guarantee (survives any single failure) in microseconds. The log is snapshotted to disk asynchronously for restart.
- Strong consistency vs partition tolerance during a network split — Chosen: CP: if the primary loses quorum with its replicas, it stops accepting orders. Two matching engines accepting orders during a partition would produce conflicting fills that can never be reconciled. Halting is expensive (market closed for that symbol for seconds) but recoverable; split brain is not.
- Multicast market data vs per-client TCP streams — Chosen: UDP multicast for all subscribers on the same trading venue, TCP retransmit gap-fill server for recovery. 1000 subscribers on TCP means 1000 copies of every delta out of the publisher NIC. Multicast emits once and the switch fabric fans out. Cost is UDP loss — handled by a sequence-numbered protocol with a gap-fill service for missed packets.
- In-memory book vs persistent book per event — Chosen: In-memory book, end-of-day snapshot plus continuous event log. The sequenced log IS the durable record. Rebuilding the book on restart means replaying the log from the last snapshot, which takes seconds even for full-day volume. Writing the book to durable storage on every event would destroy the latency budget.
Interview follow-ups
- Opening and closing auctions with an uncrossing algorithm that maximizes executed volume
- Self-trade prevention: cancel or reprice orders that would match against the same beneficiary
- Iceberg orders with a hidden reserve quantity that refills the visible displayed size
- Circuit breakers: auto-halt per symbol or market-wide on large adverse price moves
- Regulatory reporting (MiFID II, CAT) with nanosecond-precision order lifecycle capture
- Cross-datacenter disaster recovery with consistent cutover and no duplicate fills