How do you design Scale from Zero to Millions of Users for a system design interview?

Start from requirements, estimate scale, draw the high-level architecture, explain the main data flow, identify bottlenecks, and compare trade-offs. This walkthrough covers Scale from Zero to Millions of Users with components, operations, capacity estimates, and an interactive simulator.

What should I discuss in a Scale from Zero to Millions of Users system design answer?

Cover functional requirements, non-functional requirements, APIs, data model, key services, storage choices, caching, queues or streams where relevant, failure handling, observability, and scaling trade-offs.

Can I practice Scale from Zero to Millions of Users interactively?

Yes. The page includes an interactive browser-based simulator for Scale from Zero to Millions of Users, including an architecture diagram, step-by-step request flow, component swap, and stress simulation.

Scale from Zero to Millions of Users System Design Interview Question

By Rahul Kumar · Senior Software Engineer · Updated May 2026 · 11 components · 3 operations ·Source: Alex Xu, System Design Interview Vol 1, Chapter 1

Problem: Evolve a single-server web app into a multi-tier, multi-region architecture that serves millions of users with low latency and high availability.

Overview

Scaling from zero to millions of users is less a single decision than a sequence of forced moves: each bottleneck the traffic exposes pushes you one step further down a well-worn path. You start with a single box running web, app, and database on the same machine, and every subsequent iteration peels a responsibility off that box. The first split is usually web tier from DB tier; next the web tier goes stateless by moving sessions into a shared store so autoscaling can add and drop instances freely; then caching arrives to keep hot reads off the primary; then read replicas, a CDN, a message queue, and finally sharding by user id and geo-replication for disaster recovery. The useful mental model is not a blueprint but a ladder: each rung solves the problem the previous rung created. This page walks the ladder end-to-end and shows the Java glue you need at each step so the evolution feels concrete instead of hand-wavy.

Scale from Zero to Millions of Users — Interactive Simulator

Runs fully client-side in your browser; no sign-up. Or open full screen →

Launch the interactive walkthrough for Scale from Zero to Millions of Users — animated architecture diagram, step-by-step flow with real payloads, component swap, and a discrete-event stress simulator.

Summary

A canonical read-heavy web application (about 10:1 read/write). We take the mature end-state that Alex Xu's chapter converges on: geo-DNS routes users to the nearest data center; an L7 load balancer terminates TLS and spreads traffic across a stateless web tier; session state is evicted from the web tier into a shared NoSQL store so any server can handle any request (enabling autoscaling); a Redis cache absorbs the hot read set; the SQL store is split into a master for writes and slave replicas for reads; a CDN fronts static assets; a message queue decouples slow work (photo processing, emails) to asynchronous worker pools; the data tier ultimately shards by user id; logging/metrics/automation run across everything. The dominant tradeoff is consistency vs latency — reads from replicas and CDN edges can be seconds stale, so write paths must invalidate cache entries and accept replica lag on subsequent reads.

Requirements

Functional

Serve HTTP requests for a read-heavy web workload with roughly a 10:1 read/write ratio
Support user signup, login, and session-backed authenticated browsing
Return personalised content whose hot working set fits comfortably in cache
Upload and serve user-generated media through a CDN-fronted object store
Offload slow work (thumbnailing, transactional email, push) to asynchronous workers
Expose health and readiness endpoints for the orchestrator and load balancer
Provide per-request and per-service metrics plus centralized logs for operators

Non-functional

Availability target of 99.99% for the read path, 99.9% for the write path
p99 read latency under 200 ms at the edge, under 50 ms for cache hits
Horizontal scale from one instance to thousands without code changes
Survive the loss of an entire availability zone with no data loss
Cost scales sub-linearly with traffic thanks to caching and CDN offload
No single stateful component is a hard SPOF once past the MVP stage

Capacity Assumptions

5M DAU, ~20 page views/user/day → 100M page views/day
10:1 read:write ratio
Session/profile row ~2 KB; hot working set ~20% of users
p95 end-to-end latency target ≤ 200ms globally
99.95% monthly availability target (≈ 22 min/month downtime budget)

Back-of-Envelope Estimates

Reads: 100M / 86400 ≈ 1160 QPS avg, peak 3–5x ≈ 5K QPS
Writes: 116 QPS avg, peak ≈ 500 QPS
Cache working set: 5M * 20% * 2 KB ≈ 2 GB (fits comfortably in a single Redis node; use a 3-node replicated cluster for HA)
Primary DB write load ≈ 500 QPS peak — well within a single well-tuned Postgres/MySQL primary before sharding is needed
Static asset egress via CDN offloads ~80% of bytes — origin bandwidth bill drops ~5x
Async job volume (photo resize, email send, etc.) ~200/s off the synchronous path into the message queue; worker pool sized for ~2x burst absorption
Multi-DC geoDNS split: ~60/40 US-East / US-West in steady state; 100% to survivor on DC outage

High-level architecture

Traffic enters through geo-DNS, which steers each user to the nearest regional edge. A layer-7 load balancer terminates TLS and fans requests across a stateless web tier running in multiple availability zones; any instance can serve any request because sessions live in a shared NoSQL store rather than local memory. The web tier talks to a Redis cluster for hot reads using a cache-aside pattern: miss, load from the primary, populate, set a TTL. Writes go to the SQL master, which asynchronously replicates to read replicas; the data-source router in the web app directs SELECTs to replicas and INSERT/UPDATE/DELETE to the master, accepting a small window of replica lag. Static assets and user media sit behind a CDN that pulls from object storage, so the application servers never touch image bytes. Slow or bursty work — photo processing, webhooks, transactional email — is published to a message queue and drained by a separate worker fleet that can scale independently of the web tier. As data grows, the SQL store is sharded by user id; as the business goes global, the entire stack is cloned into additional regions with cross-region replication providing disaster recovery. Observability is a first-class citizen from day one: structured logs to a central pipeline, RED-style metrics scraped per instance, and a health endpoint the load balancer polls to eject bad pods within seconds.

Architecture Components (11)

Client (Browser / Mobile) (client) — Browser or native app that resolves DNS, fetches static assets from the CDN, and calls the API for dynamic content.
DNS (Route53 / Cloud DNS) (api) — Geo-aware DNS that returns the closest healthy region's load-balancer IP.
CDN (CloudFront / Cloudflare) (cdn) — Global edge cache serving static and semi-static assets close to users.
Load Balancer (lb) — L7 load balancer terminating TLS and spreading dynamic traffic across stateless web replicas.
Web Tier (Stateless App Servers) (api) — Stateless Node/Go/Java servers rendering dynamic pages and API responses.
Redis Cache (cache) — In-memory KV cache fronting the DB, holding the hot working set.
DB Primary / Master (SQL) (sql) — Single master Postgres/MySQL node owning all writes and acting as replication source for the slaves, per the chapter's master/slave replication model.
DB Read Replica (Slave) (sql) — Read-only slave streaming from the master; serves the read tier and analytics.
Session / NoSQL Store (nosql) — Shared out-of-tier session and profile store that lets the web tier be stateless — the pivot point of the chapter's stateless-web-tier design.
Message Queue (queue) — Durable buffer that decouples producers (web tier) from consumers (workers); the chapter's answer to 'scale components independently'.
Worker Pool (worker) — Consumer processes that pull jobs off the queue and run them asynchronously — photo processing, emails, webhook fanout.

Operations Walked Through (3)

read-hot — The common case: a dynamic page request whose underlying row is already cached in Redis. p95 ~10ms.
read-miss — Same read, but the key is not in Redis. Web tier reads from a replica, backfills the cache, and returns. p95 ~25ms.
write-invalidate — POST /api/profile/42 with a new bio. Web tier UPDATEs the primary, deletes the cache key, and returns. The next read takes a miss and re-warms from the primary.

Implementation

Spring Boot application skeleton

package com.example.app;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.cache.annotation.EnableCaching;
import org.springframework.scheduling.annotation.EnableAsync;

@SpringBootApplication
@EnableCaching
@EnableAsync
public class WebApp {
    public static void main(String[] args) {
        // -Dspring.profiles.active=prod selects the multi-region config
        SpringApplication.run(WebApp.class, args);
    }
}

HealthController with liveness and readiness

package com.example.app.health;

import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.*;
import java.util.concurrent.atomic.AtomicBoolean;

@RestController
@RequestMapping("/health")
public class HealthController {
    // liveness flips to false only on an unrecoverable internal error
    private final AtomicBoolean alive = new AtomicBoolean(true);
    // readiness flips to false during warm-up or while draining on shutdown
    private final AtomicBoolean ready = new AtomicBoolean(false);

    @GetMapping("/live")
    public ResponseEntity<String> live() {
        return alive.get() ? ResponseEntity.ok("OK") : ResponseEntity.status(500).body("DEAD");
    }

    @GetMapping("/ready")
    public ResponseEntity<String> ready() {
        return ready.get() ? ResponseEntity.ok("READY") : ResponseEntity.status(503).body("WARMING");
    }

    public void markReady()   { ready.set(true); }
    public void beginDrain()  { ready.set(false); } // LB stops sending new traffic
    public void fatalError()  { alive.set(false); } // orchestrator restarts the pod
}

Cache-aside service with Redis

package com.example.app.user;

import org.springframework.data.redis.core.StringRedisTemplate;
import org.springframework.stereotype.Service;
import java.time.Duration;

@Service
public class UserProfileService {
    private static final Duration TTL = Duration.ofMinutes(10);
    private final StringRedisTemplate redis;
    private final UserRepository repo;
    private final ObjectMapper json = new ObjectMapper();

    public UserProfileService(StringRedisTemplate redis, UserRepository repo) {
        this.redis = redis;
        this.repo = repo;
    }

    public UserProfile get(long userId) throws Exception {
        String key = "user:" + userId;
        String cached = redis.opsForValue().get(key);
        if (cached != null) return json.readValue(cached, UserProfile.class); // hit

        UserProfile fresh = repo.findById(userId);              // miss -> DB
        if (fresh != null) {
            redis.opsForValue().set(key, json.writeValueAsString(fresh), TTL);
        }
        return fresh;
    }

    public void update(UserProfile p) {
        repo.save(p);
        redis.delete("user:" + p.getId()); // invalidate; next read repopulates
    }
}

Read-replica DataSource routing

package com.example.app.db;

import org.springframework.jdbc.datasource.lookup.AbstractRoutingDataSource;
import javax.sql.DataSource;
import java.util.Map;

public class ReadWriteRoutingDataSource extends AbstractRoutingDataSource {
    public enum Route { MASTER, REPLICA }
    private static final ThreadLocal<Route> CTX = ThreadLocal.withInitial(() -> Route.MASTER);

    public static void useReplica() { CTX.set(Route.REPLICA); }
    public static void useMaster()  { CTX.set(Route.MASTER); }
    public static void clear()      { CTX.remove(); }

    @Override
    protected Object determineCurrentLookupKey() { return CTX.get(); }

    public static ReadWriteRoutingDataSource build(DataSource master, DataSource replica) {
        ReadWriteRoutingDataSource ds = new ReadWriteRoutingDataSource();
        ds.setTargetDataSources(Map.of(Route.MASTER, master, Route.REPLICA, replica));
        ds.setDefaultTargetDataSource(master);
        ds.afterPropertiesSet();
        return ds;
    }
}

// @Transactional(readOnly=true) methods call useReplica() in an aspect;
// writes always hit MASTER. Replica lag is accepted on the read path.

Key design decisions & trade-offs

Session storage — Chosen: Move sessions out of the web tier into a shared store (Redis or NoSQL). Sticky sessions work until an instance dies; shared sessions let any server handle any request, which is the prerequisite for autoscaling and rolling deploys.
Read scaling — Chosen: Master-replica SQL rather than jumping straight to NoSQL. A 10:1 read/write ratio is served beautifully by replicas plus a cache, and you keep transactional writes. NoSQL is the right move only once the master can no longer absorb writes.
Async workload isolation — Chosen: Message queue + worker fleet instead of threads inside the web process. Long-running jobs inside the request process couple user latency to worker health and make autoscaling math awkward; a queue lets web and workers scale on independent signals.
Static asset delivery — Chosen: CDN in front of object storage. Serving images from the app tier wastes egress and CPU, and never benefits from edge caching. The CDN is almost free compared to the bandwidth it saves.
Multi-region topology — Chosen: Active-passive with async replication before active-active. Active-active needs conflict resolution and careful write routing; active-passive gives you DR with far less complexity and is sufficient until a single region is provably saturated.
Sharding trigger — Chosen: Shard by user id only after vertical scaling + replicas are exhausted. Sharding pays a permanent complexity tax (cross-shard joins, rebalancing). Deferring it until the data really does not fit on one master keeps the codebase simple for longer.

Interview follow-ups

How do you warm a new cache node so it does not stampede the DB when added?
How do you keep replica lag bounded during a write spike, and what do you surface to the user when you cannot?
What is the graceful-shutdown sequence so in-flight requests drain before Kubernetes kills the pod?
How do you roll out a schema migration across sharded databases without downtime?
How do you pick a shard key that will not force resharding in 18 months?
How do you fail over traffic between regions, and how do you test that drill without breaking real users?