Scale from Zero to Millions of Users System Design Interview Question
Problem: Evolve a single-server web app into a multi-tier, multi-region architecture that serves millions of users with low latency and high availability.
Overview
Scaling from zero to millions of users is less a single decision than a sequence of forced moves: each bottleneck the traffic exposes pushes you one step further down a well-worn path. You start with a single box running web, app, and database on the same machine, and every subsequent iteration peels a responsibility off that box. The first split is usually web tier from DB tier; next the web tier goes stateless by moving sessions into a shared store so autoscaling can add and drop instances freely; then caching arrives to keep hot reads off the primary; then read replicas, a CDN, a message queue, and finally sharding by user id and geo-replication for disaster recovery. The useful mental model is not a blueprint but a ladder: each rung solves the problem the previous rung created. This page walks the ladder end-to-end and shows the Java glue you need at each step so the evolution feels concrete instead of hand-wavy.
Summary
A canonical read-heavy web application (about 10:1 read/write). We take the mature end-state that Alex Xu's chapter converges on: geo-DNS routes users to the nearest data center; an L7 load balancer terminates TLS and spreads traffic across a stateless web tier; session state is evicted from the web tier into a shared NoSQL store so any server can handle any request (enabling autoscaling); a Redis cache absorbs the hot read set; the SQL store is split into a master for writes and slave replicas for reads; a CDN fronts static assets; a message queue decouples slow work (photo processing, emails) to asynchronous worker pools; the data tier ultimately shards by user id; logging/metrics/automation run across everything. The dominant tradeoff is consistency vs latency — reads from replicas and CDN edges can be seconds stale, so write paths must invalidate cache entries and accept replica lag on subsequent reads.
Requirements
Functional
- Serve HTTP requests for a read-heavy web workload with roughly a 10:1 read/write ratio
- Support user signup, login, and session-backed authenticated browsing
- Return personalised content whose hot working set fits comfortably in cache
- Upload and serve user-generated media through a CDN-fronted object store
- Offload slow work (thumbnailing, transactional email, push) to asynchronous workers
- Expose health and readiness endpoints for the orchestrator and load balancer
- Provide per-request and per-service metrics plus centralized logs for operators
Non-functional
- Availability target of 99.99% for the read path, 99.9% for the write path
- p99 read latency under 200 ms at the edge, under 50 ms for cache hits
- Horizontal scale from one instance to thousands without code changes
- Survive the loss of an entire availability zone with no data loss
- Cost scales sub-linearly with traffic thanks to caching and CDN offload
- No single stateful component is a hard SPOF once past the MVP stage
Capacity Assumptions
- 5M DAU, ~20 page views/user/day → 100M page views/day
- 10:1 read:write ratio
- Session/profile row ~2 KB; hot working set ~20% of users
- p95 end-to-end latency target ≤ 200ms globally
- 99.95% monthly availability target (≈ 22 min/month downtime budget)
Back-of-Envelope Estimates
- Reads: 100M / 86400 ≈ 1160 QPS avg, peak 3–5x ≈ 5K QPS
- Writes: 116 QPS avg, peak ≈ 500 QPS
- Cache working set: 5M * 20% * 2 KB ≈ 2 GB (fits comfortably in a single Redis node; use a 3-node replicated cluster for HA)
- Primary DB write load ≈ 500 QPS peak — well within a single well-tuned Postgres/MySQL primary before sharding is needed
- Static asset egress via CDN offloads ~80% of bytes — origin bandwidth bill drops ~5x
- Async job volume (photo resize, email send, etc.) ~200/s off the synchronous path into the message queue; worker pool sized for ~2x burst absorption
- Multi-DC geoDNS split: ~60/40 US-East / US-West in steady state; 100% to survivor on DC outage
High-level architecture
Traffic enters through geo-DNS, which steers each user to the nearest regional edge. A layer-7 load balancer terminates TLS and fans requests across a stateless web tier running in multiple availability zones; any instance can serve any request because sessions live in a shared NoSQL store rather than local memory. The web tier talks to a Redis cluster for hot reads using a cache-aside pattern: miss, load from the primary, populate, set a TTL. Writes go to the SQL master, which asynchronously replicates to read replicas; the data-source router in the web app directs SELECTs to replicas and INSERT/UPDATE/DELETE to the master, accepting a small window of replica lag. Static assets and user media sit behind a CDN that pulls from object storage, so the application servers never touch image bytes. Slow or bursty work — photo processing, webhooks, transactional email — is published to a message queue and drained by a separate worker fleet that can scale independently of the web tier. As data grows, the SQL store is sharded by user id; as the business goes global, the entire stack is cloned into additional regions with cross-region replication providing disaster recovery. Observability is a first-class citizen from day one: structured logs to a central pipeline, RED-style metrics scraped per instance, and a health endpoint the load balancer polls to eject bad pods within seconds.
Architecture Components (11)
- Client (Browser / Mobile) (client) — Browser or native app that resolves DNS, fetches static assets from the CDN, and calls the API for dynamic content.
- DNS (Route53 / Cloud DNS) (api) — Geo-aware DNS that returns the closest healthy region's load-balancer IP.
- CDN (CloudFront / Cloudflare) (cdn) — Global edge cache serving static and semi-static assets close to users.
- Load Balancer (lb) — L7 load balancer terminating TLS and spreading dynamic traffic across stateless web replicas.
- Web Tier (Stateless App Servers) (api) — Stateless Node/Go/Java servers rendering dynamic pages and API responses.
- Redis Cache (cache) — In-memory KV cache fronting the DB, holding the hot working set.
- DB Primary / Master (SQL) (sql) — Single master Postgres/MySQL node owning all writes and acting as replication source for the slaves, per the chapter's master/slave replication model.
- DB Read Replica (Slave) (sql) — Read-only slave streaming from the master; serves the read tier and analytics.
- Session / NoSQL Store (nosql) — Shared out-of-tier session and profile store that lets the web tier be stateless — the pivot point of the chapter's stateless-web-tier design.
- Message Queue (queue) — Durable buffer that decouples producers (web tier) from consumers (workers); the chapter's answer to 'scale components independently'.
- Worker Pool (worker) — Consumer processes that pull jobs off the queue and run them asynchronously — photo processing, emails, webhook fanout.
Operations Walked Through (3)
- read-hot — The common case: a dynamic page request whose underlying row is already cached in Redis. p95 ~10ms.
- read-miss — Same read, but the key is not in Redis. Web tier reads from a replica, backfills the cache, and returns. p95 ~25ms.
- write-invalidate — POST /api/profile/42 with a new bio. Web tier UPDATEs the primary, deletes the cache key, and returns. The next read takes a miss and re-warms from the primary.
Implementation
package com.example.app;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.cache.annotation.EnableCaching;
import org.springframework.scheduling.annotation.EnableAsync;
@SpringBootApplication
@EnableCaching
@EnableAsync
public class WebApp {
public static void main(String[] args) {
// -Dspring.profiles.active=prod selects the multi-region config
SpringApplication.run(WebApp.class, args);
}
}
package com.example.app.health;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.*;
import java.util.concurrent.atomic.AtomicBoolean;
@RestController
@RequestMapping("/health")
public class HealthController {
// liveness flips to false only on an unrecoverable internal error
private final AtomicBoolean alive = new AtomicBoolean(true);
// readiness flips to false during warm-up or while draining on shutdown
private final AtomicBoolean ready = new AtomicBoolean(false);
@GetMapping("/live")
public ResponseEntity<String> live() {
return alive.get() ? ResponseEntity.ok("OK") : ResponseEntity.status(500).body("DEAD");
}
@GetMapping("/ready")
public ResponseEntity<String> ready() {
return ready.get() ? ResponseEntity.ok("READY") : ResponseEntity.status(503).body("WARMING");
}
public void markReady() { ready.set(true); }
public void beginDrain() { ready.set(false); } // LB stops sending new traffic
public void fatalError() { alive.set(false); } // orchestrator restarts the pod
}
package com.example.app.user;
import org.springframework.data.redis.core.StringRedisTemplate;
import org.springframework.stereotype.Service;
import java.time.Duration;
@Service
public class UserProfileService {
private static final Duration TTL = Duration.ofMinutes(10);
private final StringRedisTemplate redis;
private final UserRepository repo;
private final ObjectMapper json = new ObjectMapper();
public UserProfileService(StringRedisTemplate redis, UserRepository repo) {
this.redis = redis;
this.repo = repo;
}
public UserProfile get(long userId) throws Exception {
String key = "user:" + userId;
String cached = redis.opsForValue().get(key);
if (cached != null) return json.readValue(cached, UserProfile.class); // hit
UserProfile fresh = repo.findById(userId); // miss -> DB
if (fresh != null) {
redis.opsForValue().set(key, json.writeValueAsString(fresh), TTL);
}
return fresh;
}
public void update(UserProfile p) {
repo.save(p);
redis.delete("user:" + p.getId()); // invalidate; next read repopulates
}
}
package com.example.app.db;
import org.springframework.jdbc.datasource.lookup.AbstractRoutingDataSource;
import javax.sql.DataSource;
import java.util.Map;
public class ReadWriteRoutingDataSource extends AbstractRoutingDataSource {
public enum Route { MASTER, REPLICA }
private static final ThreadLocal<Route> CTX = ThreadLocal.withInitial(() -> Route.MASTER);
public static void useReplica() { CTX.set(Route.REPLICA); }
public static void useMaster() { CTX.set(Route.MASTER); }
public static void clear() { CTX.remove(); }
@Override
protected Object determineCurrentLookupKey() { return CTX.get(); }
public static ReadWriteRoutingDataSource build(DataSource master, DataSource replica) {
ReadWriteRoutingDataSource ds = new ReadWriteRoutingDataSource();
ds.setTargetDataSources(Map.of(Route.MASTER, master, Route.REPLICA, replica));
ds.setDefaultTargetDataSource(master);
ds.afterPropertiesSet();
return ds;
}
}
// @Transactional(readOnly=true) methods call useReplica() in an aspect;
// writes always hit MASTER. Replica lag is accepted on the read path.
Key design decisions & trade-offs
- Session storage — Chosen: Move sessions out of the web tier into a shared store (Redis or NoSQL). Sticky sessions work until an instance dies; shared sessions let any server handle any request, which is the prerequisite for autoscaling and rolling deploys.
- Read scaling — Chosen: Master-replica SQL rather than jumping straight to NoSQL. A 10:1 read/write ratio is served beautifully by replicas plus a cache, and you keep transactional writes. NoSQL is the right move only once the master can no longer absorb writes.
- Async workload isolation — Chosen: Message queue + worker fleet instead of threads inside the web process. Long-running jobs inside the request process couple user latency to worker health and make autoscaling math awkward; a queue lets web and workers scale on independent signals.
- Static asset delivery — Chosen: CDN in front of object storage. Serving images from the app tier wastes egress and CPU, and never benefits from edge caching. The CDN is almost free compared to the bandwidth it saves.
- Multi-region topology — Chosen: Active-passive with async replication before active-active. Active-active needs conflict resolution and careful write routing; active-passive gives you DR with far less complexity and is sufficient until a single region is provably saturated.
- Sharding trigger — Chosen: Shard by user id only after vertical scaling + replicas are exhausted. Sharding pays a permanent complexity tax (cross-shard joins, rebalancing). Deferring it until the data really does not fit on one master keeps the codebase simple for longer.
Interview follow-ups
- How do you warm a new cache node so it does not stampede the DB when added?
- How do you keep replica lag bounded during a write spike, and what do you surface to the user when you cannot?
- What is the graceful-shutdown sequence so in-flight requests drain before Kubernetes kills the pod?
- How do you roll out a schema migration across sharded databases without downtime?
- How do you pick a shard key that will not force resharding in 18 months?
- How do you fail over traffic between regions, and how do you test that drill without breaking real users?