WhatsApp-style Messaging System Design Interview Question
Problem: Design a 1-to-1 and small group chat system (up to 100 members) with online presence, multi-device sync, and push notifications.
Overview
WhatsApp-style messaging targets 50 billion messages per day with sub-second delivery, multi-device sync, and presence, all while respecting mobile battery and flaky networks. The problem is fundamentally different from request/response web apps: the server has to push to the client, not the other way around, so a persistent bidirectional channel is mandatory. The design is shaped by three constants the book fixes: 50M daily active users, groups capped at 100 members, and chat history retained forever. Those numbers drive roughly 10M concurrent WebSocket connections, 150 TB of message storage per year, and a fleet of about 200 chat servers each holding 50K live sockets. Getting the message path right matters because every design tradeoff — protocol choice, storage engine, fanout model, offline handling — cascades from the fact that users expect delivery to feel instant even when the recipient is offline, on airplane mode, or roaming through a carrier NAT.
Summary
A real-time messaging system for 50M DAU, built around three tiers: (1) stateless API servers for signup/login/profile, (2) stateful chat servers holding long-lived WebSocket connections for real-time send/receive, (3) third-party push-notification integration for offline delivery. Service discovery (ZooKeeper) picks the best chat server for each client at login. Chat history lives in a key-value store (HBase / Cassandra) keyed by channel_id. The dominant design choice is WebSocket for bidirectional traffic (vs HTTP polling or long-polling) — it is the only protocol that lets the server push to the client cheaply over a persistent connection; the main tradeoff is that connection affinity makes rolling deploys and failovers harder than a stateless tier, which is why ZooKeeper coordinates chat-server health and capacity.
Requirements
Functional
- 1-to-1 chat and small groups up to 100 members
- Real-time delivery to online recipients plus offline push via APNs/FCM
- Multi-device sync: same account on phone, laptop, and web see identical history
- Delivery and read receipts (sent, delivered, read)
- Online presence with last-seen timestamp
- Message history retained forever and searchable
Non-functional
- P99 send-to-receive latency under 500 ms for online recipients
- 99.99% availability for send and receive
- At-least-once delivery with client-side dedupe via message_id
- Horizontally scalable to 10M+ concurrent WebSocket connections
- Graceful reconnect within 10 s after chat-server failover
- End-to-end encryption compatibility (Signal protocol layer above transport)
Capacity Assumptions
- 50M DAU (book's target scale)
- ~60B messages/day across Messenger + WhatsApp (cited Facebook/WhatsApp number)
- Max group size: 100 members (book's explicit constraint)
- Text messages up to 100,000 chars; no attachments in base design (extension)
- Chat history retained forever (book answer to 'how long?')
- Read:write ≈ 1:1 for 1-on-1 chat
- Per-connection memory ≈ 10 KB → 1M conns ≈ 10 GB on one box (book's sizing exercise)
Back-of-Envelope Estimates
- Concurrent WebSocket connections at 50M DAU assuming 20% online concurrent ≈ 10M
- Chat servers: 10M / 50K per server ≈ 200 chat servers (pre-HA)
- Message volume: 50M * 40 msgs/day ≈ 2B msgs/day → ~23K msgs/sec avg, ~70K peak
- Message storage (forever): 2B * 200B * 365 ≈ 150 TB/year — requires horizontal KV store
- Presence fanout bounded by small-group size (~100 subscribers per user)
High-level architecture
Clients open a long-lived WebSocket (WSS) to a chat server chosen by ZooKeeper-backed service discovery at login; all short, stateless calls (signup, profile, friends, group CRUD) hit a separate stateless API tier fronted by the same load balancer. The split matters because the chat tier is sticky — once a socket lives on chat-server-42, breaking affinity means redialing and re-authenticating — whereas the API tier can round-robin freely and scale purely on CPU. When user A sends a message, the chat server assigns a globally unique message_id from a Snowflake-style ID generator, appends the row to a Cassandra/HBase table keyed by channel_id (so a conversation's history is one contiguous partition), and then consults the message sync queue to find where each recipient device currently lives. For each online device, the fanout pushes over its WebSocket; for each offline device, the message is staged in that user's inbox and an APNs/FCM push is emitted through the notification service to wake the app. Presence piggybacks on the heartbeat: clients send a keepalive every 5 s, and the presence service flips a user offline after a 30 s silence window so a brief tunnel or NAT rebind does not spam friends with status churn. Consistent hashing on user_id keeps a user's connections, inbox queue, and message shards co-located, which is what makes fanout cheap even at groups of 100.
Architecture Components (12)
- Client (iOS / Android / Web) (client) — Opens a persistent WebSocket to a chat server chosen via service discovery; uses HTTP for signup/login/profile. Each device tracks cur_max_message_id for multi-device sync.
- Load Balancer (lb) — Fronts both stateless API servers (HTTP) and stateful chat servers (WSS upgrade). Routes HTTP by path, WSS by consistent hash.
- API Servers (stateless) (api) — Stateless HTTP services for signup, login, user profile, friends list, group membership. Book's 'Stateless Services' tier.
- Service Discovery (ZooKeeper) (coordinator) — Registers all chat servers and picks the best one for each client at login based on geo, load, and health.
- Chat Servers (stateful WebSocket) (api) — Stateful WebSocket servers holding persistent connections. Book's only 'Stateful Service' — each client stays on one chat server as long as it is available.
- Message ID Generator (id-generator) — Produces unique, time-sortable message_ids. Book considers three options: auto_increment, Snowflake (global), local per-channel sequence.
- Presence Servers (presence) — Manages online/offline status via client heartbeats; fans out status changes to friends through per-pair pub/sub channels.
- Message Sync Queue (queue) — Per-recipient inbox. Chat server publishes one copy of the message per recipient; each recipient's chat server (or notifier) consumes from it.
- Message KV Store (kv) — Key-value store for chat history. Book recommends HBase (Messenger) or Cassandra (Discord).
- User / Profile DB (relational) (sql) — Stores user profile, settings, friends list, group membership. Book: generic data → relational DB with replication + sharding.
- Notification Servers (worker) — Third-party integration: sends APNs/FCM push notifications to offline recipients. Book's only third-party integration.
- Media Blob Store (extension) (blob) — Object storage for photos / videos / voice — book flags this as an extension beyond the text-only base scope.
Operations Walked Through (6)
- login-discover — Book Figure 12-11. User logs in via API servers; ZooKeeper picks the best chat server; client opens WebSocket to it.
- send-1to1 — User A sends to User B. Chat server 1 gets msg_id, writes to sync queue, KV store persists; if B online, message relayed to B's chat server (2).
- send-offline — Book step 5b. B is offline; message persists in KV store and a push notification wakes B via APNs/FCM.
- send-group — User A posts in a 3-person group (A, B, C). Sender's chat server writes one copy per recipient to each member's inbox queue. Book: simplifies receive — each client only reads its own inbox.
- presence-heartbeat — Client sends 5s heartbeats; on status change, presence server fans out to per-pair channels (book Figure 12-19).
- media-upload — Book §Wrap-up lists media as an extension. Client uploads directly to blob store, then sends a reference message.
Implementation
package com.systemdesign.whatsapp.model;
import java.time.Instant;
import java.util.UUID;
public final class ChatMessage {
private final long messageId; // snowflake id
private final String channelId; // 1:1 or group
private final long senderId;
private final String body; // up to 100_000 chars
private final Instant createdAt;
private final String clientDedupeKey; // uuid from client
public ChatMessage(long messageId, String channelId, long senderId,
String body, Instant createdAt, String clientDedupeKey) {
if (body != null && body.length() > 100_000) {
throw new IllegalArgumentException("body exceeds 100k chars");
}
this.messageId = messageId;
this.channelId = channelId;
this.senderId = senderId;
this.body = body;
this.createdAt = createdAt;
this.clientDedupeKey = clientDedupeKey == null
? UUID.randomUUID().toString()
: clientDedupeKey;
}
public long getMessageId() { return messageId; }
public String getChannelId() { return channelId; }
public long getSenderId() { return senderId; }
public String getBody() { return body; }
public Instant getCreatedAt() { return createdAt; }
public String getClientDedupeKey() { return clientDedupeKey; }
}
package com.systemdesign.whatsapp.delivery;
import com.systemdesign.whatsapp.model.ChatMessage;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;
import java.util.List;
@Service
public class MessageDelivery {
private static final Logger log = LoggerFactory.getLogger(MessageDelivery.class);
private final DeviceRegistry deviceRegistry; // user_id -> List<Device>
private final WebSocketHandler wsHandler; // local pushes
private final OfflineInboxQueue offlineInbox; // durable queue per user
private final PushNotificationClient apns; // APNs/FCM
@Autowired
public MessageDelivery(DeviceRegistry deviceRegistry,
WebSocketHandler wsHandler,
OfflineInboxQueue offlineInbox,
PushNotificationClient apns) {
this.deviceRegistry = deviceRegistry;
this.wsHandler = wsHandler;
this.offlineInbox = offlineInbox;
this.apns = apns;
}
public void deliver(ChatMessage msg, List<Long> recipientUserIds) {
for (Long uid : recipientUserIds) {
List<Device> devices = deviceRegistry.lookup(uid);
for (Device d : devices) {
if (d.isOnline()) {
boolean pushed = wsHandler.pushToDevice(d.getDeviceId(), msg);
if (!pushed) {
// socket dropped mid-flight, fall through to offline path
offlineInbox.enqueue(uid, d.getDeviceId(), msg);
}
} else {
offlineInbox.enqueue(uid, d.getDeviceId(), msg);
apns.wake(d.getPushToken(), msg.getChannelId());
}
}
}
log.info("delivered messageId={} to {} recipients", msg.getMessageId(), recipientUserIds.size());
}
}
package com.systemdesign.whatsapp.ws;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.systemdesign.whatsapp.delivery.OfflineInboxQueue;
import com.systemdesign.whatsapp.model.ChatMessage;
import org.springframework.web.socket.CloseStatus;
import org.springframework.web.socket.TextMessage;
import org.springframework.web.socket.WebSocketSession;
import org.springframework.web.socket.handler.TextWebSocketHandler;
import org.springframework.stereotype.Component;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;
@Component
public class ChatWebSocketHandler extends TextWebSocketHandler {
private final ConcurrentMap<String, WebSocketSession> deviceIdToSession = new ConcurrentHashMap<>();
private final ObjectMapper mapper = new ObjectMapper();
private final OfflineInboxQueue offlineInbox;
public ChatWebSocketHandler(OfflineInboxQueue offlineInbox) {
this.offlineInbox = offlineInbox;
}
@Override
public void afterConnectionEstablished(WebSocketSession session) throws Exception {
String deviceId = (String) session.getAttributes().get("deviceId");
long userId = (Long) session.getAttributes().get("userId");
deviceIdToSession.put(deviceId, session);
// drain anything staged while offline
offlineInbox.drain(userId, deviceId, msg -> pushToDevice(deviceId, msg));
}
@Override
public void afterConnectionClosed(WebSocketSession session, CloseStatus status) {
String deviceId = (String) session.getAttributes().get("deviceId");
deviceIdToSession.remove(deviceId);
}
public boolean pushToDevice(String deviceId, ChatMessage msg) {
WebSocketSession s = deviceIdToSession.get(deviceId);
if (s == null || !s.isOpen()) return false;
try {
synchronized (s) {
s.sendMessage(new TextMessage(mapper.writeValueAsBytes(msg)));
}
return true;
} catch (Exception e) {
return false;
}
}
}
Key design decisions & trade-offs
- Real-time transport — Chosen: WebSocket over HTTP long-polling. WebSocket gives bidirectional server push at ~10 KB per idle connection; long-polling at 10M users would mean 2M empty GETs/sec. The cost is connection affinity, which complicates rolling deploys.
- Message storage engine — Chosen: Wide-column KV (HBase/Cassandra) keyed by channel_id. A chat is a time-ordered append log per channel — exactly the workload Cassandra is best at. RDBMS would need cross-shard joins once channels spread across hosts.
- Chat-server discovery — Chosen: ZooKeeper-backed service discovery at login. Clients need the least-loaded chat server and must re-resolve after crashes. DNS TTLs are too coarse; ZooKeeper gives sub-second reroute. The cost is another stateful dependency to operate.
- Offline delivery — Chosen: Durable per-user inbox queue plus APNs/FCM wake. Storing the message in an inbox lets reconnecting devices drain at their own pace; APNs/FCM pushes wake background apps without running our own mobile daemon. The tradeoff is dual-write complexity between the message log and the inbox.
- Presence detection — Chosen: Heartbeat every 5 s with 30 s offline threshold. Short enough to feel live, long enough to tolerate brief tunnels and NAT rebinds. Shorter intervals drain mobile batteries; longer intervals make 'online' dots lie.
Interview follow-ups
- Extend to groups larger than 100 — how does fanout change when a single message has 100K recipients?
- Add end-to-end encryption via the Signal double-ratchet protocol without breaking multi-device sync
- Design voice and video calling on top of this chat substrate (SDP, TURN, media servers)
- Handle per-region data residency (EU messages must stay in EU shards) without losing global chats
- Add message search across forever-retained history without scanning the whole KV store