Notification System (Push / SMS / Email) System Design Interview Question
Problem: Design a system that delivers notifications across push (APNs/FCM), SMS, and email channels at scale, with reliable retries, rate limiting, per-user preferences, and minute-scale latency for transactional messages.
Overview
A multi-channel notification system has to sit between fast internal producers (order-confirmed, 2FA code, friend-request) and slow, rate-limited, failure-prone third-party gateways (APNs, FCM, Twilio, SendGrid), while still delivering 10 billion messages a day. The challenge is almost entirely operational: every downstream vendor has a different throughput tier, different retry semantics, and different failure modes, and the business still wants a single /notify call that Just Works. The design pressures are (1) ingest spikes — a broadcast campaign can ask for 100M deliveries in 10 minutes — (2) per-user abuse prevention so a buggy producer does not SMS-bomb a customer, (3) idempotent retries so a flaky network between producer and API does not double-charge an SMS, and (4) audit — every attempted delivery must be logged for analytics, compliance, and debugging. Getting this right is what separates a reliable platform from a source of production incidents.
Summary
A producer-facing HTTP API accepts notification requests, validates, rate-limits, and writes them to a per-channel queue. Per-channel worker fleets consume and call the appropriate third-party service (APNs for iOS, FCM for Android, SMS vendors like Twilio/Nexmo, email providers like SendGrid/SES). Device tokens, templates, and user settings live in a DB with a cache tier in front. Every attempt is logged to a notifications log/analytics store. The dominant design choice is an async queue between notification servers and workers — this decouples ingest spikes (bulk broadcasts) from the rate-limited, slow, and failure-prone third-party services. Key reliability patterns from the book: (1) dedupe via an idempotency key at the server so repeats aren't double-sent, (2) retry with a retry queue for transient failures, (3) notification log for analytics + audit, (4) rate limiting so users aren't spammed. Sized for 10B notifications/day (~115K/sec sustained, 1M/sec peak for broadcasts).
Requirements
Functional
- Accept notification requests via POST /v1/notify with channel hint and template
- Deliver across push (APNs/FCM), SMS, and email channels
- Support per-user preferences and opt-outs per channel
- Dedupe retries via client-supplied idempotency key
- Retry transient failures with exponential backoff
- Emit per-attempt log rows for analytics and audit
Non-functional
- Sustain 115K notifications/sec average, 1M/sec peak for broadcasts
- P99 end-to-end latency under 30 s for transactional push
- At-least-once delivery with exactly-once semantics per (user, idempotency_key) for 24 h
- Per-user rate limit of 10 notifications/min/channel
- 99.95% availability of the ingest API independent of vendor outages
Capacity Assumptions
- 10B notifications/day total across all channels (≈ 115K/sec average)
- Channel mix: 70% push, 20% email, 10% SMS
- Broadcast campaigns up to 100M recipients in <10 minutes → peak 200K fanout/sec
- Per-user rate limit: 10 notifications per minute per user per channel (abuse guard)
- Retry policy: exponential backoff at 1m, 5m, 30m, 2h, 12h; drop after 5 attempts
Back-of-Envelope Estimates
- Push workers: 115K * 0.7 = 80K/sec → at 100ms p50 per APNs/FCM call with connection pooling, ~8K concurrent → 16 workers * 500 each
- SMS workers: 115K * 0.1 = 11.5K/sec → Twilio account throughput tier typically 100-600 msg/sec; need to shard across many sub-accounts
- Email workers: 115K * 0.2 = 23K/sec → SMTP via SendGrid, pool of ~2K connections
- Kafka topics: one per channel, 200 partitions each, 7-day retention → ~3 TB peak storage
- Notification log DB (Cassandra): 10B rows/day * 300B ≈ 3 TB/day → 1 PB over a year
High-level architecture
Producers POST to a stateless notification API, which authenticates the caller, validates the template_id, checks the idempotency cache (Redis keyed by (user_id, idempotency_key) with 24 h TTL), consults the rate limiter, and writes the request to a per-channel Kafka topic — one for push, one for SMS, one for email — before returning 202. The async queue is the load-bearing design choice: it decouples bursty ingest from downstream vendors that have fixed throughput tiers (Twilio at a few hundred msg/sec per sub-account, APNs with HTTP/2 stream caps, SendGrid SMTP pools). Per-channel worker fleets consume their topic, enrich the payload by joining device tokens and user settings from a cached DB, and call the appropriate third-party gateway. On success, the worker writes a row to the Cassandra notification log (keyed by notification_id, clustered by attempt timestamp) and emits a metric. On a retryable failure (5xx, 429, timeout), the worker writes an outbox row and a delayed record to a retry queue with a backoff schedule of 1m, 5m, 30m, 2h, 12h; after five attempts it moves to a dead-letter topic. The outbox pattern guarantees that the decision to retry survives worker crashes — the retry scheduler polls the outbox, not in-memory state. Campaign broadcasts go through a separate fan-out service that reads a recipient list from object storage and drips into the same per-channel queues at a controlled rate, so transactional latency is not starved by marketing blasts.
Architecture Components (12)
- Services / Clients (producers) (client) — Any internal service or campaign UI that submits notification requests. The book ch10 shows 'Service 1..Service N' as producers — any upstream can fire a notification.
- Notification Servers (api) — Stateless HTTP service (the book calls these 'notification servers') that validates, resolves user preferences/device tokens/templates, rate-limits, dedupes, and forwards to the per-channel queue.
- Cache (tokens / templates / settings) (cache) — Read-through Redis cache for the three hot lookups notification servers do on every request: user device tokens, notification templates, and user notification settings.
- Notification DB (users / devices / settings / templates) (nosql) — Source-of-truth store for user profiles, device tokens, per-user notification settings, and the notification-template catalog. The book ch10 explicitly lists these four tables behind the notification servers.
- Rate Limiter (rate-limiter) — Per-user, per-channel token-bucket limiter backed by Redis.
- Per-Channel Queues (+ Retry Queue) (queue) — Durable log that decouples notification servers from third-party services. The book ch10 diagrams one queue per channel so they scale independently; also an explicit retry queue for transient failures.
- Push Worker Fleet (worker) — Consumes notif.push from Kafka and calls APNs (iOS) / FCM (Android).
- SMS Worker Fleet (worker) — Consumes notif.sms and submits messages through Twilio / Sinch.
- Email Worker Fleet (worker) — Consumes notif.email and submits via SendGrid / SES / Postmark.
- External Gateways (APNs / FCM / Twilio / SendGrid) (api) — Grouped abstraction for the 3rd-party services that actually deliver to devices / networks.
- Notification Log DB (nosql) — Append-only record of every notification attempt — status, timestamp, provider response. The book lists this explicitly as a reliability + analytics sink that feeds dashboards and supports auditing 'did X actually get the message?'.
- Analytics (OLAP / Dashboards) (stream-processor) — Stream processor that aggregates notification outcomes for dashboards and alerting.
Operations Walked Through (3)
- single-push — A transactional push (e.g. 'your order shipped') — API accepts, queues, push worker consumes and calls APNs, log is updated.
- bulk-broadcast — A marketing campaign sends one push to 100M users. API fans out into per-user records; Kafka absorbs the burst; workers drain over ~10 minutes.
- gateway-timeout-retry — APNs is brown-out; worker times out; record is re-scheduled onto notif.retry with exponential backoff; second attempt succeeds.
Implementation
package com.systemdesign.notifications.dispatcher;
import com.systemdesign.notifications.model.NotificationRequest;
import com.systemdesign.notifications.model.SendResult;
import com.systemdesign.notifications.sender.ChannelSender;
import com.systemdesign.notifications.sender.EmailSender;
import com.systemdesign.notifications.sender.PushSender;
import com.systemdesign.notifications.sender.SmsSender;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.stereotype.Component;
import java.util.Map;
@Component
public class NotificationDispatcher {
private static final Logger log = LoggerFactory.getLogger(NotificationDispatcher.class);
private final Map<String, ChannelSender> senders;
public NotificationDispatcher(PushSender push, SmsSender sms, EmailSender email) {
this.senders = Map.of(
"push", push,
"sms", sms,
"email", email);
}
public SendResult dispatch(NotificationRequest req) {
ChannelSender sender = senders.get(req.getChannel());
if (sender == null) {
throw new IllegalArgumentException("unsupported channel: " + req.getChannel());
}
try {
SendResult r = sender.send(req);
log.info("sent id={} channel={} vendor={} status={}",
req.getNotificationId(), req.getChannel(), r.vendor(), r.status());
return r;
} catch (TransientSendException e) {
log.warn("transient failure id={} channel={} msg={}",
req.getNotificationId(), req.getChannel(), e.getMessage());
return SendResult.retryable(e.getMessage());
}
}
}
package com.systemdesign.notifications.sender;
import com.systemdesign.notifications.model.NotificationRequest;
import com.systemdesign.notifications.model.SendResult;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.stereotype.Component;
public interface ChannelSender {
SendResult send(NotificationRequest req) throws TransientSendException;
}
@Component
class PushSender implements ChannelSender {
private final ApnsClient apns; // iOS HTTP/2
private final FcmClient fcm; // Android
private final DeviceTokenLookup tokens;
PushSender(@Qualifier("apns") ApnsClient apns,
@Qualifier("fcm") FcmClient fcm,
DeviceTokenLookup tokens) {
this.apns = apns;
this.fcm = fcm;
this.tokens = tokens;
}
@Override
public SendResult send(NotificationRequest req) throws TransientSendException {
DeviceTokenLookup.Device d = tokens.lookup(req.getUserId());
try {
if (d.platform() == DeviceTokenLookup.Platform.IOS) {
apns.push(d.token(), req.getPayload());
return SendResult.ok("apns");
}
fcm.push(d.token(), req.getPayload());
return SendResult.ok("fcm");
} catch (VendorTimeoutException | VendorThrottleException e) {
throw new TransientSendException(e.getMessage(), e);
} catch (InvalidTokenException e) {
// permanent: device uninstalled the app
tokens.invalidate(req.getUserId(), d.token());
return SendResult.permanentFailure("invalid_token");
}
}
}
package com.systemdesign.notifications.retry;
import com.systemdesign.notifications.model.NotificationRequest;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;
import org.springframework.transaction.annotation.Transactional;
import java.time.Duration;
import java.time.Instant;
import java.util.List;
@Component
public class RetryOutbox {
private static final Duration[] BACKOFF = {
Duration.ofMinutes(1),
Duration.ofMinutes(5),
Duration.ofMinutes(30),
Duration.ofHours(2),
Duration.ofHours(12)
};
private static final int MAX_ATTEMPTS = BACKOFF.length;
private final OutboxRepository outbox;
private final NotificationQueuePublisher queue;
private final DeadLetterSink dlq;
public RetryOutbox(OutboxRepository outbox,
NotificationQueuePublisher queue,
DeadLetterSink dlq) {
this.outbox = outbox;
this.queue = queue;
this.dlq = dlq;
}
/** Called from the worker when a send returns a retryable error. */
@Transactional
public void recordFailure(NotificationRequest req, int attemptNumber, String reason) {
if (attemptNumber >= MAX_ATTEMPTS) {
dlq.publish(req, reason);
return;
}
Instant nextAttempt = Instant.now().plus(BACKOFF[attemptNumber]);
outbox.save(new OutboxRow(
req.getNotificationId(),
req,
attemptNumber + 1,
nextAttempt,
reason));
}
/** Poller promotes due rows back onto the channel queue. */
@Scheduled(fixedDelay = 10_000)
public void drainDue() {
List<OutboxRow> due = outbox.lockDue(Instant.now(), 500);
for (OutboxRow row : due) {
queue.publish(row.request(), row.attemptNumber());
outbox.delete(row.notificationId());
}
}
}
Key design decisions & trade-offs
- Ingest coupling — Chosen: Async queue between API and workers. Decouples bursty producers from rate-limited vendors; lets the API return 202 in single-digit milliseconds. Cost: the API cannot tell the caller that APNs itself succeeded — only that the request was accepted.
- Retry state — Chosen: Outbox table polled on a schedule, not in-memory timers. Worker crashes lose in-memory state; an outbox row survives restarts and is the source of truth for 'what still needs to be retried'. Cost: a DB poller adds a small steady load even when there is nothing to retry.
- Dedupe strategy — Chosen: Idempotency key in Redis with 24 h TTL, checked at the API. Catches producer-side retries before they hit the queue or the vendor. Cost: Redis is now on the hot path — a Redis outage fails open (best) or fails closed (safer but rejects valid traffic); the design fails open with a short local cache as backup.
- Channel isolation — Chosen: Separate Kafka topic and worker fleet per channel. A Twilio outage cannot starve push or email workers; per-channel tuning (partitions, consumer concurrency) matches the vendor's throughput profile. Cost: more moving parts, more dashboards.
- Broadcast handling — Chosen: Dedicated fan-out service that drips into the normal queues. Keeps transactional latency stable during marketing blasts by rate-limiting the burst at the source rather than at the worker. Cost: broadcast campaigns take minutes to fully dispatch, which is acceptable for marketing.
Interview follow-ups
- Add per-user preferences UI and opt-out enforcement across channels
- Design a smart-retry policy that learns vendor-specific error codes (e.g. APNs 410 gone vs 429)
- Support scheduled notifications ('send at 9am local time') with timezone-aware dispatch
- Add rich notifications with localized templates and A/B test variants
- Build a delivery-receipt pipeline that reconciles vendor webhooks with the notification log