← System Design Simulator

Notification System (Push / SMS / Email) System Design Interview Question

By Rahul Kumar · Senior Software Engineer · Updated · 12 components · 3 operations ·Source: Alex Xu, System Design Interview Vol 1, Chapter 10; Apple Push Notification service docs; Twilio Programmable Messaging SLAs; SendGrid reliability guide

Problem: Design a system that delivers notifications across push (APNs/FCM), SMS, and email channels at scale, with reliable retries, rate limiting, per-user preferences, and minute-scale latency for transactional messages.

Overview

A multi-channel notification system has to sit between fast internal producers (order-confirmed, 2FA code, friend-request) and slow, rate-limited, failure-prone third-party gateways (APNs, FCM, Twilio, SendGrid), while still delivering 10 billion messages a day. The challenge is almost entirely operational: every downstream vendor has a different throughput tier, different retry semantics, and different failure modes, and the business still wants a single /notify call that Just Works. The design pressures are (1) ingest spikes — a broadcast campaign can ask for 100M deliveries in 10 minutes — (2) per-user abuse prevention so a buggy producer does not SMS-bomb a customer, (3) idempotent retries so a flaky network between producer and API does not double-charge an SMS, and (4) audit — every attempted delivery must be logged for analytics, compliance, and debugging. Getting this right is what separates a reliable platform from a source of production incidents.

Notification System (Push / SMS / Email) — Interactive Simulator

Runs fully client-side in your browser; no sign-up. Or open full screen →

Launch the interactive walkthrough for Notification System (Push / SMS / Email) — animated architecture diagram, step-by-step flow with real payloads, component swap, and a discrete-event stress simulator.

Summary

A producer-facing HTTP API accepts notification requests, validates, rate-limits, and writes them to a per-channel queue. Per-channel worker fleets consume and call the appropriate third-party service (APNs for iOS, FCM for Android, SMS vendors like Twilio/Nexmo, email providers like SendGrid/SES). Device tokens, templates, and user settings live in a DB with a cache tier in front. Every attempt is logged to a notifications log/analytics store. The dominant design choice is an async queue between notification servers and workers — this decouples ingest spikes (bulk broadcasts) from the rate-limited, slow, and failure-prone third-party services. Key reliability patterns from the book: (1) dedupe via an idempotency key at the server so repeats aren't double-sent, (2) retry with a retry queue for transient failures, (3) notification log for analytics + audit, (4) rate limiting so users aren't spammed. Sized for 10B notifications/day (~115K/sec sustained, 1M/sec peak for broadcasts).

Requirements

Functional

Non-functional

Capacity Assumptions

Back-of-Envelope Estimates

High-level architecture

Producers POST to a stateless notification API, which authenticates the caller, validates the template_id, checks the idempotency cache (Redis keyed by (user_id, idempotency_key) with 24 h TTL), consults the rate limiter, and writes the request to a per-channel Kafka topic — one for push, one for SMS, one for email — before returning 202. The async queue is the load-bearing design choice: it decouples bursty ingest from downstream vendors that have fixed throughput tiers (Twilio at a few hundred msg/sec per sub-account, APNs with HTTP/2 stream caps, SendGrid SMTP pools). Per-channel worker fleets consume their topic, enrich the payload by joining device tokens and user settings from a cached DB, and call the appropriate third-party gateway. On success, the worker writes a row to the Cassandra notification log (keyed by notification_id, clustered by attempt timestamp) and emits a metric. On a retryable failure (5xx, 429, timeout), the worker writes an outbox row and a delayed record to a retry queue with a backoff schedule of 1m, 5m, 30m, 2h, 12h; after five attempts it moves to a dead-letter topic. The outbox pattern guarantees that the decision to retry survives worker crashes — the retry scheduler polls the outbox, not in-memory state. Campaign broadcasts go through a separate fan-out service that reads a recipient list from object storage and drips into the same per-channel queues at a controlled rate, so transactional latency is not starved by marketing blasts.

Architecture Components (12)

Operations Walked Through (3)

Implementation

NotificationDispatcher — routes to channel-specific senders
package com.systemdesign.notifications.dispatcher;

import com.systemdesign.notifications.model.NotificationRequest;
import com.systemdesign.notifications.model.SendResult;
import com.systemdesign.notifications.sender.ChannelSender;
import com.systemdesign.notifications.sender.EmailSender;
import com.systemdesign.notifications.sender.PushSender;
import com.systemdesign.notifications.sender.SmsSender;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.stereotype.Component;

import java.util.Map;

@Component
public class NotificationDispatcher {
    private static final Logger log = LoggerFactory.getLogger(NotificationDispatcher.class);

    private final Map<String, ChannelSender> senders;

    public NotificationDispatcher(PushSender push, SmsSender sms, EmailSender email) {
        this.senders = Map.of(
                "push", push,
                "sms", sms,
                "email", email);
    }

    public SendResult dispatch(NotificationRequest req) {
        ChannelSender sender = senders.get(req.getChannel());
        if (sender == null) {
            throw new IllegalArgumentException("unsupported channel: " + req.getChannel());
        }
        try {
            SendResult r = sender.send(req);
            log.info("sent id={} channel={} vendor={} status={}",
                    req.getNotificationId(), req.getChannel(), r.vendor(), r.status());
            return r;
        } catch (TransientSendException e) {
            log.warn("transient failure id={} channel={} msg={}",
                    req.getNotificationId(), req.getChannel(), e.getMessage());
            return SendResult.retryable(e.getMessage());
        }
    }
}
ChannelSender interface and PushSender implementation
package com.systemdesign.notifications.sender;

import com.systemdesign.notifications.model.NotificationRequest;
import com.systemdesign.notifications.model.SendResult;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.stereotype.Component;

public interface ChannelSender {
    SendResult send(NotificationRequest req) throws TransientSendException;
}

@Component
class PushSender implements ChannelSender {

    private final ApnsClient apns;   // iOS HTTP/2
    private final FcmClient fcm;     // Android
    private final DeviceTokenLookup tokens;

    PushSender(@Qualifier("apns") ApnsClient apns,
               @Qualifier("fcm") FcmClient fcm,
               DeviceTokenLookup tokens) {
        this.apns = apns;
        this.fcm = fcm;
        this.tokens = tokens;
    }

    @Override
    public SendResult send(NotificationRequest req) throws TransientSendException {
        DeviceTokenLookup.Device d = tokens.lookup(req.getUserId());
        try {
            if (d.platform() == DeviceTokenLookup.Platform.IOS) {
                apns.push(d.token(), req.getPayload());
                return SendResult.ok("apns");
            }
            fcm.push(d.token(), req.getPayload());
            return SendResult.ok("fcm");
        } catch (VendorTimeoutException | VendorThrottleException e) {
            throw new TransientSendException(e.getMessage(), e);
        } catch (InvalidTokenException e) {
            // permanent: device uninstalled the app
            tokens.invalidate(req.getUserId(), d.token());
            return SendResult.permanentFailure("invalid_token");
        }
    }
}
Outbox pattern with exponential backoff retry
package com.systemdesign.notifications.retry;

import com.systemdesign.notifications.model.NotificationRequest;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;
import org.springframework.transaction.annotation.Transactional;

import java.time.Duration;
import java.time.Instant;
import java.util.List;

@Component
public class RetryOutbox {

    private static final Duration[] BACKOFF = {
            Duration.ofMinutes(1),
            Duration.ofMinutes(5),
            Duration.ofMinutes(30),
            Duration.ofHours(2),
            Duration.ofHours(12)
    };
    private static final int MAX_ATTEMPTS = BACKOFF.length;

    private final OutboxRepository outbox;
    private final NotificationQueuePublisher queue;
    private final DeadLetterSink dlq;

    public RetryOutbox(OutboxRepository outbox,
                       NotificationQueuePublisher queue,
                       DeadLetterSink dlq) {
        this.outbox = outbox;
        this.queue = queue;
        this.dlq = dlq;
    }

    /** Called from the worker when a send returns a retryable error. */
    @Transactional
    public void recordFailure(NotificationRequest req, int attemptNumber, String reason) {
        if (attemptNumber >= MAX_ATTEMPTS) {
            dlq.publish(req, reason);
            return;
        }
        Instant nextAttempt = Instant.now().plus(BACKOFF[attemptNumber]);
        outbox.save(new OutboxRow(
                req.getNotificationId(),
                req,
                attemptNumber + 1,
                nextAttempt,
                reason));
    }

    /** Poller promotes due rows back onto the channel queue. */
    @Scheduled(fixedDelay = 10_000)
    public void drainDue() {
        List<OutboxRow> due = outbox.lockDue(Instant.now(), 500);
        for (OutboxRow row : due) {
            queue.publish(row.request(), row.attemptNumber());
            outbox.delete(row.notificationId());
        }
    }
}

Key design decisions & trade-offs

Interview follow-ups

Related