← System Design Simulator

Distributed Rate Limiter System Design Interview Question

By Rahul Kumar · Senior Software Engineer · Updated · 10 components · 4 operations ·Source: Alex Xu, System Design Interview Vol 1, Chapter 4

Problem: Design a distributed rate limiter that protects APIs at 1M+ RPS with per-user / per-IP / per-API quotas.

Overview

A rate limiter is the thin layer that stands between the internet and your services and says 'no' when callers ask for too much. It is the cheapest way to stop abusive scrapers, protect fragile backends from thundering retries, enforce per-tenant quotas on a multi-tenant API, and give paying customers a predictable SLA. At 1M+ requests per second across a public API, the rate limiter becomes its own distributed systems problem: gateway replicas must agree on a shared counter without serializing every request through one node, rule updates must propagate to the edge in seconds, and the limiter itself must fail open or fail closed in a way that matches the business. Real systems such as Stripe, Cloudflare, Lyft Envoy and AWS API Gateway all converge on roughly the same architecture: token-bucket state in a fast in-memory store, atomic check-and-decrement via Lua or CAS, and a rules pipeline separate from the data path. Understanding why is a standard senior-engineer interview topic.

Distributed Rate Limiter — Interactive Simulator

Runs fully client-side in your browser; no sign-up. Or open full screen →

Launch the interactive walkthrough for Distributed Rate Limiter — animated architecture diagram, step-by-step flow with real payloads, component swap, and a discrete-event stress simulator.

Summary

A distributed token-bucket rate limiter enforced at the API gateway, backed by Redis with atomic Lua scripts to avoid race conditions across gateway replicas (the chapter's two concrete concurrency-fix options are Lua scripts and Redis sorted sets). The dominant design choice is token bucket in Redis (O(1) per check, supports bursts — used in production by Amazon and Stripe per the book) over alternatives like sliding-window log (accurate but O(N) memory), sliding-window counter (Cloudflare-style hybrid, approximate but smooth), fixed window (cheap but allows 2x burst at boundaries), or leaking bucket (fixed output rate, no burst allowance). Rules are authored as Lyft-style YAML (domain/descriptor/unit/requests_per_unit), persisted on disk, and pulled by background workers into an in-memory rules cache so gateway hot-path lookups are cache-local. When a request is denied the gateway returns HTTP 429 with `X-Ratelimit-Limit`, `X-Ratelimit-Remaining`, and `X-Ratelimit-Retry-After`; soft-limited requests can optionally be deferred to a queue for later processing instead of being dropped. Multi-DC deployments place the limiter at edge POPs and reconcile counters under an eventual-consistency model.

Requirements

Functional

Non-functional

Capacity Assumptions

Back-of-Envelope Estimates

High-level architecture

A client request arrives at a TLS-terminating L4 load balancer and is routed to one of N API gateway replicas using least-connections. Each gateway runs a rate-limit middleware that sits after authentication (so we have a reliable identity) and before the upstream service proxy. The middleware looks up applicable rules in an in-process rules cache, a simple hashmap refreshed every 5 seconds by a background rules-worker thread polling a config service. For each rule the middleware calls Redis with an atomic Lua script that implements token-bucket: it reads the bucket, refills tokens based on elapsed time, decrements if a token is available, and writes back the new state - all in a single round-trip. The Lua script is the key correctness mechanism because any two gateway replicas hitting the same key will serialize on Redis's single-threaded event loop, so the counter cannot be double-decremented. If the bucket is empty the middleware returns HTTP 429 with the three standard headers; otherwise the request proceeds to the upstream. An optional throttled queue catches requests that are soft-limited (e.g. email-sending APIs where 'delayed by 2s' is acceptable) and a worker drains them at the configured rate. For multi-region deployments each edge POP runs its own Redis cluster and counter deltas are reconciled asynchronously; brief over-admission across regions is an accepted trade.

Architecture Components (10)

Operations Walked Through (4)

Implementation

TokenBucketRateLimiter.java
package com.hld.ratelimit;

import java.time.Duration;
import java.util.Collections;
import java.util.List;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.data.redis.core.StringRedisTemplate;
import org.springframework.data.redis.core.script.DefaultRedisScript;
import org.springframework.stereotype.Component;

@Component
public class TokenBucketRateLimiter {
    private static final String LUA =
            "local key = KEYS[1] \n" +
            "local rate = tonumber(ARGV[1]) \n" +
            "local capacity = tonumber(ARGV[2]) \n" +
            "local now = tonumber(ARGV[3]) \n" +
            "local requested = tonumber(ARGV[4]) \n" +
            "local bucket = redis.call('HMGET', key, 'tokens', 'ts') \n" +
            "local tokens = tonumber(bucket[1]) or capacity \n" +
            "local ts = tonumber(bucket[2]) or now \n" +
            "local delta = math.max(0, now - ts) \n" +
            "local filled = math.min(capacity, tokens + delta * rate) \n" +
            "local allowed = filled >= requested \n" +
            "if allowed then filled = filled - requested end \n" +
            "redis.call('HMSET', key, 'tokens', filled, 'ts', now) \n" +
            "redis.call('PEXPIRE', key, 60000) \n" +
            "return { allowed and 1 or 0, filled } \n";

    private final DefaultRedisScript<List> script;
    private final StringRedisTemplate redis;

    @Autowired
    public TokenBucketRateLimiter(StringRedisTemplate redis) {
        this.redis = redis;
        this.script = new DefaultRedisScript<>();
        this.script.setScriptText(LUA);
        this.script.setResultType(List.class);
    }

    public Decision tryAcquire(String identity, double refillPerSec, long capacity) {
        long nowMs = System.currentTimeMillis();
        List<?> out = redis.execute(script,
                Collections.singletonList("rl:" + identity),
                String.valueOf(refillPerSec / 1000.0),
                String.valueOf(capacity),
                String.valueOf(nowMs),
                "1");
        boolean allowed = ((Number) out.get(0)).intValue() == 1;
        long remaining = ((Number) out.get(1)).longValue();
        return new Decision(allowed, remaining, Duration.ofMillis(allowed ? 0 : (long) (1000.0 / refillPerSec)));
    }

    public record Decision(boolean allowed, long remaining, Duration retryAfter) { }
}
SlidingWindowLogLimiter.java
package com.hld.ratelimit;

import java.util.Deque;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentLinkedDeque;
import java.util.concurrent.ConcurrentMap;

/** O(N) sliding-window-log: exact but higher memory. Used for low-QPS high-value keys. */
public final class SlidingWindowLogLimiter {
    private final long windowMs;
    private final int maxRequests;
    private final ConcurrentMap<String, Deque<Long>> timestamps = new ConcurrentHashMap<>();

    public SlidingWindowLogLimiter(long windowMs, int maxRequests) {
        this.windowMs = windowMs;
        this.maxRequests = maxRequests;
    }

    public boolean allow(String identity) {
        long now = System.currentTimeMillis();
        long cutoff = now - windowMs;
        Deque<Long> log = timestamps.computeIfAbsent(identity, k -> new ConcurrentLinkedDeque<>());
        synchronized (log) {
            while (!log.isEmpty() && log.peekFirst() <= cutoff) {
                log.pollFirst();
            }
            if (log.size() >= maxRequests) {
                return false;
            }
            log.addLast(now);
            return true;
        }
    }

    public int currentCount(String identity) {
        Deque<Long> log = timestamps.get(identity);
        return log == null ? 0 : log.size();
    }
}
RateLimitInterceptor.java
package com.hld.ratelimit;

import jakarta.servlet.http.HttpServletRequest;
import jakarta.servlet.http.HttpServletResponse;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.http.HttpStatus;
import org.springframework.stereotype.Component;
import org.springframework.web.servlet.HandlerInterceptor;

@Component
public class RateLimitInterceptor implements HandlerInterceptor {
    private final TokenBucketRateLimiter limiter;
    private final RulesCache rules;

    @Autowired
    public RateLimitInterceptor(TokenBucketRateLimiter limiter, RulesCache rules) {
        this.limiter = limiter;
        this.rules = rules;
    }

    @Override
    public boolean preHandle(HttpServletRequest req, HttpServletResponse resp, Object handler) {
        String identity = (String) req.getAttribute("apiKey");
        if (identity == null) identity = "ip:" + req.getRemoteAddr();

        Rule rule = rules.resolve(identity, req.getRequestURI(), req.getMethod());
        TokenBucketRateLimiter.Decision d =
                limiter.tryAcquire(rule.bucketKey(), rule.refillPerSec(), rule.capacity());

        resp.setHeader("X-Ratelimit-Limit", Long.toString(rule.capacity()));
        resp.setHeader("X-Ratelimit-Remaining", Long.toString(d.remaining()));

        if (!d.allowed()) {
            resp.setStatus(HttpStatus.TOO_MANY_REQUESTS.value());
            resp.setHeader("X-Ratelimit-Retry-After", Long.toString(d.retryAfter().toSeconds()));
            return false;
        }
        return true;
    }

    public record Rule(String bucketKey, double refillPerSec, long capacity) { }
}
RulesCache.java
package com.hld.ratelimit;

import java.util.Map;
import java.util.concurrent.atomic.AtomicReference;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;

@Component
public class RulesCache {
    private final ConfigServiceClient config;
    private final AtomicReference<Map<String, RateLimitInterceptor.Rule>> snapshot =
            new AtomicReference<>(Map.of());

    public RulesCache(ConfigServiceClient config) {
        this.config = config;
        refresh();
    }

    @Scheduled(fixedDelayString = "5000")
    public void refresh() {
        try {
            Map<String, RateLimitInterceptor.Rule> fresh = config.fetchAllRules();
            snapshot.set(Map.copyOf(fresh));
        } catch (RuntimeException ignored) {
            // keep stale snapshot on config-service outage
        }
    }

    public RateLimitInterceptor.Rule resolve(String identity, String path, String method) {
        Map<String, RateLimitInterceptor.Rule> current = snapshot.get();
        RateLimitInterceptor.Rule rule = current.get(identity + ":" + method + ":" + path);
        if (rule != null) return rule;
        rule = current.get(identity + ":*");
        if (rule != null) return rule;
        return current.getOrDefault("default",
                new RateLimitInterceptor.Rule("rl:" + identity, 100.0 / 60.0, 100));
    }

    public interface ConfigServiceClient {
        Map<String, RateLimitInterceptor.Rule> fetchAllRules();
    }
}

Key design decisions & trade-offs

Interview follow-ups

Related