Distributed Rate Limiter System Design Interview Question
Problem: Design a distributed rate limiter that protects APIs at 1M+ RPS with per-user / per-IP / per-API quotas.
Overview
A rate limiter is the thin layer that stands between the internet and your services and says 'no' when callers ask for too much. It is the cheapest way to stop abusive scrapers, protect fragile backends from thundering retries, enforce per-tenant quotas on a multi-tenant API, and give paying customers a predictable SLA. At 1M+ requests per second across a public API, the rate limiter becomes its own distributed systems problem: gateway replicas must agree on a shared counter without serializing every request through one node, rule updates must propagate to the edge in seconds, and the limiter itself must fail open or fail closed in a way that matches the business. Real systems such as Stripe, Cloudflare, Lyft Envoy and AWS API Gateway all converge on roughly the same architecture: token-bucket state in a fast in-memory store, atomic check-and-decrement via Lua or CAS, and a rules pipeline separate from the data path. Understanding why is a standard senior-engineer interview topic.
Summary
A distributed token-bucket rate limiter enforced at the API gateway, backed by Redis with atomic Lua scripts to avoid race conditions across gateway replicas (the chapter's two concrete concurrency-fix options are Lua scripts and Redis sorted sets). The dominant design choice is token bucket in Redis (O(1) per check, supports bursts — used in production by Amazon and Stripe per the book) over alternatives like sliding-window log (accurate but O(N) memory), sliding-window counter (Cloudflare-style hybrid, approximate but smooth), fixed window (cheap but allows 2x burst at boundaries), or leaking bucket (fixed output rate, no burst allowance). Rules are authored as Lyft-style YAML (domain/descriptor/unit/requests_per_unit), persisted on disk, and pulled by background workers into an in-memory rules cache so gateway hot-path lookups are cache-local. When a request is denied the gateway returns HTTP 429 with `X-Ratelimit-Limit`, `X-Ratelimit-Remaining`, and `X-Ratelimit-Retry-After`; soft-limited requests can optionally be deferred to a queue for later processing instead of being dropped. Multi-DC deployments place the limiter at edge POPs and reconcile counters under an eventual-consistency model.
Requirements
Functional
- Enforce quotas per API key, per user, per IP, and per endpoint with configurable limits
- Allow controlled bursts (e.g. 100 requests within 1s even on a 60/min quota) via token bucket
- Return HTTP 429 with X-Ratelimit-Limit, X-Ratelimit-Remaining, and X-Ratelimit-Retry-After headers
- Hot-reload rules from a control plane without restarting gateway replicas
- Optionally enqueue soft-limited requests to a throttled queue instead of outright rejecting
- Support composite keys (tenant + endpoint + method) and hierarchical quotas
Non-functional
- Add no more than 1-3ms of latency per request at 1M RPS fleet-wide
- Correct under concurrency: two gateway replicas must not both allow a request that would exceed the budget
- Highly available: limiter outage must not take down the API - fail open or fail to local cache
- Horizontally scalable: adding gateway replicas must not add write contention
- Low memory footprint: bucket state under 100 bytes per active identity
Capacity Assumptions
- Total traffic: 1M RPS across the fleet
- 100 API gateway replicas
- Rule set: ~50K distinct (api_key, endpoint) quotas
- Default limit: 100 req/min per user, 10K req/min per service key
- Rule updates: <100/minute globally (very low write QPS)
Back-of-Envelope Estimates
- Redis ops: 1M RPS * 1 script call each ≈ 1M RPS (single Redis cluster with 10 shards handles this)
- Memory per bucket: ~100B → 50K rules * 100B ≈ 5 MB (trivial)
- Rules cache size per limiter replica: ~5 MB in-memory, O(1) hashmap lookup per request
- Config service reads: rules-worker on each gateway polls every 5s → 100 / 5 = 20 RPS — negligible
- Gateway overhead added per request: 1–3ms (local rules lookup + Redis RTT + Lua exec)
- Multi-DC (per book): replicate counter deltas under an eventual-consistency model — brief over-admission across regions is acceptable for almost all APIs
High-level architecture
A client request arrives at a TLS-terminating L4 load balancer and is routed to one of N API gateway replicas using least-connections. Each gateway runs a rate-limit middleware that sits after authentication (so we have a reliable identity) and before the upstream service proxy. The middleware looks up applicable rules in an in-process rules cache, a simple hashmap refreshed every 5 seconds by a background rules-worker thread polling a config service. For each rule the middleware calls Redis with an atomic Lua script that implements token-bucket: it reads the bucket, refills tokens based on elapsed time, decrements if a token is available, and writes back the new state - all in a single round-trip. The Lua script is the key correctness mechanism because any two gateway replicas hitting the same key will serialize on Redis's single-threaded event loop, so the counter cannot be double-decremented. If the bucket is empty the middleware returns HTTP 429 with the three standard headers; otherwise the request proceeds to the upstream. An optional throttled queue catches requests that are soft-limited (e.g. email-sending APIs where 'delayed by 2s' is acceptable) and a worker drains them at the configured rate. For multi-region deployments each edge POP runs its own Redis cluster and counter deltas are reconciled asynchronously; brief over-admission across regions is an accepted trade.
Architecture Components (10)
- Client (3rd-party / internal) (client) — Any caller hitting the public or internal API.
- Load Balancer (lb) — Distributes traffic across gateway replicas.
- API Gateway (api) — Edge service that runs auth + rate-limit middleware before proxying — the book's preferred host for rate limiting in a microservices architecture.
- Rate Limiter Middleware (rate-limiter) — Token bucket check implemented via a single Redis Lua script per request — the chapter's preferred algorithm (used by Amazon and Stripe) for its burst tolerance and O(1) memory.
- Redis (Token Bucket Store) (cache) — Sharded Redis cluster holding per-identity bucket state.
- Config Service (Rules on Disk) (api) — Authoritative store of rate-limit rules — per the book, rules are written in config files and persisted on disk, then pulled by background workers into an in-memory cache on each limiter replica.
- Rules Cache (in-memory) (cache) — Per-limiter in-memory copy of the authoritative rule set so hot-path lookups avoid the config service.
- Rules Worker (worker) — Background goroutine/process that periodically pulls rules from the config service / disk and refreshes the in-memory rules cache.
- Throttled Request Queue (queue) — Optional queue for rate-limited requests the chapter mentions — instead of dropping, enqueue for later processing (e.g., order submission under load).
- Upstream Service (api) — The actual API the gateway protects; only reached if the limiter allows.
Operations Walked Through (4)
- allow — Normal path: limiter loads the rule from its local rules cache, runs the token-bucket Lua script against Redis, decrements a token, and forwards upstream.
- deny — Bucket empty; gateway returns 429 with X-Ratelimit-Limit / X-Ratelimit-Remaining / X-Ratelimit-Retry-After headers and does NOT touch upstream.
- rule-refresh — The chapter's background loop: a worker pulls the authoritative rule set from disk/config service every 5s and overwrites the in-memory rules cache used by every request.
- soft-limit-enqueue — The chapter's 'enqueue rate-limited requests to be processed later' option. Bucket is empty but the request is for a critical flow (e.g., order submission), so the gateway returns 202 Accepted and parks the work on the throttled queue.
Implementation
package com.hld.ratelimit;
import java.time.Duration;
import java.util.Collections;
import java.util.List;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.data.redis.core.StringRedisTemplate;
import org.springframework.data.redis.core.script.DefaultRedisScript;
import org.springframework.stereotype.Component;
@Component
public class TokenBucketRateLimiter {
private static final String LUA =
"local key = KEYS[1] \n" +
"local rate = tonumber(ARGV[1]) \n" +
"local capacity = tonumber(ARGV[2]) \n" +
"local now = tonumber(ARGV[3]) \n" +
"local requested = tonumber(ARGV[4]) \n" +
"local bucket = redis.call('HMGET', key, 'tokens', 'ts') \n" +
"local tokens = tonumber(bucket[1]) or capacity \n" +
"local ts = tonumber(bucket[2]) or now \n" +
"local delta = math.max(0, now - ts) \n" +
"local filled = math.min(capacity, tokens + delta * rate) \n" +
"local allowed = filled >= requested \n" +
"if allowed then filled = filled - requested end \n" +
"redis.call('HMSET', key, 'tokens', filled, 'ts', now) \n" +
"redis.call('PEXPIRE', key, 60000) \n" +
"return { allowed and 1 or 0, filled } \n";
private final DefaultRedisScript<List> script;
private final StringRedisTemplate redis;
@Autowired
public TokenBucketRateLimiter(StringRedisTemplate redis) {
this.redis = redis;
this.script = new DefaultRedisScript<>();
this.script.setScriptText(LUA);
this.script.setResultType(List.class);
}
public Decision tryAcquire(String identity, double refillPerSec, long capacity) {
long nowMs = System.currentTimeMillis();
List<?> out = redis.execute(script,
Collections.singletonList("rl:" + identity),
String.valueOf(refillPerSec / 1000.0),
String.valueOf(capacity),
String.valueOf(nowMs),
"1");
boolean allowed = ((Number) out.get(0)).intValue() == 1;
long remaining = ((Number) out.get(1)).longValue();
return new Decision(allowed, remaining, Duration.ofMillis(allowed ? 0 : (long) (1000.0 / refillPerSec)));
}
public record Decision(boolean allowed, long remaining, Duration retryAfter) { }
}
package com.hld.ratelimit;
import java.util.Deque;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentLinkedDeque;
import java.util.concurrent.ConcurrentMap;
/** O(N) sliding-window-log: exact but higher memory. Used for low-QPS high-value keys. */
public final class SlidingWindowLogLimiter {
private final long windowMs;
private final int maxRequests;
private final ConcurrentMap<String, Deque<Long>> timestamps = new ConcurrentHashMap<>();
public SlidingWindowLogLimiter(long windowMs, int maxRequests) {
this.windowMs = windowMs;
this.maxRequests = maxRequests;
}
public boolean allow(String identity) {
long now = System.currentTimeMillis();
long cutoff = now - windowMs;
Deque<Long> log = timestamps.computeIfAbsent(identity, k -> new ConcurrentLinkedDeque<>());
synchronized (log) {
while (!log.isEmpty() && log.peekFirst() <= cutoff) {
log.pollFirst();
}
if (log.size() >= maxRequests) {
return false;
}
log.addLast(now);
return true;
}
}
public int currentCount(String identity) {
Deque<Long> log = timestamps.get(identity);
return log == null ? 0 : log.size();
}
}
package com.hld.ratelimit;
import jakarta.servlet.http.HttpServletRequest;
import jakarta.servlet.http.HttpServletResponse;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.http.HttpStatus;
import org.springframework.stereotype.Component;
import org.springframework.web.servlet.HandlerInterceptor;
@Component
public class RateLimitInterceptor implements HandlerInterceptor {
private final TokenBucketRateLimiter limiter;
private final RulesCache rules;
@Autowired
public RateLimitInterceptor(TokenBucketRateLimiter limiter, RulesCache rules) {
this.limiter = limiter;
this.rules = rules;
}
@Override
public boolean preHandle(HttpServletRequest req, HttpServletResponse resp, Object handler) {
String identity = (String) req.getAttribute("apiKey");
if (identity == null) identity = "ip:" + req.getRemoteAddr();
Rule rule = rules.resolve(identity, req.getRequestURI(), req.getMethod());
TokenBucketRateLimiter.Decision d =
limiter.tryAcquire(rule.bucketKey(), rule.refillPerSec(), rule.capacity());
resp.setHeader("X-Ratelimit-Limit", Long.toString(rule.capacity()));
resp.setHeader("X-Ratelimit-Remaining", Long.toString(d.remaining()));
if (!d.allowed()) {
resp.setStatus(HttpStatus.TOO_MANY_REQUESTS.value());
resp.setHeader("X-Ratelimit-Retry-After", Long.toString(d.retryAfter().toSeconds()));
return false;
}
return true;
}
public record Rule(String bucketKey, double refillPerSec, long capacity) { }
}
package com.hld.ratelimit;
import java.util.Map;
import java.util.concurrent.atomic.AtomicReference;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;
@Component
public class RulesCache {
private final ConfigServiceClient config;
private final AtomicReference<Map<String, RateLimitInterceptor.Rule>> snapshot =
new AtomicReference<>(Map.of());
public RulesCache(ConfigServiceClient config) {
this.config = config;
refresh();
}
@Scheduled(fixedDelayString = "5000")
public void refresh() {
try {
Map<String, RateLimitInterceptor.Rule> fresh = config.fetchAllRules();
snapshot.set(Map.copyOf(fresh));
} catch (RuntimeException ignored) {
// keep stale snapshot on config-service outage
}
}
public RateLimitInterceptor.Rule resolve(String identity, String path, String method) {
Map<String, RateLimitInterceptor.Rule> current = snapshot.get();
RateLimitInterceptor.Rule rule = current.get(identity + ":" + method + ":" + path);
if (rule != null) return rule;
rule = current.get(identity + ":*");
if (rule != null) return rule;
return current.getOrDefault("default",
new RateLimitInterceptor.Rule("rl:" + identity, 100.0 / 60.0, 100));
}
public interface ConfigServiceClient {
Map<String, RateLimitInterceptor.Rule> fetchAllRules();
}
}
Key design decisions & trade-offs
- Algorithm — Chosen: Token bucket (vs fixed/sliding window). Allows configurable bursts, O(1) per check, and tiny state; fixed window allows 2x burst at boundaries and sliding-window-log is O(N) memory.
- Shared state store — Chosen: Redis with Lua scripts (vs per-gateway local counters). Gateway replicas must agree on one counter; Lua gives single-RTT atomic check-and-decrement. Local counters would over-admit by a factor of N replicas.
- Rules distribution — Chosen: Pull-based cache with 5s refresh (vs push/stream). Eventual consistency on rules is fine since rule changes are rare (<100/min); pull avoids config-service outages blocking the hot path.
- Failure mode — Chosen: Fail open on Redis unavailability. Most APIs prefer brief over-admission to outright downtime during limiter outage; payment APIs may invert this and fail closed.
- Multi-region — Chosen: Per-POP limiter with async delta reconciliation. Strong global counters would add cross-region RTT per request; per-POP plus eventual reconciliation accepts a small over-admission window.
Interview follow-ups
- Add a leaky-bucket option for smooth outbound rate to fragile downstreams
- Introduce adaptive limits that lower quotas when upstream latency SLO is breached
- Support hierarchical quotas (org > team > key) with precedence rules
- Dashboard showing per-identity 429 rate and Redis slot contention
- Tarpitting / progressive backoff for repeatedly abusive keys before outright blocking