Metrics Monitoring and Alerting (Prometheus-style) System Design Interview Question
Problem: Design a metrics monitoring and alerting system for a fleet of ~10,000 servers and thousands of services, Prometheus/Grafana-style.
Overview
A metrics monitoring system is the nervous system of a production fleet: without it you are flying blind, and with a bad one you are flying blind while being paged every ninety seconds. The Prometheus-style design that has become the de-facto standard answers three questions — how do metrics get collected, how are they stored cheaply enough to keep for weeks, and how does an alert engine evaluate thousands of rules over them without falling behind. The big commitments are pull-based scraping (the collector discovers targets via service discovery and pulls /metrics on a fixed interval), a time-series database that leans on delta-of-delta encoding and XOR float compression to keep a single sample under two bytes on disk, and a rule engine that re-runs PromQL expressions every 15-30 seconds to fire alerts. This page walks that pipeline end-to-end and shows the Java primitives you would build if you were instrumenting a service by hand.
Summary
A pull-based metrics pipeline: instrumented apps expose /metrics, a collector scrapes them on a fixed interval, samples land in a time-series DB (TSDB) with delta-of-delta + XOR float compression, an alert engine continuously evaluates PromQL-style rules, and Grafana queries the TSDB for dashboards. The dominant design choice is pull-based scraping over push: pull makes service-discovery the source of truth for the target list, and a target failing to be scraped is itself a signal (up==0). The main tradeoff is short-lived batch jobs, which can finish before a scrape — those go through an optional push gateway. Sized for ~10M active series and ~1M samples/sec ingest, which fits Prometheus on a handful of nodes and spills to a remote long-term TSDB (Thanos / Mimir / VictoriaMetrics) for retention beyond 15 days.
Requirements
Functional
- Collect counter, gauge, histogram, and summary metrics from ~10,000 hosts and thousands of services
- Expose a /metrics endpoint per process in Prometheus text exposition format
- Scrape all targets on a configurable interval (typically 15 or 30 seconds)
- Store samples for at least 15 days locally and 12 months in a remote long-term TSDB
- Evaluate PromQL-style alerting rules continuously and dispatch notifications
- Serve ad-hoc queries from Grafana with sub-second latency on the last hour of data
Non-functional
- Ingest at least 1 million samples per second per collector
- Support 10 million active series per collector without OOM
- Scrape failures must themselves be observable (up == 0 is the signal)
- Alerting latency under 60 seconds from threshold breach to notification
- Graceful degradation: if the remote TSDB is down, local retention continues
- Instrumentation overhead under 1% CPU and negligible allocation on the hot path
Capacity Assumptions
- 10K hosts * ~1000 active series each = ~10M active series
- Scrape interval: 15s default; alert eval interval: 30s
- Sample size on disk: ~1.3 bytes/sample (Gorilla / Facebook compression)
- Retention: 15 days hot local TSDB, 13 months remote long-term storage
- Alerts: ~2000 rules, ~50 firing on average
Back-of-Envelope Estimates
- Ingest: 10M series / 15s ≈ 667K samples/sec sustained (peak ~1M/sec)
- Daily sample volume: 667K * 86400 ≈ 57.6B samples/day
- Hot storage: 57.6B * 1.3B * 15d ≈ 1.1 TB across the TSDB cluster
- Query load: Grafana ~500 concurrent dashboards * 6 panels each ≈ 3K PromQL/sec peak
- Alert eval: 2000 rules / 30s ≈ 67 evals/sec, each touching a handful of series
High-level architecture
Each service process embeds a lightweight client library that maintains in-memory counters, gauges, and histograms and exposes them on an HTTP /metrics endpoint in Prometheus text format. A collector fleet discovers targets through a service-discovery plug-in — Kubernetes API, Consul, or a static file — and, on a fixed scrape interval, issues a GET against every target's /metrics URL. Samples land in a local time-series database that is append-only: the head block holds the last two hours in memory with a write-ahead log, and older blocks are compacted into immutable on-disk chunks that apply delta-of-delta timestamp encoding and Gorilla XOR float compression, typically 1.3 bytes per sample. A rule manager reloads alerting rules from config and re-evaluates them every 15 or 30 seconds against the head block; firing alerts are pushed to an Alertmanager-style deduper that handles grouping, silencing, and routing to PagerDuty or Slack. For long-term retention, the collector remote-writes samples to a horizontally scalable store such as Thanos, Mimir, or VictoriaMetrics, which fans out queries across many collectors behind a single query API. Short-lived batch jobs that might finish before a scrape push into an optional push-gateway that the collector then scrapes like any other target. Grafana is a read-only client of the query API and performs no ingestion itself.
Architecture Components (9)
- Operator Browser (client) — SRE/developer browser rendering Grafana dashboards and receiving pager notifications.
- Instrumented App (client) — Application process exposing a /metrics endpoint with counters, gauges, histograms, and summaries.
- Push Gateway (api) — Accumulator for short-lived batch jobs that finish before the next scrape.
- Scrape Collector (worker) — Prometheus-style scraper: on every interval, pulls /metrics from every discovered target.
- Metrics Kafka (queue) — Optional buffer between collectors and the TSDB write path — absorbs bursts and decouples failure domains.
- Time Series DB (nosql) — Columnar TSDB (Prometheus local + Thanos/Mimir for long-term) with Gorilla compression.
- Alert Rule Engine (worker) — Continuously evaluates alert rules (PromQL) and hands firing alerts to the notifier.
- Grafana (api) — Visualization layer. Issues PromQL queries to the TSDB and renders time-series panels.
- Notification Sender (api) — Routes firing alerts to PagerDuty, Slack, email, and webhooks.
Operations Walked Through (3)
- scrape — Every 15s the collector pulls /metrics from a target, parses, and writes the samples through Kafka into the TSDB.
- query — An operator opens a Grafana dashboard; Grafana issues a PromQL query for the last 15 minutes and renders a chart.
- alert — The alert engine evaluates a PromQL rule every 30s; when 5xx rate exceeds 5% for 10 minutes, it fires and the notifier pages on-call.
Implementation
package com.example.metrics;
import java.util.concurrent.atomic.*;
public final class Metrics {
public static final class Counter {
private final LongAdder value = new LongAdder(); // lock-free, sharded adder
public void inc() { value.increment(); }
public void add(long delta) { value.add(delta); }
public long get() { return value.sum(); }
}
public static final class Gauge {
private final DoubleAdder delta = new DoubleAdder();
public void set(double v) { delta.reset(); delta.add(v); }
public void inc(double v) { delta.add(v); }
public double get() { return delta.sum(); }
}
public static final class Histogram {
private final double[] buckets; // upper bounds, +Inf at end
private final LongAdder[] counts; // one adder per bucket
private final DoubleAdder sum = new DoubleAdder();
private final LongAdder total = new LongAdder();
public Histogram(double[] upperBounds) {
this.buckets = upperBounds;
this.counts = new LongAdder[upperBounds.length];
for (int i = 0; i < counts.length; i++) counts[i] = new LongAdder();
}
public void observe(double v) {
total.increment();
sum.add(v);
for (int i = 0; i < buckets.length; i++) {
if (v <= buckets[i]) { counts[i].increment(); return; }
}
}
public long count() { return total.sum(); }
public double sum() { return sum.sum(); }
public long bucketCount(int i){ return counts[i].sum(); }
}
}
package com.example.metrics;
import com.sun.net.httpserver.*;
import java.io.*;
import java.net.InetSocketAddress;
import java.nio.charset.StandardCharsets;
import java.util.*;
public class MetricsServer {
private final Registry registry; // holds named Counters, Gauges, Histograms
public MetricsServer(Registry r) { this.registry = r; }
public void start(int port) throws IOException {
HttpServer s = HttpServer.create(new InetSocketAddress(port), 0);
s.createContext("/metrics", this::handle);
s.start();
}
private void handle(HttpExchange ex) throws IOException {
StringBuilder out = new StringBuilder(8192);
for (Registry.Entry e : registry.snapshot()) {
out.append("# HELP ").append(e.name).append(' ').append(e.help).append('\n');
out.append("# TYPE ").append(e.name).append(' ').append(e.type).append('\n');
if (e.type.equals("histogram")) {
Metrics.Histogram h = (Metrics.Histogram) e.instrument;
for (int i = 0; i < e.buckets.length; i++) {
out.append(e.name).append("_bucket{le=\"").append(e.buckets[i]).append("\"} ")
.append(h.bucketCount(i)).append('\n');
}
out.append(e.name).append("_sum ").append(h.sum()).append('\n');
out.append(e.name).append("_count ").append(h.count()).append('\n');
} else if (e.type.equals("counter")) {
out.append(e.name).append(' ').append(((Metrics.Counter) e.instrument).get()).append('\n');
} else { // gauge
out.append(e.name).append(' ').append(((Metrics.Gauge) e.instrument).get()).append('\n');
}
}
byte[] body = out.toString().getBytes(StandardCharsets.UTF_8);
ex.getResponseHeaders().set("Content-Type", "text/plain; version=0.0.4");
ex.sendResponseHeaders(200, body.length);
try (OutputStream os = ex.getResponseBody()) { os.write(body); }
}
}
package com.example.metrics.scrape;
import java.net.URI;
import java.net.http.*;
import java.time.Duration;
import java.util.*;
import java.util.concurrent.*;
public class PullScheduler {
private final ScheduledExecutorService exec = Executors.newScheduledThreadPool(32);
private final HttpClient http = HttpClient.newBuilder()
.connectTimeout(Duration.ofSeconds(2)).build();
private final Sink sink; // writes samples into the TSDB head block
private final Duration interval;
public PullScheduler(Sink sink, Duration interval) {
this.sink = sink;
this.interval = interval;
}
public void register(Target t) {
// Stagger scrapes across the interval so all 10k targets don't hit at t=0.
long jitterMs = ThreadLocalRandom.current().nextLong(interval.toMillis());
exec.scheduleAtFixedRate(() -> scrape(t), jitterMs, interval.toMillis(), TimeUnit.MILLISECONDS);
}
private void scrape(Target t) {
long ts = System.currentTimeMillis();
try {
HttpRequest req = HttpRequest.newBuilder(URI.create(t.url))
.timeout(Duration.ofSeconds(10)).GET().build();
HttpResponse<String> r = http.send(req, HttpResponse.BodyHandlers.ofString());
if (r.statusCode() != 200) { sink.writeUp(t, ts, 0); return; }
sink.writeUp(t, ts, 1);
ExpositionParser.parse(r.body(), (name, labels, value) ->
sink.writeSample(t, name, labels, ts, value));
} catch (Exception e) {
sink.writeUp(t, ts, 0); // up == 0 is itself a signal the alerting engine uses
}
}
}
Key design decisions & trade-offs
- Collection model — Chosen: Pull-based scraping over push. Pull makes service discovery the source of truth for 'what should exist', and a missing scrape (up == 0) is itself a first-class signal. Push hides dead targets as silent no-data.
- Storage format — Chosen: Delta-of-delta timestamps plus Gorilla XOR float compression. Metric timestamps are near-regular and float values rarely change dramatically between samples; these encodings cut disk from ~16 bytes/sample to ~1.3 bytes while preserving full resolution.
- In-process aggregation — Chosen: LongAdder / DoubleAdder over AtomicLong. Under write contention, LongAdder shards across cells to avoid the single-cache-line hot spot of AtomicLong. The extra read cost (summing cells) is paid once per scrape, not per increment.
- Short-lived jobs — Chosen: Optional push-gateway rather than extending the TSDB to accept pushes. Most targets are long-lived and pull works beautifully; a narrow push-gateway for cron-style jobs keeps the main path simple without forcing the whole system into push semantics.
- Long-term retention — Chosen: Remote-write to Thanos/Mimir/VictoriaMetrics instead of scaling local TSDB. Local TSDB is optimised for recent data and a single host; multi-month retention and global query need a horizontally scalable store. Remote-write decouples ingest from long-term storage evolution.
- Histogram implementation — Chosen: Fixed bucket boundaries rather than sparse/exponential histograms. Fixed buckets are trivial to aggregate across instances and simple to reason about in PromQL; the loss of precision at the tails is worth the operational simplicity at fleet scale.
Interview follow-ups
- How do you handle cardinality explosions when a well-meaning developer labels a metric with user_id?
- How does the alert rule engine avoid re-evaluating every rule from scratch every 15 seconds?
- How do you migrate from local retention to a remote TSDB without a gap in history?
- How do you federate metrics across regions so one Grafana dashboard can query all of them?
- How do you instrument a hot path that runs billions of times per day without measurable overhead?
- How do you detect and recover from a collector that silently falls behind on scrape ingestion?