Payment System System Design Interview Question
Problem: Design a payment system that charges customers via external PSPs (Stripe/Braintree), keeps an auditable ledger, supports refunds, and reconciles with the PSP daily.
Overview
A payment system sits on the most unforgiving slice of the stack: money. Unlike a social feed where a stale read is an annoyance, a duplicate charge becomes a chargeback, a support ticket, and sometimes a regulator complaint. The job sounds simple on paper — accept a charge, call Stripe or Braintree, record what happened — but every interesting failure mode lives in the gaps between those three steps. Networks time out after the PSP captured the card but before the client saw a response. Retries turn one intent into two captures. Background jobs crash halfway through a refund and leave the ledger out of sync with the settlement file. The design below accepts that the external world (PSPs, mobile networks, flaky customers) is hostile and pushes correctness into two primitives: client-supplied idempotency keys stored in Redis, and a double-entry ledger where every debit has a matching credit. Everything else — webhooks, reconciliation, refunds — composes on top of those two invariants.
Summary
An idempotent payment API that sits in front of an external PSP (Stripe, Braintree) and a double-entry ledger. The dominant design choices are (1) client-supplied idempotency keys stored in Redis with SETNX so a retried charge never double-bills; (2) double-entry bookkeeping in a SQL ledger (DECIMAL, never float) so every debit has a credit and the books always balance; (3) orchestrator pattern that calls the PSP, records the ledger entry, and updates the wallet — with a compensating refund if any step fails. A nightly reconciliation job reads the PSP's settlement report and diffs it against our ledger so any drift is caught within 24h.
Requirements
Functional
- Charge a customer's saved payment method through an external PSP (Stripe, Braintree, Adyen) and return a stable charge_id
- Honor client-supplied Idempotency-Key: a retried POST with the same key must return the original response, never re-charge
- Record every money movement as balanced double-entry ledger rows (debit + credit) in a SQL store
- Process PSP webhooks (charge.succeeded, charge.refunded, charge.disputed) and update internal state exactly once
- Support full and partial refunds that compensate the original ledger entries
- Run a nightly reconciliation job that diffs the PSP settlement report against our ledger and flags drift
- Expose a read API for charge status, refund status, and a merchant-facing statement
Non-functional
- Strong consistency on charge state — no user should ever see a ledger that doesn't balance
- Idempotency window of at least 24 hours; same key + same payload returns the original response
- p99 latency under 1.5s end-to-end, bounded by PSP (PSP p95 ~300ms is the long pole)
- Durability: zero ledger row loss; 7-year retention for SOX and PCI audit
- Availability 99.95% on the write path; the PSP itself is the next weakest link
- PCI-DSS scope minimized: raw PAN never touches our servers, only tokenized payment methods
Capacity Assumptions
- 1M successful charges/day, peak 3x average
- Average charge amount: $50, 99th percentile: $2000
- 5% refund rate
- PSP p95 latency: 300ms (external network + PSP processing)
- 7-year retention on ledger entries for SOX / PCI audit
Back-of-Envelope Estimates
- Charge QPS: 1M / 86400 ≈ 12 QPS avg, ~35 QPS peak — tiny in QPS, huge in correctness cost
- Ledger writes: 2 per charge (debit + credit) + 2 per refund ≈ 2.1M rows/day
- Ledger storage: 2.1M * 365 * 7 * 200B ≈ 1.1 TB over 7 years — fits in one well-tuned Postgres cluster
- Idempotency keys in Redis: 1M/day * 24h TTL ≈ 1M live keys * 200B ≈ 200 MB — trivial
- Reconciliation batch: scan 1M rows nightly in <10 minutes
High-level architecture
Requests enter at an L7 load balancer that terminates public TLS and re-originates mTLS to a stateless Payment API tier. The API validates the request, extracts the Idempotency-Key header, and asks the Idempotency Store (Redis) whether this key has been seen. Redis SETNX with a 24h TTL atomically claims the key; if the key already exists and the request payload hash matches, we return the cached response and skip everything downstream. A Payment Orchestrator then drives the three-step dance: (1) call the PSP to capture the card via tokenized payment method, (2) write the double-entry ledger rows in a single SQL transaction, (3) update the Wallet Service or merchant balance. If step 1 succeeds but step 2 fails, the orchestrator issues a compensating PSP refund — the system never leaves money uncounted. Webhooks from the PSP arrive at a separate endpoint with signature verification; each webhook carries an event_id used as its own idempotency key so Stripe's at-least-once delivery doesn't produce duplicate ledger rows. A nightly Reconciliation Job pulls the PSP's settlement file, joins it against the ledger on external_charge_id, and files any mismatch into a manual-review queue. Storage is a partitioned Postgres cluster — partitioned on charge_date — because the ledger is append-only and time-ordered access dominates. Notifications are fire-and-forget over Kafka so a slow SMS provider never backs up the charge path.
Architecture Components (10)
- Client (Merchant checkout / mobile app) (client) — Merchant checkout page or mobile app that initiates a charge with a client-generated idempotency key.
- Load Balancer (lb) — L7 HTTPS load balancer with mTLS to the API tier.
- Payment Service API (api) — Stateless REST API that validates requests, enforces idempotency, and delegates to the orchestrator.
- Idempotency Store (Redis) (cache) — Redis SETNX-backed cache of idempotency keys and their cached responses (24h TTL).
- Payment Orchestrator (worker) — Coordinates the PSP call, ledger write, wallet update, and notification. Owns the saga / compensation logic.
- PSP Gateway (Stripe / Braintree) (api) — Abstraction over external payment service providers. Pool of outbound HTTPS connections to Stripe/Braintree.
- Ledger DB (Double-Entry, SQL) (sql) — Postgres/MySQL with double-entry bookkeeping. Every charge writes a matched debit/credit pair inside one transaction.
- Wallet / Balance Service (api) — Serves merchant balance views derived from the ledger. Keeps a cached aggregated balance per account.
- Reconciliation Job (worker) — Nightly batch that compares our ledger against the PSP's settlement report and raises discrepancies.
- Notification Service (queue) — Emits receipt emails, push notifications, and merchant webhooks after state changes.
Operations Walked Through (4)
- charge — POST /v1/charges — first-time request. Idempotency claimed, PSP charged, ledger written, wallet credited, notification emitted.
- charge-retry — Network flake causes the client to retry with the same idempotency key. Redis returns the cached response; no PSP call, no ledger write.
- refund — POST /v1/refunds — orchestrator calls PSP.refund(), writes compensating ledger entries, debits the merchant balance.
- recon-sweep — Nightly job pulls the PSP settlement file and diffs it against the ledger. Any mismatch pages on-call.
Implementation
@RestController
@RequestMapping("/v1/charges")
public class PaymentController {
private final IdempotencyStore idempotency;
private final PaymentOrchestrator orchestrator;
@PostMapping
public ResponseEntity<ChargeResponse> charge(
@RequestHeader("Idempotency-Key") String key,
@Valid @RequestBody ChargeRequest req) {
String fingerprint = Hashing.sha256(req.canonicalBytes());
IdempotencyStore.Slot slot = idempotency.claim(key, fingerprint, Duration.ofHours(24));
if (slot.isReplay()) {
if (!slot.fingerprint().equals(fingerprint)) {
return ResponseEntity.status(422).body(ChargeResponse.keyReuseConflict());
}
return ResponseEntity.ok(slot.cachedResponse(ChargeResponse.class));
}
try {
ChargeResponse resp = orchestrator.charge(req);
idempotency.commit(key, resp);
return ResponseEntity.ok(resp);
} catch (PspException e) {
idempotency.commit(key, ChargeResponse.failed(e.code(), e.getMessage()));
return ResponseEntity.status(402).body(ChargeResponse.failed(e.code(), e.getMessage()));
}
}
}
@Service
public class Ledger {
private final JdbcTemplate jdbc;
@Transactional
public void post(LedgerTxn txn) {
if (txn.entries().stream().mapToLong(LedgerEntry::amountMinor).sum() != 0) {
throw new IllegalStateException("unbalanced txn: debits != credits");
}
for (LedgerEntry e : txn.entries()) {
jdbc.update(
"INSERT INTO ledger_entries (txn_id, account_id, amount_minor, currency, direction, external_ref, posted_at) " +
"VALUES (?, ?, ?, ?, ?::entry_dir, ?, now())",
txn.id(), e.accountId(), Math.abs(e.amountMinor()), e.currency(),
e.amountMinor() < 0 ? "DEBIT" : "CREDIT", txn.externalRef());
}
}
public static LedgerTxn captureCharge(String txnId, long amountMinor, String currency,
String customerAccount, String merchantAccount, String chargeId) {
return new LedgerTxn(txnId, chargeId, List.of(
new LedgerEntry(customerAccount, -amountMinor, currency),
new LedgerEntry(merchantAccount, amountMinor, currency)));
}
}
@PostMapping("/v1/webhooks/stripe")
public ResponseEntity<Void> stripe(@RequestHeader("Stripe-Signature") String sig,
@RequestBody byte[] raw) {
if (!StripeSigs.verify(raw, sig, webhookSecret, Duration.ofMinutes(5))) {
return ResponseEntity.status(400).build();
}
StripeEvent evt = StripeEvent.parse(raw);
// event_id is our idempotency key for webhook delivery
if (!webhookLog.recordIfNew(evt.id(), evt.type())) {
return ResponseEntity.ok().build(); // already processed
}
switch (evt.type()) {
case "charge.succeeded" -> chargeService.markCaptured(evt.chargeId(), evt.amountMinor());
case "charge.refunded" -> refundService.applyRefund(evt.chargeId(), evt.refundId(), evt.amountMinor());
case "charge.dispute.created" -> disputeService.open(evt.chargeId(), evt.disputeId());
default -> {}
}
return ResponseEntity.ok().build();
}
@Scheduled(cron = "0 15 2 * * *", zone = "UTC")
public void reconcile() {
LocalDate day = LocalDate.now(ZoneOffset.UTC).minusDays(1);
try (var psp = pspClient.settlementReport(day);
var ours = ledgerRepo.streamCapturedOn(day)) {
Map<String, Money> pspByRef = psp.stream()
.collect(Collectors.toMap(r -> r.externalChargeId, r -> r.amount));
List<ReconDiff> diffs = new ArrayList<>();
ours.forEach(row -> {
Money pspAmt = pspByRef.remove(row.externalChargeId());
if (pspAmt == null) diffs.add(ReconDiff.missingAtPsp(row));
else if (!pspAmt.equals(row.amount())) diffs.add(ReconDiff.amountMismatch(row, pspAmt));
});
pspByRef.forEach((ref, amt) -> diffs.add(ReconDiff.missingInLedger(ref, amt)));
if (!diffs.isEmpty()) reconQueue.fileForReview(day, diffs);
metrics.gauge("recon.diffs", diffs.size());
}
}
Key design decisions & trade-offs
- Client-supplied idempotency keys vs server-generated dedupe — Chosen: Client generates a UUIDv4 per checkout attempt; server stores it in Redis with SETNX. Only the client knows which retries belong to the same logical intent. Server-side fuzzy dedupe on (amount, card, time window) produces false positives for legitimate repeat purchases (e.g., two coffees in a row). The tradeoff is that buggy clients can defeat dedupe by rotating keys — we mitigate with a server-side alert on duplicate (merchant, card, amount) within 60s.
- Strong consistency vs availability on the write path — Chosen: Strong consistency — single Postgres primary per region with synchronous replicas. Money requires a linearizable ledger; CP beats AP here. During a primary failover the charge API returns 503 for a few seconds rather than accept writes that could diverge. We accept the availability hit because a 30s outage is recoverable, whereas a split-brain ledger is not.
- Synchronous PSP call vs async queue-based capture — Chosen: Synchronous with tight timeouts (30s hard cap), retries only on idempotent operations. Customers expect an immediate pass/fail at checkout, so async-only breaks UX. Async does work for later operations (refunds, payouts) where users don't block on a response. The cost of sync is that PSP latency is on our critical path — we mitigate with per-PSP circuit breakers and a fallback PSP for resilient merchants.
- Single ledger table vs per-tenant sharding — Chosen: One logical ledger, time-partitioned by charge_date; shard only if a single tenant dominates volume. At 1M charges/day the entire 7-year ledger fits in ~1.1TB — one well-tuned Postgres cluster handles it. Sharding adds cross-shard txn complexity (money movements between tenants) we don't need yet. The tradeoff is that hot merchants on Black Friday can create contention; partition pruning plus read replicas cover it until a tenant needs its own shard.
- Webhook processing: inline handler vs Kafka-buffered consumer — Chosen: Inline handler writes to a webhook_events table keyed by event_id, then processes. PSPs deliver webhooks at-least-once, so we need an event_id dedupe store anyway. Adding Kafka is pure complexity unless we need consumer-side fanout (notification, analytics, fraud). We punt Kafka to phase 2 when more consumers show up.
Interview follow-ups
- Multi-PSP routing: fall over to a secondary PSP when the primary breaches a latency or error budget
- 3DS / SCA flow for European PSD2 compliance — interactive authentication before capture
- Push ledger events to a data warehouse (Snowflake, BigQuery) via CDC for finance reporting
- Chargeback and dispute workflow with evidence upload to the PSP
- Currency conversion and multi-currency ledger accounts with FX-rate lock at charge time
- Fraud scoring pre-authorization (Sift, Forter) with a decline-or-review decision on high-risk charges