S3-like Object Storage System Design Interview Question
Problem: Design a distributed, highly durable object storage service like Amazon S3 supporting PUT / GET / DELETE of arbitrary-size objects with versioning.
Overview
S3-like object storage is the bedrock that most other internet-scale systems sit on, which is exactly why the interview asks you to design it. The surface area is small (PUT, GET, DELETE, LIST) but the internals are unforgiving: trillions of objects, exabytes on disk, 11 nines of durability, and a QPS profile that dwarfs almost any other system. The winning split separates metadata from data. A stateless REST front door speaks the S3 protocol and authenticates requests. A strongly-consistent metadata service maps bucket + key to the placement of data shards. A horizontally-scaled fleet of dumb data nodes stores the actual bytes, protected by erasure coding rather than replication to hit 11 nines at 1.5x overhead. Large objects fan out through multipart upload, deletes leave tombstones, and versioning is a first-class concept baked into the metadata schema. This piece walks each plane, the multipart flow, and the tradeoffs that pay for that durability SLA.
Summary
A write-once, read-many blob system split into three planes: (1) a stateless REST API that speaks the S3 protocol, (2) a metadata service that maps object keys to the placement of their data chunks, and (3) a fleet of data nodes that store the actual bytes. The dominant design choice is separating metadata from data: the metadata service is a small, strongly-consistent index ("which chunks live on which data nodes"), while data nodes are dumb, horizontally-scaled storage units protected by erasure coding (Reed-Solomon 6+3) for 11 nines of durability at ~1.5x overhead rather than 3x full replication. The main tradeoff is EC's higher reconstruction cost on read tail and on single-node failure vs. 3x replication's simpler recovery. Versioning and deletes are tombstoned; a background garbage collector compacts freed space. Sized for Amazon-scale: trillions of objects, exabytes total, ~1M GETs/sec and ~100K PUTs/sec.
Requirements
Functional
- PUT, GET, DELETE, HEAD, and LIST objects within a bucket
- Multipart upload for objects larger than 100 MB
- Object versioning with the ability to GET a prior version
- Bucket- and object-level ACLs, signed URLs, and SigV4 auth
- Consistent listing with pagination and prefix filters
- Server-side encryption at rest and optional client-side keys
- Lifecycle rules: TTL expiry, tiering to cold storage
Non-functional
- 11 nines durability per object
- 99.99% availability for GET, 99.9% for PUT
- Scale to 10 trillion objects, 3 EB stored, ~1M GET QPS
- P99 GET latency under 100 ms for small objects
- Erasure-coded storage overhead under 1.6x
- Graceful degradation: single-AZ loss does not break reads
Capacity Assumptions
- 10 trillion objects stored lifetime, average object 200 KB (bi-modal: many small + few huge)
- Durability target: 11 nines (99.999999999%) — one object lost per 10B per year
- Availability target: 99.99% for GET, 99.9% for PUT (PUT tolerates slightly more retry)
- Read:write ratio ~10:1
- Erasure coding scheme: Reed-Solomon 6+3 (6 data + 3 parity shards, tolerates 3 simultaneous shard losses)
- Multipart upload threshold: objects > 100 MB split into 5–100 MB parts
Back-of-Envelope Estimates
- PUT QPS: ~100K sustained, peak ~300K during backup windows
- GET QPS: ~1M sustained, peak ~3M (analytics / ML training reads)
- Bandwidth: 1M * 200KB ≈ 200 GB/s read egress from data-node fleet
- Storage: 10T * 200KB * 1.5 (EC overhead) ≈ 3 EB raw, distributed across ~300K data nodes at 10 TB each
- Metadata rows: 10T objects * ~1 KB each = 10 PB of metadata — sharded across ~10K metadata nodes
- Multipart completes: 500K/day (assumes 5% of PUTs are large multipart)
High-level architecture
The REST tier is stateless: nodes validate SigV4, authorize against IAM, and route to the right backend. A PUT flows as follows. The API asks the metadata service to allocate an object_id and pick a placement group (a set of data nodes across failure domains). The bytes stream through the API directly to the data nodes; the API itself never spools large bodies to disk. The data path splits the object into fixed chunks (4 MB is common) and, on each chunk, runs a Reed-Solomon 6+3 encoder to produce nine shards that scatter across nine data nodes in at least three failure domains. Once all chunks are durable, the API calls the metadata service to commit: object_id, version_id, list of chunks, shard-to-node map, content MD5, and size. Commit is the atomic point of visibility for a GET. For GET, the API resolves the key to a version, fetches the chunk manifest, then streams chunks back while assembling them client-side-style inside the API node. Any six of nine shards are enough; the tail of the slowest three is trimmed. DELETE writes a tombstone version; a background garbage collector follows lifecycle rules to free shards once no live manifest references them. Multipart upload uses an upload_id namespace inside the metadata service to hold part ETags until CompleteMultipartUpload stitches them into a final manifest. Versioning is just the natural consequence of never mutating an object row in place: every PUT under the same key creates a new version_id and pushes prior versions down the history chain.
Architecture Components (9)
- Client (SDK / CLI / Browser) (client) — AWS SDK, s3cmd, or browser-based uploader speaking the S3 REST protocol over HTTPS.
- Load Balancer (lb) — L7 HTTPS load balancer fronting the S3 API fleet, with anycast DNS for regional affinity.
- S3 REST API (api) — Stateless service that speaks the S3 wire protocol: bucket ops, object PUT/GET/DELETE, multipart coordination, ACL.
- Auth Service (auth) — Validates SigV4 signatures, enforces bucket policies and IAM ACLs.
- Metadata Service (Index) (sql) — Strongly-consistent index mapping (bucket, key, version) → chunk placements + object attributes.
- Data Node (blob) — Storage server holding on-disk shards; exposes chunk-level PUT/GET via an internal RPC.
- Replication Coordinator (coordinator) — Drives the erasure-coded write: encodes chunks into 6+3 shards and fans them out to data nodes.
- Versioning Log (stream) — Append-only log of every metadata mutation (PUT version, DELETE tombstone, LifecycleExpire).
- Garbage Collector (worker) — Background job that reclaims storage from tombstoned / expired / multipart-abandoned chunks.
Operations Walked Through (3)
- put — Client uploads a 500 MB object via multipart. Each part is EC-encoded into 6+3 shards and fanned out to data nodes across 3 AZs; metadata is committed only after write quorum.
- get — Client downloads a 10 MB object. API resolves placement via metadata, fans out 6 parallel shard reads, reconstructs if any shard times out, streams bytes.
- delete — User deletes an object. API writes a tombstone row in metadata — the delete is visible immediately, but the underlying chunks are reclaimed later by the garbage collector.
Implementation
@RestController
@RequestMapping("/{bucket}/{key:.+}")
public class ObjectController {
private final ObjectService objects;
private final AuthService auth;
public ObjectController(ObjectService o, AuthService a) {
this.objects = o;
this.auth = a;
}
@PutMapping(consumes = MediaType.APPLICATION_OCTET_STREAM_VALUE)
public ResponseEntity<Void> put(
@PathVariable String bucket,
@PathVariable String key,
@RequestHeader("Content-Length") long length,
@RequestHeader(value = "Content-MD5", required = false) String md5,
@RequestHeader("Authorization") String sig,
InputStream body) throws IOException {
Principal p = auth.verify(sig, "PUT", bucket, key);
PutResult r = objects.put(p, bucket, key, body, length, md5);
return ResponseEntity.ok()
.eTag('"' + r.getEtag() + '"')
.header("x-amz-version-id", r.getVersionId())
.build();
}
@GetMapping
public ResponseEntity<StreamingResponseBody> get(
@PathVariable String bucket,
@PathVariable String key,
@RequestParam(value = "versionId", required = false) String versionId,
@RequestHeader("Authorization") String sig) {
Principal p = auth.verify(sig, "GET", bucket, key);
ObjectHandle h = objects.open(p, bucket, key, Optional.ofNullable(versionId));
StreamingResponseBody stream = out -> objects.stream(h, out);
return ResponseEntity.ok()
.contentLength(h.getLength())
.eTag('"' + h.getEtag() + '"')
.header("x-amz-version-id", h.getVersionId())
.body(stream);
}
@DeleteMapping
public ResponseEntity<Void> delete(
@PathVariable String bucket,
@PathVariable String key,
@RequestHeader("Authorization") String sig) {
Principal p = auth.verify(sig, "DELETE", bucket, key);
String tombstoneVersion = objects.delete(p, bucket, key);
return ResponseEntity.noContent()
.header("x-amz-version-id", tombstoneVersion)
.header("x-amz-delete-marker", "true")
.build();
}
}
public final class ChunkingWriter {
public static final int CHUNK_SIZE = 4 * 1024 * 1024;
private final DataNodeClient dataNodes;
private final ErasureCoder coder; // Reed-Solomon 6+3
public ChunkingWriter(DataNodeClient dn, ErasureCoder ec) {
this.dataNodes = dn;
this.coder = ec;
}
public List<ChunkManifest> write(String objectId, InputStream in, long totalBytes) throws IOException {
List<ChunkManifest> manifests = new ArrayList<>();
byte[] buf = new byte[CHUNK_SIZE];
long remaining = totalBytes;
int index = 0;
while (remaining > 0) {
int toRead = (int) Math.min(CHUNK_SIZE, remaining);
int read = readFully(in, buf, toRead);
if (read <= 0) throw new EOFException("short object body");
byte[][] shards = coder.encode(buf, read); // 9 shards (6 data + 3 parity)
List<PlacedShard> placed = dataNodes.scatter(objectId, index, shards);
String chunkHash = sha256(buf, read);
manifests.add(new ChunkManifest(index, read, chunkHash, placed));
remaining -= read;
index++;
}
return manifests;
}
private static int readFully(InputStream in, byte[] buf, int want) throws IOException {
int total = 0;
while (total < want) {
int r = in.read(buf, total, want - total);
if (r < 0) break;
total += r;
}
return total;
}
private static String sha256(byte[] buf, int len) {
try {
MessageDigest d = MessageDigest.getInstance("SHA-256");
d.update(buf, 0, len);
return HexFormat.of().formatHex(d.digest());
} catch (NoSuchAlgorithmException e) { throw new IllegalStateException(e); }
}
}
public class ObjectVersion {
public enum Kind { LIVE, DELETE_MARKER }
private final String bucket;
private final String key;
private final String versionId; // monotonic ULID
private final String objectId; // stable id of the blob
private final Kind kind;
private final long sizeBytes;
private final String etag; // content MD5
private final List<ChunkManifest> chunks;
private final String storageClass; // STANDARD, IA, COLD
private final Instant createdAt;
private final String createdBy;
private final Optional<Instant> expiresAt;
public ObjectVersion(String bucket, String key, String versionId, String objectId,
Kind kind, long size, String etag, List<ChunkManifest> chunks,
String storageClass, String createdBy, Optional<Instant> expiresAt) {
this.bucket = bucket;
this.key = key;
this.versionId = versionId;
this.objectId = objectId;
this.kind = kind;
this.sizeBytes = size;
this.etag = etag;
this.chunks = List.copyOf(chunks);
this.storageClass = storageClass;
this.createdAt = Instant.now();
this.createdBy = createdBy;
this.expiresAt = expiresAt;
}
public boolean isDeleteMarker() { return kind == Kind.DELETE_MARKER; }
public boolean isExpired(Instant now) { return expiresAt.map(e -> now.isAfter(e)).orElse(false); }
public String getBucket() { return bucket; }
public String getKey() { return key; }
public String getVersionId() { return versionId; }
public List<ChunkManifest> getChunks() { return chunks; }
public String getEtag() { return etag; }
public long getSizeBytes() { return sizeBytes; }
}
Key design decisions & trade-offs
- Durability mechanism — Chosen: Reed-Solomon 6+3 erasure coding. Gets 11 nines at ~1.5x storage vs. 3x for full replication. Costs more CPU on the read path and makes single-shard recovery slower than a simple copy.
- Metadata vs. data separation — Chosen: Strongly-consistent metadata service, dumb bulk-storage data nodes. Lets each plane scale on its own axis: metadata is CPU and index-bound, data nodes are disk-bound. Also keeps the consistency surface tiny and well-defined.
- Delete semantics — Chosen: Tombstoned versions with async GC. Returns instantly, is idempotent under retries, and plays well with versioning. Costs disk space until the GC sweeps, which is acceptable at EB scale.
- Large-object upload — Chosen: Multipart with client-chosen part size (5-100 MB). Parallelism saturates the client uplink and lets single-part failures retry cheaply. Adds an explicit Complete step, which the SDK hides.
- Listing consistency — Chosen: Strongly consistent, paginated with continuation tokens. List-after-write surprises are the usual cause of production bugs. Pagination keeps memory bounded on huge prefixes.
- Storage tiering — Chosen: Lifecycle rules to IA / cold tiers. Most objects are written once and read rarely after 30 days. Tiering cuts cost 3-10x, at the price of multi-second retrieval for cold reads.
Interview follow-ups
- How would you design cross-region replication with bucket-level RPO targets?
- How do you run the garbage collector without accidentally deleting live data under concurrent writes?
- How would you add strong-consistency read-after-write for overwritten keys?
- How would you implement S3 Select / server-side query pushdown?
- How do you shard the metadata service without hotspots on a popular bucket?