← System Design Simulator

S3-like Object Storage System Design Interview Question

By Rahul Kumar · Senior Software Engineer · Updated · 9 components · 3 operations ·Source: Alex Xu, System Design Interview Vol 2, Chapter 9

Problem: Design a distributed, highly durable object storage service like Amazon S3 supporting PUT / GET / DELETE of arbitrary-size objects with versioning.

Overview

S3-like object storage is the bedrock that most other internet-scale systems sit on, which is exactly why the interview asks you to design it. The surface area is small (PUT, GET, DELETE, LIST) but the internals are unforgiving: trillions of objects, exabytes on disk, 11 nines of durability, and a QPS profile that dwarfs almost any other system. The winning split separates metadata from data. A stateless REST front door speaks the S3 protocol and authenticates requests. A strongly-consistent metadata service maps bucket + key to the placement of data shards. A horizontally-scaled fleet of dumb data nodes stores the actual bytes, protected by erasure coding rather than replication to hit 11 nines at 1.5x overhead. Large objects fan out through multipart upload, deletes leave tombstones, and versioning is a first-class concept baked into the metadata schema. This piece walks each plane, the multipart flow, and the tradeoffs that pay for that durability SLA.

S3-like Object Storage — Interactive Simulator

Runs fully client-side in your browser; no sign-up. Or open full screen →

Launch the interactive walkthrough for S3-like Object Storage — animated architecture diagram, step-by-step flow with real payloads, component swap, and a discrete-event stress simulator.

Summary

A write-once, read-many blob system split into three planes: (1) a stateless REST API that speaks the S3 protocol, (2) a metadata service that maps object keys to the placement of their data chunks, and (3) a fleet of data nodes that store the actual bytes. The dominant design choice is separating metadata from data: the metadata service is a small, strongly-consistent index ("which chunks live on which data nodes"), while data nodes are dumb, horizontally-scaled storage units protected by erasure coding (Reed-Solomon 6+3) for 11 nines of durability at ~1.5x overhead rather than 3x full replication. The main tradeoff is EC's higher reconstruction cost on read tail and on single-node failure vs. 3x replication's simpler recovery. Versioning and deletes are tombstoned; a background garbage collector compacts freed space. Sized for Amazon-scale: trillions of objects, exabytes total, ~1M GETs/sec and ~100K PUTs/sec.

Requirements

Functional

Non-functional

Capacity Assumptions

Back-of-Envelope Estimates

High-level architecture

The REST tier is stateless: nodes validate SigV4, authorize against IAM, and route to the right backend. A PUT flows as follows. The API asks the metadata service to allocate an object_id and pick a placement group (a set of data nodes across failure domains). The bytes stream through the API directly to the data nodes; the API itself never spools large bodies to disk. The data path splits the object into fixed chunks (4 MB is common) and, on each chunk, runs a Reed-Solomon 6+3 encoder to produce nine shards that scatter across nine data nodes in at least three failure domains. Once all chunks are durable, the API calls the metadata service to commit: object_id, version_id, list of chunks, shard-to-node map, content MD5, and size. Commit is the atomic point of visibility for a GET. For GET, the API resolves the key to a version, fetches the chunk manifest, then streams chunks back while assembling them client-side-style inside the API node. Any six of nine shards are enough; the tail of the slowest three is trimmed. DELETE writes a tombstone version; a background garbage collector follows lifecycle rules to free shards once no live manifest references them. Multipart upload uses an upload_id namespace inside the metadata service to hold part ETags until CompleteMultipartUpload stitches them into a final manifest. Versioning is just the natural consequence of never mutating an object row in place: every PUT under the same key creates a new version_id and pushes prior versions down the history chain.

Architecture Components (9)

Operations Walked Through (3)

Implementation

REST endpoints: PutObject, GetObject, DeleteObject
@RestController
@RequestMapping("/{bucket}/{key:.+}")
public class ObjectController {
  private final ObjectService objects;
  private final AuthService auth;

  public ObjectController(ObjectService o, AuthService a) {
    this.objects = o;
    this.auth = a;
  }

  @PutMapping(consumes = MediaType.APPLICATION_OCTET_STREAM_VALUE)
  public ResponseEntity<Void> put(
      @PathVariable String bucket,
      @PathVariable String key,
      @RequestHeader("Content-Length") long length,
      @RequestHeader(value = "Content-MD5", required = false) String md5,
      @RequestHeader("Authorization") String sig,
      InputStream body) throws IOException {
    Principal p = auth.verify(sig, "PUT", bucket, key);
    PutResult r = objects.put(p, bucket, key, body, length, md5);
    return ResponseEntity.ok()
        .eTag('"' + r.getEtag() + '"')
        .header("x-amz-version-id", r.getVersionId())
        .build();
  }

  @GetMapping
  public ResponseEntity<StreamingResponseBody> get(
      @PathVariable String bucket,
      @PathVariable String key,
      @RequestParam(value = "versionId", required = false) String versionId,
      @RequestHeader("Authorization") String sig) {
    Principal p = auth.verify(sig, "GET", bucket, key);
    ObjectHandle h = objects.open(p, bucket, key, Optional.ofNullable(versionId));
    StreamingResponseBody stream = out -> objects.stream(h, out);
    return ResponseEntity.ok()
        .contentLength(h.getLength())
        .eTag('"' + h.getEtag() + '"')
        .header("x-amz-version-id", h.getVersionId())
        .body(stream);
  }

  @DeleteMapping
  public ResponseEntity<Void> delete(
      @PathVariable String bucket,
      @PathVariable String key,
      @RequestHeader("Authorization") String sig) {
    Principal p = auth.verify(sig, "DELETE", bucket, key);
    String tombstoneVersion = objects.delete(p, bucket, key);
    return ResponseEntity.noContent()
        .header("x-amz-version-id", tombstoneVersion)
        .header("x-amz-delete-marker", "true")
        .build();
  }
}
Chunking strategy: split objects into 4 MB blocks
public final class ChunkingWriter {
  public static final int CHUNK_SIZE = 4 * 1024 * 1024;

  private final DataNodeClient dataNodes;
  private final ErasureCoder coder; // Reed-Solomon 6+3

  public ChunkingWriter(DataNodeClient dn, ErasureCoder ec) {
    this.dataNodes = dn;
    this.coder = ec;
  }

  public List<ChunkManifest> write(String objectId, InputStream in, long totalBytes) throws IOException {
    List<ChunkManifest> manifests = new ArrayList<>();
    byte[] buf = new byte[CHUNK_SIZE];
    long remaining = totalBytes;
    int index = 0;
    while (remaining > 0) {
      int toRead = (int) Math.min(CHUNK_SIZE, remaining);
      int read = readFully(in, buf, toRead);
      if (read <= 0) throw new EOFException("short object body");
      byte[][] shards = coder.encode(buf, read); // 9 shards (6 data + 3 parity)
      List<PlacedShard> placed = dataNodes.scatter(objectId, index, shards);
      String chunkHash = sha256(buf, read);
      manifests.add(new ChunkManifest(index, read, chunkHash, placed));
      remaining -= read;
      index++;
    }
    return manifests;
  }

  private static int readFully(InputStream in, byte[] buf, int want) throws IOException {
    int total = 0;
    while (total < want) {
      int r = in.read(buf, total, want - total);
      if (r < 0) break;
      total += r;
    }
    return total;
  }

  private static String sha256(byte[] buf, int len) {
    try {
      MessageDigest d = MessageDigest.getInstance("SHA-256");
      d.update(buf, 0, len);
      return HexFormat.of().formatHex(d.digest());
    } catch (NoSuchAlgorithmException e) { throw new IllegalStateException(e); }
  }
}
Versioning model
public class ObjectVersion {
  public enum Kind { LIVE, DELETE_MARKER }

  private final String bucket;
  private final String key;
  private final String versionId;          // monotonic ULID
  private final String objectId;           // stable id of the blob
  private final Kind kind;
  private final long sizeBytes;
  private final String etag;               // content MD5
  private final List<ChunkManifest> chunks;
  private final String storageClass;       // STANDARD, IA, COLD
  private final Instant createdAt;
  private final String createdBy;
  private final Optional<Instant> expiresAt;

  public ObjectVersion(String bucket, String key, String versionId, String objectId,
                       Kind kind, long size, String etag, List<ChunkManifest> chunks,
                       String storageClass, String createdBy, Optional<Instant> expiresAt) {
    this.bucket = bucket;
    this.key = key;
    this.versionId = versionId;
    this.objectId = objectId;
    this.kind = kind;
    this.sizeBytes = size;
    this.etag = etag;
    this.chunks = List.copyOf(chunks);
    this.storageClass = storageClass;
    this.createdAt = Instant.now();
    this.createdBy = createdBy;
    this.expiresAt = expiresAt;
  }

  public boolean isDeleteMarker() { return kind == Kind.DELETE_MARKER; }
  public boolean isExpired(Instant now) { return expiresAt.map(e -> now.isAfter(e)).orElse(false); }

  public String getBucket() { return bucket; }
  public String getKey() { return key; }
  public String getVersionId() { return versionId; }
  public List<ChunkManifest> getChunks() { return chunks; }
  public String getEtag() { return etag; }
  public long getSizeBytes() { return sizeBytes; }
}

Key design decisions & trade-offs

Interview follow-ups

Related