Video Streaming (YouTube-style) System Design Interview Question
Problem: Design a video upload, transcoding, and streaming platform like YouTube.
Overview
A YouTube-style video platform is two workloads glued together: a trickle of uploads (around 1% of traffic) that must survive multi-gigabyte files on flaky home Wi-Fi, and a firehose of playback (the remaining 99%) that must hit single-digit startup latency worldwide. The interview answer is to decouple them aggressively. Uploads land in raw object storage through resumable, chunked PUTs close to the user, then a DAG of transcoding workers fans the source into an adaptive-bitrate ladder (240p through 4K), HLS segments, and thumbnails. Playback never touches the application tier on the hot path; master playlists and six-second segments are served from a CDN that absorbs 95%+ of global QPS. This intro frames the design's two-sided nature before the architecture walkthrough dives into how the completion queue, metadata service, and recommendation engine stitch the two flows together.
Summary
A massively read-skewed system (~99% reads, 1% uploads) split into two flows: a video-uploading flow (original storage → transcoding servers → transcoded storage → CDN, with a completion queue + handler that updates metadata once encoding finishes) and a streaming flow that serves adaptive-bitrate manifests and segments from the CDN with fallback to the transcoded origin. The dominant design choice is push all playback traffic to CDN edge — the origin should see <1% of global playback QPS — while transcoding runs asynchronously on a DAG of tasks (inspection → video encoding → audio encoding → thumbnails → watermark → assembler) so uploads never block on the 10+ minute encode. The main tradeoffs are storage blow-up (each source becomes 6–8 adaptive-bitrate ladders, ~3–5x raw storage) and upload latency, which the book attacks with GOP-level chunk parallelism and upload points geographically near users.
Requirements
Functional
- Upload source videos up to multi-GB with resumable, chunked PUTs
- Transcode each source into an ABR ladder (240p, 360p, 480p, 720p, 1080p, 4K) plus HLS/DASH segments and thumbnails
- Serve adaptive-bitrate playback with sub-second startup and mid-stream ladder switching
- Metadata lookup by video ID: title, owner, privacy, length, ladder manifest URL
- Trending, recommended, and search surfaces powered by view logs
- Content moderation, takedown, and geo-restriction per video
- Live comment / like / watch-count counters with eventual consistency
Non-functional
- 99.95% playback availability; 99.9% upload availability
- P99 segment fetch under 100 ms from CDN edge globally
- Durability 11 nines for source and transcoded assets
- Scale to 2B DAU, ~115K segment QPS sustained, ~350K peak
- Transcoding pipeline elastic to 100+ concurrent encodes
- Cost efficiency: origin egress under 5% of total playback bandwidth
Capacity Assumptions
- 2B DAU, 5 videos watched per user per day → 10B views/day
- 500K uploads/day, average video 100 MB source
- Transcoded to 6 adaptive-bitrate ladders (240p, 360p, 480p, 720p, 1080p, 4K) + HLS 6s segments
- Video retained forever (no expiry); metadata updates on transcode completion via completion queue
- CDN hit ratio target: 95%+ for the long-tail head; origin absorbs <5% of playback QPS
Back-of-Envelope Estimates
- Playback QPS: 10B / 86400 ≈ 115K segment requests/sec (peak ~350K)
- Origin egress: 5% of peak = 17K segments/sec — CDN absorbs the rest
- Upload bandwidth: 500K * 100MB / 86400 ≈ 580 MB/s peak
- Storage: 500K * 100MB * 4x (ladders + HLS overhead) / day ≈ 200 TB/day, 73 PB/year
- Transcoding: 500K uploads * avg 10 min compute / 86400 ≈ 35 concurrent encodes baseline, peak 100+
High-level architecture
The upload path begins at a regional upload PoP so the first TCP hop is short. The client initiates a resumable upload, receives a signed URL, and PUTs 5 MB GOP-aligned chunks in parallel. Chunks stream into the original-source bucket; on the final chunk the upload service enqueues a TranscodingJob. Workers pull the DAG (inspect, video encode, audio encode, thumbnails, watermark, assembler), emit ladder outputs to the transcoded bucket, and write an HLS master playlist. A completion queue notifies the metadata service and the CDN pre-warm job so the first viewer does not pay a cold-cache tax. The playback path is CDN-first: the client resolves to an edge PoP, fetches the master.m3u8, picks a ladder from bandwidth probes, then pulls six-second .ts or .m4s segments that are cached aggressively. Origin pull only happens on cache miss. Metadata (title, ACL, manifest URL) is served by a sharded SQL tier behind a Redis cache with tight TTLs. A recommendation service and a view-count aggregator consume a Kafka stream of playback events. Tradeoffs dominate: storage blows up 3-5x from the ABR ladder, and transcoding latency is variable, which is why uploads return immediately and the UI shows a 'processing' state. The system assumes read-skew and pushes nearly all bytes to the edge, keeping the application tier small, stateless, and cheap.
Architecture Components (12)
- Client (Web / Mobile / TV) (client) — HLS/DASH player that adapts bitrate based on network and buffer; also chunks uploads into GOP-aligned pieces for parallel transfer.
- CDN Edge (cdn) — Geographically distributed edge caches serving HLS/DASH segments close to the viewer; falls back to transcoded-storage origin on miss.
- Load Balancer (lb) — L7 LB for control-plane traffic (API calls, upload initiation), NOT for segment GETs.
- Video API (api) — Metadata CRUD, upload session coordination, signed-URL issuance to original storage, and playback manifest URL lookup.
- Metadata DB (sql) — Relational store for video metadata (title, owner, status, view count, ladder list, manifest URL).
- Metadata Cache (cache) — Redis cache fronting the metadata DB for watch-page reads and manifest-URL lookups.
- Original Storage (blob) — S3-style bucket holding raw source uploads; the book's first stop for uploaded bytes before transcoding.
- Transcoding Servers (worker) — Fleet that pulls raw source from original storage and runs a DAG of transcoding tasks to produce every ladder, thumbnail, and captions set.
- Transcoded Storage (blob) — Authoritative home of all transcoded HLS/DASH segments and manifests; acts as the CDN's origin.
- Completion Queue (queue) — Durable queue of transcode-completion events produced by the transcoding DAG's assembler and drained by the completion handler.
- Completion Handler (worker) — Consumer of the completion queue that finalizes the upload: flips metadata to READY, stores the manifest/ladder list, and warms the cache.
- Recommendation Service (api) — Returns 'up next' list for watch pages.
Operations Walked Through (5)
- play — Client pulls manifest + segments from CDN; origin never touched for popular content. This is the streaming flow's happy path.
- play-cold — First viewer in the region (long-tail or newly-promoted video): the edge misses, shield pulls from transcoded storage, segment is written into edge cache for subsequent viewers.
- watch-page — Client fetches video metadata + recommendations. Metadata is Redis-cached after the completion handler warms it on READY transition; rec fans out in parallel.
- upload — Per the book: client POSTs metadata AND streams source chunks in parallel. API issues a signed URL, client PUTs chunks direct to original storage, origin-bucket event triggers transcoding pipeline asynchronously, completion handler eventually flips metadata to READY.
- transcode — Transcoding workers pull source from original storage, run the DAG (inspection → per-ladder video/audio encode + thumbnails → assembler → HLS manifests), write outputs to transcoded storage, publish to CDN, emit a completion event. Completion handler flips metadata to READY and warms cache.
Implementation
@RestController
@RequestMapping("/v1/videos")
public class VideoUploadController {
private final UploadSessionService sessions;
private final SignedUrlFactory signer;
public VideoUploadController(UploadSessionService s, SignedUrlFactory f) {
this.sessions = s;
this.signer = f;
}
@PostMapping("/uploads")
public ResponseEntity<InitiateUploadResponse> initiate(
@RequestBody InitiateUploadRequest req,
@AuthenticationPrincipal UserPrincipal user) {
if (req.getSizeBytes() <= 0 || req.getSizeBytes() > 10L * 1024 * 1024 * 1024) {
return ResponseEntity.badRequest().build();
}
UploadSession session = sessions.create(user.getId(), req.getFilename(), req.getSizeBytes(), req.getContentType());
URI resumableUrl = signer.signedPut(session.getObjectKey(), Duration.ofHours(6));
InitiateUploadResponse body = new InitiateUploadResponse(
session.getSessionId(),
resumableUrl.toString(),
5 * 1024 * 1024,
session.getExpiresAt());
return ResponseEntity
.status(HttpStatus.CREATED)
.header("Location", "/v1/videos/uploads/" + session.getSessionId())
.body(body);
}
}
public class TranscodingJob {
public enum State { QUEUED, RUNNING, COMPLETED, FAILED }
public enum Ladder { P240, P360, P480, P720, P1080, P2160 }
private final String jobId;
private final String videoId;
private final String sourceKey;
private final Set<Ladder> targets;
private State state;
private int attempt;
private Instant enqueuedAt;
private Instant startedAt;
private Instant completedAt;
private String failureReason;
public TranscodingJob(String videoId, String sourceKey, Set<Ladder> targets) {
this.jobId = UUID.randomUUID().toString();
this.videoId = videoId;
this.sourceKey = sourceKey;
this.targets = EnumSet.copyOf(targets);
this.state = State.QUEUED;
this.attempt = 0;
this.enqueuedAt = Instant.now();
}
public void markRunning() {
this.state = State.RUNNING;
this.startedAt = Instant.now();
this.attempt++;
}
public void markCompleted() {
this.state = State.COMPLETED;
this.completedAt = Instant.now();
}
public void markFailed(String reason) {
this.state = State.FAILED;
this.failureReason = reason;
this.completedAt = Instant.now();
}
public String getJobId() { return jobId; }
public String getVideoId() { return videoId; }
public State getState() { return state; }
public Set<Ladder> getTargets() { return Collections.unmodifiableSet(targets); }
}
@RestController
@RequestMapping("/v1/videos/uploads/{sessionId}")
public class ChunkUploadController {
private final UploadSessionService sessions;
private final ObjectStoreClient store;
private final TranscodingQueue queue;
@PutMapping(value = "/parts/{partNumber}", consumes = MediaType.APPLICATION_OCTET_STREAM_VALUE)
public ResponseEntity<PartResponse> uploadPart(
@PathVariable String sessionId,
@PathVariable int partNumber,
@RequestHeader("Content-Range") String contentRange,
@RequestHeader("X-Chunk-Sha256") String chunkHash,
InputStream body) throws IOException {
UploadSession session = sessions.require(sessionId);
ByteRange range = ByteRange.parse(contentRange);
String etag = store.putPart(session.getUploadId(), partNumber, body, range.length(), chunkHash);
sessions.recordPart(sessionId, partNumber, etag, range);
if (sessions.isComplete(sessionId)) {
List<PartRef> parts = sessions.listParts(sessionId);
store.completeMultipart(session.getUploadId(), parts);
sessions.markUploaded(sessionId);
queue.enqueue(new TranscodingJob(session.getVideoId(), session.getObjectKey(), Ladder.defaults()));
return ResponseEntity.ok(PartResponse.finalPart(etag));
}
return ResponseEntity.ok(PartResponse.intermediate(etag));
}
}
public final class HlsManifestBuilder {
public String buildMaster(List<RenditionOutput> renditions) {
StringBuilder sb = new StringBuilder();
sb.append("#EXTM3U\n");
sb.append("#EXT-X-VERSION:7\n");
sb.append("#EXT-X-INDEPENDENT-SEGMENTS\n");
for (RenditionOutput r : renditions) {
sb.append("#EXT-X-STREAM-INF:BANDWIDTH=").append(r.getBandwidthBps())
.append(",AVERAGE-BANDWIDTH=").append(r.getAvgBandwidthBps())
.append(",RESOLUTION=").append(r.getWidth()).append('x').append(r.getHeight())
.append(",CODECS=\"").append(r.getCodecs()).append("\"")
.append(",FRAME-RATE=").append(r.getFps())
.append('\n');
sb.append(r.getPlaylistPath()).append('\n');
}
return sb.toString();
}
public String buildMedia(List<HlsSegment> segments, int targetDurationSec) {
StringBuilder sb = new StringBuilder();
sb.append("#EXTM3U\n")
.append("#EXT-X-VERSION:7\n")
.append("#EXT-X-TARGETDURATION:").append(targetDurationSec).append('\n')
.append("#EXT-X-MEDIA-SEQUENCE:0\n")
.append("#EXT-X-PLAYLIST-TYPE:VOD\n");
for (HlsSegment s : segments) {
sb.append("#EXTINF:").append(String.format("%.3f", s.getDurationSec())).append(",\n");
sb.append(s.getUri()).append('\n');
}
sb.append("#EXT-X-ENDLIST\n");
return sb.toString();
}
}
Key design decisions & trade-offs
- Where to do playback delivery — Chosen: CDN-first with origin pull fallback. Pushing 95%+ of segment bytes to edge PoPs is the only way to serve ~115K QPS globally under 100 ms P99 without a multi-Tbps origin. Origin cost and failure blast radius both shrink dramatically.
- Transcoding timing — Chosen: Async DAG of workers with a completion queue. A 10-minute encode cannot block the upload response. Async keeps the 'processing' UX simple and lets the encoder fleet scale independently; cost is a delay before the video is watchable.
- Storage layout for transcoded outputs — Chosen: Full ABR ladder (6-8 renditions) per source. Trades ~4x raw storage for adaptive playback across 2G phones to 1 Gbps fiber. Cheaper than re-encoding on demand and keeps the edge cache-friendly.
- Upload protocol — Chosen: Resumable multipart with 5 MB GOP-aligned chunks. Avoids restarting multi-GB uploads on a single flaky connection and allows parallel PUTs to saturate the user's uplink. Adds server bookkeeping for part manifests.
- Metadata store — Chosen: Sharded SQL behind Redis. Video metadata is relational (owner, ACL, playlists) and read-dominated. SQL gives transactional updates on privacy changes; Redis absorbs the 10-100x read amplification.
Interview follow-ups
- How would you support live streaming with LL-HLS and sub-3-second glass-to-glass latency?
- How would you A/B test a new codec (AV1) without doubling storage for all videos?
- How do you protect premium content with DRM (Widevine/FairPlay) across the ladder?
- How would you build a recommendation service on top of view-event logs?
- How do you handle copyright (Content ID) matching at upload time?