Encoding & Evolution
JSON / Protobuf / Avro size + schema compat. Ch 4.
This interactive explanation is built for system design interview prep: step through Encoding & Evolution, watch the internal state change, and connect the concept to real distributed-system trade-offs.
Overview
Every message leaving your process and every row going to disk is encoded: turned from an in-memory object graph into a sequence of bytes, then decoded on the other side. JSON, Protobuf, and Avro are the three encoding families that dominate backend systems, and each makes a different bet about schemas. JSON puts field names inside every payload, so the reader needs no prior agreement but pays for the metadata on every byte. Protobuf assigns a compact numeric tag per field and requires the reader to know the schema, trading self-description for density. Avro goes further: the payload contains no tags or names at all, and the writer's schema is shipped out-of-band, which makes the bytes brutally compact but tightly couples writer and reader to a schema registry. Kleppmann's key point is that encoding choice is really a schema-evolution choice — how you add, remove, and rename fields without breaking old clients.
How it works
At write time, JSON walks the object tree and emits key-value pairs as UTF-8 text, making the result human-readable but 3-10x larger than binary equivalents and slow to parse. Protobuf numbers each field in a .proto file; the encoder writes a tag byte packing field number and wire type, then the value, skipping any field that is unset. Unknown fields at read time are preserved as raw bytes, so old readers can round-trip new messages without losing data — the core trick that makes forward compatibility work. Avro is stricter: the writer's schema must accompany the data, either embedded in a file header or looked up by ID from a registry. The reader's schema may differ, and Avro resolves the two schemas at read time, applying default values for missing fields and dropping unknown ones. Schema evolution rules fall out of these mechanics: in Protobuf, new fields must be optional with unique tags and old fields must never be reused; in Avro, a reader using an older schema tolerates unknown fields as long as they have defaults in the newer writer schema. Getting the rules wrong — renaming a Proto field while keeping the tag, or removing a required Avro field with no default — creates silent data corruption that shows up weeks later.
Implementation
import java.util.Objects;
import java.util.Optional;
/** V1 had {id, email}. V2 adds an optional phoneNumber without breaking V1 readers. */
public final class UserProfile {
private final long id;
private final String email;
private final Optional<String> phoneNumber; // added in V2; V1 clients ignore it
public UserProfile(long id, String email, Optional<String> phoneNumber) {
this.id = id;
this.email = Objects.requireNonNull(email);
this.phoneNumber = phoneNumber == null ? Optional.empty() : phoneNumber;
}
/** Backward-compatible factory for pre-V2 callers. */
public static UserProfile v1(long id, String email) {
return new UserProfile(id, email, Optional.empty());
}
public long id() { return id; }
public String email() { return email; }
public Optional<String> phoneNumber() { return phoneNumber; }
@Override
public boolean equals(Object o) {
if (!(o instanceof UserProfile u)) return false;
return id == u.id && email.equals(u.email) && phoneNumber.equals(u.phoneNumber);
}
@Override
public int hashCode() {
return Objects.hash(id, email, phoneNumber);
}
}
syntax = "proto3";
package user.v2;
// Adding fields: always allocate a new tag number, never reuse retired ones.
// Removing fields: mark the tag as reserved so nobody re-uses it accidentally.
message UserProfile {
int64 id = 1;
string email = 2;
// Added in V2. Old clients serialize messages without this field;
// new servers parse those messages and see an empty phone_number.
string phone_number = 3;
// Field 4 was "legacy_username" — retired. Reserve so it cannot be re-used.
reserved 4;
reserved "legacy_username";
}
import com.google.protobuf.InvalidProtocolBufferException;
import user.v2.UserProto;
import java.util.Optional;
public final class UserProfileSerde {
/** Encode POJO to bytes. Unset phoneNumber becomes empty string on the wire. */
public byte[] encode(UserProfile profile) {
UserProto.UserProfile.Builder b = UserProto.UserProfile.newBuilder()
.setId(profile.id())
.setEmail(profile.email());
profile.phoneNumber().ifPresent(b::setPhoneNumber);
return b.build().toByteArray();
}
/** Decode bytes to POJO. Works on payloads from V1 (no phone field). */
public UserProfile decode(byte[] bytes) throws InvalidProtocolBufferException {
UserProto.UserProfile msg = UserProto.UserProfile.parseFrom(bytes);
Optional<String> phone = msg.getPhoneNumber().isEmpty()
? Optional.empty()
: Optional.of(msg.getPhoneNumber());
return new UserProfile(msg.getId(), msg.getEmail(), phone);
}
/** Round-trip sanity check. */
public boolean roundTrip(UserProfile original) throws InvalidProtocolBufferException {
byte[] wire = encode(original);
UserProfile decoded = decode(wire);
return decoded.equals(original);
}
}
Complexity
- JSON encode/decode:
O(N) with ~3-10x size vs binary - Protobuf encode:
O(N) with ~1x binary size, tag+value varint per field - Avro encode:
O(N) smallest on wire, schema fetched out-of-band - Schema registry lookup:
O(1) with client-side cache, O(RTT) on miss - Backward/forward compat checks:
O(F) per schema change where F is field count
Key design decisions & trade-offs
- Self-describing vs schema-out-of-band — Chosen: JSON embeds names; Protobuf embeds tag numbers; Avro embeds nothing. Self-describing wins for ad-hoc tools and debugging but costs bytes and CPU on every call. Out-of-band schemas need a registry and discipline but deliver the smallest payloads and strongest evolution guarantees.
- Field identity — Chosen: Tag numbers (Protobuf) or position (Avro) over field names. Names are easy to rename in an IDE but tag numbers are what the wire format actually cares about. Renaming a Proto field changes the Java accessor but leaves the wire format intact; renaming in JSON silently breaks every consumer.
- Required vs optional fields — Chosen: Avoid "required" in evolvable schemas. A required field can never be removed without breaking old readers forever. Proto3 removed the keyword entirely for this reason. Treat everything as optional with sensible defaults and enforce invariants in code.
- Human-readable vs binary — Chosen: JSON for external APIs, binary for internal RPC and storage. External consumers need debuggability and curl-friendliness. Internal traffic is rarely inspected by humans and benefits from the 5-10x size reduction and 10x parse speedup of binary formats.
Common pitfalls
- Reusing a Protobuf tag number after deleting a field; the new field silently deserializes old payloads as garbage
- Renaming an Avro field without providing an alias; old writers produce data that new readers cannot find
- Shipping a JSON-over-HTTP API and later discovering you cannot remove a field because unknown clients still depend on it
- Forgetting that Protobuf's default values are indistinguishable from unset, so you cannot tell "user passed 0" from "user omitted the field" without a wrapper type
- Using Java serialization across service boundaries; it is neither compact nor evolvable nor cross-language
Interview follow-ups
- Set up a schema registry with compatibility checks that block incompatible deploys at CI time
- Compare payload size and parse latency for the same object in JSON, Protobuf, Avro, and MessagePack
- Design a rolling deploy where V1 producers, V2 consumers, V1 consumers, and V2 producers all coexist
- Handle a field rename safely using aliases (Avro) or parallel-write-both-tags (Protobuf)
Recommended reading
- Alex Petrov, Database Internals — storage engines and distributed systems internals.
- Martin Kleppmann, Designing Data-Intensive Applications (DDIA) — data models, replication, partitioning, consistency.
- The System Design Primer — high-level design building blocks.
- Foundational networking + web-security references (TCP/IP, TLS 1.3, OWASP Top 10).