Encoding & Evolution

By Rahul Kumar · Senior Software Engineer · Updated May 2026 · Category: Kleppmann · Designing Data-Intensive Applications

JSON / Protobuf / Avro size + schema compat. Ch 4.

This interactive explanation is built for system design interview prep: step through Encoding & Evolution, watch the internal state change, and connect the concept to real distributed-system trade-offs.

Overview

Every message leaving your process and every row going to disk is encoded: turned from an in-memory object graph into a sequence of bytes, then decoded on the other side. JSON, Protobuf, and Avro are the three encoding families that dominate backend systems, and each makes a different bet about schemas. JSON puts field names inside every payload, so the reader needs no prior agreement but pays for the metadata on every byte. Protobuf assigns a compact numeric tag per field and requires the reader to know the schema, trading self-description for density. Avro goes further: the payload contains no tags or names at all, and the writer's schema is shipped out-of-band, which makes the bytes brutally compact but tightly couples writer and reader to a schema registry. Kleppmann's key point is that encoding choice is really a schema-evolution choice — how you add, remove, and rename fields without breaking old clients.

Encoding & Evolution — Interactive Simulator

Runs fully client-side in your browser; no sign-up. Or open full screen →

Launch the interactive Encoding & Evolution widget — step through the algorithm or protocol and observe the internal state updating in real time.

How it works

At write time, JSON walks the object tree and emits key-value pairs as UTF-8 text, making the result human-readable but 3-10x larger than binary equivalents and slow to parse. Protobuf numbers each field in a .proto file; the encoder writes a tag byte packing field number and wire type, then the value, skipping any field that is unset. Unknown fields at read time are preserved as raw bytes, so old readers can round-trip new messages without losing data — the core trick that makes forward compatibility work. Avro is stricter: the writer's schema must accompany the data, either embedded in a file header or looked up by ID from a registry. The reader's schema may differ, and Avro resolves the two schemas at read time, applying default values for missing fields and dropping unknown ones. Schema evolution rules fall out of these mechanics: in Protobuf, new fields must be optional with unique tags and old fields must never be reused; in Avro, a reader using an older schema tolerates unknown fields as long as they have defaults in the newer writer schema. Getting the rules wrong — renaming a Proto field while keeping the tag, or removing a required Avro field with no default — creates silent data corruption that shows up weeks later.

Implementation

POJO with a new optional field (schema evolution)

import java.util.Objects;
import java.util.Optional;

/** V1 had {id, email}. V2 adds an optional phoneNumber without breaking V1 readers. */
public final class UserProfile {
    private final long id;
    private final String email;
    private final Optional<String> phoneNumber; // added in V2; V1 clients ignore it

    public UserProfile(long id, String email, Optional<String> phoneNumber) {
        this.id = id;
        this.email = Objects.requireNonNull(email);
        this.phoneNumber = phoneNumber == null ? Optional.empty() : phoneNumber;
    }

    /** Backward-compatible factory for pre-V2 callers. */
    public static UserProfile v1(long id, String email) {
        return new UserProfile(id, email, Optional.empty());
    }

    public long id() { return id; }
    public String email() { return email; }
    public Optional<String> phoneNumber() { return phoneNumber; }

    @Override
    public boolean equals(Object o) {
        if (!(o instanceof UserProfile u)) return false;
        return id == u.id && email.equals(u.email) && phoneNumber.equals(u.phoneNumber);
    }

    @Override
    public int hashCode() {
        return Objects.hash(id, email, phoneNumber);
    }
}

Protobuf .proto excerpt (evolvable schema)

syntax = "proto3";

package user.v2;

// Adding fields: always allocate a new tag number, never reuse retired ones.
// Removing fields: mark the tag as reserved so nobody re-uses it accidentally.
message UserProfile {
  int64 id = 1;
  string email = 2;

  // Added in V2. Old clients serialize messages without this field;
  // new servers parse those messages and see an empty phone_number.
  string phone_number = 3;

  // Field 4 was "legacy_username" — retired. Reserve so it cannot be re-used.
  reserved 4;
  reserved "legacy_username";
}

Protobuf SerDe: round-trip the POJO

import com.google.protobuf.InvalidProtocolBufferException;
import user.v2.UserProto;
import java.util.Optional;

public final class UserProfileSerde {

    /** Encode POJO to bytes. Unset phoneNumber becomes empty string on the wire. */
    public byte[] encode(UserProfile profile) {
        UserProto.UserProfile.Builder b = UserProto.UserProfile.newBuilder()
            .setId(profile.id())
            .setEmail(profile.email());
        profile.phoneNumber().ifPresent(b::setPhoneNumber);
        return b.build().toByteArray();
    }

    /** Decode bytes to POJO. Works on payloads from V1 (no phone field). */
    public UserProfile decode(byte[] bytes) throws InvalidProtocolBufferException {
        UserProto.UserProfile msg = UserProto.UserProfile.parseFrom(bytes);
        Optional<String> phone = msg.getPhoneNumber().isEmpty()
            ? Optional.empty()
            : Optional.of(msg.getPhoneNumber());
        return new UserProfile(msg.getId(), msg.getEmail(), phone);
    }

    /** Round-trip sanity check. */
    public boolean roundTrip(UserProfile original) throws InvalidProtocolBufferException {
        byte[] wire = encode(original);
        UserProfile decoded = decode(wire);
        return decoded.equals(original);
    }
}

Complexity

JSON encode/decode: O(N) with ~3-10x size vs binary
Protobuf encode: O(N) with ~1x binary size, tag+value varint per field
Avro encode: O(N) smallest on wire, schema fetched out-of-band
Schema registry lookup: O(1) with client-side cache, O(RTT) on miss
Backward/forward compat checks: O(F) per schema change where F is field count

Key design decisions & trade-offs

Self-describing vs schema-out-of-band — Chosen: JSON embeds names; Protobuf embeds tag numbers; Avro embeds nothing. Self-describing wins for ad-hoc tools and debugging but costs bytes and CPU on every call. Out-of-band schemas need a registry and discipline but deliver the smallest payloads and strongest evolution guarantees.
Field identity — Chosen: Tag numbers (Protobuf) or position (Avro) over field names. Names are easy to rename in an IDE but tag numbers are what the wire format actually cares about. Renaming a Proto field changes the Java accessor but leaves the wire format intact; renaming in JSON silently breaks every consumer.
Required vs optional fields — Chosen: Avoid "required" in evolvable schemas. A required field can never be removed without breaking old readers forever. Proto3 removed the keyword entirely for this reason. Treat everything as optional with sensible defaults and enforce invariants in code.
Human-readable vs binary — Chosen: JSON for external APIs, binary for internal RPC and storage. External consumers need debuggability and curl-friendliness. Internal traffic is rarely inspected by humans and benefits from the 5-10x size reduction and 10x parse speedup of binary formats.

Common pitfalls

Reusing a Protobuf tag number after deleting a field; the new field silently deserializes old payloads as garbage
Renaming an Avro field without providing an alias; old writers produce data that new readers cannot find
Shipping a JSON-over-HTTP API and later discovering you cannot remove a field because unknown clients still depend on it
Forgetting that Protobuf's default values are indistinguishable from unset, so you cannot tell "user passed 0" from "user omitted the field" without a wrapper type
Using Java serialization across service boundaries; it is neither compact nor evolvable nor cross-language

Interview follow-ups

Set up a schema registry with compatibility checks that block incompatible deploys at CI time
Compare payload size and parse latency for the same object in JSON, Protobuf, Avro, and MessagePack
Design a rolling deploy where V1 producers, V2 consumers, V1 consumers, and V2 producers all coexist
Handle a field rename safely using aliases (Avro) or parallel-write-both-tags (Protobuf)