Data Models

By Rahul Kumar · Senior Software Engineer · Updated May 2026 · Category: Kleppmann · Designing Data-Intensive Applications

Relational / document / graph side-by-side with queries. Ch 2.

This interactive explanation is built for system design interview prep: step through Data Models, watch the internal state change, and connect the concept to real distributed-system trade-offs.

Overview

Relational, document, and graph data models are three different answers to the same question: how should the shape of your data match the shape of your queries? Relational tables enforce a schema up front, split data across normalized rows, and lean on joins to reassemble it. Document stores embed related data as a single self-describing blob, optimizing locality for one-shot reads at the cost of update anomalies and awkward cross-document joins. Graph databases model entities and the relationships between them as first-class citizens, so queries that would be N-way joins in SQL become constant-time traversals regardless of how many hops they span. Kleppmann frames the choice as query-shape versus data-shape impedance: pick the model whose access pattern already matches your reads, because fighting the mismatch later is what produces brittle data layers. Most real systems end up polyglot, carrying one workload per store.

Data Models — Interactive Simulator

Runs fully client-side in your browser; no sign-up. Or open full screen →

Launch the interactive Data Models widget — step through the algorithm or protocol and observe the internal state updating in real time.

How it works

Relational stores break an entity into normalized rows across many tables and use foreign keys to preserve referential integrity; the database rebuilds the object at read time by joining rows on those keys. This is great when the same underlying facts are queried from multiple angles ("show me users of this org" and "show me orgs of this user"). Document stores flip the tradeoff: the whole entity lives in a single nested record, usually JSON or BSON, so one lookup returns everything the UI needs with no join cost. The price is denormalization: a user's name embedded in every order becomes stale the moment the user renames, and cross-document transactions are historically weak. Graph databases use an adjacency-list representation where each vertex stores pointers to its edges, so an edge traversal is a direct memory dereference. A three-hop friend-of-friend query that would require two self-joins over a billion-row edge table in SQL becomes a bounded walk in a graph engine. The crucial realization is that all three models can encode any data; the difference is how expensive your actual queries become. Pick the model that makes your hottest read path a single lookup or a short walk, not a multi-way join.

Implementation

Relational: JPA @Entity

import jakarta.persistence.*;
import java.time.Instant;
import java.util.*;

@Entity
@Table(name = "users")
public class UserRow {
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;

    @Column(nullable = false, unique = true)
    private String email;

    @Column(nullable = false)
    private String name;

    @Column(name = "created_at", nullable = false)
    private Instant createdAt;

    @OneToMany(mappedBy = "user", fetch = FetchType.LAZY, cascade = CascadeType.ALL)
    private List<OrderRow> orders = new ArrayList<>();

    @ManyToMany
    @JoinTable(name = "user_roles",
        joinColumns = @JoinColumn(name = "user_id"),
        inverseJoinColumns = @JoinColumn(name = "role_id"))
    private Set<RoleRow> roles = new HashSet<>();

    protected UserRow() {}

    public UserRow(String email, String name) {
        this.email = email;
        this.name = name;
        this.createdAt = Instant.now();
    }

    public Long id() { return id; }
    public String email() { return email; }
    public String name() { return name; }
    public List<OrderRow> orders() { return orders; }
    public Set<RoleRow> roles() { return roles; }
}

Document: MongoDB POJO

import org.bson.codecs.pojo.annotations.BsonId;
import org.bson.codecs.pojo.annotations.BsonProperty;
import java.time.Instant;
import java.util.*;

public final class UserDoc {
    @BsonId
    private String id;                       // usually ObjectId string

    @BsonProperty("email")
    private String email;

    @BsonProperty("name")
    private String name;

    @BsonProperty("created_at")
    private Instant createdAt;

    // Embedded subdocuments: one read returns the user and their recent orders.
    @BsonProperty("recent_orders")
    private List<EmbeddedOrder> recentOrders = new ArrayList<>();

    @BsonProperty("roles")
    private List<String> roles = new ArrayList<>();  // denormalized role names

    public UserDoc() {}

    public UserDoc(String email, String name) {
        this.email = email;
        this.name = name;
        this.createdAt = Instant.now();
    }

    public static final class EmbeddedOrder {
        @BsonProperty("order_id") private String orderId;
        @BsonProperty("total_cents") private long totalCents;
        @BsonProperty("placed_at") private Instant placedAt;

        public EmbeddedOrder() {}
        public EmbeddedOrder(String id, long cents, Instant at) {
            this.orderId = id; this.totalCents = cents; this.placedAt = at;
        }
    }

    public String id() { return id; }
    public String email() { return email; }
    public List<EmbeddedOrder> recentOrders() { return recentOrders; }
}

Graph: Neo4j Cypher query wrapper

import org.neo4j.driver.*;
import java.util.*;
import static org.neo4j.driver.Values.parameters;

public final class SocialGraphClient implements AutoCloseable {
    private final Driver driver;

    public SocialGraphClient(String uri, String user, String pass) {
        this.driver = GraphDatabase.driver(uri, AuthTokens.basic(user, pass));
    }

    /** Create a User vertex. */
    public void upsertUser(String userId, String name) {
        try (Session s = driver.session()) {
            s.executeWrite(tx -> tx.run(
                "MERGE (u:User {id: $id}) SET u.name = $name RETURN u",
                parameters("id", userId, "name", name)).consume());
        }
    }

    /** Create a FRIEND edge between two users. */
    public void follow(String fromId, String toId) {
        try (Session s = driver.session()) {
            s.executeWrite(tx -> tx.run(
                "MATCH (a:User {id: $a}), (b:User {id: $b}) " +
                "MERGE (a)-[:FRIEND]->(b)",
                parameters("a", fromId, "b", toId)).consume());
        }
    }

    /** Friends-of-friends up to depth 3 — trivial traversal, not a 3-way join. */
    public List<String> friendsOfFriends(String userId, int depth) {
        try (Session s = driver.session()) {
            return s.executeRead(tx -> tx.run(
                "MATCH (u:User {id: $id})-[:FRIEND*2.." + depth + "]->(f:User) " +
                "WHERE f.id <> $id RETURN DISTINCT f.id AS id",
                parameters("id", userId))
                .list(r -> r.get("id").asString()));
        }
    }

    @Override public void close() { driver.close(); }
}

Complexity

Relational 3-way join: O(N log N) with indexes, O(N^2) worst case
Document single-key lookup: O(1) with primary-key index
Graph k-hop traversal: O(V + E) bounded to visited subgraph
Schema migration (relational): O(N) rewrite or online DDL
Write amplification (embedded documents): O(copies) when denormalized field changes

Key design decisions & trade-offs

Normalization level — Chosen: Normalize in relational, denormalize in document. Relational writes stay cheap because each fact lives once; document reads stay cheap because each read returns the whole aggregate. Mixing them produces the worst of both: duplicated data and scattered joins.
Schema enforcement — Chosen: Relational: schema-on-write. Document: schema-on-read.. Schema-on-write catches errors at insert time and keeps the data clean for analytics. Schema-on-read lets teams evolve independently but pushes validation into every reader, which drifts over time.
Relationship-heavy workloads — Chosen: Use a graph database if traversal depth is unbounded or dynamic. Fixed-depth relationships (one-to-many, many-to-many) are fine in SQL. When queries say "find any path between A and B" or "shortest fraud chain," graph engines win by orders of magnitude because edges are pointers, not index lookups.
ACID across aggregates — Chosen: Relational gives multi-row transactions; most document stores only give single-document atomicity. If your consistency boundary crosses entities (order + inventory), a relational store lets you commit them together. Document stores force you to either embed everything into one document or build a two-phase commit on top.

Common pitfalls

Treating JSON columns in Postgres as a free pass to skip schema design; you end up with un-indexable mystery blobs that are worst of both worlds
Choosing a graph DB for a workload that is really a 2-table join; the constant factor on a graph engine is much higher than an indexed SQL lookup
Denormalizing a username into every embedded document and then discovering you cannot rename a user atomically
Emulating joins in application code across a document store and calling it a microservice architecture

Interview follow-ups

Model a social feed in each of the three stores and compare the p95 read latency
Design a polyglot stack: OLTP in Postgres, cached aggregates in Mongo, friend graph in Neo4j
Implement online schema migration for a 1B-row relational table with zero downtime
Handle schema evolution in a document store using versioned readers