Row vs Column
Storage layout & scan cost. Ch 1.
This interactive explanation is built for system design interview prep: step through Row vs Column, watch the internal state change, and connect the concept to real distributed-system trade-offs.
Overview
Row-oriented and column-oriented storage lay out exactly the same logical table in physically opposite ways, and that single choice drives the personality of a database more than any other internal decision. Row stores (Postgres, MySQL, Oracle) write every attribute of a tuple contiguously on a page, so a point read of a whole record is one sequential disk touch, and OLTP workloads that insert or update full rows stay fast. Column stores (Vertica, Redshift, ClickHouse, Parquet) instead write each attribute into its own contiguous vector, so analytical scans that read only a handful of columns from billions of rows skip over the rest of the table and run an order of magnitude faster. The catch: column stores hate point writes because a single row update must touch N files, and row stores hate analytical scans because they drag useless columns through the memory hierarchy. Understanding when to pick each — and hybrids like PAX and Apache Arrow — is the crux of modern storage engine design.
How it works
A row store lays out tuples in a heap file one after another, typically inside a slotted page; the scan path is pread() the page into the buffer pool, walk the slot array, and materialize a Java object per slot. A columnar store instead writes each attribute as a separate array on disk, often with a lightweight header describing dictionary encoding, run-length encoding, or bit-packing. When a query like SELECT SUM(amount) FROM orders WHERE region = 'US' arrives, the columnar executor enters a state machine: OPEN_COLUMN (mmap the region vector), FILTER (SIMD-compare each lane against the constant 'US' to produce a selection bitmap), PROJECT (apply the bitmap to the amount vector), and AGGREGATE (vectorized sum over surviving lanes). The row engine must instead OPEN_PAGE, for each slot DECODE_TUPLE, CHECK_PREDICATE on region, and then ADD amount — paying per-tuple dispatch cost on every row, including rows it ends up rejecting. Writes invert the story: a row insert is one append; a columnar insert must split the tuple across N columns, which is why real systems buffer writes in a row-oriented write-optimized store (the 'delta') and merge periodically into the columnar base (the 'main').
Implementation
package com.example.storage;
import java.util.*;
/** One column = one primitive array. Reads of a single column are pure sequential memory. */
public class ColumnStore {
private final int[] region; // dictionary-encoded: 0=US, 1=EU, 2=APAC
private final long[] amount;
private final int rows;
public ColumnStore(int[] region, long[] amount) {
if (region.length != amount.length) throw new IllegalArgumentException();
this.region = region;
this.amount = amount;
this.rows = region.length;
}
/** SUM(amount) WHERE region = target. Vectorized, branch-free inner loop. */
public long sumWhereRegion(int target) {
long sum = 0;
for (int i = 0; i < rows; i++) {
// mask = -1 when match, 0 otherwise; avoids a hard-to-predict branch
long mask = -(long) ((region[i] == target) ? 1 : 0);
sum += amount[i] & mask;
}
return sum;
}
}
package com.example.storage;
import java.util.*;
/** One row = one object. Good for point lookups and full-tuple reads. */
public class RowStore {
public static final class Order {
public final int region;
public final long amount;
public final long customerId;
public final long ts;
public Order(int region, long amount, long customerId, long ts) {
this.region = region; this.amount = amount;
this.customerId = customerId; this.ts = ts;
}
}
private final List<Order> heap = new ArrayList<>();
public void insert(Order o) { heap.add(o); }
public Order get(int rowId) { return heap.get(rowId); }
/** Same query as ColumnStore but forced to drag every field through cache. */
public long sumWhereRegion(int target) {
long sum = 0;
for (Order o : heap) {
if (o.region == target) sum += o.amount;
}
return sum;
}
}
package com.example.storage;
import java.util.concurrent.ThreadLocalRandom;
public class ScanBenchmark {
public static void main(String[] args) {
final int N = 10_000_000;
int[] region = new int[N];
long[] amount = new long[N];
RowStore rows = new RowStore();
ThreadLocalRandom r = ThreadLocalRandom.current();
for (int i = 0; i < N; i++) {
region[i] = r.nextInt(3);
amount[i] = r.nextLong(1_000);
rows.insert(new RowStore.Order(region[i], amount[i], r.nextLong(), r.nextLong()));
}
ColumnStore cols = new ColumnStore(region, amount);
long t1 = System.nanoTime(); long sC = cols.sumWhereRegion(0);
long t2 = System.nanoTime(); long sR = rows.sumWhereRegion(0);
long t3 = System.nanoTime();
System.out.printf("col: %d ms sum=%d%n", (t2 - t1) / 1_000_000, sC);
System.out.printf("row: %d ms sum=%d%n", (t3 - t2) / 1_000_000, sR);
// Typical result: column store is 3-10x faster because it scans
// 8 bytes/row instead of a ~32-byte Order object.
}
}
Complexity
- ScanOneColumn:
O(rows) but tight sequential bytes, SIMD-friendly - PointRead (full tuple):
Row: O(1); Column: O(N_columns) random I/O - PointWrite:
Row: O(1); Column: O(N_columns) - Compression:
Column: 3-10x typical (runs/dicts in one column); Row: 1.5-2x
Key design decisions & trade-offs
- Physical layout — Chosen: Row store for OLTP, column store for OLAP. Point writes and whole-row reads love rows; scans over a few columns love columns. The workload picks the layout, not the other way around.
- Handling writes in a column store — Chosen: Write-optimized row-shaped delta that is merged into the columnar main. Pure columnar inserts amplify writes by N_columns. A small in-memory row-shaped delta absorbs the mutation burst and is merged in the background, trading read-time merge cost for sane write throughput.
- Compression scheme per column — Chosen: Dictionary + RLE + bit-packing chosen per column, not globally. Low-cardinality columns compress 20x with a dictionary; monotonic timestamps want delta encoding; floats usually cannot be compressed at all. Picking globally leaves most of the win on the table.
- Hybrid formats (PAX, Arrow) — Chosen: Row groups of columnar blocks — good for both SSDs and vector execution. A pure columnar file pays a seek per column per query; PAX keeps a row group together so one I/O fetches all columns a query needs, while preserving column-wise compression and SIMD scans.
Common pitfalls
- Running OLTP point updates on a pure column store and being surprised when write latency is N_columns x worse than on Postgres.
- Storing a 64-bit integer column uncompressed when it actually holds 3 distinct values — missing a trivial dictionary gives up most of the size advantage.
- Benchmarking column stores with SELECT * which defeats the entire point; columns help exactly because the query touches a subset.
- Forgetting that columnar updates require tombstones + rewrite; retention policies and GC pressure become real operational concerns.
- Assuming columnar is always faster — for a query that touches every column on a small table, a row store's single sequential read wins.
Interview follow-ups
- How would you implement UPDATE on a column store without rewriting the whole column?
- Why does vectorized execution pair naturally with columnar storage, and what does it mean for the inner loop?
- When would PAX or Arrow's row-group layout beat both pure row and pure column layouts?
- How do late-materialization and early-materialization execution strategies differ in a columnar engine?
- Explain how Parquet encodes a column chunk — repetition/definition levels, dictionary pages, and min/max stats for predicate pushdown.
Recommended reading
- Alex Petrov, Database Internals — storage engines and distributed systems internals.
- Martin Kleppmann, Designing Data-Intensive Applications (DDIA) — data models, replication, partitioning, consistency.
- The System Design Primer — high-level design building blocks.
- Foundational networking + web-security references (TCP/IP, TLS 1.3, OWASP Top 10).