Foundations of Data Systems Explained-- DDIA (1/3)

Introduction

Modern applications handle more data, serve more users, and must run across more machines than ever before. But what makes a data system truly reliable, scalable, and maintainable?

Designing Data-Intensive Applications by Martin Kleppmann offers a deep exploration of how data systems work under the hood. It is not just a book about databases — it is a blueprint for building systems that can grow, adapt, and perform under pressure.

Part I of the book, “Foundations of Data Systems,” sets the stage by introducing the fundamental building blocks of backend infrastructure. This part is about understanding the core trade-offs and architectural decisions behind every storage engine, query language, and data format we choose.

This article distills the key lessons from Chapters 1 to 4:

What reliability and scalability really mean in practice
How different data models affect flexibility and performance
The internal workings of storage engines like B-Trees and LSM-Trees
How data encoding and schema evolution shape the future of your system

Whether you’re an engineer designing your first database schema or an architect managing distributed pipelines, these foundations will give you the vocabulary, mental models, and examples to build better systems.

Let’s dive in.

Chapter 1: Reliable, Scalable, and Maintainable Applications

Reliability

A reliable system continues to function correctly even when hardware or software fails. Reliability is about minimizing downtime and data loss, and ensuring the system can recover gracefully from unexpected events.

Key Strategies:

Redundancy: Replicate data and services across nodes, racks, or regions to avoid single points of failure.
Fault Isolation: Design so that a failure in one component does not cascade to others (e.g., circuit breakers, bulkheads).
Automated Recovery: Use backups, snapshots, and automated failover to restore service quickly.
Idempotence: Ensure operations can be safely retried without side effects.

Redundancy and Failover

User
 |
 v
Load Balancer
 |         |
 v         v
App1     App2
  \       /
   v     v
  Database Cluster
   |   |   |
   v   v   v
 Rep1 Rep2 Rep3

Real-World Example:

Netflix replicates its services and data across multiple AWS regions to remain available during outages.
Google Spanner uses multi-region replication and TrueTime for global consistency.

Scalability

Scalability is the ability of a system to handle increased load by adding resources. It’s not just about handling more users, but also about maintaining performance as demand grows.

Types of Scaling:

Vertical Scaling: Add more power (CPU, RAM) to a single machine.
Horizontal Scaling: Add more machines to distribute the load (preferred for large systems).

Horizontal vs Vertical Scaling

Vertical Scaling:
  [App Server]
       |
   [Database]

Horizontal Scaling:
  [App1]   [App2]   [App3]
      \     |     /
      [Shared Database]

Scalability Metrics:

Throughput: Requests per second the system can handle.
Latency: Time taken to process a request (use percentiles, e.g., 95th, 99th).
Error Rate: Percentage of failed requests.

Scaling Approaches

Approach	Pros	Cons
Vertical Scaling	Simple, no code changes	Hardware limits, downtime
Horizontal Scaling	No single point of failure	More complex, needs sharding

Real-World Example:

Instagram uses Redis and Memcached to cache frequent queries and reduce pressure on its databases.
Amazon DynamoDB automatically shards data across many servers for massive scale.

Maintainability

Maintainability is about how easily a system can be changed, debugged, and extended. Systems must evolve as requirements change, so maintainability is crucial for long-term success.

Best Practices:

Modular Design: Break systems into independent, well-defined components.
Observability: Use logs, metrics, and traces to understand system behavior.
Automation: Automate testing, deployment, and recovery to reduce human error.
Documentation: Keep code and system documentation up to date.

Observability Stack

[Application]
   |   |   |
   |   |   +---> [Tracing]
   |   +-------> [Metrics]
   +-----------> [Logging]
         |   |   |
         +---+---+
             |
     [Monitoring Dashboard]

Maintainability Practices

Practice	Benefit	Example Tool
Modular Design	Easier to change/test	Microservices, SOA
Observability	Faster debugging	Prometheus, Grafana
Automation	Fewer manual errors	CI/CD, Ansible
Documentation	Onboarding, knowledge sharing	Wikis, READMEs

Real-World Example:

Google SRE teams invest in automation and observability to reduce toil and improve system clarity.
Facebook uses automated canary releases and rollbacks to maintain reliability during deployments.

Chapter 1 Key Takeaways

This chapter sets the stage for understanding modern data systems by introducing three essential attributes:

Reliability: The system should continue to function correctly even when hardware fails, software has bugs, or human operators make mistakes. Techniques like replication, fault tolerance, and idempotence help mitigate risks.
Scalability: A system should handle growth—whether in data volume, traffic, or complexity—without suffering performance degradation. This requires clear definitions of load and performance, plus design strategies like partitioning and caching.
Maintainability: Systems evolve, and maintainable systems make this evolution easy. Characteristics of maintainability include good observability, clear abstractions, and the ability to change behavior with minimal risk or downtime.

Together, these principles create a framework for evaluating and designing robust, efficient, and adaptable systems.

Chapter 2: Data Models and Query Languages

Relational Model Versus Document Model

Relational databases store data in structured tables with fixed schemas, while document databases (like MongoDB) use nested, schema-less JSON documents.

Relational Example:

-- Orders and Customers in SQL
SELECT orders.id, customers.name
FROM orders
JOIN customers ON orders.customer_id = customers.id;

Document Example:

{
  "order_id": 1001,
  "customer": {
    "id": "cust123",
    "name": "Alice"
  },
  "items": [
    {"product": "Widget", "qty": 2},
    {"product": "Gadget", "qty": 1}
  ]
}

Relational vs Document Model

Relational:
+-------------+      +---------------+
| Orders      |      | Customers     |
|-------------|      |---------------|
| id          |----->| id            |
| customer_id |      | name          |
| ...         |      | ...           |
+-------------+      +---------------+

Document:
+----------------------+
| Order Document       |
|----------------------|
| order_id             |
| customer:            |
|   id                 |
|   name               |
| items: [             |
|   {product, qty},    |
|   ...                |
| ]                    |
+----------------------+

Relational vs Document

Feature	Relational DBs	Document DBs
Schema	Fixed, normalized	Dynamic, flexible
Queries	Declarative (SQL)	Embedded (MongoQL)
Relationships	JOINs	Embedded or referenced
Scalability	Manual sharding	Built-in sharding
Use Case	OLTP, analytics	Content, catalogs

Real-World Example:

MongoDB is used for content management systems and catalogs where flexible schemas are needed.
PostgreSQL and MySQL are used for financial and transactional systems requiring strong consistency.

The Birth of NoSQL

NoSQL databases emerged to meet the needs of large-scale applications where traditional relational databases struggled:

Amazon Dynamo: key-value store with eventual consistency
Google Bigtable: wide-column store
MongoDB: document store with dynamic schemas

NoSQL systems prioritized horizontal scalability, high availability, and flexible schemas.

The Object-Relational Mismatch

Relational tables are flat, but application code often uses deeply nested objects. This mismatch leads to inefficiencies:

ORMs try to bridge the gap but can be leaky abstractions
Document databases naturally align with object-oriented code

Many-to-One and Many-to-Many Relationships

Relational databases use normalized tables and JOINs:

SELECT * FROM books
JOIN book_authors ON books.id = book_authors.book_id
JOIN authors ON book_authors.author_id = authors.id;

In document databases, you can embed data:

{
  "title": "DDIA",
  "authors": [
    {"id": 1, "name": "Kleppmann"},
    {"id": 2, "name": "Other"}
  ]
}

Use embedding for static data; reference for large or frequently updated sub-objects.

Are Document Databases Repeating History?

Hierarchical databases from the 1960s (like IMS) stored data in tree structures, which made querying rigid and brittle. Document databases risk similar pitfalls:

Efficient for fixed access patterns
Less composable and general-purpose than relational models

The relational model won due to its data independence and declarative power.

Relational Versus Document Databases Today

Relational and document databases each have strengths and trade-offs, and neither is a universal solution. Relational databases are built around normalization, which avoids redundancy and supports flexible, ad hoc queries using JOINs. This makes them ideal for applications where relationships between entities are complex or frequently queried in different ways (e.g., financial systems, analytics).

Document databases, on the other hand, store data as nested documents (often JSON), making them a natural fit for aggregate-oriented data—where you typically fetch or update an entire object at once (e.g., a blog post with comments, a user profile with settings). Embedding related data inside a document can improve read performance and reduce the need for joins, but can also lead to duplication and update anomalies if the same data appears in many places.

Embedding vs Referencing:

Embed data when it is accessed together and does not change independently (e.g., order items inside an order).
Reference data when it is large, shared, or updated independently (e.g., user profiles referenced from orders).

Limitations:

Document databases are less suited for queries that need to join across many collections or require flexible, ad hoc reporting. They work best when access patterns are well known and data can be grouped into aggregates.
Relational databases can be harder to scale horizontally, but their query flexibility and data integrity features remain unmatched for many use cases.

Modern Practice: Many organizations use both models: relational databases for core transactional data, and document or NoSQL stores for content, logs, or user-generated data. The choice depends on your application’s data access patterns, consistency needs, and scalability requirements.

Query Languages for Data

Query languages allow users to retrieve and manipulate data. The most common is SQL, but others exist for different models.

Declarative Queries on the Web

Declarative query languages let you specify what data you want, not how to get it. The database or engine figures out the best way to execute your request. This is in contrast to imperative code, where you specify the exact steps.

SQL (Relational):

SELECT * FROM users WHERE age > 30;

MongoDB (Document):

db.users.find({ age: { $gt: 30 } })

GraphQL (APIs):

query {
  user(id: "1") {
    name
    posts { title }
  }
}

SPARQL (RDF):

SELECT ?book ?author WHERE {
  ?book <http://purl.org/dc/elements/1.1/creator> ?author
}

Declarative queries are powerful because they allow the underlying system to optimize execution, use indexes, and parallelize work. This abstraction is a key reason for the success of SQL and similar languages.

MapReduce Querying

MapReduce is a programming model for processing large datasets in parallel across many machines. It was popularized by Google and implemented in open source by Hadoop.

Map step: Each input record is processed to produce key-value pairs.
Reduce step: All values for the same key are combined.

Example: Word Count

# Map step: emit (word, 1) for each word in a document
for word in document.split():
    emit(word, 1)

# Reduce step: sum all counts for each word
for word, counts in grouped_by_word:
    emit(word, sum(counts))

MapReduce is less expressive than SQL, but it scales to massive datasets and is fault-tolerant. Many modern data platforms (like Spark) build on these ideas, but offer more flexible APIs.

Query Languages

Model	Language	Example Syntax	Use Case
Relational	SQL	SELECT … FROM …	OLTP, analytics
Document	MongoQL	db.coll.find({…})	Content, catalogs
Graph	Cypher	MATCH (a)-[]->(b)	Social, recommendations
RDF	SPARQL	SELECT … WHERE …	Semantic web, research

Graph-Like Data Models

Graph databases represent entities as nodes and relationships as edges. This model is ideal for use cases like social networks, recommendation engines, and fraud detection, where relationships are first-class citizens and traversal is frequent.

Property Graphs

Property Graph is a graph model where nodes and edges can have arbitrary key-value properties. Used in Neo4j and similar systems.A property graph consists of:

Nodes: entities (e.g., users, products)
Edges: relationships (e.g., follows, purchased)
Properties: key-value pairs on both nodes and edges

(Alice)---FRIEND--->(Bob)---FRIEND--->(Carol)
  |
  +---PURCHASED--->(Widget)

This diagram shows a property graph: Alice is friends with Bob, who is friends with Carol. Alice also purchased a Widget. Nodes represent entities, and edges represent relationships. This structure is ideal for social networks or recommendations.

Property graphs are used when you need to model and query complex, interconnected data with rich relationships.

The Cypher Query Language

Cypher is a declarative graph query language used in Neo4j. It allows expressive pattern matching and traversal.

Example: Find Alice’s friends

MATCH (a:User)-[:FRIEND]->(b:User)
WHERE a.name = 'Alice'
RETURN b.name;

This Cypher query finds all users who are friends with Alice. Cypher’s pattern-matching syntax makes it easy to express traversals in a graph database.

Cypher is ideal for queries that follow paths or patterns in a graph, such as recommendations or social connections.

Graph Queries in SQL

SQL is not naturally suited for recursive or deep relationship traversals, but recursive Common Table Expressions (CTEs) can simulate it.

Example: Find all friends-of-friends

WITH RECURSIVE friends(id, depth) AS (
  SELECT id, 0 FROM users WHERE id = 1
  UNION
  SELECT u.id, depth + 1
  FROM users u
  JOIN friendships f ON u.id = f.friend_id
  JOIN friends fr ON f.user_id = fr.id
)
SELECT * FROM friends;

This SQL example uses a recursive CTE to find all friends-of-friends starting from user 1. While possible, such queries are more verbose and less efficient than in a native graph database.

This approach works for limited depth traversal but lacks graph-native optimizations.

Triple-Stores and SPARQL

Triple stores (e.g., Apache Jena, Virtuoso) are used in semantic web contexts. Data is stored as RDF triples: subject, predicate, object.

RDF (Resource Description Framework): A standard for representing data as triples (subject, predicate, object), widely used in the semantic web.

Example RDF triple:

<Bob> <knows> <Alice>

This triple expresses that Bob knows Alice, using the subject-predicate-object format common in semantic web and knowledge graph systems.

SPARQL is a query language for RDF data, designed to retrieve and manipulate data stored in Resource Description Framework format.

SPARQL query:

SELECT ?friend WHERE {
  <Bob> <knows> ?friend
}

This SPARQL query retrieves all friends of Bob from an RDF triple store.

Triple-stores and SPARQL are used for knowledge graphs, linked data, and applications that require flexible, schema-less relationships.

The Foundation: Datalog

Datalog is a declarative logic programming language, similar to Prolog, that is especially good for expressing recursive relationships in data.

Example: Ancestor relationship

ancestor(X, Y) :- parent(X, Y).
ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).

This Datalog program defines the ancestor relationship recursively: X is an ancestor of Y if X is a parent of Y, or if X is a parent of Z and Z is an ancestor of Y. Datalog is valued for its clarity and ability to express recursive relationships concisely.

Modern graph engines and some databases use Datalog-like languages for advanced queries.

Chapter 2 Key Takeaways

Relational, document, graph, and RDF models serve different needs
Choose based on data structure, query flexibility, and performance
No model is superior — all are trade-offs in modeling and scaling

Feature	Relational (SQL)	Document (NoSQL)	Graph (Property/RDF)
Schema	Fixed, normalized	Flexible, schema-less	Flexible, typed labels
Query Language	SQL	MongoQL, etc.	Cypher, SPARQL, Datalog
Relationship Modeling	JOINs	Embedding/Reference	Native edge traversal
Best Use Case	OLTP, finance	CMS, catalogs, APIs	Social networks, graphs
Scaling	Vertical/horizontal	Horizontal	Horizontal
Complexity	Medium	Low	High
Flexibility	Medium	High	Very High

Understanding data models deeply helps you choose the right tool and scale confidently.

Chapter 3: Storage and Retrieval

How do databases store data on disk and retrieve it when queried?

This chapter dives into the internals of storage engines—the part of a database that handles storing data to disk and reading it back. Understanding these fundamentals helps you choose the right database for your workload and debug performance issues.

Data Structures That Power Your Database

At the simplest level, a database could be just two bash functions:

#!/bin/bash

db_set () {
    echo "$1,$2" >> database
}

db_get () {
    grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}

This append-only log has excellent write performance (O(1)) but terrible read performance (O(n)). Real databases use sophisticated data structures to balance these trade-offs.

Hash Indexes

Hash indexes use in-memory hash maps to store byte offsets where values are stored on disk. This is the approach used by Bitcask (Riak’s default storage engine).

How Hash Indexes Work:

In-Memory Hash Map:
┌──────────┬────────────┐
│   Key    │   Offset   │
├──────────┼────────────┤
│ "user1"  │     0      │
│ "user2"  │    64      │
│ "user3"  │   128      │
└──────────┴────────────┘

Disk (Append-Only Log):
┌─────────────────────────────────────┐
│ 0: user1,{id:1,name:"Alice"}        │
│ 64: user2,{id:2,name:"Bob"}         │
│ 128: user3,{id:3,name:"Charlie"}    │
└─────────────────────────────────────┘

Implementation Details:

Write Process:

Append new record to end of log file
Update hash map with new offset
If updating existing key, just update offset (old data becomes dead)

Read Process:

Look up key in hash map (O(1))
Seek to offset on disk
Read value

Compaction: Periodically merge log segments, keeping only the latest value for each key:

Before Compaction:
Segment 1: user1,v1 | user2,v1 | user1,v2
Segment 2: user3,v1 | user1,v3 | user2,v2

After Compaction:
Merged: user1,v3 | user2,v2 | user3,v1

Hash Index Trade-offs:

Pros	Cons
✅ Fast writes (append-only)	❌ Hash map must fit in memory
✅ Fast reads (one disk seek)	❌ No range queries
✅ Simple crash recovery	❌ Poor memory locality
✅ Built-in compression via compaction	❌ Hash collisions can degrade performance

Real-world usage: Riak (Bitcask), some Redis persistence modes

SSTables and LSM-Trees

Sorted String Tables (SSTables) improve upon hash indexes by keeping keys sorted within each segment. This enables range queries and more efficient merging.

SSTable Structure:

SSTable File:
┌─────────────────────────────────────────┐
│ Header: min_key=alice, max_key=zoe      │
├─────────────────────────────────────────┤
│ Index: alice→0, bob→32, charlie→64      │
├─────────────────────────────────────────┤
│ Data:                                   │
│   alice,{age:25,city:"NYC"}             │
│   bob,{age:30,city:"SF"}                │
│   charlie,{age:28,city:"LA"}            │
└─────────────────────────────────────────┘

Log-Structured Merge Trees (LSM-Trees):

LSM-Trees use SSTables as building blocks in a multi-level architecture:

LSM-Tree Architecture:
                    Writes
                      ↓
┌─────────────────────────────────────────┐
│ MemTable (Red-Black Tree in memory)    │ ← Fast writes
└─────────────────────┬───────────────────┘
                      ↓ flush (when full)
┌─────────────────────────────────────────┐
│ Level 0 SSTables (on disk)             │ ← May overlap
├─────────────────────┬───────────────────┤
                      ↓ compact
┌─────────────────────────────────────────┐
│ Level 1 SSTables (non-overlapping)     │ ← Sorted runs
├─────────────────────┬───────────────────┤
                      ↓ compact
┌─────────────────────────────────────────┐
│ Level 2+ SSTables (larger, sorted)     │ ← Long-term storage
└─────────────────────────────────────────┘

LSM-Tree Operations:

Write Path:

Write to Write-Ahead Log (WAL) for durability
Insert into MemTable (in-memory sorted structure)
When MemTable fills, flush to Level 0 as SSTable
Background compaction merges SSTables between levels

Read Path:

Check MemTable first
Check Level 0 SSTables (newest to oldest)
Check higher levels using sparse index
Use Bloom filters to avoid unnecessary disk reads

Compaction Strategy:

Size-tiered: Merge SSTables of similar size
Leveled: Maintain non-overlapping ranges per level (used by LevelDB/RocksDB)

LSM-Tree Trade-offs:

Advantages	Disadvantages
✅ High write throughput	❌ Read amplification (check multiple SSTables)
✅ Good compression ratios	❌ Write amplification (compaction overhead)
✅ Range query support	❌ Compaction can impact performance
✅ Handles high write volumes	❌ More complex than B-Trees

Real-world usage: Cassandra, HBase, LevelDB, RocksDB, Elasticsearch

B-Trees

B-Trees are the most widely used indexing structure, found in almost all relational databases and many non-relational ones.

B-Tree Structure:

B-Trees maintain balance by ensuring all leaf nodes are at the same depth:

B-Tree Example (branching factor = 3):

                    [    J    ]           ← Root (level 2)
                   /           \
              [  C   F  ]    [  P   S  ]  ← Internal nodes (level 1)
             /    |    \    /    |    \
        [A,B]  [D,E]  [G,H,I] [K,L,M] [Q,R] [T,U,V] ← Leaf nodes (level 0)

Key Properties:

Each node has n keys and n+1 child pointers
Keys within a node are sorted
All leaves are at the same level
Each node is typically one disk page (4KB-64KB)

B-Tree Operations:

Search Algorithm:

def search(node, key):
    if node.is_leaf():
        return node.find(key)
    
    i = 0
    while i < len(node.keys) and key > node.keys[i]:
        i += 1
    
    return search(node.children[i], key)

Insert with Node Splitting:

Before Insert (node full):
[A, C, E, G]

Insert D:
Split into: [A, C] and [E, G]
Promote: D to parent

Result:
     [D]
   /     \
[A, C]  [E, G]

Range Query:

SELECT * FROM users WHERE age BETWEEN 25 AND 35;

Navigate to leaf containing age=25
Scan leaves left-to-right until age>35
Follow leaf-to-leaf pointers (siblings)

B-Tree Optimizations:

Copy-on-Write (MVCC): Instead of updating pages in-place, create new page versions:

Version 1: [A, B, C] 
              ↓ update B
Version 2: [A, B', C] (new page)

Write-Ahead Logging: Record changes before applying them:

WAL: "Update page 42: B → B'"
Then: Apply change to page 42

Comparing B-Trees and LSM-Trees:

This is one of the most important trade-offs in database design:

Performance Characteristics:

Workload Type	B-Trees	LSM-Trees	Winner
Read-heavy	Fast (1-2 seeks)	Slower (multiple SSTables)	B-Trees
Write-heavy	Slower (random I/O)	Fast (sequential writes)	LSM-Trees
Range scans	Excellent	Good	B-Trees
Point lookups	Very good	Good (with Bloom filters)	B-Trees

Resource Usage:

Resource	B-Trees	LSM-Trees
Write amplification	2x (WAL + page)	5-10x (compaction)
Read amplification	1x	5-10x (multiple levels)
Space amplification	Low	Medium (fragmentation during compaction)
CPU usage	Lower	Higher (compaction)

Operational Considerations:

B-Trees:

More predictable performance
Simpler to reason about
Better for OLTP workloads
Used by: PostgreSQL, MySQL, SQL Server, Oracle

LSM-Trees:

Better write throughput
More complex tuning (compaction strategies)
Better for analytics/logging workloads
Used by: Cassandra, HBase, ClickHouse, Elasticsearch

Other Indexing Structures:

R-Trees (Spatial Indexes): For geographic and multi-dimensional data:

Geographic Bounding Boxes:
┌─────────────────────────┐
│ A     ┌─────┐           │
│   ┌───│  B  │───┐       │
│   │   └─────┘   │   C   │
│   │      D      │       │
│   └─────────────┘       │
└─────────────────────────┘

Use cases: PostGIS, MongoDB geospatial queries, game engines

Full-text Search Indexes: Inverted indexes map terms to documents:

Term → Document IDs
"database" → [1, 3, 7, 12]
"storage" → [1, 5, 8]
"engine" → [3, 7, 9, 15]

Use cases: Elasticsearch, Solr, database FTS features

Memory-only Structures: Skip Lists: Probabilistic alternative to B-Trees

Level 3:     1 ──────────────→ 9
Level 2:     1 ─────→ 6 ─────→ 9
Level 1:     1 ──→ 4 → 6 ──→ 8 → 9
Level 0:     1 → 2 → 4 → 6 → 7 → 8 → 9

Use cases: Redis sorted sets, LevelDB MemTable

Transaction Processing or Analytics?

Modern applications typically need both Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) capabilities.

OLTP Characteristics

-- Typical OLTP queries (simple, fast)
INSERT INTO orders (customer_id, product_id, quantity) 
VALUES (12345, 'widget-A', 2);

UPDATE inventory SET quantity = quantity - 2 
WHERE product_id = 'widget-A';

SELECT * FROM customers WHERE customer_id = 12345;

Properties:

Small number of records per query
Random access patterns
Interactive response times (< 100ms)
High concurrency

OLAP Characteristics

-- Typical OLAP query (complex, analytical)
SELECT 
  region,
  product_category,
  SUM(revenue) as total_revenue,
  AVG(customer_satisfaction) as avg_satisfaction
FROM sales_fact sf
JOIN dim_product dp ON sf.product_id = dp.product_id
JOIN dim_customer dc ON sf.customer_id = dc.customer_id
JOIN dim_time dt ON sf.date_id = dt.date_id
WHERE dt.year = 2024 AND dt.quarter = 'Q1'
GROUP BY region, product_category
ORDER BY total_revenue DESC;

Properties:

Scan millions of records
Sequential access patterns
Batch processing (seconds to hours)
Read-heavy workload

OLTP vs OLAP Comparison

Aspect	OLTP	OLAP
Main pattern	Read/write individual records	Aggregate over large datasets
Query complexity	Simple	Complex with JOINs
Dataset size	GB to TB	TB to PB
Bottleneck	Random I/O, lock contention	Bandwidth, CPU
Data freshness	Real-time	Minutes to hours delay
Examples	E-commerce checkout, banking	Business intelligence, reporting

Data Warehousing

Since OLTP and OLAP have different requirements, many organizations use separate systems:

OLTP Systems → ETL Pipeline → Data Warehouse (OLAP)
     ↓              ↓              ↓
  PostgreSQL    Apache Airflow    Snowflake
  MySQL         dbt               BigQuery  
  MongoDB       Spark             Redshift

ETL vs ELT:

Extract-Transform-Load (ETL):

Source DB → Transform (clean, aggregate) → Load into warehouse

Extract-Load-Transform (ELT):

Source DB → Load raw data → Transform in warehouse

Modern cloud warehouses favor ELT because they can handle transformations at scale.

Stars and Snowflakes: Schemas for Analytics

Fact and Dimension Tables:

Fact table: Contains measurements/metrics (sales amount, quantity)
Dimension tables: Contain descriptive attributes (customer info, product details)

Star Schema:

        ┌─────────────┐
        │ dim_product │
        └──────┬──────┘
               │
    ┌─────────────┐    ┌─────────────┐
    │ dim_customer│    │   dim_time  │
    └──────┬──────┘    └──────┬──────┘
           │                  │
        ┌──┴──────────────────┴──┐
        │      sales_fact       │ ← Central fact table
        └──┬──────────────────┬──┘
           │                  │
    ┌──────┴──────┐    ┌──────┴──────┐
    │ dim_store   │    │ dim_promotion│
    └─────────────┘    └─────────────┘

Snowflake Schema: Like star schema but dimension tables are normalized into sub-dimensions:

dim_product → dim_category → dim_department

Trade-offs:

Star: Simpler queries, some redundancy
Snowflake: Normalized, more complex JOINs

Column-Oriented Storage

Row-oriented storage stores entire records together, while column-oriented storage groups values from the same column.

Row-oriented (traditional):

Disk blocks:
Block 1: [Row1: ID=1,Name="Alice",Age=25,City="NYC"]
         [Row2: ID=2,Name="Bob",Age=30,City="SF"]
Block 2: [Row3: ID=3,Name="Carol",Age=28,City="LA"]
         [Row4: ID=4,Name="Dave",Age=35,City="NYC"]

Column-oriented:

Disk blocks:
ID column:   [1, 2, 3, 4, 5, 6, ...]
Name column: ["Alice", "Bob", "Carol", "Dave", ...]
Age column:  [25, 30, 28, 35, 42, 29, ...]
City column: ["NYC", "SF", "LA", "NYC", "SF", ...]

Column Compression

Similar values compress extremely well:

Before compression:
City: ["NYC", "SF", "LA", "NYC", "SF", "NYC", "LA", "SF"]

Run-length encoding:
NYC×1, SF×1, LA×1, NYC×1, SF×1, NYC×1, LA×1, SF×1

Dictionary encoding:
Dictionary: {0: "NYC", 1: "SF", 2: "LA"}
Values: [0, 1, 2, 0, 1, 0, 2, 1]

Bit-packed:
"NYC": 10100100 (positions where NYC appears)
"SF":  01010001 
"LA":  00101010

Sort Order in Column Storage

Unlike row stores, column stores can use different sort orders for different queries:

Sorted by (date, product_id):
Date:       [2024-01-01, 2024-01-01, 2024-01-02, ...]
Product:    [widget-A, widget-B, widget-A, ...]
Revenue:    [100, 150, 120, ...]

Alternative sort by (product_id, date):
Product:    [widget-A, widget-A, widget-B, ...]  
Date:       [2024-01-01, 2024-01-02, 2024-01-01, ...]
Revenue:    [100, 120, 150, ...]

Benefits of sorted order:

Enables efficient range queries
Improves compression (similar values grouped together)
Allows for early termination in aggregations

Challenge: Different queries benefit from different sort orders.

Solutions:

Store multiple sorted copies
Use techniques like late materialization
Maintain separate indexes for different access patterns

Writing to Column-Oriented Storage

Column stores optimize for reads, making writes more challenging:

LSM-Tree approach (used by ClickHouse):

Buffer writes in row-oriented format
Periodically convert to columnar format
Merge with existing column files

Batch loading:

ETL Pipeline → Staging (row format) → Batch convert → Column store

Most column stores are optimized for append-only workloads rather than transactional updates.

Aggregation: Data Cubes and Materialized Views

Materialized View: Pre-computed query results stored as tables:

-- Create materialized view
CREATE MATERIALIZED VIEW monthly_sales AS
SELECT 
  DATE_TRUNC('month', order_date) as month,
  product_category,
  SUM(revenue) as total_revenue
FROM sales_fact sf
JOIN dim_product dp ON sf.product_id = dp.product_id
GROUP BY month, product_category;

Data Cube (OLAP Cube): Pre-aggregated data across multiple dimensions:

3D Cube: Time × Product × Region
- Total sales by month
- Total sales by product  
- Total sales by region
- Total sales by month + product
- Total sales by month + region
- Total sales by product + region
- Total sales by month + product + region

Trade-offs:

✅ Lightning-fast queries for pre-computed aggregates
❌ Storage overhead (exponential with dimensions)
❌ Complex to maintain with data updates

Chapter 3 Key Takeaways

Storage Engine Trade-offs:

Hash indexes: Fast writes and reads, but no range queries and limited by memory
LSM-Trees: Excellent for write-heavy workloads, good compression
B-Trees: Superior for read-heavy workloads, excellent range queries

OLTP vs OLAP:

OLTP: Transaction processing, normalized schemas, real-time operations
OLAP: Analytics, star/snowflake schemas, batch processing with complex aggregations

Column Storage Advantages:

Dramatic compression improvements
Vectorized processing capabilities
Excellent for analytical queries
Challenges with writes and different sort orders

The Right Tool for the Job:

Choose storage engines based on your workload characteristics
Consider read/write ratios, query patterns, and consistency requirements
Modern systems often combine multiple storage technologies

Chapter 4: Encoding and Evolution

How do you handle data format changes when applications and databases are running in production?

This chapter addresses one of the most practical challenges in system design: evolving your data formats without breaking existing services. Whether you’re adding new fields to a JSON API or migrating between database schemas, understanding encoding and evolution is crucial for building resilient systems.

Formats for Encoding Data

Language-Specific Formats

Many programming languages have built-in support for encoding in-memory objects:

// Java serialization
ByteArrayOutputStream bos = new ByteArrayOutputStream();
ObjectOutputStream out = new ObjectOutputStream(bos);
out.writeObject(myObject);
byte[] serialized = bos.toByteArray();

# Python pickle
import pickle
serialized = pickle.dumps(my_object)
my_object = pickle.loads(serialized)

Problems with language-specific formats:

❌ Tied to a particular programming language
❌ Poor performance and bloated encoding
❌ Security vulnerabilities (arbitrary code execution)
❌ No forward/backward compatibility guarantees

JSON, XML, and Binary Variants

JSON Example:

{
  "userName": "Martin",
  "favoriteNumber": 1337,
  "interests": ["daydreaming", "hacking"]
}

XML Equivalent:

<Person>
  <userName>Martin</userName>
  <favoriteNumber>1337</favoriteNumber>
  <interests>
    <li>daydreaming</li>
    <li>hacking</li>
  </interests>
</Person>

JSON/XML Trade-offs:

Aspect	JSON	XML	Binary Formats
Human readable	✅	✅	❌
Compact	⚠️ Medium	❌ Verbose	✅
Schema support	❌	✅	✅
Numbers precision	❌ Ambiguous	❌ Text-based	✅ Typed
Unicode support	✅	✅	✅
Parsing speed	⚠️ Medium	❌ Slow	✅ Fast

JSON/XML Problems:

Number ambiguity: No distinction between integers and floats
Large size: Repeated field names, verbose syntax
No schema: Can’t enforce data structure

Binary Encoding:

MessagePack example (binary JSON):

JSON (81 bytes):
{"userName": "Martin", "favoriteNumber": 1337, "interests": ["daydreaming", "hacking"]}

MessagePack (66 bytes):
83 a8 userName a6 Martin ad favoriteNumber cd 05 39 a9 interests 92 ab daydreaming a7 hacking

Saves ~18% space by eliminating repeated field names and using binary encoding for numbers.

Thrift and Protocol Buffers

Schema-driven binary encoding formats that generate code for multiple languages.

Protocol Buffers (Protobuf) Example:

Schema definition (.proto file):

syntax = "proto3";

message Person {
  string user_name = 1;
  int64 favorite_number = 2;
  repeated string interests = 3;
}

Generated code usage:

# Python
person = Person()
person.user_name = "Martin"
person.favorite_number = 1337
person.interests.extend(["daydreaming", "hacking"])

# Serialize to binary
binary_data = person.SerializeToString()

# Deserialize
person2 = Person()
person2.ParseFromString(binary_data)

Binary Encoding Details:

Protobuf binary format:

Field 1 (user_name): tag=1, type=string
Field 2 (favorite_number): tag=2, type=varint  
Field 3 (interests): tag=3, type=string (repeated)

Binary: 0a 06 4d 61 72 74 69 6e 10 b9 0a 1a 0b 64 61 79 64 72 65 61 6d 69 6e 67 1a 07 68 61 63 6b 69 6e 67

Key insight: Field names are not stored—only field tags (numbers).

Thrift vs Protobuf Comparison:

Feature	Thrift	Protocol Buffers
Binary formats	BinaryProtocol, CompactProtocol	Single format
Schema evolution	Good	Excellent
Language support	Fewer languages	Many languages
RPC framework	Built-in	gRPC (separate)
Optional/required	Explicit	All optional (proto3)
Default values	Specified in schema	Type defaults

Avro

Avro takes a different approach: it embeds the schema with the data, enabling better schema evolution.

Avro Schema Example:

{
  "type": "record",
  "name": "Person",
  "fields": [
    {"name": "userName", "type": "string"},
    {"name": "favoriteNumber", "type": ["null", "long"], "default": null},
    {"name": "interests", "type": {"type": "array", "items": "string"}}
  ]
}

Writer vs Reader Schema:

Key innovation: Avro supports different schemas for writing and reading:

Writer Schema (v1):           Reader Schema (v2):
- userName (string)           - userName (string)  
- favoriteNumber (long)       - favoriteNumber (long)
                             - email (string, default: null)

The Avro library resolves differences between schemas automatically.

Schema Evolution in Avro:

Forward compatibility (new schema can read old data):

// Old schema
{"name": "favoriteNumber", "type": "long"}

// New schema (more permissive)
{"name": "favoriteNumber", "type": ["null", "long"], "default": null}

Backward compatibility (old schema can read new data):

// New schema
{"name": "email", "type": "string", "default": ""}

// Old schema ignores unknown fields

Avro vs Protobuf/Thrift:

Feature	Avro	Protobuf/Thrift
Schema storage	With data	Separately
Field identification	Field names	Tag numbers
Code generation	Optional	Required
Schema evolution	Very flexible	Good
Compactness	High	High
Dynamic typing	✅	❌

The Merits of Schemas

Advantages of Schema-Based Formats:

Compactness: Binary encoding with no field names
Documentation: Schema serves as always up-to-date documentation
Validation: Catch errors at write time, not read time
Code generation: Generate statically typed data structures
Evolution: Maintain compatibility across versions

Schema Evolution Patterns:

Safe changes:

Add optional fields with defaults
Remove optional fields
Change field order (if using field names/tags)

Unsafe changes:

Change field types (e.g., int32 → string)
Rename fields (without aliases)
Remove required fields
Add required fields without defaults

Example evolution sequence:

v1: {id, name}
v2: {id, name, email?}        # Add optional email
v3: {id, name, email?, age?}  # Add optional age  
v4: {id, name, email, age?}   # Make email required (unsafe!)

Modes of Dataflow

1. Dataflow Through Databases

The problem: Applications change faster than databases.

Timeline:
T1: App v1 writes data using schema v1
T2: App v2 deployed, can read v1 data?
T3: App v2 writes data using schema v2  
T4: App v1 still running, can read v2 data?

Solution strategies:

Migrations: Transform all existing data to new schema
Multi-version: Support reading multiple schema versions
Additive changes: Only add optional fields

Example database evolution:

-- v1 schema
CREATE TABLE users (id INT, name VARCHAR(50));

-- v2 migration  
ALTER TABLE users ADD COLUMN email VARCHAR(100) DEFAULT NULL;

-- v3 migration
ALTER TABLE users ADD COLUMN preferences JSON DEFAULT '{}';

2. Dataflow Through Services: REST and RPC

Service communication patterns:

REST (JSON over HTTP):
GET /api/users/123
Content-Type: application/json

{
  "id": 123,
  "name": "Alice",
  "email": "alice@example.com"
}

RPC (binary protocol):
UserService.getUser(user_id=123) → User object

Web Services Comparison:

Aspect	REST	SOAP	gRPC
Protocol	HTTP/JSON	HTTP/XML	HTTP/2 + Protobuf
Schema	Informal/OpenAPI	WSDL	.proto files
Code generation	Optional	Required	Generated stubs
Performance	Medium	Low	High
Debugging	Easy	Hard	Medium
Browser support	Native	Complex	Limited

Remote Procedure Calls (RPC):

RPC tries to make network calls look like local function calls:

# Local call
user = get_user(123)

# RPC call (looks the same!)  
user = user_service.get_user(123)

Problems with RPC:

Network failures: Local calls don’t fail randomly
Latency: Network calls are ~1M times slower
Serialization: Objects don’t always serialize cleanly
Versioning: Harder to evolve than local APIs

RPC frameworks:

gRPC (Google): High performance, HTTP/2, streaming
Thrift: Cross-language, used at Facebook/Meta
Finagle (Twitter): Scala-based, circuit breakers
JSON-RPC: Simple, HTTP-based

Data Encoding and Evolution for RPC:

Versioning strategies:

URL versioning:

/api/v1/users/123
/api/v2/users/123

Header versioning:

Accept: application/vnd.api+json;version=2

Content negotiation:

Accept: application/vnd.user.v2+json

Backward/forward compatibility:

Old clients must work with new servers
New clients should work with old servers
Use optional fields and sensible defaults

3. Message-Passing Dataflow

Asynchronous message passing combines benefits of RPC and databases:

Producer → Message Broker → Consumer
         ↑                 ↓
   (fire and forget)   (process later)

Message Brokers:

Popular message brokers:

Apache Kafka: High-throughput, persistent logs
RabbitMQ: Flexible routing, AMQP protocol
Amazon SQS: Managed queue service
Redis Pub/Sub: In-memory, simple

Message Broker Benefits:

Benefit	Description
Reliability	Broker acts as buffer against outages
Decoupling	Producers/consumers evolve independently
Flexibility	One message → multiple consumers
Scalability	Horizontal scaling of consumers

Actor Model:

Actors are computational entities that communicate via messages:

Actor A ──message──→ Actor B
        ←─response─┘

Actor properties:

Encapsulated state (no shared memory)
Asynchronous message passing
Location transparency (local or remote)

Frameworks: Akka (JVM), Orleans (.NET), Erlang/Elixir

Message Encoding in Distributed Actors:

Evolution challenges:

Rolling deployments mean mixed actor versions
Message formats must be backward/forward compatible
Consider using schemas (Protobuf, Avro) for actor messages

%% Erlang actor message evolution
%% v1 message
{user_created, UserId, Name}

%% v2 message (backward compatible)
{user_created, UserId, Name, Email}

Chapter 4 Key Takeaways

Encoding Format Selection:

Language-specific formats: Avoid for cross-language compatibility
JSON/XML: Human-readable but verbose and ambiguous for numbers
Binary formats (Protobuf, Thrift, Avro): Compact, fast, schema-driven

Schema Evolution Strategies:

Forward compatibility: New code can read old data
Backward compatibility: Old code can read new data
Best practices: Add optional fields, use defaults, avoid breaking changes

Dataflow Patterns:

Database dataflow: Handle schema migrations carefully, support multiple versions
Service dataflow: Use versioning strategies (URL, headers, content negotiation)
Message-passing: Enables loose coupling, supports schema evolution through brokers

The Evolution Challenge: Schema evolution is not just a technical problem—it’s an organizational one. Plan for change from day one, use schemas that support evolution, and design systems that can gracefully handle mixed versions during deployments.

Part I Summary

Key takeaways from Part I:

Reliability, Scalability, Maintainability are the fundamental goals of good system design
Data models (relational, document, graph) shape how you think about problems
Storage engines (B-Trees vs LSM-Trees) have different performance characteristics
Encoding formats and schema evolution enable systems to change over time
Dataflow patterns (databases, services, messages) each have different consistency and performance properties

Technology selection framework:

Understand your workload characteristics (read/write ratio, consistency needs)
Choose data models that match your access patterns
Select storage engines based on performance requirements
Plan for schema evolution from day one
Design dataflow to match your consistency and availability needs

Looking ahead to Part II: The foundations covered here prepare you for distributed systems challenges: replication, partitioning, transactions, and the difficulties of maintaining consistency across multiple machines.

“The future is already here—it’s just not evenly distributed.” Understanding these foundations helps you build systems that can adapt and scale as requirements change.