Introduction
Modern applications handle more data, serve more users, and must run across more machines than ever before. But what makes a data system truly reliable, scalable, and maintainable?
Designing Data-Intensive Applications by Martin Kleppmann offers a deep exploration of how data systems work under the hood. It is not just a book about databases β it is a blueprint for building systems that can grow, adapt, and perform under pressure.
Part I of the book, βFoundations of Data Systems,β sets the stage by introducing the fundamental building blocks of backend infrastructure. This part is about understanding the core trade-offs and architectural decisions behind every storage engine, query language, and data format we choose.
This article distills the key lessons from Chapters 1 to 4:
- What reliability and scalability really mean in practice
- How different data models affect flexibility and performance
- The internal workings of storage engines like B-Trees and LSM-Trees
- How data encoding and schema evolution shape the future of your system
Whether youβre an engineer designing your first database schema or an architect managing distributed pipelines, these foundations will give you the vocabulary, mental models, and examples to build better systems.
Letβs dive in.
Chapter 1: Reliable, Scalable, and Maintainable Applications
Reliability
A reliable system continues to function correctly even when hardware or software fails. Reliability is about minimizing downtime and data loss, and ensuring the system can recover gracefully from unexpected events.
Key Strategies:
- Redundancy: Replicate data and services across nodes, racks, or regions to avoid single points of failure.
- Fault Isolation: Design so that a failure in one component does not cascade to others (e.g., circuit breakers, bulkheads).
- Automated Recovery: Use backups, snapshots, and automated failover to restore service quickly.
- Idempotence: Ensure operations can be safely retried without side effects.
Redundancy and Failover
User
|
v
Load Balancer
| |
v v
App1 App2
\ /
v v
Database Cluster
| | |
v v v
Rep1 Rep2 Rep3
Real-World Example:
- Netflix replicates its services and data across multiple AWS regions to remain available during outages.
- Google Spanner uses multi-region replication and TrueTime for global consistency.
Scalability
Scalability is the ability of a system to handle increased load by adding resources. Itβs not just about handling more users, but also about maintaining performance as demand grows.
Types of Scaling:
- Vertical Scaling: Add more power (CPU, RAM) to a single machine.
- Horizontal Scaling: Add more machines to distribute the load (preferred for large systems).
Horizontal vs Vertical Scaling
Vertical Scaling:
[App Server]
|
[Database]
Horizontal Scaling:
[App1] [App2] [App3]
\ | /
[Shared Database]
Scalability Metrics:
- Throughput: Requests per second the system can handle.
- Latency: Time taken to process a request (use percentiles, e.g., 95th, 99th).
- Error Rate: Percentage of failed requests.
Scaling Approaches
| Approach | Pros | Cons |
|---|---|---|
| Vertical Scaling | Simple, no code changes | Hardware limits, downtime |
| Horizontal Scaling | No single point of failure | More complex, needs sharding |
Real-World Example:
- Instagram uses Redis and Memcached to cache frequent queries and reduce pressure on its databases.
- Amazon DynamoDB automatically shards data across many servers for massive scale.
Maintainability
Maintainability is about how easily a system can be changed, debugged, and extended. Systems must evolve as requirements change, so maintainability is crucial for long-term success.
Best Practices:
- Modular Design: Break systems into independent, well-defined components.
- Observability: Use logs, metrics, and traces to understand system behavior.
- Automation: Automate testing, deployment, and recovery to reduce human error.
- Documentation: Keep code and system documentation up to date.
Observability Stack
[Application]
| | |
| | +---> [Tracing]
| +-------> [Metrics]
+-----------> [Logging]
| | |
+---+---+
|
[Monitoring Dashboard]
Maintainability Practices
| Practice | Benefit | Example Tool |
|---|---|---|
| Modular Design | Easier to change/test | Microservices, SOA |
| Observability | Faster debugging | Prometheus, Grafana |
| Automation | Fewer manual errors | CI/CD, Ansible |
| Documentation | Onboarding, knowledge sharing | Wikis, READMEs |
Real-World Example:
- Google SRE teams invest in automation and observability to reduce toil and improve system clarity.
- Facebook uses automated canary releases and rollbacks to maintain reliability during deployments.
Chapter 1 Key Takeaways
This chapter sets the stage for understanding modern data systems by introducing three essential attributes:
-
Reliability: The system should continue to function correctly even when hardware fails, software has bugs, or human operators make mistakes. Techniques like replication, fault tolerance, and idempotence help mitigate risks.
-
Scalability: A system should handle growthβwhether in data volume, traffic, or complexityβwithout suffering performance degradation. This requires clear definitions of load and performance, plus design strategies like partitioning and caching.
-
Maintainability: Systems evolve, and maintainable systems make this evolution easy. Characteristics of maintainability include good observability, clear abstractions, and the ability to change behavior with minimal risk or downtime.
Together, these principles create a framework for evaluating and designing robust, efficient, and adaptable systems.
Chapter 2: Data Models and Query Languages
Relational Model Versus Document Model
Relational databases store data in structured tables with fixed schemas, while document databases (like MongoDB) use nested, schema-less JSON documents.
Relational Example:
-- Orders and Customers in SQL
SELECT orders.id, customers.name
FROM orders
JOIN customers ON orders.customer_id = customers.id;
Document Example:
{
"order_id": 1001,
"customer": {
"id": "cust123",
"name": "Alice"
},
"items": [
{"product": "Widget", "qty": 2},
{"product": "Gadget", "qty": 1}
]
}
Relational vs Document Model
Relational:
+-------------+ +---------------+
| Orders | | Customers |
|-------------| |---------------|
| id |----->| id |
| customer_id | | name |
| ... | | ... |
+-------------+ +---------------+
Document:
+----------------------+
| Order Document |
|----------------------|
| order_id |
| customer: |
| id |
| name |
| items: [ |
| {product, qty}, |
| ... |
| ] |
+----------------------+
Relational vs Document
| Feature | Relational DBs | Document DBs |
|---|---|---|
| Schema | Fixed, normalized | Dynamic, flexible |
| Queries | Declarative (SQL) | Embedded (MongoQL) |
| Relationships | JOINs | Embedded or referenced |
| Scalability | Manual sharding | Built-in sharding |
| Use Case | OLTP, analytics | Content, catalogs |
Real-World Example:
- MongoDB is used for content management systems and catalogs where flexible schemas are needed.
- PostgreSQL and MySQL are used for financial and transactional systems requiring strong consistency.
The Birth of NoSQL
NoSQL databases emerged to meet the needs of large-scale applications where traditional relational databases struggled:
- Amazon Dynamo: key-value store with eventual consistency
- Google Bigtable: wide-column store
- MongoDB: document store with dynamic schemas
NoSQL systems prioritized horizontal scalability, high availability, and flexible schemas.
The Object-Relational Mismatch
Relational tables are flat, but application code often uses deeply nested objects. This mismatch leads to inefficiencies:
- ORMs try to bridge the gap but can be leaky abstractions
- Document databases naturally align with object-oriented code
Many-to-One and Many-to-Many Relationships
Relational databases use normalized tables and JOINs:
SELECT * FROM books
JOIN book_authors ON books.id = book_authors.book_id
JOIN authors ON book_authors.author_id = authors.id;
In document databases, you can embed data:
{
"title": "DDIA",
"authors": [
{"id": 1, "name": "Kleppmann"},
{"id": 2, "name": "Other"}
]
}
Use embedding for static data; reference for large or frequently updated sub-objects.
Are Document Databases Repeating History?
Hierarchical databases from the 1960s (like IMS) stored data in tree structures, which made querying rigid and brittle. Document databases risk similar pitfalls:
- Efficient for fixed access patterns
- Less composable and general-purpose than relational models
The relational model won due to its data independence and declarative power.
Relational Versus Document Databases Today
Relational and document databases each have strengths and trade-offs, and neither is a universal solution. Relational databases are built around normalization, which avoids redundancy and supports flexible, ad hoc queries using JOINs. This makes them ideal for applications where relationships between entities are complex or frequently queried in different ways (e.g., financial systems, analytics).
Document databases, on the other hand, store data as nested documents (often JSON), making them a natural fit for aggregate-oriented dataβwhere you typically fetch or update an entire object at once (e.g., a blog post with comments, a user profile with settings). Embedding related data inside a document can improve read performance and reduce the need for joins, but can also lead to duplication and update anomalies if the same data appears in many places.
Embedding vs Referencing:
- Embed data when it is accessed together and does not change independently (e.g., order items inside an order).
- Reference data when it is large, shared, or updated independently (e.g., user profiles referenced from orders).
Limitations:
- Document databases are less suited for queries that need to join across many collections or require flexible, ad hoc reporting. They work best when access patterns are well known and data can be grouped into aggregates.
- Relational databases can be harder to scale horizontally, but their query flexibility and data integrity features remain unmatched for many use cases.
Modern Practice: Many organizations use both models: relational databases for core transactional data, and document or NoSQL stores for content, logs, or user-generated data. The choice depends on your applicationβs data access patterns, consistency needs, and scalability requirements.
Query Languages for Data
Query languages allow users to retrieve and manipulate data. The most common is SQL, but others exist for different models.
Declarative Queries on the Web
Declarative query languages let you specify what data you want, not how to get it. The database or engine figures out the best way to execute your request. This is in contrast to imperative code, where you specify the exact steps.
SQL (Relational):
SELECT * FROM users WHERE age > 30;
MongoDB (Document):
db.users.find({ age: { $gt: 30 } })
GraphQL (APIs):
query {
user(id: "1") {
name
posts { title }
}
}
SPARQL (RDF):
SELECT ?book ?author WHERE {
?book <http://purl.org/dc/elements/1.1/creator> ?author
}
Declarative queries are powerful because they allow the underlying system to optimize execution, use indexes, and parallelize work. This abstraction is a key reason for the success of SQL and similar languages.
MapReduce Querying
MapReduce is a programming model for processing large datasets in parallel across many machines. It was popularized by Google and implemented in open source by Hadoop.
- Map step: Each input record is processed to produce key-value pairs.
- Reduce step: All values for the same key are combined.
Example: Word Count
# Map step: emit (word, 1) for each word in a document
for word in document.split():
emit(word, 1)
# Reduce step: sum all counts for each word
for word, counts in grouped_by_word:
emit(word, sum(counts))
MapReduce is less expressive than SQL, but it scales to massive datasets and is fault-tolerant. Many modern data platforms (like Spark) build on these ideas, but offer more flexible APIs.
Query Languages
| Model | Language | Example Syntax | Use Case |
|---|---|---|---|
| Relational | SQL | SELECT β¦ FROM β¦ | OLTP, analytics |
| Document | MongoQL | db.coll.find({β¦}) | Content, catalogs |
| Graph | Cypher | MATCH (a)-[]->(b) | Social, recommendations |
| RDF | SPARQL | SELECT β¦ WHERE β¦ | Semantic web, research |
Graph-Like Data Models
Graph databases represent entities as nodes and relationships as edges. This model is ideal for use cases like social networks, recommendation engines, and fraud detection, where relationships are first-class citizens and traversal is frequent.
Property Graphs
Property Graph is a graph model where nodes and edges can have arbitrary key-value properties. Used in Neo4j and similar systems.A property graph consists of:
- Nodes: entities (e.g., users, products)
- Edges: relationships (e.g., follows, purchased)
- Properties: key-value pairs on both nodes and edges
(Alice)---FRIEND--->(Bob)---FRIEND--->(Carol)
|
+---PURCHASED--->(Widget)
This diagram shows a property graph: Alice is friends with Bob, who is friends with Carol. Alice also purchased a Widget. Nodes represent entities, and edges represent relationships. This structure is ideal for social networks or recommendations.
Property graphs are used when you need to model and query complex, interconnected data with rich relationships.
The Cypher Query Language
Cypher is a declarative graph query language used in Neo4j. It allows expressive pattern matching and traversal.
Example: Find Aliceβs friends
MATCH (a:User)-[:FRIEND]->(b:User)
WHERE a.name = 'Alice'
RETURN b.name;
This Cypher query finds all users who are friends with Alice. Cypherβs pattern-matching syntax makes it easy to express traversals in a graph database.
Cypher is ideal for queries that follow paths or patterns in a graph, such as recommendations or social connections.
Graph Queries in SQL
SQL is not naturally suited for recursive or deep relationship traversals, but recursive Common Table Expressions (CTEs) can simulate it.
Example: Find all friends-of-friends
WITH RECURSIVE friends(id, depth) AS (
SELECT id, 0 FROM users WHERE id = 1
UNION
SELECT u.id, depth + 1
FROM users u
JOIN friendships f ON u.id = f.friend_id
JOIN friends fr ON f.user_id = fr.id
)
SELECT * FROM friends;
This SQL example uses a recursive CTE to find all friends-of-friends starting from user 1. While possible, such queries are more verbose and less efficient than in a native graph database.
This approach works for limited depth traversal but lacks graph-native optimizations.
Triple-Stores and SPARQL
Triple stores (e.g., Apache Jena, Virtuoso) are used in semantic web contexts. Data is stored as RDF triples: subject, predicate, object.
RDF (Resource Description Framework): A standard for representing data as triples (subject, predicate, object), widely used in the semantic web.
Example RDF triple:
<Bob> <knows> <Alice>
This triple expresses that Bob knows Alice, using the subject-predicate-object format common in semantic web and knowledge graph systems.
SPARQL is a query language for RDF data, designed to retrieve and manipulate data stored in Resource Description Framework format.
SPARQL query:
SELECT ?friend WHERE {
<Bob> <knows> ?friend
}
This SPARQL query retrieves all friends of Bob from an RDF triple store.
Triple-stores and SPARQL are used for knowledge graphs, linked data, and applications that require flexible, schema-less relationships.
The Foundation: Datalog
Datalog is a declarative logic programming language, similar to Prolog, that is especially good for expressing recursive relationships in data.
Example: Ancestor relationship
ancestor(X, Y) :- parent(X, Y).
ancestor(X, Y) :- parent(X, Z), ancestor(Z, Y).
This Datalog program defines the ancestor relationship recursively: X is an ancestor of Y if X is a parent of Y, or if X is a parent of Z and Z is an ancestor of Y. Datalog is valued for its clarity and ability to express recursive relationships concisely.
Modern graph engines and some databases use Datalog-like languages for advanced queries.
Chapter 2 Key Takeaways
- Relational, document, graph, and RDF models serve different needs
- Choose based on data structure, query flexibility, and performance
- No model is superior β all are trade-offs in modeling and scaling
| Feature | Relational (SQL) | Document (NoSQL) | Graph (Property/RDF) |
|---|---|---|---|
| Schema | Fixed, normalized | Flexible, schema-less | Flexible, typed labels |
| Query Language | SQL | MongoQL, etc. | Cypher, SPARQL, Datalog |
| Relationship Modeling | JOINs | Embedding/Reference | Native edge traversal |
| Best Use Case | OLTP, finance | CMS, catalogs, APIs | Social networks, graphs |
| Scaling | Vertical/horizontal | Horizontal | Horizontal |
| Complexity | Medium | Low | High |
| Flexibility | Medium | High | Very High |
Understanding data models deeply helps you choose the right tool and scale confidently.
Chapter 3: Storage and Retrieval
How do databases store data on disk and retrieve it when queried?
This chapter dives into the internals of storage enginesβthe part of a database that handles storing data to disk and reading it back. Understanding these fundamentals helps you choose the right database for your workload and debug performance issues.
Data Structures That Power Your Database
At the simplest level, a database could be just two bash functions:
#!/bin/bash
db_set () {
echo "$1,$2" >> database
}
db_get () {
grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}
This append-only log has excellent write performance (O(1)) but terrible read performance (O(n)). Real databases use sophisticated data structures to balance these trade-offs.
Hash Indexes
Hash indexes use in-memory hash maps to store byte offsets where values are stored on disk. This is the approach used by Bitcask (Riakβs default storage engine).
How Hash Indexes Work:
In-Memory Hash Map:
ββββββββββββ¬βββββββββββββ
β Key β Offset β
ββββββββββββΌβββββββββββββ€
β "user1" β 0 β
β "user2" β 64 β
β "user3" β 128 β
ββββββββββββ΄βββββββββββββ
Disk (Append-Only Log):
βββββββββββββββββββββββββββββββββββββββ
β 0: user1,{id:1,name:"Alice"} β
β 64: user2,{id:2,name:"Bob"} β
β 128: user3,{id:3,name:"Charlie"} β
βββββββββββββββββββββββββββββββββββββββ
Implementation Details:
Write Process:
- Append new record to end of log file
- Update hash map with new offset
- If updating existing key, just update offset (old data becomes dead)
Read Process:
- Look up key in hash map (O(1))
- Seek to offset on disk
- Read value
Compaction: Periodically merge log segments, keeping only the latest value for each key:
Before Compaction:
Segment 1: user1,v1 | user2,v1 | user1,v2
Segment 2: user3,v1 | user1,v3 | user2,v2
After Compaction:
Merged: user1,v3 | user2,v2 | user3,v1
Hash Index Trade-offs:
| Pros | Cons |
|---|---|
| β Fast writes (append-only) | β Hash map must fit in memory |
| β Fast reads (one disk seek) | β No range queries |
| β Simple crash recovery | β Poor memory locality |
| β Built-in compression via compaction | β Hash collisions can degrade performance |
Real-world usage: Riak (Bitcask), some Redis persistence modes
SSTables and LSM-Trees
Sorted String Tables (SSTables) improve upon hash indexes by keeping keys sorted within each segment. This enables range queries and more efficient merging.
SSTable Structure:
SSTable File:
βββββββββββββββββββββββββββββββββββββββββββ
β Header: min_key=alice, max_key=zoe β
βββββββββββββββββββββββββββββββββββββββββββ€
β Index: aliceβ0, bobβ32, charlieβ64 β
βββββββββββββββββββββββββββββββββββββββββββ€
β Data: β
β alice,{age:25,city:"NYC"} β
β bob,{age:30,city:"SF"} β
β charlie,{age:28,city:"LA"} β
βββββββββββββββββββββββββββββββββββββββββββ
Log-Structured Merge Trees (LSM-Trees):
LSM-Trees use SSTables as building blocks in a multi-level architecture:
LSM-Tree Architecture:
Writes
β
βββββββββββββββββββββββββββββββββββββββββββ
β MemTable (Red-Black Tree in memory) β β Fast writes
βββββββββββββββββββββββ¬ββββββββββββββββββββ
β flush (when full)
βββββββββββββββββββββββββββββββββββββββββββ
β Level 0 SSTables (on disk) β β May overlap
βββββββββββββββββββββββ¬ββββββββββββββββββββ€
β compact
βββββββββββββββββββββββββββββββββββββββββββ
β Level 1 SSTables (non-overlapping) β β Sorted runs
βββββββββββββββββββββββ¬ββββββββββββββββββββ€
β compact
βββββββββββββββββββββββββββββββββββββββββββ
β Level 2+ SSTables (larger, sorted) β β Long-term storage
βββββββββββββββββββββββββββββββββββββββββββ
LSM-Tree Operations:
Write Path:
- Write to Write-Ahead Log (WAL) for durability
- Insert into MemTable (in-memory sorted structure)
- When MemTable fills, flush to Level 0 as SSTable
- Background compaction merges SSTables between levels
Read Path:
- Check MemTable first
- Check Level 0 SSTables (newest to oldest)
- Check higher levels using sparse index
- Use Bloom filters to avoid unnecessary disk reads
Compaction Strategy:
- Size-tiered: Merge SSTables of similar size
- Leveled: Maintain non-overlapping ranges per level (used by LevelDB/RocksDB)
LSM-Tree Trade-offs:
| Advantages | Disadvantages |
|---|---|
| β High write throughput | β Read amplification (check multiple SSTables) |
| β Good compression ratios | β Write amplification (compaction overhead) |
| β Range query support | β Compaction can impact performance |
| β Handles high write volumes | β More complex than B-Trees |
Real-world usage: Cassandra, HBase, LevelDB, RocksDB, Elasticsearch
B-Trees
B-Trees are the most widely used indexing structure, found in almost all relational databases and many non-relational ones.
B-Tree Structure:
B-Trees maintain balance by ensuring all leaf nodes are at the same depth:
B-Tree Example (branching factor = 3):
[ J ] β Root (level 2)
/ \
[ C F ] [ P S ] β Internal nodes (level 1)
/ | \ / | \
[A,B] [D,E] [G,H,I] [K,L,M] [Q,R] [T,U,V] β Leaf nodes (level 0)
Key Properties:
- Each node has n keys and n+1 child pointers
- Keys within a node are sorted
- All leaves are at the same level
- Each node is typically one disk page (4KB-64KB)
B-Tree Operations:
Search Algorithm:
def search(node, key):
if node.is_leaf():
return node.find(key)
i = 0
while i < len(node.keys) and key > node.keys[i]:
i += 1
return search(node.children[i], key)
Insert with Node Splitting:
Before Insert (node full):
[A, C, E, G]
Insert D:
Split into: [A, C] and [E, G]
Promote: D to parent
Result:
[D]
/ \
[A, C] [E, G]
Range Query:
SELECT * FROM users WHERE age BETWEEN 25 AND 35;
- Navigate to leaf containing age=25
- Scan leaves left-to-right until age>35
- Follow leaf-to-leaf pointers (siblings)
B-Tree Optimizations:
Copy-on-Write (MVCC): Instead of updating pages in-place, create new page versions:
Version 1: [A, B, C]
β update B
Version 2: [A, B', C] (new page)
Write-Ahead Logging: Record changes before applying them:
WAL: "Update page 42: B β B'"
Then: Apply change to page 42
Comparing B-Trees and LSM-Trees:
This is one of the most important trade-offs in database design:
Performance Characteristics:
| Workload Type | B-Trees | LSM-Trees | Winner |
|---|---|---|---|
| Read-heavy | Fast (1-2 seeks) | Slower (multiple SSTables) | B-Trees |
| Write-heavy | Slower (random I/O) | Fast (sequential writes) | LSM-Trees |
| Range scans | Excellent | Good | B-Trees |
| Point lookups | Very good | Good (with Bloom filters) | B-Trees |
Resource Usage:
| Resource | B-Trees | LSM-Trees |
|---|---|---|
| Write amplification | 2x (WAL + page) | 5-10x (compaction) |
| Read amplification | 1x | 5-10x (multiple levels) |
| Space amplification | Low | Medium (fragmentation during compaction) |
| CPU usage | Lower | Higher (compaction) |
Operational Considerations:
B-Trees:
- More predictable performance
- Simpler to reason about
- Better for OLTP workloads
- Used by: PostgreSQL, MySQL, SQL Server, Oracle
LSM-Trees:
- Better write throughput
- More complex tuning (compaction strategies)
- Better for analytics/logging workloads
- Used by: Cassandra, HBase, ClickHouse, Elasticsearch
Other Indexing Structures:
R-Trees (Spatial Indexes): For geographic and multi-dimensional data:
Geographic Bounding Boxes:
βββββββββββββββββββββββββββ
β A βββββββ β
β βββββ B βββββ β
β β βββββββ β C β
β β D β β
β βββββββββββββββ β
βββββββββββββββββββββββββββ
Use cases: PostGIS, MongoDB geospatial queries, game engines
Full-text Search Indexes: Inverted indexes map terms to documents:
Term β Document IDs
"database" β [1, 3, 7, 12]
"storage" β [1, 5, 8]
"engine" β [3, 7, 9, 15]
Use cases: Elasticsearch, Solr, database FTS features
Memory-only Structures: Skip Lists: Probabilistic alternative to B-Trees
Level 3: 1 βββββββββββββββ 9
Level 2: 1 ββββββ 6 ββββββ 9
Level 1: 1 βββ 4 β 6 βββ 8 β 9
Level 0: 1 β 2 β 4 β 6 β 7 β 8 β 9
Use cases: Redis sorted sets, LevelDB MemTable
Transaction Processing or Analytics?
Modern applications typically need both Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) capabilities.
OLTP Characteristics
-- Typical OLTP queries (simple, fast)
INSERT INTO orders (customer_id, product_id, quantity)
VALUES (12345, 'widget-A', 2);
UPDATE inventory SET quantity = quantity - 2
WHERE product_id = 'widget-A';
SELECT * FROM customers WHERE customer_id = 12345;
Properties:
- Small number of records per query
- Random access patterns
- Interactive response times (< 100ms)
- High concurrency
OLAP Characteristics
-- Typical OLAP query (complex, analytical)
SELECT
region,
product_category,
SUM(revenue) as total_revenue,
AVG(customer_satisfaction) as avg_satisfaction
FROM sales_fact sf
JOIN dim_product dp ON sf.product_id = dp.product_id
JOIN dim_customer dc ON sf.customer_id = dc.customer_id
JOIN dim_time dt ON sf.date_id = dt.date_id
WHERE dt.year = 2024 AND dt.quarter = 'Q1'
GROUP BY region, product_category
ORDER BY total_revenue DESC;
Properties:
- Scan millions of records
- Sequential access patterns
- Batch processing (seconds to hours)
- Read-heavy workload
OLTP vs OLAP Comparison
| Aspect | OLTP | OLAP |
|---|---|---|
| Main pattern | Read/write individual records | Aggregate over large datasets |
| Query complexity | Simple | Complex with JOINs |
| Dataset size | GB to TB | TB to PB |
| Bottleneck | Random I/O, lock contention | Bandwidth, CPU |
| Data freshness | Real-time | Minutes to hours delay |
| Examples | E-commerce checkout, banking | Business intelligence, reporting |
Data Warehousing
Since OLTP and OLAP have different requirements, many organizations use separate systems:
OLTP Systems β ETL Pipeline β Data Warehouse (OLAP)
β β β
PostgreSQL Apache Airflow Snowflake
MySQL dbt BigQuery
MongoDB Spark Redshift
ETL vs ELT:
Extract-Transform-Load (ETL):
Source DB β Transform (clean, aggregate) β Load into warehouse
Extract-Load-Transform (ELT):
Source DB β Load raw data β Transform in warehouse
Modern cloud warehouses favor ELT because they can handle transformations at scale.
Stars and Snowflakes: Schemas for Analytics
Fact and Dimension Tables:
- Fact table: Contains measurements/metrics (sales amount, quantity)
- Dimension tables: Contain descriptive attributes (customer info, product details)
Star Schema:
βββββββββββββββ
β dim_product β
ββββββββ¬βββββββ
β
βββββββββββββββ βββββββββββββββ
β dim_customerβ β dim_time β
ββββββββ¬βββββββ ββββββββ¬βββββββ
β β
ββββ΄βββββββββββββββββββ΄βββ
β sales_fact β β Central fact table
ββββ¬βββββββββββββββββββ¬βββ
β β
ββββββββ΄βββββββ ββββββββ΄βββββββ
β dim_store β β dim_promotionβ
βββββββββββββββ βββββββββββββββ
Snowflake Schema: Like star schema but dimension tables are normalized into sub-dimensions:
dim_product β dim_category β dim_department
Trade-offs:
- Star: Simpler queries, some redundancy
- Snowflake: Normalized, more complex JOINs
Column-Oriented Storage
Row-oriented storage stores entire records together, while column-oriented storage groups values from the same column.
Row-oriented (traditional):
Disk blocks:
Block 1: [Row1: ID=1,Name="Alice",Age=25,City="NYC"]
[Row2: ID=2,Name="Bob",Age=30,City="SF"]
Block 2: [Row3: ID=3,Name="Carol",Age=28,City="LA"]
[Row4: ID=4,Name="Dave",Age=35,City="NYC"]
Column-oriented:
Disk blocks:
ID column: [1, 2, 3, 4, 5, 6, ...]
Name column: ["Alice", "Bob", "Carol", "Dave", ...]
Age column: [25, 30, 28, 35, 42, 29, ...]
City column: ["NYC", "SF", "LA", "NYC", "SF", ...]
Column Compression
Similar values compress extremely well:
Before compression:
City: ["NYC", "SF", "LA", "NYC", "SF", "NYC", "LA", "SF"]
Run-length encoding:
NYCΓ1, SFΓ1, LAΓ1, NYCΓ1, SFΓ1, NYCΓ1, LAΓ1, SFΓ1
Dictionary encoding:
Dictionary: {0: "NYC", 1: "SF", 2: "LA"}
Values: [0, 1, 2, 0, 1, 0, 2, 1]
Bit-packed:
"NYC": 10100100 (positions where NYC appears)
"SF": 01010001
"LA": 00101010
Sort Order in Column Storage
Unlike row stores, column stores can use different sort orders for different queries:
Sorted by (date, product_id):
Date: [2024-01-01, 2024-01-01, 2024-01-02, ...]
Product: [widget-A, widget-B, widget-A, ...]
Revenue: [100, 150, 120, ...]
Alternative sort by (product_id, date):
Product: [widget-A, widget-A, widget-B, ...]
Date: [2024-01-01, 2024-01-02, 2024-01-01, ...]
Revenue: [100, 120, 150, ...]
Benefits of sorted order:
- Enables efficient range queries
- Improves compression (similar values grouped together)
- Allows for early termination in aggregations
Challenge: Different queries benefit from different sort orders.
Solutions:
- Store multiple sorted copies
- Use techniques like late materialization
- Maintain separate indexes for different access patterns
Writing to Column-Oriented Storage
Column stores optimize for reads, making writes more challenging:
LSM-Tree approach (used by ClickHouse):
- Buffer writes in row-oriented format
- Periodically convert to columnar format
- Merge with existing column files
Batch loading:
ETL Pipeline β Staging (row format) β Batch convert β Column store
Most column stores are optimized for append-only workloads rather than transactional updates.
Aggregation: Data Cubes and Materialized Views
Materialized View: Pre-computed query results stored as tables:
-- Create materialized view
CREATE MATERIALIZED VIEW monthly_sales AS
SELECT
DATE_TRUNC('month', order_date) as month,
product_category,
SUM(revenue) as total_revenue
FROM sales_fact sf
JOIN dim_product dp ON sf.product_id = dp.product_id
GROUP BY month, product_category;
Data Cube (OLAP Cube): Pre-aggregated data across multiple dimensions:
3D Cube: Time Γ Product Γ Region
- Total sales by month
- Total sales by product
- Total sales by region
- Total sales by month + product
- Total sales by month + region
- Total sales by product + region
- Total sales by month + product + region
Trade-offs:
- β Lightning-fast queries for pre-computed aggregates
- β Storage overhead (exponential with dimensions)
- β Complex to maintain with data updates
Chapter 3 Key Takeaways
Storage Engine Trade-offs:
- Hash indexes: Fast writes and reads, but no range queries and limited by memory
- LSM-Trees: Excellent for write-heavy workloads, good compression
- B-Trees: Superior for read-heavy workloads, excellent range queries
OLTP vs OLAP:
- OLTP: Transaction processing, normalized schemas, real-time operations
- OLAP: Analytics, star/snowflake schemas, batch processing with complex aggregations
Column Storage Advantages:
- Dramatic compression improvements
- Vectorized processing capabilities
- Excellent for analytical queries
- Challenges with writes and different sort orders
The Right Tool for the Job:
- Choose storage engines based on your workload characteristics
- Consider read/write ratios, query patterns, and consistency requirements
- Modern systems often combine multiple storage technologies
Chapter 4: Encoding and Evolution
How do you handle data format changes when applications and databases are running in production?
This chapter addresses one of the most practical challenges in system design: evolving your data formats without breaking existing services. Whether youβre adding new fields to a JSON API or migrating between database schemas, understanding encoding and evolution is crucial for building resilient systems.
Formats for Encoding Data
Language-Specific Formats
Many programming languages have built-in support for encoding in-memory objects:
// Java serialization
ByteArrayOutputStream bos = new ByteArrayOutputStream();
ObjectOutputStream out = new ObjectOutputStream(bos);
out.writeObject(myObject);
byte[] serialized = bos.toByteArray();
# Python pickle
import pickle
serialized = pickle.dumps(my_object)
my_object = pickle.loads(serialized)
Problems with language-specific formats:
- β Tied to a particular programming language
- β Poor performance and bloated encoding
- β Security vulnerabilities (arbitrary code execution)
- β No forward/backward compatibility guarantees
JSON, XML, and Binary Variants
JSON Example:
{
"userName": "Martin",
"favoriteNumber": 1337,
"interests": ["daydreaming", "hacking"]
}
XML Equivalent:
<Person>
<userName>Martin</userName>
<favoriteNumber>1337</favoriteNumber>
<interests>
<li>daydreaming</li>
<li>hacking</li>
</interests>
</Person>
JSON/XML Trade-offs:
| Aspect | JSON | XML | Binary Formats |
|---|---|---|---|
| Human readable | β | β | β |
| Compact | β οΈ Medium | β Verbose | β |
| Schema support | β | β | β |
| Numbers precision | β Ambiguous | β Text-based | β Typed |
| Unicode support | β | β | β |
| Parsing speed | β οΈ Medium | β Slow | β Fast |
JSON/XML Problems:
- Number ambiguity: No distinction between integers and floats
- Large size: Repeated field names, verbose syntax
- No schema: Canβt enforce data structure
Binary Encoding:
MessagePack example (binary JSON):
JSON (81 bytes):
{"userName": "Martin", "favoriteNumber": 1337, "interests": ["daydreaming", "hacking"]}
MessagePack (66 bytes):
83 a8 userName a6 Martin ad favoriteNumber cd 05 39 a9 interests 92 ab daydreaming a7 hacking
Saves ~18% space by eliminating repeated field names and using binary encoding for numbers.
Thrift and Protocol Buffers
Schema-driven binary encoding formats that generate code for multiple languages.
Protocol Buffers (Protobuf) Example:
Schema definition (.proto file):
syntax = "proto3";
message Person {
string user_name = 1;
int64 favorite_number = 2;
repeated string interests = 3;
}
Generated code usage:
# Python
person = Person()
person.user_name = "Martin"
person.favorite_number = 1337
person.interests.extend(["daydreaming", "hacking"])
# Serialize to binary
binary_data = person.SerializeToString()
# Deserialize
person2 = Person()
person2.ParseFromString(binary_data)
Binary Encoding Details:
Protobuf binary format:
Field 1 (user_name): tag=1, type=string
Field 2 (favorite_number): tag=2, type=varint
Field 3 (interests): tag=3, type=string (repeated)
Binary: 0a 06 4d 61 72 74 69 6e 10 b9 0a 1a 0b 64 61 79 64 72 65 61 6d 69 6e 67 1a 07 68 61 63 6b 69 6e 67
Key insight: Field names are not storedβonly field tags (numbers).
Thrift vs Protobuf Comparison:
| Feature | Thrift | Protocol Buffers |
|---|---|---|
| Binary formats | BinaryProtocol, CompactProtocol | Single format |
| Schema evolution | Good | Excellent |
| Language support | Fewer languages | Many languages |
| RPC framework | Built-in | gRPC (separate) |
| Optional/required | Explicit | All optional (proto3) |
| Default values | Specified in schema | Type defaults |
Avro
Avro takes a different approach: it embeds the schema with the data, enabling better schema evolution.
Avro Schema Example:
{
"type": "record",
"name": "Person",
"fields": [
{"name": "userName", "type": "string"},
{"name": "favoriteNumber", "type": ["null", "long"], "default": null},
{"name": "interests", "type": {"type": "array", "items": "string"}}
]
}
Writer vs Reader Schema:
Key innovation: Avro supports different schemas for writing and reading:
Writer Schema (v1): Reader Schema (v2):
- userName (string) - userName (string)
- favoriteNumber (long) - favoriteNumber (long)
- email (string, default: null)
The Avro library resolves differences between schemas automatically.
Schema Evolution in Avro:
Forward compatibility (new schema can read old data):
// Old schema
{"name": "favoriteNumber", "type": "long"}
// New schema (more permissive)
{"name": "favoriteNumber", "type": ["null", "long"], "default": null}
Backward compatibility (old schema can read new data):
// New schema
{"name": "email", "type": "string", "default": ""}
// Old schema ignores unknown fields
Avro vs Protobuf/Thrift:
| Feature | Avro | Protobuf/Thrift |
|---|---|---|
| Schema storage | With data | Separately |
| Field identification | Field names | Tag numbers |
| Code generation | Optional | Required |
| Schema evolution | Very flexible | Good |
| Compactness | High | High |
| Dynamic typing | β | β |
The Merits of Schemas
Advantages of Schema-Based Formats:
- Compactness: Binary encoding with no field names
- Documentation: Schema serves as always up-to-date documentation
- Validation: Catch errors at write time, not read time
- Code generation: Generate statically typed data structures
- Evolution: Maintain compatibility across versions
Schema Evolution Patterns:
Safe changes:
- Add optional fields with defaults
- Remove optional fields
- Change field order (if using field names/tags)
Unsafe changes:
- Change field types (e.g., int32 β string)
- Rename fields (without aliases)
- Remove required fields
- Add required fields without defaults
Example evolution sequence:
v1: {id, name}
v2: {id, name, email?} # Add optional email
v3: {id, name, email?, age?} # Add optional age
v4: {id, name, email, age?} # Make email required (unsafe!)
Modes of Dataflow
1. Dataflow Through Databases
The problem: Applications change faster than databases.
Timeline:
T1: App v1 writes data using schema v1
T2: App v2 deployed, can read v1 data?
T3: App v2 writes data using schema v2
T4: App v1 still running, can read v2 data?
Solution strategies:
- Migrations: Transform all existing data to new schema
- Multi-version: Support reading multiple schema versions
- Additive changes: Only add optional fields
Example database evolution:
-- v1 schema
CREATE TABLE users (id INT, name VARCHAR(50));
-- v2 migration
ALTER TABLE users ADD COLUMN email VARCHAR(100) DEFAULT NULL;
-- v3 migration
ALTER TABLE users ADD COLUMN preferences JSON DEFAULT '{}';
2. Dataflow Through Services: REST and RPC
Service communication patterns:
REST (JSON over HTTP):
GET /api/users/123
Content-Type: application/json
{
"id": 123,
"name": "Alice",
"email": "alice@example.com"
}
RPC (binary protocol):
UserService.getUser(user_id=123) β User object
Web Services Comparison:
| Aspect | REST | SOAP | gRPC |
|---|---|---|---|
| Protocol | HTTP/JSON | HTTP/XML | HTTP/2 + Protobuf |
| Schema | Informal/OpenAPI | WSDL | .proto files |
| Code generation | Optional | Required | Generated stubs |
| Performance | Medium | Low | High |
| Debugging | Easy | Hard | Medium |
| Browser support | Native | Complex | Limited |
Remote Procedure Calls (RPC):
RPC tries to make network calls look like local function calls:
# Local call
user = get_user(123)
# RPC call (looks the same!)
user = user_service.get_user(123)
Problems with RPC:
- Network failures: Local calls donβt fail randomly
- Latency: Network calls are ~1M times slower
- Serialization: Objects donβt always serialize cleanly
- Versioning: Harder to evolve than local APIs
RPC frameworks:
- gRPC (Google): High performance, HTTP/2, streaming
- Thrift: Cross-language, used at Facebook/Meta
- Finagle (Twitter): Scala-based, circuit breakers
- JSON-RPC: Simple, HTTP-based
Data Encoding and Evolution for RPC:
Versioning strategies:
- URL versioning:
/api/v1/users/123
/api/v2/users/123
- Header versioning:
Accept: application/vnd.api+json;version=2
- Content negotiation:
Accept: application/vnd.user.v2+json
Backward/forward compatibility:
- Old clients must work with new servers
- New clients should work with old servers
- Use optional fields and sensible defaults
3. Message-Passing Dataflow
Asynchronous message passing combines benefits of RPC and databases:
Producer β Message Broker β Consumer
β β
(fire and forget) (process later)
Message Brokers:
Popular message brokers:
- Apache Kafka: High-throughput, persistent logs
- RabbitMQ: Flexible routing, AMQP protocol
- Amazon SQS: Managed queue service
- Redis Pub/Sub: In-memory, simple
Message Broker Benefits:
| Benefit | Description |
|---|---|
| Reliability | Broker acts as buffer against outages |
| Decoupling | Producers/consumers evolve independently |
| Flexibility | One message β multiple consumers |
| Scalability | Horizontal scaling of consumers |
Actor Model:
Actors are computational entities that communicate via messages:
Actor A ββmessageβββ Actor B
ββresponseββ
Actor properties:
- Encapsulated state (no shared memory)
- Asynchronous message passing
- Location transparency (local or remote)
Frameworks: Akka (JVM), Orleans (.NET), Erlang/Elixir
Message Encoding in Distributed Actors:
Evolution challenges:
- Rolling deployments mean mixed actor versions
- Message formats must be backward/forward compatible
- Consider using schemas (Protobuf, Avro) for actor messages
%% Erlang actor message evolution
%% v1 message
{user_created, UserId, Name}
%% v2 message (backward compatible)
{user_created, UserId, Name, Email}
Chapter 4 Key Takeaways
Encoding Format Selection:
- Language-specific formats: Avoid for cross-language compatibility
- JSON/XML: Human-readable but verbose and ambiguous for numbers
- Binary formats (Protobuf, Thrift, Avro): Compact, fast, schema-driven
Schema Evolution Strategies:
- Forward compatibility: New code can read old data
- Backward compatibility: Old code can read new data
- Best practices: Add optional fields, use defaults, avoid breaking changes
Dataflow Patterns:
- Database dataflow: Handle schema migrations carefully, support multiple versions
- Service dataflow: Use versioning strategies (URL, headers, content negotiation)
- Message-passing: Enables loose coupling, supports schema evolution through brokers
The Evolution Challenge: Schema evolution is not just a technical problemβitβs an organizational one. Plan for change from day one, use schemas that support evolution, and design systems that can gracefully handle mixed versions during deployments.
Part I Summary
Key takeaways from Part I:
- Reliability, Scalability, Maintainability are the fundamental goals of good system design
- Data models (relational, document, graph) shape how you think about problems
- Storage engines (B-Trees vs LSM-Trees) have different performance characteristics
- Encoding formats and schema evolution enable systems to change over time
- Dataflow patterns (databases, services, messages) each have different consistency and performance properties
Technology selection framework:
- Understand your workload characteristics (read/write ratio, consistency needs)
- Choose data models that match your access patterns
- Select storage engines based on performance requirements
- Plan for schema evolution from day one
- Design dataflow to match your consistency and availability needs
Looking ahead to Part II: The foundations covered here prepare you for distributed systems challenges: replication, partitioning, transactions, and the difficulties of maintaining consistency across multiple machines.
βThe future is already hereβitβs just not evenly distributed.β Understanding these foundations helps you build systems that can adapt and scale as requirements change.