*Designing data-intensive applications*

5/29/25

Table of Contents

Part I. Foundations of data systems

  1. Reliable, scalable, and maintainable applications
  2. Data models and query languages
  3. Storage and retrieval
  4. Encoding and evolution

Part II. Distributed data

  1. Replication
  2. Partitioning
  3. Transactions
  4. The trouble with distributed systems
  5. Consistency and consensus

Part III. Derived data

  1. Batch processing
  2. Stream processing
  3. The future of data systems

Reliable, scalable, and maintainable applications

Reliability means that the system works even when faults occur.

Scalability means that performance stays good even when load reasonably increases.

Maintainability means that life is good for people who need to work with the system.

Data models and query languages

Relational models have a strict schema. Relationships can be many-to-many, but not deeply nested.

Document models have a loose schema. Data is self-contained, such as a resume. Relationships are 1-to-many.

Graph models have a loose schema. Anything is potentially related to anything. Relationships are many-to-many and queries can traverse over a chain of edges.

Storage and retrieval

Online transaction processing (OLTP) and online analytic processing (OLAP) have different requirements

  • OLTP systems are user-facing and have to handle a huge volume of requests. Each query usually only touch a small number of records. Reads requests are usually via individual keys. Write requests are usually random-access.
  • OLAP like data warehouses are used by business analysts and have to handle a smaller volume of requests. Each query tends to touch millions of records. Read requests are usually over a range of keys. Write requests can be done in batches.

Storage engines for OLTP rely on indices and sequential writes. They can be log-structured (hash indexes, LSM trees) or update-in-place (b-trees).

Storage engines for OLAP rely less on indexing and more on column-oriented compression. The star schema helps organize data from multiple OLTP databases. It is a fact table with every event with foreign keys to dimension tables.

Hash indexes

  1. Write requests are appended to a log
  2. The most recent write to a key is tracked by the in-memory hash index
  3. After some threshold, segments are compacted and merged

Advantages

  • Disk writes are sequential, so write throughput is high
  • Concurrency and crash recovery is simple

Disadvantages

  • Keys need to fit in-memory
  • Keys are not ordered, so range queries are not efficient

LSM tree (log-structured merge tree)

Similar to hash indexes, but keys are sorted.

  1. Write requests are added to the sorted, in-memory memtable (red-black or AVL tree)
  2. After some threshold, the memtable is copied to an on-disk SSTable (sorted string table)
  3. After some threshold, SSTables are compacted and merged
  4. Read requests first check the memtable and then the on-disk segments

Advantages

  • Disk writes are sequential, so write throughput is high
  • The hash index does not need to store all keys anymore
  • Merging segments is efficient through mergesort
  • Range queries are efficient

Disadvantages

  • When the database crashes, the memtable will be lost. Thus, write requests are also added to the unsorted, on-disk log.
  • Read requests need to check the memtable and all of the segments before determining that a key doesn't exist. Bloom filters can be used to speed this up.

B-Trees

The most common type of on-disk index. Similar to LSM trees in that keys are sorted. Thought to be faster for reads but slower for writes.

  1. The root page references child pages, which each represent a range of keys.
  2. Writes requests overwrite the page containing the key.
  3. Read requests traverse from the root page to the appropriate child page.

Disadvantages

  • Write requests are not resilient to crashes unless there is a write-ahead log (WAL or redo log) or a copy-on-write scheme

Encoding and evolution

Data can often have two different reprsentations:

  1. In memory, data is kept in objects that are optimized for the CPU (typically using pointers)
  2. In disk or in packets, data is kept in a self-contained sequence of bytes (without any pointers)

Translating from the in-memory representation to the byte sequence is called encoding, serialization, or marshalling, and the reverse is called decoding, deserialization, or demarshalling.

Types of encodings:

  1. Language-specific formats lock you in to a specific language.
  2. Text-based formats like JSON, XML, and CSV are widely supported but have ambiguous support for numbers and binary strings.
  3. Binary formats like Thrift, Protocol Buffers, and Avro are more compact. They also tend to have better support for schemas, especially when it comes to maintaining compatibility.

When a data format changes, there may be both new and old data formats, and new and old code, all existing in an application. We need to support both (1) backward compatibility, or new code working on old data, and (2) forward compatibility, or old code working on new data.

  • In Thrift and Protocol Buffers, only optional columns can be added or removed
  • In Avro, differences between writers and readers schemas are resolved, and only columns with a default value can be added or removed

Modes of dataflow:

  • With databases, a process writing to it encodes the data and a process reading from it decodes the data. Migrating data into a new schema is expensive. Adding a new column with a default value is fine.
  • With services (REST and RPC), both the client and the server encode and decode requests and responses.
  • With asynchronous message passing (using message brokers), nodes send each other messages encoded by the sender and decoded by the recipient.

Replication

Advantages of replication:

  1. Increased availability
  2. Increased read throughput
  3. Less latency if servers are around the globe

In leader-based replication, the leader is the source of truth. Clients must send write requests to the leader, which then updates the followers. Clients can send read requests to either the leader or the followers.

Leaders update their followers with replication logs:

  • Statement-based replication could break down with non-deterministic functions like NOW() or RAND()
  • Write-ahead logs (WAL) are very low-level, making it highly coupled with the storage engine. This makes performing software updates without downtime difficult.
  • Logical log replication describes writes at the granularity of a row. It is decoupled from the storage engine's (physical) log.

Followers can either be synchronous or asynchronous. In synchronous replication, then the leader waits for the follower to confirm that it received the write before reporting success to the user. The advantage is that the follower is guaranteed to have an up-to-date copy of the data.

In asynchronous replication, the leader can continue taking write requests even when followers lag behind. But this can cause replication lag:

  • Requesting a write, and then not reading the write (read-after-write)
  • Reading from a fresh replica and then a stale replica (non-monotonic reads)
  • Reading writes in a different order than they were written (inconsistent prefix reads)

Try to avoid multi-leader replication, since conflict resolution can get hairy. Sometimes it is necessary if there are multiple data centers, offline requests, or collaborative editing.

There is also leaderless replication (Dynamo-style), where clients send write and read requests to multiple servers to get a quorum. Given nn replicas where ww replicas must confirm a write request and rr replicas are queried for a read request, having r+w>nr + w > n means that generally every read request will get the up-to-date value.