Skip to main content
SEMastery
book

Designing Data-Intensive Applications: Review and Study Guide for .NET Developers

A friendly review and study guide for Kleppmann's DDIA, with simple analogies, C# examples, and a clear chapter-by-chapter reading plan.

17 min readUpdated October 15, 2025

What this book is

Designing Data-Intensive Applications is a book by Martin Kleppmann. People call it "DDIA" or "the wild boar book" because of the animal on the cover. The second edition is co-authored with Chris Riccomini.

It is not a book about one database. It is a book about the ideas underneath all databases and data systems. How is data stored on disk? How do you copy it to other machines so you do not lose it? How do you split it when it gets too big for one machine? How do you keep it correct when many users change it at once?

These questions matter more every year. Apps today handle huge amounts of data, for many users, all at the same time. This book is the modern classic that explains how those systems really work. This page is a review plus a study guide. By the end you will know what the book teaches, see small C# examples, learn who it suits, and have a chapter-by-chapter plan.

A real-life way to think about it

Imagine a giant library. Not a small room of books, but a huge one with millions of books and thousands of visitors every hour.

If you keep only one copy of each book, you have problems. If that copy is lost or torn, it is gone forever. If two people want the same book, one has to wait. And if too many people show up at once, the single front desk gets a huge line.

So a smart library does a few things. It keeps copies of popular books, so losing one is not the end of the world. That is replication. It splits the collection across many rooms by topic, so no single room is too full and visitors spread out. That is partitioning. It has rules so two people cannot both check out the very last copy of a book at the same time. That is transactions and consistency. And it has a quiet back office that slowly re-shelves and re-indexes everything overnight. That is batch processing, while the busy front desk handling live visitors is stream processing.

That is the whole book in one picture. A data-intensive app is just a very large, very busy library for data. Kleppmann teaches you how to run that library well, so it stays fast, never loses a book, and never hands out the same last copy twice.

The three goals: reliable, scalable, maintainable

The book opens with three words that shape everything after. Every serious data system is judged by them.

  • Reliable. It keeps working correctly, even when things go wrong. Hard drives die. Networks drop. People make mistakes. A reliable system expects this and survives it.
  • Scalable. It keeps working well as load grows. Ten users or ten million, the system has a plan to cope, usually by adding more machines.
  • Maintainable. People can keep working on it for years without dreading it. The code and the data shape are clear enough that the next engineer is not scared to touch them.

A key idea here is the difference between a fault and a failure. A fault is one part going wrong, like a single disk dying. A failure is the whole system going down. The goal of reliability is to stop a small fault from turning into a big failure. Good systems even cause small faults on purpose, in tests, to prove they can take the hit.

// Reliability in everyday code: expect the network to fail,
// and retry with a backoff instead of crashing the whole request.
async Task<string> FetchWithRetry(HttpClient client, string url)
{
    for (var attempt = 1; attempt <= 3; attempt++)
    {
        try
        {
            return await client.GetStringAsync(url);
        }
        catch (HttpRequestException) when (attempt < 3)
        {
            // A fault, not a failure. Wait a bit, then try again.
            await Task.Delay(TimeSpan.FromMilliseconds(200 * attempt));
        }
    }
    throw new InvalidOperationException("Service is unavailable.");
}

Data models: how you shape the data

Part one of the book asks: what shape should your data take? The two big families are relational (tables with rows and columns, like SQL Server or Postgres) and document (nested objects, like a JSON blob in MongoDB). There is also the graph model, for data that is mostly about connections, like a social network.

There is no single winner. Each shape fits different work.

Data modelBest when...A weak spot
Relational (tables)Data has many relationships and you join across themDeeply nested data can feel clumsy
Document (JSON)Data is one self-contained tree, like an order with its itemsJoins across documents are awkward
Graph (nodes + edges)Connections matter most, like friends-of-friendsOverkill for simple, flat data

A big lesson is that the model you pick shapes the questions you can ask easily. The book also explains declarative query languages, like SQL, where you say what you want and the database figures out how. That is why the database can get faster over the years without you rewriting your queries.

// Declarative thinking with LINQ over EF Core: you describe the result,
// the provider decides how to fetch it. This mirrors the book's point.
var topCustomers = await db.Orders
    .GroupBy(o => o.CustomerId)
    .Select(g => new { CustomerId = g.Key, Total = g.Sum(o => o.Amount) })
    .OrderByDescending(x => x.Total)
    .Take(10)
    .ToListAsync();

Storage engines: what happens on disk

This is one of the most loved parts of the book. It opens the hood and shows how a database actually saves and finds your data.

You learn about two big families of storage engine. Log-structured engines (like LSM-trees, used by Cassandra and RocksDB) write new data quickly by appending it, then tidy up in the background. Page-oriented engines (like B-trees, used by most SQL databases) keep data in a sorted tree on disk and update it in place. The trade-off is roughly: log-structured is often faster to write, B-trees are often faster and steadier to read.

You also learn why indexes make reads fast but make writes a little slower. An index is an extra copy of some data, sorted so you can find things quickly. The library analogy fits: a card catalog helps you find a book fast, but every new book means updating the catalog too.

A write goes to the table and to every index, which is why more indexes slow writes down.

The practical takeaway for a .NET developer: an index is a deal. You trade slightly slower writes for much faster reads. Add them where you read a lot, not everywhere.

// Telling EF Core to create an index: the same trade-off the book describes.
modelBuilder.Entity<Customer>()
    .HasIndex(c => c.Email)   // fast lookups by email
    .IsUnique();              // also enforces no duplicate emails

Replication: keeping copies safe

Now we cross into part two: distributed data. The first big topic is replication, which means keeping copies of the same data on several machines.

Why bother? Three reasons the book stresses. To keep data close to users (lower delay), to keep serving if one machine dies (reliability), and to spread out read load (scalability). Back to the library: keeping copies of a popular book in several rooms means losing one is fine, and more people can read at once.

The hard part is keeping the copies in sync. The book walks through single-leader (one machine takes all writes and copies them out), multi-leader (several machines accept writes), and leaderless (any machine can take a write). Single-leader is the most common and the easiest to reason about.

Single-leader replication

Write
Leader
Followers
Reads

Steps

1

Write

All writes go to the leader

2

Leader

Records the change in its log

3

Followers

Copy the log and apply it

4

Reads

Clients read from any copy

One leader takes writes, then streams them to followers that serve reads.

This is also where you meet replication lag. A follower might be a moment behind the leader. So a user can save a comment, then refresh and not see it, because they read from a follower that has not caught up yet. The book names the guarantees that fix this, like "read your own writes". This is the same lag idea behind the outbox pattern, where you make sure a saved change and its message stay in step.

Partitioning: splitting big data

When data is too big for one machine, you partition it (also called sharding). You split it into pieces and put each piece on a different machine. The library splits books across rooms by topic; a database splits rows across machines by some key.

The trick is choosing a good partition key. A bad key creates a hot spot, where one machine gets most of the traffic while the others sit idle. Imagine putting every popular book in one room: that room gets mobbed while the rest stay empty. The book explains splitting by key range versus splitting by a hash of the key, and the trade-offs of each.

Partition strategyHow it splitsWatch out for
By key rangeA to M on one node, N to Z on anotherHot spots if some ranges are far busier
By hash of keySpread keys evenly with a hashYou lose easy range scans

Partitioning and replication usually work together. You split the data into partitions, then keep replicated copies of each partition. Now your data is both spread out and safe.

Transactions: all or nothing

A transaction groups several steps so they all happen, or none of them do. The classic example is moving money: take from one account, add to another. If the system crashes halfway, you must not end up with money taken but never added.

The book uses the famous ACID letters to describe strong transactions:

  • Atomic. All steps happen, or none do. No half-done state.
  • Consistent. The data follows your rules before and after.
  • Isolated. Concurrent transactions do not trip over each other.
  • Durable. Once saved, it survives a crash.

The deepest part here is isolation levels. When many transactions run at once, weird bugs can appear, like dirty reads or lost updates. Isolation levels are dials that trade speed for safety. Read Committed is loose and fast. Snapshot Isolation gives each transaction a steady view. Serializable is the strictest: it behaves as if transactions ran one at a time, which is safest but can be slower.

// A transaction in EF Core: both saves commit together, or neither does.
await using var tx = await db.Database.BeginTransactionAsync();
try
{
    fromAccount.Balance -= amount;
    toAccount.Balance   += amount;
    await db.SaveChangesAsync();
    await tx.CommitAsync();   // all or nothing
}
catch
{
    await tx.RollbackAsync(); // crash-safe: undo everything
    throw;
}

The trouble with distributed systems

Once your data lives on many machines, a whole new world of trouble opens up, and the book is honest about it. Networks drop messages or delay them. Clocks on different machines disagree. A machine can pause for a moment (for example during garbage collection) and look dead when it is not.

The scary part is partial failure. On one machine, things either work or crash. Across many machines, some parts work while others quietly fail, and you often cannot tell which. A message you sent might have arrived, or might not. You simply do not know.

The book's calm advice: stop assuming the network is reliable. Design for messages that get lost, arrive twice, or arrive late. This is why idempotent operations matter so much. An idempotent operation can run twice and still give the same result, so a retry is safe.

// Idempotency: a duplicate request with the same key does no extra harm.
async Task ChargeOnce(string idempotencyKey, decimal amount)
{
    if (await db.Charges.AnyAsync(c => c.Key == idempotencyKey))
        return; // already done; a safe retry, not a double charge
 
    db.Charges.Add(new Charge { Key = idempotencyKey, Amount = amount });
    await db.SaveChangesAsync();
}

Consistency and consensus

When copies of data live on many machines, what does it mean for them to "agree"? The book explores a spectrum of guarantees.

Eventual consistency is loose: all copies will agree eventually, but for a moment they may differ. Linearizability is strict: the whole system behaves as if there is one single, up-to-date copy, so once you write a value, every later read sees it.

Strict guarantees are easier to reason about but cost speed and availability. This is the heart of the famous CAP theorem idea: when the network splits, you must choose between staying consistent and staying available. You cannot have both during a split.

The book also covers consensus: how a group of machines agrees on a single value even when some are slow or failing. Consensus quietly powers leader election, distributed locks, and more. You do not usually write it yourself, but knowing it exists explains why tools like ZooKeeper and etcd matter.

Strong guarantees cost more; loose guarantees are cheaper but can show stale data.

Batch versus stream processing

The final part is about derived data: turning your raw data into useful results, like reports, search indexes, and recommendations.

There are two big styles. Batch processing takes a large, fixed pile of data and crunches it all at once, often overnight. Think of the library's back office re-indexing every book after closing. It is great for big, slow, accurate jobs.

Stream processing handles events one at a time, as they happen. Think of the front desk reacting to each visitor the moment they arrive. It is great for live dashboards, alerts, and keeping things fresh. The book shows that the two are closer than they look: a stream is really just a never-ending batch.

StyleWhen data arrivesGood for
BatchA big fixed set, processed laterNightly reports, full re-indexes
StreamOne event at a time, right nowLive alerts, real-time dashboards

In the .NET world this maps to message systems and event handling. A queue of events that your service reacts to, one by one, is stream processing in practice.

// A tiny stream-style handler: react to each event as it arrives.
await foreach (var orderEvent in eventStream.ReadAllAsync(cancellationToken))
{
    if (orderEvent.Type == "OrderPlaced")
        await UpdateLiveSalesDashboard(orderEvent);
}

How the book is laid out

Knowing the shape of the book helps you plan your reading. It has twelve chapters in three parts.

PartChaptersTheme
I. Foundations of Data Systems1 to 4Reliability, data models, storage, encoding
II. Distributed Data5 to 9Replication, partitioning, transactions, distributed trouble, consistency
III. Derived Data10 to 12Batch processing, stream processing, the future

Part I works even on a single machine. Part II is where things get distributed and hard. Part III ties it together into real pipelines.

Who this book suits

  • Intermediate-to-senior .NET developers who already build apps with databases and now want to understand what is really happening underneath.
  • Anyone building systems with real data or real scale. If you worry about losing data, slow queries, or copies going out of sync, this book gives you the vocabulary and the mental models.
  • People moving toward a modular monolith or microservices. The chapters on replication, partitioning, and distributed trouble explain exactly the pain you meet when one system becomes many.

It is less ideal as a first programming book. You should be comfortable writing C# and have used a database before. The lessons land hardest once you have felt a real data problem in production.

A note on modern .NET

The book is deliberately tool-neutral, which is why it has aged so well. Its ideas apply just as cleanly on .NET 10, the current LTS release, as on anything else. The transaction code, the retry logic, and the idempotency keys above are everyday .NET patterns built straight on the book's lessons.

C# 14 has shipped, and C# 15 brings union types in the .NET 11 preview, which can make modelling "either a success or one of these errors" cleaner when you handle the partial-failure cases the book warns about. None of that changes the core ideas. Storage, replication, and consistency work the same whatever language you write in.

One practical heads-up while you build the messaging and streaming pipelines the book describes: some popular .NET libraries, including MediatR, MassTransit, and AutoMapper, have moved to commercial licenses. That is a tooling and budget choice, not a design one. Every concept in DDIA can be built with the standard .NET SDK, the built-in System.Threading.Channels, and a database of your choice.

A chapter-by-chapter reading plan

DDIA is dense. Do not rush it. One chapter per week, with notes, is a realistic and rewarding pace.

A steady reading plan

Weeks 1-2
Weeks 3-4
Weeks 5-7
Weeks 8-9
Weeks 10-12

Steps

1

Weeks 1-2

Ch 1-2: goals and data models

2

Weeks 3-4

Ch 3-4: storage and encoding

3

Weeks 5-7

Ch 5-7: replication, partitioning, transactions

4

Weeks 8-9

Ch 8-9: distributed trouble and consistency

5

Weeks 10-12

Ch 10-12: batch, stream, the future

Read one chapter per session and write down one real example for each.
  1. Chapter 1 — Reliable, scalable, maintainable. The compass for the whole book. Read it slowly.
  2. Chapter 2 — Data models and query languages. Relational versus document versus graph.
  3. Chapter 3 — Storage and retrieval. How databases actually save and find data. A favourite.
  4. Chapter 4 — Encoding and evolution. How data formats change over time without breaking old code.
  5. Chapter 5 — Replication. Copies of data and the lag that comes with them.
  6. Chapter 6 — Partitioning. Splitting big data and avoiding hot spots.
  7. Chapter 7 — Transactions. ACID and isolation levels. Read it twice.
  8. Chapter 8 — The trouble with distributed systems. Networks, clocks, and partial failure. Humbling.
  9. Chapter 9 — Consistency and consensus. The spectrum from eventual to linearizable.
  10. Chapter 10 — Batch processing. Big jobs over fixed data.
  11. Chapter 11 — Stream processing. Reacting to events as they happen.
  12. Chapter 12 — The future of data systems. Tying it all together into real pipelines.

Tips while you read

  • Keep a notebook. After each chapter, write one real example from your own work that the chapter explains.
  • Do not memorise. Aim to recognise. The win is hearing "replication lag" later and knowing exactly what it means.
  • Read chapters 7 and 8 twice. Transactions and distributed trouble are the densest and the most useful.
  • Pair it with our outbox pattern guide, which puts the replication and consistency lessons into a concrete .NET shape.

Quick recap

  • DDIA teaches the ideas under all databases: storage, replication, partitioning, transactions, consistency, and processing.
  • A data-intensive app is like a huge, busy library: keep copies, split the shelves, and never hand out the same last book twice.
  • Every system aims to be reliable, scalable, and maintainable, and to stop a small fault from becoming a big failure.
  • Replication keeps copies safe but brings lag; partitioning splits big data but risks hot spots.
  • Transactions and isolation levels keep data correct when many users change it at once.
  • Distributed systems bring partial failure, so design for retries and make operations idempotent.
  • Batch crunches a fixed pile of data; stream reacts to events live. A stream is just a never-ending batch.
  • The ideas fit perfectly on .NET 10 and C# 14, because the book is about concepts, not one tool.

References and further reading