Skip to content

Modern Databases, Warehousing, and Lakehouses

Summary

Surveys the database categories the modern IS professional must select among — analytical/OLAP, key-value, column-family, document, graph — alongside data warehouses, dimensional modeling, data lakes, and lakehouses, with the workload tradeoffs that drive selection.

Role in the course: Move students past 'database = MySQL' into the polyglot reality: which store fits which workload, and why.

Concepts Covered

This chapter covers the following 30 concepts from the learning graph:

  1. Relational Database
  2. PostgreSQL
  3. MySQL
  4. SQL Server
  5. Analytical Database
  6. OLTP vs OLAP
  7. Data Warehouse
  8. Star Schema
  9. Snowflake Schema
  10. Fact Table
  11. Dimension Table
  12. Slowly Changing Dimension
  13. Columnar Storage
  14. Snowflake Platform
  15. Amazon Redshift
  16. Key-Value Store
  17. Redis
  18. DynamoDB
  19. Column-Family Store
  20. Apache Cassandra
  21. HBase
  22. Document Database
  23. MongoDB
  24. JSON Document
  25. XML Document
  26. Database Selection
  27. CAP Theorem
  28. Polyglot Persistence
  29. Data Lake
  30. Data Lakehouse

Prerequisites

This chapter builds on concepts from:


Welcome back

Iris the hummingbird waves hello Welcome to Chapter 7. Last chapter you mastered the relational database — the workhorse that has quietly run civilization for fifty years. This chapter is the part where civilization gets interesting. We are going to walk through eight different kinds of databases, two ways to organize a warehouse, three generations of analytical platform, and the one theorem (CAP) that explains why you cannot have everything you want at once. By the end you will know why a sane organization runs five databases instead of one, why your shopping cart, your fraud detector, and your CFO's quarterly report all live in different stores, and why the phrase one database to rule them all is the relational equivalent of I'm sure this email from a Nigerian prince is legitimate. Let's go.

Why One Database Doesn't Fit All

If the relational database is so good — and it is, genuinely, an engineering marvel — why does the modern IS professional need to learn five other kinds? The answer is that the world threw five new workloads at databases over the last two decades, and each workload had requirements the relational model was never designed to satisfy. A relational database is general-purpose: it does many things adequately. The rise of new categories happened because, for some workloads, "adequate" was several orders of magnitude away from "good enough."

The five workloads that broke the one-database-fits-all assumption were these. Web-scale traffic — millions of concurrent users hitting a shopping cart at Black Friday — needed sub-millisecond reads and horizontal scale that relational systems struggled to deliver. Analytical reporting — scanning a billion rows to compute a quarterly summary — needed a storage layout fundamentally different from the row-oriented format relational engines used for transaction processing. Flexible documents — the JSON payloads flowing between web services and mobile apps — needed a store that did not require a schema migration every time a developer added a field. Time-series telemetry — the firehose of metrics from every sensor, container, and microservice — needed write throughput in the hundreds of thousands of events per second. And graph queries — find the shortest path between two people in a social network, or trace the suppliers behind every component in a product — needed a data model where the relationships were first-class citizens, not joins computed on the fly.

The systems-thinking insight is that there is no neutral default. Picking a single database for every workload is itself a choice, with consequences. The unintended consequence of one-database-fits-all is that every workload gets a poor fit on at least one critical dimension — and in IS, the dimension that is poorly fit is usually the one that goes wrong at 3 AM. The unintended consequence of one-database-per-workload-no-matter-how-small, on the other hand, is operational nightmare: more upgrades, more backups, more skills, more on-call rotations. The right answer lies between these extremes and has its own name — polyglot persistence — which we will get to. First, the categories.

A Refresher: The Relational Database in 2026

Let's anchor on familiar ground. A Relational Database is a database organized as a collection of tables (relations) made of rows and columns, with strong schema enforcement, ACID transactions, and SQL as the query language. That definition has not changed since Codd wrote it in 1970, and yet the relational database has remained the default choice for new applications for half a century — because the model fits an enormous range of business problems extraordinarily well.

The three relational systems an IS graduate must recognize on a resume are PostgreSQL, MySQL, and SQL Server.

PostgreSQL is the open-source relational database that has, over the last decade, quietly become the default for new applications across the industry. It is famously rigorous about correctness — it implements the SQL standard more faithfully than most competitors — and famously extensible: PostgreSQL supports JSON columns (with indexing!), full-text search, geospatial queries (via PostGIS), time-series workloads (via TimescaleDB), and even vector similarity search for AI applications (via pgvector). PostgreSQL's superpower is that it can absorb new workloads as extensions rather than forcing the team to adopt a separate database for each one. If a small team needs to commit to one database and not regret the choice in three years, PostgreSQL is almost always the right call.

MySQL is the other major open-source relational database, and the one that powered most of the early web — WordPress, Drupal, the LAMP stack, the original Facebook backend. MySQL is famous for its operational simplicity, its enormous installed base, and its focus on read-heavy workloads. The InnoDB storage engine gives it ACID guarantees and MVCC concurrency control. MySQL is owned by Oracle today; its community fork, MariaDB, is maintained independently and tracks closely. MySQL is still an excellent choice when the workload is well-understood, the team has MySQL expertise, and the schema does not need PostgreSQL's extensibility.

SQL Server is Microsoft's enterprise relational database, the dominant choice in shops that already run on the Microsoft stack — .NET applications, Windows Server, Active Directory. SQL Server has historically led on tooling (SQL Server Management Studio, integrated reporting, SSIS for ETL), has tight integration with Microsoft's broader data platform (Power BI, Azure Synapse), and offers a free Express edition for small workloads. The licensing cost on production workloads can be substantial — but for organizations already paying the Microsoft tax, SQL Server is often the path of least resistance.

System License Sweet spot Notable strengths
PostgreSQL Open source (PostgreSQL License) New applications, extensibility, AI workloads JSON, vector, geospatial, time-series extensions
MySQL Open source (GPL) / Oracle commercial Web stacks, read-heavy applications Simplicity, replication, enormous community
SQL Server Commercial (Microsoft) Microsoft-shop enterprises, BI integration Tooling, Power BI integration, T-SQL extensions

All three are relational, all three speak SQL, and all three handle the same broad workloads. The differences live at the margins — extensibility, licensing, ecosystem — and those margins are exactly where the IS professional earns their salary by picking correctly.

OLTP vs OLAP: Two Different Beasts

The most consequential workload distinction in the database world is the split between transaction processing and analytical processing. OLTP vs OLAP — Online Transaction Processing versus Online Analytical Processing — names the two. They sound similar. They are not.

OLTP workloads are what most relational databases were built for: short, frequent, write-heavy operations driven by users interacting with applications. A customer places an order. A bank credits an account. A nurse logs a vital sign. Each transaction touches a few rows, completes in milliseconds, and must be ACID-compliant because the consequence of getting it wrong is real money or real lives. OLTP systems prize low latency per transaction, high concurrency, and strong consistency. They are typically normalized (so each fact is stored once) and row-oriented (so reading or writing a complete row is fast).

OLAP workloads are the opposite shape: long, infrequent, read-heavy queries that scan millions or billions of rows to produce summary results. What were our sales by region by quarter for the last five years? Which customers have churn risk in the next 30 days? Show me a cohort retention analysis broken down by acquisition channel. OLAP queries touch huge swaths of the data, complete in seconds-to-minutes (not milliseconds), and tolerate slightly stale data because nobody re-runs a quarterly report a hundred times a second. OLAP systems prize high throughput per query, fast aggregation over many rows, and batch freshness. They are typically denormalized (so reports do not require expensive JOINs) and columnar (so reading just the three columns a query needs is fast).

Dimension OLTP OLAP
Workload shape Many short transactions Few long analytical queries
Read/write mix Heavy writes, point reads Heavy reads, batch loads
Rows per query A handful Millions to billions
Latency target Milliseconds Seconds to minutes
Schema discipline Normalized (3NF) Denormalized (star/snowflake)
Storage layout Row-oriented Column-oriented
Concurrency Hundreds to millions of clients Tens to hundreds of analyst sessions
Data freshness Real-time Batch-loaded (minutes to hours)
Example queries "Insert one order" "Sum sales by region for last five years"
Example systems PostgreSQL, MySQL, SQL Server Snowflake, Redshift, BigQuery, Databricks

The structural reason these two workloads cannot share a single database — well, they can, and a small organization often does — is that every design decision favors one at the expense of the other. Normalize for write integrity, and OLAP queries get slow. Index for OLAP scans, and OLTP writes get slow. Add columns for analytical attributes, and the OLTP row gets fat and slow. Run a billion-row aggregation, and the OLTP latency spikes to ten seconds for everyone using the application. The modern architecture acknowledges this by physically separating the two workloads — OLTP up front, with replication or change-data-capture feeding a downstream OLAP system. Two databases, two workloads, each one happy.

Analytical Databases and Columnar Storage

An Analytical Database is a database engine designed specifically for OLAP workloads — large scans, aggregations, and joins over data that is loaded in batch and rarely updated row-by-row. The defining technical choice that separates analytical databases from operational ones is how the data is laid out on disk.

Columnar Storage is the layout pattern where the values of a single column are stored contiguously, rather than the values of a single row. In a row-oriented database (the default for OLTP systems), the bytes for (customer_id=42, order_date='2026-04-28', total=$199.99) are stored next to each other on disk, because reading or writing a complete row is the dominant access pattern. In a column-oriented database, all the customer_id values are stored together, then all the order_date values, then all the total values, because analytical queries usually need just a few columns out of many — and reading only those columns means reading a fraction of the disk.

The performance difference is not subtle. An analytical query that sums total across a billion rows in a row store must read the entire billion rows from disk (every column of every row), even though it only needs one column. The same query against a column store reads just the total column — perhaps 1% of the total bytes. Combine that with column-level compression (a column of country_code values has very low cardinality and compresses 10-50x) and vectorized execution (modern CPUs can process columnar data in SIMD batches), and analytical workloads can run 100x faster on columnar storage than on row storage.

Row Storage vs Columnar Storage — the Same Data, Two Layouts

Type: interactive-infographic sim-id: row-vs-columnar-storage
Library: p5.js
Status: Specified

A p5.js side-by-side animation showing the same six-row, four-column table (Order ID, Customer ID, Order Date, Total) laid out two ways: row-oriented (top panel) and column-oriented (bottom panel). Each cell is rendered as a small block; the row layout shows the cells grouped by row (six clusters of four blocks); the column layout shows them grouped by column (four clusters of six blocks). When the user issues an analytical query — "SUM(total) WHERE order_date > '2026-01-01'" — the simulation animates which blocks of disk are read to satisfy the query under each layout. The row layout reads every block on disk (highlighted in coral); the column layout reads only the Order Date and Total blocks (highlighted in mascot-emerald). A counter at the bottom shows bytes-read for each layout, with a dramatic ratio.

A second toggle issues an OLTP-style query — "SELECT * WHERE order_id = 1003" — and the visualization reverses: the row layout reads one contiguous block (mascot-emerald), the column layout must seek to four locations (coral). The simulation makes the workload-fit insight visible: each layout is great for one workload and poor for the other.

Color palette: row blocks in mascot-emerald, column blocks in mascot-magenta, disk reads highlighted in pulsing coral, query-text panel in slate gray with white text. A small disk-icon counter visualizes the bytes-read difference.

Interactivity: Query-type dropdown (Analytical scan, Row lookup, Column aggregation, Multi-column filter), Play/Step buttons, "Add compression" toggle that visually shrinks columnar blocks (showing the additional 10x compression advantage), and a "Show real numbers" toggle that swaps the toy six-row table for a billion-row table with realistic byte counts.

Layout: two horizontal panels stacked vertically, query control at top, byte-counter at bottom. Full canvas width, height ~600px. Canvas resizes on window resize. setup() calls updateCanvasSize() first. Canvas parented with canvas.parent(document.querySelector('main'));.

Learning objective (Bloom: Analyzing): Students can explain why row storage is fast for OLTP and slow for OLAP, why column storage is the inverse, and how column-level compression amplifies the analytical advantage.

Implementation: p5.js, deployed at /information-systems/sims/row-vs-columnar-storage/.

The footgun in storage layout is using a row-oriented database for analytical workloads, then "tuning" it for years before realizing the layout itself is wrong. The footgun has all three of the classic properties: it is silent (the reports just run slowly, not incorrectly), easy to trigger (the application database is already there, why not query it directly?), and delayed (a workload that runs fine on a million rows runs catastrophically on a billion). The structural fix is to separate the analytical workload onto a columnar engine before the data volume forces the issue, not after.

The Data Warehouse: A Place For Reports To Live

A Data Warehouse is a centralized analytical database that consolidates data from many operational sources, transforms it into a consistent shape, and serves it to reporting and BI tools. The modern data warehouse is the single most important architectural pattern in business intelligence, and every IS graduate will work with one — directly or indirectly — within a year of starting their career.

The warehouse pattern, originally articulated by Bill Inmon and Ralph Kimball in the 1990s, has four defining characteristics. First, the warehouse is subject-oriented — organized around business domains (Sales, Customers, Products, Finance) rather than around the source applications (CRM, ERP, web logs). Second, it is integrated — data from many sources is reconciled into a consistent schema with consistent naming and consistent units. Third, it is time-variant — the warehouse keeps historical snapshots so analysts can see how the business looked last quarter, last year, last decade. Fourth, it is non-volatile — once data lands in the warehouse it is rarely updated; new data is appended.

Data flows into the warehouse through a process called ETL (Extract, Transform, Load) or its more modern variant ELT (Extract, Load, Transform). ETL extracts data from operational systems, transforms it on a separate compute layer, and loads the transformed result into the warehouse. ELT extracts and loads first, then transforms inside the warehouse using its (now formidable) compute power. Both patterns have a place; the modern cloud trend strongly favors ELT.

Dimensional Modeling: Star and Snowflake

Once data is in the warehouse, you need a schema designed for analytical queries — not the normalized 3NF schema that worked beautifully for OLTP. The dominant analytical schema pattern is dimensional modeling, championed by Ralph Kimball, with two main flavors: star and snowflake.

A Star Schema organizes a subject area into one central Fact Table surrounded by several Dimension Tables, with foreign-key relationships radiating outward like the points of a star. The fact table holds the measurements — quantitative events the business cares about (sales transactions, web clicks, sensor readings). Each row in the fact table represents one event, with columns for the numeric measures (quantity, revenue, discount_amount) and foreign keys pointing to the dimensions. The dimension tables hold the context — the descriptive attributes that let an analyst slice and dice the facts ("by region," "by product category," "by month").

A typical retail star schema might have a Sales fact table with foreign keys to a Customer dimension, a Product dimension, a Date dimension, a Store dimension, and a Promotion dimension. A query like "total revenue by product category for Q4 2025 in the Western region" becomes a single fact-to-dimensions JOIN with three filters and a GROUP BY — fast, readable, and the basic shape of nine out of ten BI dashboards in the world.

A Snowflake Schema is a star schema in which the dimension tables themselves are normalized into multiple related tables, producing a structure that — when drawn — looks like a snowflake's crystalline branches. A Product dimension might be split into Product, Category, and Department tables, each with its own primary key. The snowflake schema saves storage (no repetition of category names across products) at the cost of additional JOINs at query time and additional ETL complexity to load.

Aspect Star Schema Snowflake Schema
Dimension tables Denormalized, one per dimension Normalized into multiple related tables
Storage Higher (repeated values) Lower (no repetition)
Query JOINs Few (fact-to-dimension only) More (dimension-to-dimension chains)
Query speed Faster Slower
ETL complexity Lower Higher
Best for BI tools, ad-hoc analyst queries Storage-constrained systems, data integrity focus

In modern cloud warehouses, where storage is cheap and compute is the dominant cost, the star schema is almost always the right answer. Storage savings rarely justify the extra JOINs.

A Retail Star Schema — Sales Facts and Their Dimensions

Type: interactive-infographic sim-id: retail-star-schema
Library: vis-network
Status: Specified

A vis-network diagram of a five-dimension retail star schema. Center: a Sales fact table with the quantitative measures (quantity, unit_price, revenue, discount, cost_of_goods) and foreign keys to each dimension. Surrounding it: five dimension tables — Customer (customer_id, name, email, segment, lifetime_value, signup_date), Product (product_id, sku, name, category, subcategory, brand), Date (date_id, day, month, quarter, year, day_of_week, is_weekend, is_holiday), Store (store_id, name, region, country, square_feet), and Promotion (promotion_id, name, type, discount_pct, start_date, end_date). The fact table is rendered larger and central in mascot-magenta; dimensions in mascot-emerald with labeled foreign-key edges in dark teal radiating outward.

Interactivity: Hovering the fact table reveals a sample row and explains "this is one sale event — every transaction the business records becomes a row here." Hovering a dimension reveals (a) its grain, (b) its typical SCD type, and (c) two sample analyst questions it enables ("by quarter" for Date, "by region" for Store). A "Convert to Snowflake" toggle animates the Product dimension splitting into Product → Category → Department, the Store dimension splitting into Store → Region → Country, with new edges fading in. A side panel highlights the JOIN-count tradeoff before and after.

To work around the vis-network horizontal-edge label rendering bug, edges have a slight y-offset so labels render correctly on initial load.

Layout: central fact, dimensions arranged in a circle around it, full canvas width, height ~600px. Canvas resizes on window resize.

Learning objective (Bloom: Applying): Students can read a star schema, identify the fact table and dimension tables, write a representative analytical query against the schema, and describe the structural change that converts star to snowflake.

Implementation: vis-network, deployed at /information-systems/sims/retail-star-schema/.

Slowly Changing Dimensions: History Without the Headache

Real dimensions evolve. A customer changes their email address. A product gets reclassified into a new category. A salesperson moves from the Western region to the Eastern region. The question "do we keep the old value, the new value, or both?" has no universally correct answer — it depends on whether historical reports must reflect the world as it was at the time of the event, or as it is today. The pattern that captures these choices is the Slowly Changing Dimension (SCD), and Kimball's numbered SCD types are the standard vocabulary.

  • Type 0No change allowed. The dimension value is fixed at insert time. Used for true constants (date attributes, original-acquisition channel).
  • Type 1Overwrite. The new value replaces the old. History is lost. Used when only the current value matters (a customer's display name).
  • Type 2Add a new row. The old row is closed (with an effective_to date) and a new row is inserted (with effective_from set to today). Both rows share a natural key but have different surrogate keys. Type 2 is the workhorse SCD: it preserves full history at the cost of extra rows.
  • Type 3Add a column. A previous_value column is maintained alongside the current value. Limited history (one prior value); rarely used.
  • Type 4History table. The current dimension stays in place; a separate history table accumulates every change. Useful when most queries want the current value but a few want full history.
  • Type 6Hybrid. Combines Type 1, 2, and 3 attributes within the same dimension row. The most flexible and the most complex.
SCD type What happens on change History preserved? Typical use
Type 0 No update permitted N/A (immutable) Constants like signup channel
Type 1 Overwrite the row None Cosmetic attributes (display name)
Type 2 Insert new row, close old row Full Address, region, segment — most attributes
Type 3 Add previous_value column One step back Rare; specific attributes
Type 4 Push old row to history table Full (in history table) Mostly-current with occasional history
Type 6 Mix of Type 1, 2, and 3 Full + current view When you need everything

The footgun in SCDs is the silent default: when a dimension change is made without an explicit SCD policy, the warehouse usually ends up doing a Type 1 overwrite by accident (because the load script just UPDATEs the row), and historical reports quietly lose their meaning. Quarterly revenue by region for Q1 2024 will retroactively reassign sales from the Western region to the Eastern region, because the salesperson moved last week. The structural fix is to declare an SCD type for every dimension attribute, in writing, and have the load process enforce it.

Pause here

Iris is thinking The reason SCD Type 2 is the warehouse workhorse is that it makes time travel trivial: every dimension row carries the dates during which it was the truth, and a fact-to-dimension JOIN on the date naturally reconstructs the world as it was at the moment of the fact. The reason most teams botch SCDs anyway is that the choice of SCD type is a reporting decision, not a data-engineering decision, and the people who own the reports are usually not in the room when the load script gets written. Pause here long enough to remember this: every time historical reports change unexpectedly, somebody silently chose Type 1 when the business needed Type 2.

Cloud Warehouse Platforms: Snowflake and Redshift

For most of warehouse history, the warehouse was a single appliance — a Teradata box, an Oracle Exadata, an IBM Netezza — and "scaling up" meant buying a bigger appliance. The 2010s rewrote that story. Cloud warehouse platforms separated storage (which lives in cheap object storage like Amazon S3) from compute (which spins up on demand and disappears when idle), and suddenly the warehouse could scale storage and compute independently.

The Snowflake Platform (the company, not the schema) is the cloud-native warehouse that defined this category. Snowflake separates storage and compute completely: data lives in S3 (or Azure Blob, or Google Cloud Storage), and a virtual warehouse — a transient compute cluster — is spun up to query it. A team running heavy reports at 9 AM can run on a large warehouse; the same team running ad-hoc queries at 3 PM can run on a small one; both are billed only for the seconds they ran. Snowflake popularized features that are now industry-standard: zero-copy cloning (instant logical copies of multi-terabyte datasets), time travel (query the data as of any point in the recent past), secure data sharing (share live datasets across organizations without copying), and Snowpark (run Python and Java code in the warehouse alongside SQL).

Amazon Redshift is AWS's columnar warehouse, originally launched in 2013 as the first mainstream cloud-native data warehouse. Redshift was a fork of PostgreSQL with a columnar storage engine and a massively parallel processing (MPP) execution layer. The original Redshift coupled storage and compute on the same nodes; the modern Redshift Serverless and Redshift RA3 instances have largely caught up to Snowflake's storage/compute separation. Redshift's strength is its tight integration with the broader AWS data ecosystem — S3, Glue, Athena, Lake Formation, SageMaker — and its appeal to organizations already deep in AWS.

The other major cloud warehouses worth naming on a resume — Google BigQuery, Databricks SQL, Azure Synapse — share the same architectural pattern: separated storage and compute, columnar layout, SQL on top. The category has converged.

Platform Vendor Storage Compute model Notable feature
Snowflake Snowflake Cloud object storage Virtual warehouses (per-second) Zero-copy cloning, secure data sharing
Redshift AWS S3 (RA3) / local SSD Provisioned or serverless Tight AWS ecosystem integration
BigQuery Google Colossus Serverless slots No infrastructure to manage
Databricks SQL Databricks Delta Lake on S3 SQL warehouses on Spark Lakehouse-native, ML-native
Synapse Microsoft ADLS Gen2 Dedicated or serverless pools Power BI integration, T-SQL compatibility

NoSQL: Five Shapes for Five Workloads

The umbrella term NoSQL ("Not Only SQL") covers the family of non-relational databases that emerged in the 2000s to address workloads the relational model handled poorly. NoSQL is a marketing term more than a technical category — it includes wildly different architectures — but it usefully groups four main shapes: key-value, column-family, document, and graph. Each shape optimizes for a specific access pattern and accepts tradeoffs the relational world refuses.

Key-Value Stores: The World's Simplest Database

A Key-Value Store is the simplest possible database: a giant distributed hash map. You put a value under a key. You get the value back by the key. The value is opaque to the database (a string, a JSON blob, a serialized object — the database does not care). The query language is GET key and PUT key value. There are no JOINs, no schemas, no SQL, no relations. What the key-value store does offer is sub-millisecond latency at enormous concurrency, because the operation is so simple the database can be ruthlessly optimized.

Redis is the canonical in-memory key-value store, often called "the Swiss Army knife of caches." Redis stores its working set entirely in RAM (with optional persistence to disk), which is why it can serve millions of operations per second from a single node. Redis values can be more than just strings — they can be lists, sets, sorted sets, hashes, streams, bitmaps, and HyperLogLogs — making Redis a kind of remote data-structure server. Typical Redis workloads include session storage (the cookie maps to a session blob in Redis), application-level caching (the result of an expensive database query is stored in Redis for the next minute), rate limiting (a counter per user per endpoint), and pub/sub messaging.

DynamoDB is AWS's managed key-value (and document) store, designed from the start for predictable single-digit-millisecond latency at any scale. DynamoDB shards data across many storage nodes by a partition key, replicates across availability zones, and bills per-request — meaning a small workload pays cents and a Black Friday workload pays dollars. DynamoDB also supports secondary indexes and conditional writes, which lets it serve as a primary database for many cloud-native applications, not just a cache. DynamoDB's design lineage traces directly to the original Amazon Dynamo paper from 2007, which kicked off the entire NoSQL movement.

The footgun in key-value stores is treating them like relational databases. A workload that requires "find all orders by customer X" cannot be served efficiently by a key-value store unless you have also maintained a secondary index by customer — and maintaining that index is on you, the application developer. Key-value stores reward access patterns that are known up front and punish access patterns that are discovered later. This makes them a poor fit for exploratory analytical work and a great fit for known operational paths.

Column-Family Stores: Wide Rows for the Web

A Column-Family Store organizes data into rows, but each row can have an arbitrary set of columns grouped into column families, with the columns themselves stored in a way that allows enormous numbers of them per row. The model traces back to Google's 2006 Bigtable paper, which described how Google indexed the web. A typical column-family workload involves tables with millions of rows, each row having hundreds or thousands of columns whose names and values are determined at write time, not by a fixed schema.

Apache Cassandra is the most widely deployed column-family store, originally developed at Facebook, then open-sourced and refined at Apple, Netflix, Instagram, and (it sometimes feels) every other large tech company. Cassandra was designed for write-heavy workloads at planet scale: every write goes to multiple replicas in parallel, the cluster has no single master node (every node is a peer), and the system gracefully tolerates node failures. Cassandra excels at time-series telemetry (every IoT device writes a row), event logs, messaging history, and any workload where the dominant access pattern is "given a partition key, give me a range of recent rows."

HBase is the open-source Bigtable clone, built on top of HDFS (the Hadoop Distributed File System), and historically the column-family store of choice for organizations already running Hadoop. HBase is operationally heavier than Cassandra (it has master nodes, region servers, and a ZooKeeper dependency), but its tight Hadoop integration makes it the natural choice when a column-family store needs to coexist with MapReduce and Spark batch jobs.

The defining tradeoff for column-family stores is the same as for key-value stores, with extra steps. Queries that match the partition key are blistering fast; queries that do not match the partition key are slow or impossible. Schema flexibility is high, but query flexibility is low. Choose column-family when the access patterns are write-heavy, append-mostly, and known in advance. Choose something else when the workload is exploratory.

Document Databases: Schema, but Flexible

A Document Database stores semi-structured documents — typically JSON or BSON (binary JSON) — and indexes them by their content. Each document is a self-contained record with nested fields, arrays, and any structure the application chooses; the database does not require all documents in a collection to share the same shape. Document databases occupy the middle ground between key-value stores (no internal structure) and relational databases (rigid schemas).

MongoDB is the dominant document database and the system that popularized the category. MongoDB stores BSON documents grouped into collections (analogous to tables) inside databases. Queries are written in MongoDB's query language — a JSON-like syntax — that supports filters, projections, aggregation pipelines, and indexes. MongoDB's appeal to developers is the impedance match with modern application code: the same JSON document that travels over the wire from a web client can be stored, queried, and returned without an ORM translating it into rows and back. MongoDB has supported multi-document ACID transactions since version 4.0, narrowing the historical relational/document gap on consistency.

The two document formats every IS professional must recognize are JSON and XML. A JSON Document (JavaScript Object Notation) is the dominant document format on the modern web — concise, human-readable, supported natively by every programming language. JSON has six data types (object, array, string, number, boolean, null) and a tree structure of nested objects and arrays. JSON is what almost every REST API returns and what almost every document database stores internally.

{
  "order_id": "ORD-1003",
  "customer": { "id": 42, "email": "iris@example.com" },
  "items": [
    { "sku": "BK-001", "qty": 2, "price": 19.99 },
    { "sku": "BK-014", "qty": 1, "price": 29.99 }
  ],
  "shipping_address": { "city": "Minneapolis", "state": "MN", "zip": "55401" },
  "total": 69.97
}

An XML Document (Extensible Markup Language) is the older, more verbose document format that dominated the 1990s and 2000s before JSON ate its lunch. XML is still important in enterprise integration (SOAP web services, regulated industries like healthcare and finance, government data feeds), in document publishing (DocBook, DITA), and in configuration files (Maven POMs, Spring XML configs). XML supports schemas (XSD), transformations (XSLT), and queries (XPath, XQuery) — capabilities JSON has gradually re-grown. The IS professional should recognize both formats and translate between them on demand; the choice between them is rarely interesting and almost always made by historical accident.

Format Verbosity Schema language Query language Typical 2026 usage
JSON Low JSON Schema JSONPath, jq REST APIs, document stores, modern config
XML High XSD XPath, XQuery SOAP, HL7, regulated industries, legacy

The footgun in document databases is the missing schema. Without a schema, the application is the only enforcer of document shape — and applications change. Six months in, the team discovers that one collection contains documents written by three application versions, with three different shapes for the same logical field. The structural fix is application-level schema discipline (JSON Schema, MongoDB's $jsonSchema validator) plus a migration strategy for evolving document shapes. "Schema-less" turns out to mean "schema enforced somewhere else, possibly badly."

Graph Databases (The Brief Version)

A graph database stores nodes (entities) and edges (relationships between them) as first-class citizens, with both nodes and edges carrying properties. Graph databases excel at queries about paths — friends of friends, shortest route, fraud rings, knowledge graph traversals — that require chained JOINs in a relational database and become unfeasible at scale. The major systems include Neo4j, Amazon Neptune, ArangoDB, and TigerGraph, and the dominant query languages are Cypher, Gremlin, and the recently-standardized GQL.

We will return to graph databases in depth in Chapter 24: Graph Databases and Enterprise Knowledge Graphs, where they are central to the book's thesis about the Enterprise Nervous System. For now, recognize the category, recognize the names, and trust that the chapter on graphs is going to be one of the most fun in the book.

CAP Theorem: You Cannot Have Everything

The intellectual centerpiece of the distributed-database literature is the CAP Theorem, formulated by Eric Brewer in 2000 and proved by Gilbert and Lynch in 2002. The theorem says that any distributed data system can guarantee at most two of the following three properties at the same time:

  • Consistency (C) — every read receives the most recent write or an error. (Note: this is "linearizability," not the "C" in ACID, which is a different concept entirely. The terminology is unfortunate.)
  • Availability (A) — every request receives a (non-error) response, even if some nodes are down.
  • Partition tolerance (P) — the system continues to function when network partitions split the cluster into groups of nodes that cannot reach each other.

In a distributed system, network partitions will eventually happen — networks fail, links go down, packets get lost — so partition tolerance is not negotiable in practice. That means the real choice is between consistency and availability during a partition. A CP system chooses consistency over availability: when a partition occurs, the minority side stops accepting writes (or returns errors) to avoid disagreeing with the majority. An AP system chooses availability over consistency: every node keeps accepting reads and writes, accepting that different nodes may temporarily disagree, with reconciliation later (a process called eventual consistency).

Category C (consistency) A (availability) P (partition tolerance) Examples
CA (theoretical only) Yes Yes No (single-node, no real partitions) Single-node PostgreSQL/MySQL
CP Yes No (during partition) Yes HBase, MongoDB (default), Redis Cluster, ZooKeeper
AP Eventually Yes Yes Cassandra, DynamoDB (default), CouchDB, Riak
The CAP Theorem Triangle — Pick Two

Type: interactive-infographic sim-id: cap-theorem-triangle
Library: p5.js
Status: Specified

A p5.js triangle infographic with the three CAP properties — Consistency, Availability, Partition Tolerance — at the three vertices. Each edge of the triangle is labeled with the category of system that lives on it (CA, CP, AP). The vertices and edges are visually emphasized differently: the C-A edge (the CA category) is rendered with a faded "theoretical only" warning because CA systems cannot tolerate partitions and therefore cannot exist as truly distributed systems. The C-P edge (CP) and A-P edge (AP) are rendered as the practical choices, each with three or four real-world database logos arranged along the edge.

A simulation panel below the triangle animates a network partition: a four-node cluster splits into two groups of two. The user toggles between CP-mode (the minority side refuses writes — coral on the minority, mascot-emerald on the majority, dashed line for the partition) and AP-mode (both sides accept writes — both green, with conflicting data shown as warning indicators that resolve when the partition heals).

Color palette: Consistency vertex in mascot-emerald, Availability in mascot-magenta, Partition tolerance in dark teal, edges colored by their category, partition line in dashed coral. Conflict indicators in warning amber.

Interactivity: Click a vertex to learn the precise definition (with the "ACID-C vs CAP-C" disambiguation called out for Consistency). Click an edge to see real-world systems on that edge and their default consistency level (with a note that many systems are tunable — Cassandra, DynamoDB, MongoDB all have knobs that move them along the edge). A "Simulate partition" button drops a network partition into the four-node cluster animation and lets the student watch CP vs AP behavior.

Layout: triangle in upper portion of canvas, animated cluster panel below, full canvas width, height ~620px. Canvas resizes on window resize. setup() calls updateCanvasSize() first. Canvas parented with canvas.parent(document.querySelector('main'));.

Learning objective (Bloom: Understanding): Students can name the three CAP properties, explain why partition tolerance is non-negotiable in practice, distinguish CP from AP systems with examples, and disambiguate the "C" in CAP from the "C" in ACID.

Implementation: p5.js, deployed at /information-systems/sims/cap-theorem-triangle/.

The systems-thinking insight in CAP is that the choice of CP or AP is not a marketing knob — it is a fundamental commitment that shows up in the system's behavior at exactly the worst moments. A CP database serving a shopping cart will refuse new items during a partition; the user sees an error and goes to a competitor. An AP database serving a banking ledger will happily accept inconsistent writes during a partition; the bank discovers later that two ATMs each let the same account withdraw the last $200. Neither choice is universally correct. Both are correct for specific workloads. The IS professional's job is to know which one the business actually needs.

A more nuanced refinement of CAP is PACELC (pronounced "pass-elk"), proposed by Daniel Abadi: if a Partition occurs, choose between Availability and Consistency; Else (during normal operation), choose between Latency and Consistency. PACELC captures the fact that even when the network is working perfectly, distributed systems still trade off latency against consistency — replication takes time, and waiting for confirmation from remote replicas costs milliseconds.

Heads up

Iris warns you The CAP theorem is one of the most-misquoted ideas in IS. Three things to keep straight: first, the "C" in CAP is linearizability, not the "C" in ACID — totally different properties with the same letter; thanks, computer science. Second, CAP is a statement about behavior during a partition, not about steady-state operation; many "AP" databases are perfectly consistent when the network is healthy. Third, most modern distributed databases are tunable along the CAP spectrum — Cassandra lets you pick consistency level per query, MongoDB lets you set write concerns, DynamoDB has both eventually-consistent and strongly-consistent reads. The footgun is treating CAP as a fixed label on a database when it is actually a dial. Learn where the dial is set by default, and learn how to move it before you need to.

Database Selection: A Decision Worth Doing Carefully

Database Selection is the deliberate process of matching a workload to the database category and product that fits it best. In practice, most IS professionals will inherit a database choice that someone else made; in the rare cases where you get to choose, it is one of the highest-leverage decisions on the project.

A workable selection framework runs through six questions in order:

  1. What is the access pattern? Point lookups by key, range scans by key, ad-hoc filters across many columns, graph traversals, full-text search, geospatial proximity, time-series rollups? The access pattern names the category.
  2. What are the consistency requirements? Strong consistency at every read (banking, inventory)? Eventual consistency acceptable (social feeds, analytics)? Read-your-own-writes only (most user-facing apps)? The answer narrows the choice within a category.
  3. What is the data volume and growth rate? Gigabytes that will stay gigabytes? Terabytes growing by 10% a year? Petabytes growing 10x a year? Volume drives the question of whether single-node systems are viable or whether distribution is mandatory.
  4. What are the latency and throughput targets? Sub-millisecond per operation at a million ops/sec? Sub-second per query at a thousand qps? Sub-minute per analytical job? The targets eliminate categories that cannot meet them.
  5. What are the operational constraints? Self-hosted or managed? Cloud or on-prem? In-house DBA expertise? Compliance regime (HIPAA, PCI, SOX)? These constraints often eliminate technically excellent options.
  6. What does the team already know? A database the team can operate is worth more than a database the team admires from afar. The "best" database for a workload is the best one the team can run reliably at 3 AM.

The reason these six questions matter so much is that databases are extraordinarily expensive to migrate. Picking poorly means living with the choice for years, working around it constantly. Picking well means the database disappears into the background and the team can focus on actually building the product.

Workload pattern Likely category Specific product candidates
Transactional CRUD with relations Relational PostgreSQL, MySQL, SQL Server
Analytical queries over big history Columnar warehouse Snowflake, Redshift, BigQuery
Sub-millisecond cache or session store Key-value Redis, DynamoDB
Write-heavy time-series at scale Column-family Cassandra, ScyllaDB
Flexible nested documents from web/app Document MongoDB, DynamoDB (document mode)
Path queries on connected data Graph Neo4j, Neptune, ArangoDB
Full-text search Search engine Elasticsearch, OpenSearch
Vector similarity for AI embeddings Vector pgvector, Pinecone, Weaviate, Milvus

Polyglot Persistence: Many Databases, On Purpose

The architectural pattern that emerges naturally from the database-selection framework is Polyglot Persistence: an organization runs multiple, different databases, each serving the workload it fits best. The shopping cart lives in a key-value store. The orders live in PostgreSQL. The product catalog lives in a document store. The reporting warehouse is Snowflake. The recommendation engine reads from a graph database. The session cache is Redis. None of these is the primary database; they are each primary for their workload.

Polyglot persistence has been the dominant pattern in big tech for over a decade and has been steadily filtering down to mid-market enterprise. It is not, however, free. Every additional database adds operational surface area: another set of backups, another set of metrics, another set of upgrade paths, another set of security configurations, another on-call rotation. The unintended consequence of too much polyglot is operational drowning — a five-person team with seven databases will spend more time keeping them running than building features. The structural fix is intentional polyglot: every additional database must justify itself with a workload-fit argument, and the team must have (or build) the operational capacity to run it. Two well-fit databases are usually better than five mediocre ones; one well-fit database is sometimes better than two.

Polyglot Persistence in a Modern E-Commerce Architecture

Type: interactive-infographic sim-id: polyglot-ecommerce-architecture
Library: vis-network
Status: Specified

A vis-network architecture diagram of an e-commerce stack that uses six different databases, each serving the workload it fits. At the center: the application layer (web, mobile, services) rendered as a slate-gray cluster. Around it: six database nodes, each colored by category and labeled with both the category and a representative product:

  • PostgreSQL (relational, mascot-emerald) — orders, customers, transactions
  • Redis (key-value, mascot-magenta) — session cache, rate-limit counters, hot product cache
  • MongoDB (document, coral) — product catalog with nested attributes, user profiles
  • Cassandra (column-family, dark teal) — clickstream events, time-series telemetry
  • Neo4j (graph, slate blue) — product recommendation graph, supply-chain relationships
  • Snowflake (warehouse, gold) — analytical reporting, BI dashboards, data science

Edges show data flow: change-data-capture from PostgreSQL into Snowflake (labeled "CDC"), event streams from the application into Cassandra (labeled "Kafka"), product-catalog reads from MongoDB to the application, recommendation queries from the application to Neo4j, cache reads/writes between the application and Redis.

Interactivity: Hovering each database shows (a) the workload it serves, (b) why this category fits, (c) the rough volume and access pattern, and (d) the cost of removing it. Hovering an edge shows the data-flow mechanism (CDC, Kafka, REST, RPC, batch ETL) and its latency. A "Simplify" toggle progressively removes databases (Neo4j first, then Cassandra, then MongoDB) and animates the workload moving onto the remaining stores, with annotations showing the resulting impedance mismatch ("Now product catalog is a JSON column in PostgreSQL — workable for 100K products, painful at 10M").

To work around the vis-network horizontal-edge label rendering bug, edges have a slight y-offset so labels render correctly on initial load.

Layout: hub-and-spoke around the central application, full canvas width, height ~620px.

Learning objective (Bloom: Analyzing): Students can read a polyglot architecture, justify each database choice with a workload-fit argument, and reason about the operational tradeoffs of consolidation versus specialization.

Implementation: vis-network, deployed at /information-systems/sims/polyglot-ecommerce-architecture/.

Iris's Tip

Iris shares a tip Pro move when reviewing somebody else's polyglot architecture: ask which databases they would consolidate if they had to cut one. The good architects have an answer ready — they have already thought about it. The shaky architects pause too long, because every database in their stack got there through accretion rather than design. The number of databases an organization should run is roughly equal to the number of distinctly shaped workloads that materially benefit from specialization, plus zero. A startup with one shape of workload should run one database. A unicorn with eight workloads should run six or seven. The right number is almost never "every cool system the team read about on Hacker News."

Data Lakes: The Big Cheap Bucket

Where the data warehouse stores processed, structured data optimized for known analytical queries, the Data Lake stores raw, unprocessed data of any shape — structured, semi-structured, unstructured — in cheap object storage, on the bet that some future query, by some future analyst or data scientist, will want it. The data lake's design philosophy is essentially "store everything, decide later." A modern data lake is typically built on Amazon S3, Azure Data Lake Storage Gen2, or Google Cloud Storage, with files in formats like Parquet, ORC, Avro, JSON, and CSV.

The data lake's superpower is schema-on-read: data is stored without a schema, and the schema is applied at query time by whichever tool reads the data. This means the lake can ingest new data sources at the speed of a file-copy operation — no schema design, no ETL job, no DBA approval. Need to capture every clickstream event for posterity? Drop the JSON files into a folder in the lake. Need to retain seven years of CSV exports from a legacy ERP for compliance? Same answer. The cost per terabyte stored in a data lake is roughly two orders of magnitude cheaper than the equivalent storage in a traditional warehouse.

The data lake's kryptonite is the same property that makes it attractive: schema-on-read is great when the reader knows what to do, and a swamp when nobody does. The unintended consequence of an ungoverned data lake is the data swamp — petabytes of files with no documentation, no schema, no lineage, and no idea who put them there or what they mean. The feedback loop is brutal: more raw data added → less proportional governance → less trust → fewer analysts using it → less business pressure to govern → more raw data added with even less care. The structural fix is governance from day one: a data catalog (Glue Catalog, Unity Catalog, Atlan, Collibra), enforced metadata, lineage tracking, access control, and a curated gold layer of trusted datasets sitting on top of the raw bronze and refined silver layers. (The "bronze/silver/gold" layering is the medallion architecture, popularized by Databricks; it is the conceptual antidote to the swamp.)

Layer What lives here Audience Schema discipline
Bronze (raw) Source data exactly as ingested Data engineers, source-system owners Schema-on-read
Silver (refined) Cleaned, deduplicated, joined Data engineers, analysts Schema enforced at write
Gold (curated) Business-ready dimensional/fact tables Analysts, BI tools, data scientists Strongly typed, documented

The Data Lakehouse: Best of Both Worlds (Mostly)

The data lake's flexibility and the data warehouse's reliability have spent two decades arguing with each other across IT departments. The Data Lakehouse is the architectural pattern that ends the argument by combining both: a single storage layer (cheap object storage, lake-style) with a transactional metadata layer on top (warehouse-style ACID guarantees, schema enforcement, time travel, and SQL).

The technologies that make the lakehouse possible are open table formats — Delta Lake (originated by Databricks), Apache Iceberg (originated at Netflix), and Apache Hudi (originated at Uber). All three sit on top of files in object storage and add a transaction log that turns the underlying files into something that behaves like a database table: ACID writes, schema evolution, time travel queries, deletions and updates of individual rows. The compute engines that read these tables (Spark, Trino, Presto, Snowflake, Databricks SQL, BigQuery, Redshift) treat them as if they were warehouse tables, while the underlying storage stays cheap and open.

The lakehouse promises three architectural simplifications. First, one copy of the data — no more separate lake-and-warehouse with ETL between them. Second, open formats — the data is in Parquet files with a Delta or Iceberg log, queryable by any engine that supports the format. Third, unified workloads — BI dashboards, data science notebooks, and ML training pipelines all read from the same tables, eliminating the consistency drift between separate systems.

Lake → Warehouse → Lakehouse: Three Generations of Analytical Architecture

Type: interactive-infographic sim-id: lake-warehouse-lakehouse-evolution
Library: p5.js
Status: Specified

A p5.js timeline visualization showing three generations of analytical architecture, side-by-side, with sample data flowing through each. Generation 1 (1990s-2000s): the classic data warehouse — source systems → ETL → relational warehouse → BI tools, rendered with rigid pipes and a single integrated warehouse box. Generation 2 (2010s): the data lake era — source systems → raw files in S3 → Spark/Hadoop processing → warehouse for BI + lake for data science, rendered with two parallel branches and labels calling out the duplicated copies and consistency drift between them. Generation 3 (2020s): the lakehouse — source systems → open table format on object storage → unified compute layer (Spark, Trino, Snowflake, Databricks) serving both BI and data science from one copy.

Each generation has a small "pain points" panel showing what broke (Gen 1: hard to handle unstructured data, schema-change is painful; Gen 2: two copies, consistency drift, governance gap; Gen 3: emerging — open format wars, performance still trails dedicated warehouse on some workloads).

Color palette: Gen 1 in slate gray, Gen 2 in coral with branching warning lines, Gen 3 in mascot-emerald with a single unified data layer in mascot-magenta. Animated data droplets flow through each architecture in a continuous loop.

Interactivity: Click each generation to expand a detailed view with the specific products of the era (Gen 1: Teradata, Oracle Exadata, Netezza; Gen 2: Hadoop, Spark, S3, Snowflake/Redshift in parallel with the lake; Gen 3: Delta Lake, Iceberg, Hudi, Databricks, Unity Catalog). A "play data flow" toggle animates a single record traveling from source to BI dashboard under each architecture, highlighting the number of copies and transformations.

Layout: three vertical columns side-by-side, each column representing one generation, full canvas width, height ~640px. Canvas resizes on window resize. setup() calls updateCanvasSize() first. Canvas parented with canvas.parent(document.querySelector('main'));.

Learning objective (Bloom: Understanding): Students can describe the architectural difference between warehouse, lake, and lakehouse, explain the pain points each generation addressed, and recognize the technologies that define the current lakehouse era.

Implementation: p5.js, deployed at /information-systems/sims/lake-warehouse-lakehouse-evolution/.

The honest assessment: the lakehouse is not yet a complete replacement for a tuned analytical warehouse on every workload. Pure-play columnar warehouses still edge out lakehouses on the most demanding BI workloads (complex multi-table JOINs at sub-second latency on warm data). But the gap is closing fast, and the architectural simplification of one copy of the data is so valuable that most new analytical platforms in 2026 are being designed lakehouse-first, with a dedicated warehouse layered on top only when measured workloads demand it. For the IS graduate entering the industry now, the lakehouse is the architecture you should expect to inherit.

Putting It All Together

The modern database landscape is wider than "MySQL or Oracle?" and structured around workload fit rather than vendor preference.

  • Relational databases — PostgreSQL, MySQL, SQL Server — remain the right default for transactional applications with strong consistency requirements. They handle the OLTP side of almost every business.
  • OLTP vs OLAP is the fundamental workload split. OLTP wants short, ACID, normalized, row-oriented; OLAP wants long, batch-fresh, denormalized, columnar.
  • Analytical databases with columnar storage make OLAP queries 100x faster by reading only the columns each query needs, with column-level compression amplifying the advantage.
  • Data warehouses centralize analytical data with subject-oriented, integrated, time-variant, non-volatile design. Star schemas (with fact tables and dimension tables) and snowflake schemas are the dominant analytical schema patterns, with slowly changing dimensions handling historical truth.
  • Cloud warehouse platforms — the Snowflake Platform, Amazon Redshift, BigQuery, Databricks SQL — separate storage from compute and bill per second.
  • Key-value stores like Redis and DynamoDB trade query flexibility for sub-millisecond latency at planet scale.
  • Column-family stores like Apache Cassandra and HBase trade query flexibility for write-heavy throughput at planet scale.
  • Document databases like MongoDB store flexible nested documents — typically JSON documents or BSON, with XML documents still dominant in regulated industries — and trade rigid schemas for developer flexibility.
  • Graph databases make path queries first-class citizens; we treat them in depth in Chapter 24.
  • The CAP theorem says you cannot have consistency, availability, and partition tolerance simultaneously in a distributed system. Real systems pick CP or AP and tune the dial.
  • Database selection is a six-question discipline: access pattern, consistency, volume, latency, operations, team. Get it right, and the database disappears.
  • Polyglot persistence is the architectural answer when no single database fits all workloads. Run several databases, each fit-for-purpose, and pay the operational cost intentionally.
  • Data lakes store raw data cheaply with schema-on-read, and become data swamps without governance. Data lakehouses combine lake economics with warehouse reliability, via open table formats like Delta Lake, Iceberg, and Hudi.

A graduate of this chapter walking into a data architecture review should be able to ask, in order: What workloads is this system serving? Are those workloads OLTP or OLAP, and is the storage layout right for them? Where is the boundary between the operational store and the analytical store, and how does data cross it? Which databases are absolutely necessary, which are accidental, and which one would we consolidate first if we had to cut one? Where on the CAP dial does each distributed system sit, and is that the right setting for the workload? Is the lake governed, or is it becoming a swamp? And what is the migration path to a lakehouse if we are not already on one? Asking those questions in a kickoff meeting will earn you the kind of look senior engineers reserve for the rare junior who is going to make their lives easier rather than harder.

Nicely done

Iris celebrates You just learned the modern database landscape — relational refresher with PostgreSQL/MySQL/SQL Server, the OLTP-vs-OLAP split that organizes the rest, columnar storage and analytical engines, the data warehouse with star and snowflake schemas and slowly changing dimensions, the cloud warehouse platforms (Snowflake, Redshift) that separated storage from compute, the four NoSQL shapes (key-value with Redis and DynamoDB, column-family with Cassandra and HBase, document with MongoDB and JSON and XML, and graph for later), the CAP theorem and the difference between CP and AP systems, the six-question database-selection discipline, polyglot persistence as the architectural pattern that emerges when you take selection seriously, and the lake-to-lakehouse evolution that is reshaping analytics in 2026. That is genuinely a lot of vocabulary, and you can now hold an informed conversation with any database engineer in the industry. Onward to Chapter 8, where we leave storage behind for a chapter and look at how data actually moves — pipelines, integration patterns, and the unsung plumbing that connects the seven databases your organization probably already has.

References

See Annotated References