The Right Database

Title: The Right Database: Architecture Tradeoff Analysis for Distributed and High-Availability Systems

Target Audience: Professional development — software architects, senior engineers, technical leads, and database administrators making real-world database selection and design decisions

Prerequisites: General programming experience in at least one language; basic familiarity with software systems and application development

Course Overview

Choosing the wrong database is one of the most expensive architectural mistakes an engineering team can make. Once data is in production, migration is painful, risky, and slow — yet most teams select a database based on familiarity, marketing, or habit rather than on rigorous analysis of their system's quality attributes. This course teaches a disciplined, structured approach to database selection using the Carnegie Mellon University Software Engineering Institute's Architecture Tradeoff Analysis Method (ATAM), the gold standard for evaluating architectural decisions against competing quality attribute requirements.

Students will build a deep understanding of six major database paradigms — relational, analytical, key-value, column-family, graph, and document — and learn to reason about each in terms of the tradeoffs they make across consistency, availability, partition tolerance, scalability, query expressiveness, and operational complexity. Rather than seeking a "best" database, this course teaches students to identify the right database for a given set of architectural drivers, constraints, and quality attribute scenarios.

The course also covers the distributed systems foundations that underpin modern database architectures: the CAP theorem, BASE vs. ACID semantics, consensus protocols, replication strategies, sharding, and the engineering challenges of achieving five-nines (99.999%) availability in production systems. Students complete the course by conducting a full ATAM-based database selection exercise for a realistic architecture scenario, producing a documented decision artifact that reflects professional-grade architectural reasoning.

Main Topics Covered

The CMU SEI Architecture Tradeoff Analysis Method (ATAM): process, artifacts, and stakeholder utility tree construction
Utility trees: structure, quality attribute hierarchy, scenario elicitation, (H,H)/(H,M)/(M,H) prioritization, and reading utility trees as visual decision artifacts
Quality Attribute Workshop (QAW) techniques for eliciting database-relevant requirements
Relational databases: ACID guarantees, normalization, SQL, indexing strategies, and scalability limits
Analytical (OLAP) databases: columnar storage, MPP architectures, data warehousing, and BI workloads
Key-value stores: extreme read/write throughput, eventual consistency, TTL patterns, and caching use cases
Column-family (wide-column) databases: sparse data models, write-optimized LSM trees, and time-series workloads
Graph databases: property graph and RDF models, traversal algorithms, and relationship-heavy query patterns; distributed graph databases (e.g., TigerGraph) that scale graph storage and traversal horizontally across a cluster
Document databases: schema flexibility, embedded documents, aggregation pipelines, and developer ergonomics
CAP theorem, PACELC model, and their implications for database selection under network partition
ACID transactions: atomicity, consistency, isolation, durability, lock-based and MVCC implementations
Distributed ACID: two-phase commit (2PC), distributed sagas, and NewSQL (Google Spanner, CockroachDB, YugabyteDB)
Scale-out distributed computing: horizontal sharding, consistent hashing, replication topologies, and read/write splitting
High availability architecture: active-active vs. active-passive clustering, failover, leader election, and quorum protocols
Five-nines (99.999%) availability: failure mode analysis, SLA decomposition, chaos engineering, and operational runbooks
Workload characterization: OLTP vs. OLAP vs. HTAP, read/write ratios, data volume, and latency/throughput budgets
Vector search as a database feature: embedding generation, approximate nearest-neighbor (ANN) indexes (HNSW, IVF), similarity metrics (cosine, dot product, Euclidean), and how vector search is added to relational, document, key-value, and graph databases to enable similarity retrieval across any item or property
LLM-generated embeddings: how large language models encode semantic meaning into dense floating-point vectors, embedding model selection (OpenAI text-embedding-3, Sentence Transformers, Cohere Embed), dimensionality tradeoffs, batching strategies, cost and latency at scale, and storing embeddings alongside structured data in a database
Multi-model and polyglot persistence architectures: when and how to combine multiple database types
Database selection decision frameworks: scoring matrices, architectural risk analysis, and sensitivity/tradeoff point identification

Topics Not Covered

Internal database engine implementation (query planners, buffer pool management, storage engine internals beyond conceptual overview)
Database administration operations (backup procedures, index tuning playbooks, query optimization deep dives)
SQL or NoSQL query language tutorials beyond what is needed to understand capability tradeoffs
Standalone vector databases as a separate paradigm (covered instead as a cross-cutting feature that any database type may expose)
ML model training pipelines and feature stores as primary content
Specific cloud provider managed database services beyond illustrative examples
Data modeling for specific business domains (e-commerce, healthcare, finance) as primary content
Full ATAM facilitator certification training (this course teaches application, not facilitation)

Learning Outcomes

After completing this course, students will be able to:

Remember

Retrieving, recognizing, and recalling relevant knowledge from long-term memory.

Recall the six major database paradigms (relational, analytical, key-value, column-family, graph, document) and their primary use cases
State the four ACID properties (Atomicity, Consistency, Isolation, Durability) and what each guarantees
Identify the three properties of the CAP theorem and state that only two can be simultaneously guaranteed under network partition
List the key steps of the CMU SEI ATAM process: present ATAM, present business drivers, present architecture, identify architectural approaches, generate utility tree, analyze architectural approaches, brainstorm and prioritize scenarios, analyze architectural approaches (continued), present results
Identify the three levels of a utility tree: quality attribute (root), attribute refinement (branches), and quality attribute scenario (leaves)
Recall the two-axis prioritization scheme for utility tree scenarios: importance to stakeholders (H/M/L) and difficulty to achieve (H/M/L)
Recall common availability SLA targets and their allowable downtime equivalents (99.9% = 8.7 hrs/yr, 99.99% = 52 min/yr, 99.999% = 5.26 min/yr)
Name consensus protocols used in distributed databases (Paxos, Raft, ZAB)
Identify BASE semantics (Basically Available, Soft state, Eventually consistent) and contrast with ACID
List common sharding strategies: range-based, hash-based, directory-based, and geographic sharding
Define vector embeddings and identify the similarity metrics used in vector search (cosine similarity, dot product, Euclidean distance)
Name canonical database products by paradigm, including distributed graph databases such as TigerGraph and Amazon Neptune alongside single-server graph databases such as Neo4j
Name the approximate nearest-neighbor index types commonly used for vector search (HNSW, IVF, flat) and identify which databases natively expose vector search as a feature (e.g., pgvector for PostgreSQL, Atlas Vector Search for MongoDB, Redis Vector, Cassandra with SAI)
Recall that LLMs produce embeddings by encoding input text (or other modalities) into dense high-dimensional vectors, and name representative embedding models (OpenAI text-embedding-3-small/large, Sentence-BERT, Cohere Embed) along with their typical output dimensionalities (256–3072 dimensions)

Understand

Constructing meaning from instructional messages, including oral, written, and graphic communication.

Explain why no single database type is universally optimal and how workload characteristics drive the tradeoff space
Describe the ATAM utility tree structure and how quality attribute scenarios capture architectural requirements
Explain how a utility tree makes implicit architectural priorities explicit and visible to all stakeholders
Describe how competing quality attributes (e.g., consistency vs. availability, performance vs. durability) appear as tension between branches of the same utility tree
Explain the difference between OLTP and OLAP workloads and why they favor fundamentally different database architectures
Summarize the tradeoffs between strong consistency (CP systems) and high availability (AP systems) in real-world operational terms
Explain how LSM-tree storage engines optimize for write throughput at the cost of read amplification, and why this matters for column-family databases
Describe how distributed ACID transactions (2PC, Paxos-based commit) achieve global consistency and what latency costs they introduce
Explain the concept of a "sensitivity point" and a "tradeoff point" in ATAM and how they appear in database architecture analysis
Summarize the failure modes that prevent five-nines availability and how redundancy, failover, and chaos engineering address them
Describe the property graph model and contrast it with the relational model for relationship-heavy queries
Explain eventual consistency, read-your-writes, monotonic reads, and causal consistency as a spectrum of consistency models
Explain vector search as a database feature rather than a separate paradigm: how embedding indexes sit alongside traditional indexes, how they enable "find items similar to this one" queries across text, images, products, or any property that can be embedded, and what latency and storage tradeoffs the ANN index introduces
Explain how LLMs generate embeddings: tokenization, transformer encoder pass, pooling strategies (CLS token vs. mean pooling), and why semantically similar inputs produce geometrically close vectors in embedding space
Describe the embedding pipeline architecture — input data → embedding model API call → vector stored in DB alongside source record — and explain how model versioning and re-embedding requirements affect database design and operational cost

Apply

Carrying out or using a procedure in a given situation.

Conduct a Quality Attribute Workshop (QAW) to elicit database-relevant quality attribute scenarios from stakeholders
Construct an ATAM utility tree with quality attributes, attribute refinements, and prioritized scenarios for a given system
Apply the (importance, difficulty) prioritization matrix to rank utility tree leaf scenarios and identify which drive the most significant architectural decisions
Use a utility tree as a communication artifact to align engineering teams and business stakeholders on database selection criteria
Select an appropriate database type for a specified workload description by mapping workload characteristics to database strengths
Apply the PACELC model to characterize a specific database system's behavior under both normal operation and partition
Calculate allowable downtime budgets for a system given an SLA target and decompose that budget across dependent services
Design a replication topology (single-leader, multi-leader, leaderless) appropriate for a given availability and consistency requirement
Apply consistent hashing to design a sharding scheme for a specified data distribution and scaling requirement
Use a scoring matrix to compare three or more database candidates against a weighted set of quality attribute requirements
Add a vector search feature to a database selection decision: identify whether the candidate database exposes native ANN indexing, estimate the throughput and memory overhead, and incorporate those tradeoffs into the ATAM utility tree under a "semantic similarity" quality attribute scenario
Select an appropriate LLM embedding model for a given use case by comparing dimensionality, token limits, multilingual support, cost per token, and latency, then design the batching and caching strategy to manage embedding generation at production scale

Analyze

Breaking material into constituent parts and determining how the parts relate to one another and to an overall structure or purpose.

Analyze a system's read/write ratio, latency requirements, data volume, and consistency needs to identify which database paradigms are viable candidates
Identify architectural risks, sensitivity points, and tradeoff points in a proposed database architecture using ATAM vocabulary
Analyze a completed utility tree to locate scenarios where two high-priority quality attributes are in direct tension, revealing the core architectural tradeoff of the system
Deconstruct a poorly formed utility tree (vague scenarios, missing prioritization, unbalanced branches) and identify what information is lost as a result
Break down a high-availability architecture into its failure domains and analyze single points of failure that threaten five-nines targets
Compare the consistency guarantees of two-phase commit vs. distributed sagas and analyze the failure scenarios each handles poorly
Distinguish between accidental and essential complexity in a polyglot persistence architecture and assess whether added databases are justified
Analyze the impedance mismatch between an application's data access patterns and a proposed database's data model
Analyze the tradeoffs of adding vector search to an existing database (e.g., pgvector on PostgreSQL) versus routing similarity queries to a specialized ANN layer, considering index build time, memory footprint, query latency, and operational complexity
Analyze the downstream database implications of LLM embedding model choice: how dimensionality affects storage cost per row, index build time, and ANN recall accuracy; and how switching embedding models invalidates stored vectors and forces a full re-embedding migration
Decompose a transaction workload to determine whether distributed ACID is necessary or whether eventual consistency with compensation suffices
Evaluate the operational implications of a database selection decision: deployment complexity, backup/restore, monitoring, and team expertise

Evaluate

Making judgments based on criteria and standards through checking and critiquing.

Evaluate competing database candidates for a given architecture scenario using a structured ATAM-based decision process
Judge whether a proposed polyglot persistence architecture introduces more complexity than the problem warrants
Evaluate a utility tree produced by a team for completeness, scenario specificity, and prioritization accuracy — including whether the tree faithfully represents stakeholder intent or reflects only engineering bias
Critique a database selection decision made without ATAM by identifying unstated assumptions, missing quality attribute scenarios, and unexamined tradeoffs
Assess the risk level of a distributed database architecture's consistency model against the business consequences of stale or lost data
Evaluate the achievability of a five-nines SLA for a proposed database architecture given its replication strategy and known failure modes
Judge the adequacy of a utility tree by testing it against adversarial scenarios that stress the proposed architecture
Assess the total cost of ownership of a database selection decision including licensing, operational overhead, and migration risk
Validate that a NewSQL database's consistency guarantees meet the isolation level requirements of a given OLTP workload
Evaluate whether a system requirement for similarity search ("find products like this one", "retrieve documents semantically related to this query") is best served by native vector search in the primary database or by a dedicated embedding index, based on scale, recall requirements, and operational constraints
Evaluate the architectural risk of coupling a system's semantic retrieval quality to a third-party LLM embedding API: assess model deprecation risk, vendor lock-in, latency sensitivity, data privacy constraints, and the cost-benefit of self-hosted open-source embedding models versus managed API services

Create

Putting elements together to form a coherent or functional whole; reorganizing elements into a new pattern or structure.

Design a complete ATAM-based database selection analysis for a realistic multi-service architecture, including utility tree, candidate evaluation, risk register, and documented decision rationale
Construct a database-specific utility tree template reusable across projects, with pre-populated quality attribute branches (performance, availability, consistency, scalability, operability, security) and annotated example scenarios for each database paradigm
Architect a polyglot persistence solution that assigns each data domain to the most appropriate database type, with documented tradeoff justification
Construct a high-availability database topology targeting 99.999% uptime, including replication, failover, geographic redundancy, and chaos engineering test plan
Develop a database migration plan that transitions a system from a mismatched database to the selected optimal type with minimal downtime
Create a database selection decision template (utility tree, scoring rubric, risk register) reusable across future projects in an organization
Design a distributed transaction strategy for a microservices system that achieves adequate consistency guarantees without requiring global 2PC
Synthesize a "database selection playbook" for a specific industry vertical (e.g., fintech, e-commerce, IoT) based on recurring quality attribute patterns
Design a hybrid retrieval architecture that combines traditional structured queries with vector similarity search, mapping each query type to the appropriate index, specifying embedding model choices, and documenting the tradeoffs in an architectural decision record
Design an end-to-end LLM embedding pipeline for a production system: choose an embedding model, define the ingestion strategy (batch vs. real-time), specify how embeddings are stored and co-located with source records in the chosen database, plan for model version upgrades, and estimate the total storage and API cost at target data volume
Capstone: Conduct a full ATAM database selection analysis for a provided real-world architecture case study, produce a written architectural decision record (ADR), and present findings and recommendations to a stakeholder panel