Skip to content

Chapter 11: Distributed Scaling and Replication

Summary

This chapter covers the mechanics of scaling databases horizontally — how data is partitioned across nodes through sharding, how copies are kept consistent through replication, and how distributed systems achieve agreement through leader election and coordination services. Students learn the tradeoffs among single-leader, multi-leader, and leaderless replication topologies, how quorum reads and writes balance consistency against availability, and how tools such as ZooKeeper and etcd provide distributed coordination. Gossip protocols are examined as the failure detection backbone of leaderless systems such as Cassandra.

Concepts Covered

This chapter covers the following 19 concepts from the learning graph:

  1. Gossip Protocol
  2. Horizontal Scaling
  3. Vertical Scaling
  4. Data Sharding
  5. Range-Based Sharding
  6. Hash-Based Sharding
  7. Directory-Based Sharding
  8. Geographic Sharding
  9. Database Replication
  10. Single-Leader Replication
  11. Multi-Leader Replication
  12. Leaderless Replication
  13. Quorum Read/Write
  14. Read Replica
  15. Replication Lag
  16. Split-Brain Problem
  17. Leader Election
  18. ZooKeeper Coordination
  19. etcd

Prerequisites

This chapter builds on concepts from:


Welcome to the World of Distributed Databases

Dex the robot waving hello Dex has a confession: single-node databases kept him up at night. What happens when the disk fills? What happens when the server melts? What happens when your startup goes viral at 3 AM and your one PostgreSQL instance collapses under the weight of its own success? This chapter answers all of those questions — and introduces a collection of elegantly devious problems that come with spreading a database across many machines.

The Scaling Imperative

Every database starts small. A single server, a single disk, a single point of everything. For many applications — and far more than the distributed-systems community likes to admit — a well-tuned single node can serve millions of users. But physical limits are real. When your dataset outgrows a single machine's RAM, when write throughput saturates a single disk, when a hardware failure would mean hours of downtime, it is time to scale.

Vertical scaling (sometimes called "scaling up") means upgrading the hardware of a single server: adding CPU cores, increasing RAM from 128 GB to 512 GB, replacing spinning disks with NVMe SSDs. This is always the first lever to pull. It requires no application changes, no sharding logic, no distributed coordination headaches. The problem is physics: there is a largest machine you can buy, and it is expensive. A 192-core AWS x2gd.metal instance costs over $13 per hour. You cannot vertical-scale your way to ten petabytes.

Horizontal scaling (scaling out) means adding more nodes — more commodity servers, each handling a fraction of the total load. The appeal is obvious: ten 32-core machines cost less than one 320-core behemoth, you can add nodes incrementally, and node failure becomes a routine operational event rather than a catastrophe. The challenge is that data must now be split across nodes (sharding) and kept consistent across copies (replication) — and both problems are significantly harder than they look.

Sharding: Splitting the Dataset

Data sharding is the practice of partitioning a dataset across multiple nodes such that each node owns a disjoint subset of the data. A query that touches only one shard can be answered by one node without involving the others — this is what makes sharding attractive. Queries that span multiple shards must be coordinated across nodes, which adds latency and complexity.

The critical design decision in sharding is the shard key: the field (or composite of fields) that determines which shard a given record belongs to. A poor shard key choice causes hot spots — situations where one shard receives the majority of traffic while others sit idle. Hot spots negate the entire purpose of sharding.

There are four common sharding strategies, each with distinct tradeoffs:

Strategy How It Works Strengths Weaknesses
Range-based Records assigned by key value range (A–M → shard 1) Efficient range scans Hot spots if data is skewed
Hash-based Hash function on key mod N shards Even distribution, no hot spots Range scans require hitting all shards
Directory-based Lookup table maps each key to its shard Flexible; can rebalance any key Lookup table is a bottleneck and SPOF
Geographic Records assigned by region or country Data locality, compliance Uneven load if regions differ in size

Range-based sharding assigns records based on where their key value falls in an ordered range. A user table sharded on last name might put A–M on shard 1 and N–Z on shard 2. Range scans are fast because they hit a contiguous shard. The danger is data skew: if 70% of your users have last names starting with A–M, shard 1 runs hot.

Hash-based sharding applies a deterministic hash function to the shard key and takes the result modulo the number of shards. This distributes records evenly regardless of key distribution, eliminating hot spots. The cost is range queries: to find all users created between two dates, you must query every shard and merge the results.

Directory-based sharding maintains an explicit lookup table mapping each shard key (or key range) to its target shard. This is the most flexible strategy — you can move any individual key to any shard at will, enabling fine-grained rebalancing. The lookup table becomes both a critical coordination point and a potential single point of failure.

Geographic sharding partitions data by physical or logical region. A multi-tenant SaaS platform might keep EU customer data on EU-region shards to satisfy GDPR data residency requirements, while US customers land on US-region shards. This is the only sharding strategy driven primarily by compliance rather than performance.

Diagram: Consistent Hash Ring

MicroSim Specification **sim-id:** consistent-hash-ring **Library:** p5.js **Status:** Specified Render a circular ring representing the 0–360 degree hash space. Place 4–8 labeled nodes around the ring. Show 8–12 labeled keys distributed around the ring, each assigned to the nearest clockwise node (colored to match). Provide: - A "Add Node" button that places a new node at a random ring position. Animate keys that must migrate — highlight them in orange, then move them to the new node. Show a counter: "Keys migrated: N / Total keys: M." - A "Remove Node" button that removes a random non-fixed node. Animate its keys migrating clockwise to the next node. - A slider controlling the number of virtual nodes per physical node (1–50). As virtual nodes increase, show key distribution becoming more uniform across physical nodes. Annotate with a small histogram below the ring showing how many keys each physical node currently owns.

The Consistent Hashing Insight

A naive hash-based approach — shard = hash(key) % N — has a catastrophic property: when N changes (a node is added or removed), almost every key gets reassigned to a different shard. That means moving nearly the entire dataset, which is operationally ruinous.

Consistent hashing solves this elegantly. Both nodes and keys are mapped onto the same circular ring (0 to 2^32 − 1). A key is assigned to the nearest node in the clockwise direction. When a node is added, only the keys that fall between the new node and its predecessor migrate — approximately 1/N of the total, where N is the number of nodes. When a node is removed, only its keys migrate to its successor.

To avoid uneven load from nodes clustering on one part of the ring, each physical node is represented by multiple virtual nodes (sometimes called "vnodes"). With 150 virtual nodes per physical node, the distribution of keys is statistically uniform even with a small number of physical nodes.

Replication: Keeping Copies in Sync

Database replication is the process of maintaining copies of data on multiple nodes. Replication serves two purposes: fault tolerance (the system survives a node failure) and read scalability (multiple nodes can serve read queries in parallel). The tradeoff is complexity — keeping copies consistent is one of the central hard problems in distributed systems.

Single-Leader Replication

In single-leader replication, one node is designated the leader (also called primary or master). All writes must go to the leader. The leader records each write in a replication log and streams that log to one or more replica nodes, which apply the changes in the same order.

This is the simplest topology and the most common in traditional relational databases (PostgreSQL streaming replication, MySQL binlog replication, MongoDB replica sets). The consistency story is clean: because all writes go through a single ordering point, replicas that have caught up to the leader see a consistent view of the world.

The operational challenge is replication lag — the delay between when a write commits on the leader and when it becomes visible on a replica. Lag can be milliseconds in a healthy, lightly loaded cluster or minutes under write-heavy workloads. A user who writes to the leader and immediately reads from a lagging replica may not see their own write — a phenomenon called read-your-writes inconsistency.

Read replicas are an important operational pattern in single-leader systems. A read replica is a replica configured to accept only read queries. A typical architecture routes all writes to the leader and distributes reads across a pool of replicas, dramatically increasing total read throughput. The catch is that reads from replicas may see slightly stale data — acceptable for analytics dashboards, unacceptable for financial balances.

When Lag Bites

Dex the robot giving a helpful tip If your application writes a record and immediately redirects the user to a page that reads it back — and that read page hits a replica — you will lose the record to lag. The standard fix is read-after-write consistency: route reads to the leader for resources the current user just modified. Most client-side ORMs have a setting for this. Use it.

Multi-Leader Replication

Multi-leader replication allows writes to be accepted by more than one node simultaneously. Each leader replicates its writes to all other leaders. This topology enables geographic distribution: a user in Tokyo writes to the Tokyo leader, which replicates asynchronously to the New York leader, while New York users write locally without crossing the Pacific.

The price is write conflicts. If two users simultaneously modify the same record on different leaders, neither leader can know about the other's change until replication catches up. The system must detect this conflict and resolve it — last-write-wins (discard one), merge (combine both), or push-conflict-to-application (let the application decide).

Leaderless Replication

Leaderless replication (the Dynamo-style approach, used by Apache Cassandra and Amazon DynamoDB) eliminates the leader entirely. Any node can accept any read or write. Consistency is achieved not by ordering through a leader, but by requiring that a minimum number of nodes agree on the result.

This is where quorum reads and writes become central.

Diagram: Replication Topology Comparison

MicroSim Specification **sim-id:** replication-topology-comparison **Library:** p5.js **Status:** Specified Show three side-by-side panels, one per topology: Single-Leader, Multi-Leader, Leaderless. Each panel shows 5 nodes rendered as labeled circles. In Single-Leader: node 1 is gold (leader), nodes 2–5 are gray (replicas). Show animated arrow pulses flowing from leader to replicas. In Multi-Leader: nodes 1 and 3 are gold; arrows pulse bidirectionally between them, and from each to non-leader replicas. In Leaderless: all nodes are the same teal color; arrows pulse from a "Client" box to 3 of the 5 nodes simultaneously, labeled "W=3". Include a selector at the top: "Simulate Write" — clicking it animates the write flow for the currently highlighted topology. Below each panel, show a scorecard: - Write latency: Low / Medium / Medium - Read scalability: Medium / High / High - Conflict handling: None needed / Required / Quorum-based - Geographic distribution: Limited / Excellent / Excellent

Quorum: Counting Votes for Consistency

In a leaderless system with N replica nodes, quorum rules define the minimum number of nodes that must respond to a read or write for it to be considered successful.

  • W = the minimum number of nodes that must acknowledge a write
  • R = the minimum number of nodes that must respond to a read
  • N = the total number of replica nodes

The rule is: if W + R > N, then at least one node that acknowledged the write will also be included in every read quorum. That node has the latest data. Consistency is guaranteed.

Common configurations for N=3:

W R W+R Guarantees Tradeoff
3 1 4 Strong consistency Write tolerates 0 node failures
2 2 4 Strong consistency Balanced; standard choice
1 3 4 Strong consistency Read tolerates 0 failures; fast writes
1 1 2 Eventual consistency Maximum availability; may read stale

Quorum rules explain why Cassandra's consistency levels (ONE, QUORUM, ALL) are so granular — you are directly controlling W and R per query, trading consistency for availability at query time.

Diagram: Quorum Calculator

MicroSim Specification **sim-id:** quorum-calculator **Library:** p5.js **Status:** Specified Display three sliders: N (total replicas, 3–7), W (write quorum, 1–N), R (read quorum, 1–N). Show a row of N node icons. When a write is simulated, W of them glow orange (write acknowledged). When a read is simulated, R of them glow blue (read consulted). Highlight in green the nodes that appear in both sets — these are the nodes that guarantee fresh data. Below the node display, show: - "W + R = X" — colored red if X ≤ N (no consistency guarantee), green if X > N - "Write fault tolerance: can lose (N-W) nodes" - "Read fault tolerance: can lose (N-R) nodes" - "Consistency guaranteed: YES / NO" Add a "Simulate Stale Read" button that only activates when W+R ≤ N, animating a scenario where the read set misses the written node and returns old data.

The Split-Brain Problem

Single-leader replication has an acute failure mode: the split-brain problem. Imagine a three-node cluster — one leader and two replicas — separated by a network partition. The leader cannot reach the replicas. The replicas cannot reach the leader. A monitoring system concludes the leader is dead and promotes one replica to leader. Now there are two nodes that both believe they are primary.

If writes flow to both simultaneously, the dataset diverges. When the network partition heals, neither node knows which writes to trust. Data is lost or corrupted.

Split-brain is prevented through leader election — a distributed algorithm that ensures only one node can believe itself to be leader at any time. Two common approaches:

Raft uses a term-based election. A candidate node increments its term number, requests votes from peers, and becomes leader only if it receives a majority of votes. Because a majority must agree, two nodes cannot simultaneously achieve majority in the same term. Raft is used by CockroachDB, etcd, and CockroachDB.

The Bully Algorithm is simpler: when a node detects the leader is down, it sends an election message to all nodes with a higher ID. If no higher-ID node responds, the current node declares itself leader. If a higher-ID node responds, it takes over. The highest-reachable node always wins, hence "bully."

Never Use Two Leaders

Dex the robot raising a warning flag The split-brain problem is not theoretical. It has caused production incidents at banks, airlines, and major cloud providers. The classic safeguard is STONITH (Shoot The Other Node In The Head) — when a suspected duplicate leader is detected, the cluster proactively shuts down the suspected rogue node before promoting the new leader. It sounds extreme. It is also correct.

Diagram: Split-Brain Scenario

MicroSim Specification **sim-id:** split-brain-scenario **Library:** p5.js **Status:** Specified Show a three-node cluster: Node 1 (Leader, gold), Node 2 (Replica), Node 3 (Replica). Draw connection lines between nodes. Provide a "Partition Network" button that animates a red dashed line cutting the connection between Node 1 and Nodes 2–3. Show a timer counting up (simulating heartbeat timeout). After 10 seconds, Node 2 flashes and a label appears: "Elected new leader." Both Node 1 and Node 2 now show a gold crown icon. Show two write streams arriving: one hitting Node 1 (Write A: user balance = $500), one hitting Node 2 (Write B: user balance = $750). Both log their writes independently. Provide a "Heal Partition" button. When clicked, show Node 1 and Node 2 exchanging a "sync" message and then a "CONFLICT DETECTED" banner appearing in red. Show the two divergent values and a prompt: "Which write wins?" with options: Last-Write-Wins, Manual Resolution.

Gossip Protocols: Failure Detection Without a Boss

In a leaderless cluster, who monitors the monitors? If there is no central coordinator, how does a node learn that another node has failed?

The answer is gossip protocols — epidemic algorithms inspired by how rumors spread through a population. In a gossip protocol, each node periodically selects a small number of random peers (typically 2–3) and exchanges state information with them: which nodes it believes are up, which it believes are down, and its own current health. Each node propagates the information it has just received in the next gossip round.

Gossip protocols have elegant convergence properties. With N nodes, each round of gossip reaches approximately twice as many nodes as the previous (exponential spread). Even with high message loss rates, information about a node failure propagates to the entire cluster within O(log N) gossip rounds. Cassandra uses a gossip protocol running every second to maintain its cluster membership table — the list of all known nodes and their perceived states.

The mechanism for failure detection within gossip is typically the Phi Accrual Failure Detector. Rather than declaring a node definitively dead after missing a fixed number of heartbeats (brittle in high-latency networks), the accrual detector computes a continuous "phi" value representing the probability that the node is down based on the statistical distribution of its previous heartbeat intervals. A configurable threshold determines when phi is high enough to trigger a failure declaration.

Diagram: Gossip Protocol Spread

MicroSim Specification **sim-id:** gossip-protocol-spread **Library:** p5.js **Status:** Specified Show 12 nodes arranged in a loose grid. One node (Node 4) is marked "DOWN" with a red X. All other nodes initially show a gray question mark (they don't yet know Node 4 is down). Provide a "Start Gossip" button. Each animation step (advancing with a "Next Round" button or automatically every 1.5 seconds): 1. Each "informed" node (knows Node 4 is down) selects 2 random neighbor nodes and exchanges state with them. 2. Newly informed nodes turn from gray to orange, then to green on the next round (indicating they've spread the news). 3. Show a counter: "Round N — Nodes informed: X / 11." Show a line chart below tracking nodes informed vs. round number, illustrating the exponential growth curve. Include a "Reset" button and a slider for "Gossip fanout (peers per round): 1–4" to demonstrate how fanout affects convergence speed.

Distributed Coordination: ZooKeeper and etcd

Leader election and distributed configuration management are common enough problems that dedicated coordination services have emerged to solve them cleanly.

ZooKeeper (Apache) is a battle-hardened distributed coordination service, originally developed at Yahoo. It provides a filesystem-like namespace of znodes — small data nodes that can store configuration values, be created and deleted atomically, and trigger watches (callbacks) when they change. ZooKeeper uses the ZAB protocol (ZooKeeper Atomic Broadcast, covered in depth in Chapter 12) to ensure that all writes are totally ordered across the cluster.

ZooKeeper is used for leader election (create an ephemeral znode; whoever creates it first is leader), distributed locks (try to create a lock znode; block until the holder deletes it), and service discovery (services register their address in a known znode path). Apache Kafka, Apache HBase, and Apache Solr all depend on ZooKeeper for their distributed coordination.

etcd is a newer distributed key-value store that solves similar problems but with a cleaner API, native gRPC support, and Raft consensus rather than ZAB. etcd stores small amounts of critical configuration data and serves watch notifications when values change. Its primary claim to fame is being the backbone of Kubernetes: every cluster state, pod assignment, service endpoint, and secret lives in etcd. If etcd is unhealthy, the Kubernetes control plane cannot function.

The key difference between ZooKeeper and etcd is operational posture. ZooKeeper requires a separate ZooKeeper ensemble (typically 3 or 5 nodes) alongside your application cluster. etcd is bundled with Kubernetes and managed by the Kubernetes operators themselves. For new projects, etcd is almost always the better choice; ZooKeeper remains pervasive in the Hadoop/Kafka ecosystem.

Feature ZooKeeper etcd
Consensus protocol ZAB (Paxos variant) Raft
API Custom (znodes, watches) gRPC / REST key-value
Data size limit 1 MB per znode 1.5 MB per key (default)
Primary use case Kafka, HBase, Solr Kubernetes, distributed locks
Operational model Separate ensemble Bundled with Kubernetes
Max recommended cluster size 5 nodes 5–7 nodes

So Which Do You Pick?

Dex the robot in a thoughtful pose If you are operating in the Kafka/Hadoop ecosystem, ZooKeeper is already there — use it. If you are building on Kubernetes or starting fresh, use etcd. If you are building a new distributed system from scratch and need coordination, consider using an etcd client library directly. The rule of thumb: don't run a ZooKeeper ensemble unless something you depend on forces you to.

Putting It Together: A Realistic Distributed Architecture

A production distributed database cluster for a high-traffic e-commerce platform might look like this:

  • 6 shards, using consistent hash-based sharding on user_id
  • Each shard has 1 leader + 2 replicas (single-leader replication), for 18 nodes total
  • Quorum: W=2, R=2 on N=3 per shard for strong consistency
  • 3 read replicas per shard serving reporting queries, accepting replication lag up to 30 seconds
  • etcd cluster (3 nodes) managing leader election for each shard
  • Geographic sharding overlay: shards 1–3 on US-East, shards 4–6 on EU-West (GDPR compliance)

This architecture can tolerate the failure of one node per shard without data loss, distributes reads across 9 replicas, and maintains GDPR compliance through geographic isolation — all without violating quorum consistency on the write path.

Key Takeaways

  • Vertical scaling is always the first lever; horizontal scaling is for when physics says no.
  • Sharding partitions data across nodes; consistent hashing minimizes key migration when the cluster grows or shrinks.
  • Single-leader replication is simple and consistent but creates a write bottleneck and replication lag on replicas.
  • Multi-leader replication enables geographic distribution but requires conflict resolution strategies.
  • Leaderless replication with quorum reads/writes (W + R > N) delivers tunable consistency without a single write bottleneck.
  • Replication lag on read replicas can violate read-your-writes consistency; route sensitive reads to the leader.
  • The split-brain problem — two nodes both believing they are leader — requires leader election algorithms (Raft, Bully) to prevent.
  • Gossip protocols provide O(log N) failure detection without a central monitor; Cassandra uses them pervasively.
  • ZooKeeper and etcd are the workhorses of distributed coordination; prefer etcd for new systems, ZooKeeper for the Kafka/Hadoop ecosystem.

You Survived Distributed Systems, Round One

Dex the robot celebrating with raised arms Distributed databases are hard. They involve network partitions, split-brain scenarios, gossip rounds, quorum math, and the eternal question of whether your replica is lagging. You now understand why. In Chapter 12, Dex will take you deeper — into distributed ACID transactions, saga patterns, and the NewSQL databases that promised to fix everything (and mostly delivered).