Skip to content

Observability, Reliability, and Cloud Operations

Vista Turns On the Lights

Vista waving welcome "Fellow architects, welcome to the chapter where we stop hoping our systems are working and start knowing they are. From up here I can see the full operational landscape, and the view is telling: the teams that invest in observability recover from incidents in minutes; the teams that don't recover in hours — or discover failures from angry user tweets. ATAM's superpower is making availability and reliability explicit quality attributes with measurable targets. Let's weigh the tradeoffs and light up every corner of your system!"

Summary

Observability and reliability engineering translate quality attribute requirements into operational guarantees that can be measured, budgeted, and defended. This chapter covers the three pillars of observability — logging, metrics, and distributed tracing — and builds up through service level indicators (SLIs), service level objectives (SLOs), and error budgets to the full Site Reliability Engineering (SRE) discipline. Students learn to calculate system availability, reason about MTTR and MTBF, design for disaster recovery with explicit recovery time and recovery point objectives, and apply geographic redundancy and active-passive failover. The chapter closes with cloud observability as the operational synthesis of these practices.

Why Observability Is an Architectural Quality Attribute

Observability is frequently treated as an operational concern — something the platform team or on-call engineers worry about after systems are built. ATAM evaluations consistently reveal that this framing is wrong and expensive. When a system is designed without observability as a first-class architectural concern, the cost manifests in three ways: slower incident response (engineers cannot diagnose what is happening), longer time-to-feature (debugging production problems requires code instrumentation that delays actual development), and degraded reliability SLOs (targets cannot be enforced without measurement).

The ATAM perspective frames observability as a quality attribute enabler: it is the infrastructure that makes every other quality attribute measurable. A performance scenario with a p99 latency target is operationally meaningless without a metrics collection system that records p99 latency. An availability scenario with a five-nines target is unfalsifiable without logging and health check infrastructure that can detect and record downtime. The relationship is foundational: observability is what connects the quality attribute scenarios written on paper in an ATAM workshop to the operational reality of a production system.

This chapter builds the complete conceptual framework from raw signal collection (logging, metrics, traces) through SLI/SLO/error-budget target-setting, through incident lifecycle analysis (MTTR/MTBF), through disaster recovery design (RTO/RPO), to the full operational model of cloud-based systems.

The Three Pillars of Observability

Observability tools have consolidated around three complementary signal types, each providing a different lens into system behavior: logs, metrics, and traces. Peter Bourgon's 2016 framework of "three pillars" remains the canonical reference, and understanding the strengths and limitations of each pillar is essential for building effective observability infrastructure.

Logging Architecture

Logs are timestamped, structured or semi-structured records of discrete events in a system. A database query executing, a user authenticating, an exception being thrown, a configuration value being read — each of these is an event that a logging system can capture, persist, and make queryable.

The evolution from unstructured to structured logging is one of the most impactful architectural improvements a team can make. Unstructured logs (plain text with embedded timestamps and messages) are human-readable but machine-intractable — searching them requires string matching, and correlating events across services requires manual inspection. Structured logs (JSON or another key-value format) allow log aggregation systems to index specific fields, enabling powerful queries: "show me all requests from user X in the last hour that took longer than 500ms and resulted in a 500 error."

A mature logging architecture has several layers. Log emitters (applications) write structured events with consistent schemas to standard output (in containerized environments) or to a log file. Log shippers (Fluentd, Filebeat, Vector) collect logs from sources, enrich them (adding host metadata, environment tags, service versions), and forward them to a central aggregation system. Log aggregators (Elasticsearch, Loki, Splunk) index, store, and expose query interfaces for logs. Dashboarding tools (Kibana, Grafana) provide visualization and alerting capabilities over the aggregated log data.

Critical design decisions in logging architecture include log retention policy (how long logs are stored, which is both a cost and a compliance concern), sampling strategy (for high-throughput services, logging every request at full fidelity is often cost-prohibitive — structured sampling retains full-fidelity logs for a configured fraction of requests while capturing summary statistics for all), and correlation ID propagation (ensuring that every log event across every service involved in handling a user request carries the same trace ID, enabling full-request log correlation without distributed tracing).

Metrics Collection

Metrics are time-series numeric measurements of system and business behavior. Unlike logs (which capture what happened), metrics capture how much and how fast: request rate, error rate, latency percentiles, CPU utilization, queue depth, active sessions, business event counts.

Metrics have two dominant collection models: push-based (applications push metric values to a central collector at configured intervals — Graphite, StatsD, InfluxDB) and pull-based (a central collector polls applications at configured intervals by scraping a metrics endpoint — Prometheus, OpenTelemetry Collector). Pull-based collection has become dominant in cloud-native environments because it is easier to discover and monitor ephemeral instances, and because the collector's scrape interval provides a natural rate-limiting mechanism.

Metric types matter for correct instrumentation. A counter monotonically increases and never decreases (total requests served, total errors encountered). Dashboarding systems typically display the rate-of-change of counters rather than their raw values. A gauge represents a current value that can increase or decrease (current memory usage, active connection count, queue depth). A histogram records the distribution of observed values (request duration, response size) by tracking counts within configured buckets, enabling percentile calculations. A summary is similar to a histogram but pre-computes percentiles on the client, making it cheaper to query but less flexible to aggregate across instances.

The RED Method Is Your ATAM Metrics Contract

Vista thinking From the high-level view, I can see that the most common metrics mistake is instrumenting interesting things rather than required things. The RED method — Rate, Errors, Duration — provides a discipline: for every service and endpoint, you must have these three metrics. That is your minimum viable observability, and it maps directly to ATAM quality attribute scenarios. If your scenario specifies p95 latency, that is your Duration histogram. If it specifies error rate < 0.1%, that is your Errors counter rate. RED metrics are ATAM scenarios made observable.

Distributed Tracing and Health Checks

Distributed tracing is the mechanism by which a single user request, as it flows through multiple services in a distributed system, generates a correlated causality chain that can be examined as a unit. Without distributed tracing, a request that touches five services produces five separate log entries in five separate log aggregators with no obvious way to correlate them.

The distributed tracing model works as follows. When a request enters the system at the API gateway or first service, the service generates a unique trace ID and propagates it as an HTTP header (W3C Trace Context standard: traceparent) to every downstream service it calls. Each service creates a span — a timed record of the work it performed for this request — that includes the trace ID, a parent span ID (the span of the calling service), and timing information. A trace visualization tool (Jaeger, Zipkin, Honeycomb, Tempo) reconstructs the call tree from all spans with the same trace ID, producing a flame chart that shows the complete end-to-end timeline of the request.

The value of distributed tracing for ATAM performance analysis is direct: tracing makes request-level latency decomposition automatic. The p99 latency of a full request can be apportioned to each service in the call chain with per-service and per-database-query granularity. This transforms the abstract ATAM sensitivity point ("the database query is likely a bottleneck") into a measured fact.

Health checks are the simplest and most foundational form of observability: binary or graded signals that indicate whether a component is functioning. Three health check patterns are standard in cloud-native systems: the liveness probe (is the process alive and not in a deadlock or infinite loop? — if this fails, the container orchestrator restarts the instance), the readiness probe (is the service ready to accept traffic? — load balancers remove instances from rotation when readiness fails, enabling zero-downtime deployments), and the startup probe (has the service completed its initialization? — prevents liveness checks from killing a slow-starting container prematurely).

Diagram: Distributed Tracing Waterfall View

Distributed Tracing Waterfall Explorer

Type: Interactive diagram sim-id: distributed-trace-explorer
Library: p5.js
Status: Specified

Purpose: Simulate a distributed trace waterfall for a 5-service e-commerce request (API Gateway → Auth Service → Product Service → Inventory DB → Recommendation Service), showing spans, timing, and latency contributions.

Controls: - Service latency sliders (each service: 1–200ms) - "Inject Slowness" buttons per service to simulate a bottleneck - Toggle "Database slow query" checkbox

Display: - Horizontal waterfall timeline showing span start/end for each service - Cumulative latency indicator (p50/p95 estimate) - Color coding: green < 50ms, yellow 50–150ms, red > 150ms per span - Total trace duration prominently displayed

Behavior: When a service's latency exceeds a configurable threshold, it highlights the span in red and displays "ATAM Sensitivity Point Detected" banner.

Service Level Indicators and Service Level Objectives

With the measurement infrastructure in place, the next layer of the reliability engineering stack is converting raw metrics into contractual commitments: the SLI/SLO framework.

A Service Level Indicator (SLI) is a carefully defined quantitative measure of service behavior that maps directly to user experience. SLIs are specific, measurable, and expressed as a ratio: (number of good events) / (total events). Examples: request success rate (non-5xx responses / total responses), latency SLI (requests completed within threshold / total requests), availability SLI (successful health check polls / total polls).

The definition of "good" must be explicit and agreed upon before measurement begins. For an SLI to be meaningful, the numerator's definition must reflect the user's actual experience, not just the system's internal success signals. A request that returns HTTP 200 with a JSON body containing {"error": "timeout", "data": null} is not a good event from the user's perspective, even though it counts as a success from the server's perspective. SLI specification is a collaborative exercise between engineering, product management, and customer success teams.

A Service Level Objective (SLO) is the target for an SLI over a specified time window: "the availability SLI shall be ≥ 99.9% measured over any rolling 28-day period." SLOs are the operational translation of availability quality attribute scenarios — they are where the abstract "the system shall be highly available" business requirement becomes an enforceable operational target.

SLO selection requires careful calibration. SLOs that are too aggressive (99.999%) incentivize excessive caution, slow feature development, and are often unachievable without enormous cost. SLOs that are too lenient permit degraded user experiences without triggering engineering attention. Google SRE practice recommends setting SLOs at the level that reflects the point below which users are materially harmed — which is typically lower than teams initially propose.

Error Budgets and the SRE Philosophy

The error budget is the intellectual contribution that makes SLOs actionable rather than aspirational. An error budget is the allowed non-compliance with an SLO: the fraction of events that can be "bad" before the SLO is violated.

For an availability SLO of 99.9% over 28 days, the error budget is 0.1% of 28 days × 24 hours/day × 60 minutes/hour = 40.3 minutes of allowed downtime. This budget is shared between unplanned incidents (which consume it involuntarily) and intentional risks such as deployments, migrations, and experiments (which consume it deliberately). When the error budget is healthy (much of it remains), the team can deploy frequently and take risks. When the error budget is nearly exhausted, deployments should pause, focus shifts to reliability investment, and the remaining budget is preserved for unavoidable incidents.

The error budget creates a powerful organizational dynamic: it aligns reliability investment with product velocity because both are now expressed in the same currency. A product manager who wants to ship a risky feature is implicitly spending error budget. A site reliability engineer who wants to invest in a monitoring improvement is implicitly refilling the budget. This shared accounting replaces the adversarial "dev wants to ship, ops wants stability" dynamic with a collaborative "how do we best spend our budget?" conversation.

Site Reliability Engineering (SRE), as defined by Google's original SRE book, is the discipline of applying software engineering to operations problems. The SRE role — staffed by engineers who split their time between operational work and software engineering — is the organizational implementation of the SLO/error-budget philosophy. SRE practices include error budget policies (defining what happens when the budget is exhausted), blameless post-mortems (treating incidents as systems failures, not human failures), toil reduction (automating repetitive operational work), and chaos engineering (deliberately injecting failures to validate reliability assumptions).

Error Budgets Make ATAM Risk Conversations Quantitative

Vista with a tip In your next ATAM evaluation, try this: after establishing availability SLOs with the stakeholders, immediately compute the error budget in minutes per month. Then ask the team "how long did your last deployment outage last?" and "how many deployments do you do per month?" The arithmetic usually reveals that either the SLO is too aggressive for the current deployment practices, or the deployment practices need to change. This is the kind of concrete, numbers-grounded conversation that ATAM is designed to produce.

MTTR and MTBF: Measuring Reliability

While SLO/error-budget analysis describes reliability in terms of user-visible outcomes, Mean Time to Recovery (MTTR) and Mean Time Between Failures (MTBF) describe reliability from the engineering and operations perspective of incident management.

Mean Time Between Failures (MTBF) is the average elapsed time between successive failures of a repairable system. A system with MTBF of 720 hours fails approximately once per month on average. MTBF is a measure of reliability: how infrequently the system fails. High MTBF is achieved through fault-tolerant design, robust testing, staged rollouts, and redundancy.

Mean Time to Recovery (MTTR) is the average elapsed time from failure detection to full service restoration. A system with MTTR of 15 minutes recovers from failures quickly once they are detected. MTTR is a measure of resilience: how effectively the system and its operations team can restore service after a failure. Low MTTR is achieved through rapid detection (alerting and monitoring), automated remediation (auto-restart, circuit breakers, failover), and practiced incident response processes.

Availability is mathematically derived from these two quantities:

\[\text{Availability} = \frac{\text{MTBF}}{\text{MTBF} + \text{MTTR}}\]

This relationship reveals an important architectural insight: availability can be improved either by increasing MTBF (preventing failures) or by decreasing MTTR (recovering faster). Both paths lead to the same availability number, but they have very different cost profiles and risk distributions. Fault-tolerant hardware and exhaustive pre-release testing target MTBF; runbook automation, fast rollback mechanisms, and on-call training target MTTR. Architecturally, the choice between these paths depends on the failure modes most likely to occur and the cost of each mitigation approach — another classic ATAM tradeoff analysis.

Availability Calculation

The availability calculation model above applies to individual components. For systems composed of multiple components with individual availability values, the system-level availability depends on how the components are combined.

Components in series (each component must be available for the system to be available) multiply their availability values. A system with three sequential components each at 99.9% availability has system availability of 0.999 × 0.999 × 0.999 ≈ 99.7%. Each additional series component degrades the overall availability target, which is why ATAM evaluations of complex systems must compute end-to-end availability budgets explicitly rather than assuming the system inherits the availability of its best component.

Components in parallel (any one component being available is sufficient for the system to be available) dramatically improve availability because they require all replicas to fail simultaneously. Two independent instances each at 99.9% availability produce a combined availability of 1 - (1 - 0.999)² = 1 - 0.000001 = 99.9999% — a factor of 100× improvement for a modest duplication cost.

This serial/parallel algebra explains why highly available systems are designed with redundancy at every layer, and why the weakest link in an availability chain is often the component with no redundancy, not the component with the lowest individual availability.

The "nines" framework converts availability percentages to allowed downtime durations, which grounds abstract availability targets in operational reality:

Availability Downtime per Year Downtime per Month
99% (two nines) 87.6 hours 7.3 hours
99.9% (three nines) 8.76 hours 43.8 minutes
99.99% (four nines) 52.6 minutes 4.4 minutes
99.999% (five nines) 5.26 minutes 26.3 seconds
99.9999% (six nines) 31.5 seconds 2.6 seconds

The table immediately reveals something important: achieving five or six nines requires not just technically reliable components but operationally disciplined processes — because at five nines, a single 10-minute deployment outage consumes the entire annual downtime budget. This makes zero-downtime deployment practices (blue-green, canary, rolling updates — from Chapter 13) a prerequisite, not an optional enhancement.

Diagram: Availability Architecture Analyzer

Availability Architecture Analyzer

Type: Interactive simulation sim-id: availability-architecture-explorer
Library: p5.js
Status: Specified

Purpose: Interactive system availability calculator. Users build a system architecture by placing components in series or parallel configurations and see the resulting system availability, MTBF, MTTR, and allowed downtime.

Controls: - Add component button: specify individual availability (99%–99.999%) - Series/Parallel toggle for each component group - MTTR slider per component (1–240 minutes)

Display: - Architecture diagram showing component arrangement - Real-time system availability calculation - "Nines" indicator (two through six nines) - Annual/monthly downtime budget in minutes - Weakest-link identification (component contributing most to unavailability) - SLO compliance indicator against configurable target

Disaster Recovery: RTO and RPO

All the preceding availability engineering assumes that systems fail and recover within a single operational environment. Disaster recovery addresses a more severe scenario: what happens when an entire data center, cloud region, or operational environment becomes unavailable — due to natural disaster, infrastructure failure, ransomware, or malicious attack?

Two objectives define the disaster recovery design space:

Recovery Time Objective (RTO) is the maximum acceptable elapsed time from a disaster event to full service restoration. RTO is a business decision: how long can the organization tolerate the system being unavailable? An e-commerce platform might set RTO at 4 hours for a full data center failure, because beyond 4 hours the brand damage and revenue loss exceed the cost of more aggressive recovery infrastructure. A healthcare system managing patient care workflows might set RTO at 15 minutes, because patient safety depends on system availability.

Recovery Point Objective (RPO) is the maximum acceptable amount of data loss measured in time. RPO answers the question: if a disaster occurs right now and all data since some point in the past is lost, what is the maximum age of the lost data the organization can tolerate? An RPO of 1 hour means the organization accepts losing up to 1 hour of transactions — any disaster recovery architecture must ensure that data is replicated or backed up at intervals ≤ 1 hour.

RTO and RPO jointly define the disaster recovery tier required, which directly maps to infrastructure cost. The relationship is approximately logarithmic: each order-of-magnitude reduction in RTO or RPO roughly doubles the infrastructure cost. The following table presents the standard DR tiers:

DR Tier RTO RPO Architecture Pattern Cost Multiplier
Backup & Restore Hours–Days Hours–Days Periodic backups to cold storage
Pilot Light Hours Minutes Minimal warm standby, data replicated 2–3×
Warm Standby Minutes Seconds–Minutes Scaled-down production copy 4–6×
Multi-Site Active/Active < 1 minute Near-zero Full redundant deployment in multiple regions 8–10×

The critical ATAM analysis point is that RTO and RPO requirements must be explicitly negotiated with stakeholders before architecture decisions are made. Architects frequently design for the infrastructure cost they can afford rather than the business requirements they must meet — or conversely, gold-plate DR architecture without confirming that the organization actually needs sub-minute RTO. The ATAM quality attribute scenario process forces this negotiation to happen explicitly.

The Disaster Recovery Test You're Probably Not Running

Vista warning A disaster recovery architecture that has never been tested is not a disaster recovery architecture — it is a disaster recovery hypothesis. From up here I can see the pattern clearly: teams spend months designing multi-region failover, document the runbooks carefully, and then never test whether failover actually works as designed. The ATAM risk register should contain "untested DR architecture" as a risk with severity proportional to the gap between RTO/RPO targets and the last time failover was actually validated. Netflix's chaos engineering practice (discussed in Chapter 13) deliberately and continuously validates this.

Geographic Redundancy and Active-Passive Failover

Geographic redundancy implements the availability algebra of parallel components at the infrastructure level: deploying complete system replicas in geographically separated data centers or cloud regions so that a failure in one region does not cause total service unavailability.

Active-passive failover is the simpler and more common pattern. The active site handles all production traffic; the passive (standby) site receives replicated data and maintains a pre-configured, ready-to-activate copy of the system infrastructure. When the active site fails, a failover process promotes the passive site to active status and redirects traffic. Failover may be triggered automatically (via health checks and automation) or manually (via operator decision based on monitoring alerts).

The key design parameters of active-passive failover are the replication lag (how far behind the passive site's data is relative to the active site — this directly determines RPO), the promotion time (how long it takes to fully activate the passive site and redirect traffic — this contributes to RTO), and the data consistency guarantee (is the passive site in a fully consistent state at the time of promotion, or might there be transactions that were committed on the active site but not yet replicated?).

Active-active multi-region deployment eliminates the "cold standby" problem by having both (or all) regions handle live traffic simultaneously. Each region serves a subset of users, typically via geographic DNS routing. When a region fails, the other regions absorb its traffic load (requiring sufficient capacity headroom for this scenario). Active-active provides better RTO (no promotion delay) and better resource utilization (no idle standby infrastructure), but significantly complicates data consistency: writes happening in multiple regions simultaneously require either a global coordination layer (expensive latency), careful read-write routing, or an eventually consistent data model.

The architectural choice between active-passive and active-active is one of the most common tradeoff points in ATAM availability evaluations, and it should always be analyzed through the lens of both the business RTO/RPO requirements and the data consistency semantics required by the application.

Diagram: Active-Passive vs. Active-Active Failover

Failover Architecture Pattern Explorer

Type: Interactive simulation sim-id: failover-architecture-explorer
Library: vis-network
Status: Specified

Purpose: Animated comparison of active-passive and active-active multi-region architectures showing traffic routing, replication flows, and failover behavior.

Nodes: - Region A (primary): load balancer, app tier, database primary - Region B (standby/secondary): load balancer, app tier, database replica - DNS layer (global) - Client nodes (5 geographic locations)

Modes: - Active-Passive: all traffic → Region A; replication arrows to Region B; "Simulate Failure" button triggers animated failover sequence showing DNS cutover and passive→active promotion - Active-Active: traffic split across regions; write routing arrows to primary database; replication arrows bidirectional

Display: RTO/RPO indicators update dynamically based on replication lag slider and failover time configuration.

Cloud Observability: The Operational Synthesis

Cloud observability is not a distinct product category so much as the application of all the preceding concepts — logging, metrics, tracing, health checks, SLIs, SLOs, error budgets — within the specific operational context of cloud-native infrastructure. Cloud environments present unique observability challenges: infrastructure is ephemeral (instances are created and destroyed automatically), components are polyglot (different services may use different languages and frameworks), and the operational surface area is enormous (a Kubernetes cluster of 100 pods running 20 microservices has thousands of potential metric dimensions).

Modern cloud observability platforms (AWS CloudWatch, Google Cloud Operations Suite, Azure Monitor, Datadog, Grafana Cloud, New Relic, Honeycomb) provide integrated solutions that ingest all three observability pillars, correlate them via trace IDs and resource tags, and provide unified query and visualization interfaces. The OpenTelemetry project (CNCF) provides vendor-neutral instrumentation libraries and an agent/collector framework that enables applications to export logs, metrics, and traces to any backend without vendor lock-in.

An effective cloud observability architecture has several characteristic properties:

Correlation is the ability to navigate from a spike in an error metric to the specific log events that generated the errors, to the distributed traces that show exactly which service and database call failed. Without correlation (which requires consistent propagation of trace IDs through all signals), observability data exists as isolated islands that are difficult to reason about during an incident.

Cardinality management is the practice of controlling the number of unique dimension combinations in metric and trace data. Every label or tag added to a metric creates a new time series. A metric with four labels each having 10 possible values produces 10,000 time series. Uncontrolled high-cardinality dimensions (user IDs, session IDs, full URL paths) can cause observability backend costs to grow superlinearly with traffic.

Alerting design is the discipline of defining alert conditions that are actionable (the on-call engineer can take a meaningful action in response), calibrated (the alert fires with appropriate precision — low false positive rate — and recall — it catches real problems), and urgency-graded (paging alerts for user-impacting conditions, lower-urgency alerts for warning conditions that need attention but not immediate response).

The Observability Journey Is Never Finished — That's Fine

Vista encouraging No production system is perfectly observable. Every incident reveals a gap in instrumentation, a metric that would have been helpful, a log field that was missing. This is normal and expected — observability is an iterative practice, not a one-time configuration. The goal is a feedback loop: incident reveals gap → gap is closed → next incident is detected faster. If your team is consistently closing observability gaps after incidents, you are doing it right. The ATAM risk register can track these gaps explicitly as technical debt with measurable impact on MTTR.

Reliability Engineering in ATAM Evaluations

The complete reliability engineering framework from this chapter integrates naturally into ATAM evaluations as follows.

SLI/SLO specifications translate directly from ATAM availability quality attribute scenarios. The scenario "the order processing service shall be available 99.95% of the time during business hours" becomes the SLO target; the "measurement approach" response measure component of the scenario maps to the SLI definition.

MTTR analysis identifies architectural decisions that affect incident response time: Are alerts configured to fire within minutes of failure, or hours? Are runbooks automated or manual? Is the team practicing incident response through game days? These factors contribute to sensitivity point identification in the ATAM evaluation — the same architectural decisions that enable fast MTTR are the decisions that determine whether the availability SLO is achievable.

RTO/RPO alignment checks that the disaster recovery architecture actually achieves the business-specified objectives. A common ATAM finding is that teams have specified aggressive RTO/RPO targets (e.g., RTO < 15 minutes) but have not designed or tested a recovery architecture capable of meeting them. This gap is a risk by definition — a scenario whose response measure cannot be met by the current architecture.

Error budget policy formalization ensures that reliability degradation triggers architectural investment rather than being normalized away. When the error budget is exhausted, the policy should require feature development to pause and reliability investment to begin — this is an organizational commitment that ATAM can help teams make explicit and stick to.

You Have Become an Observability Architect

Vista celebrating From up here the view is spectacular — you have just traversed the complete reliability stack, from raw telemetry signals to SLO contracts to disaster recovery design to cloud operational synthesis. You can now walk into any ATAM evaluation and convert vague "we need high availability" requirements into concrete SLIs, explicit SLOs, calculated error budgets, negotiated RTO/RPO targets, and measurable MTTR objectives. That is not just architecture analysis — that is operational artistry. Let's weigh those tradeoffs, fellow architects! You are ready for the real world.

Concepts Covered

This chapter covers the following 16 concepts from the learning graph:

  1. Cloud Observability — integrated logging/metrics/tracing for cloud-native ephemeral infrastructure
  2. Logging Architecture — structured event collection, shipping, aggregation, and querying across services
  3. Metrics Collection — time-series numeric measurements using pull/push models with counter/gauge/histogram types
  4. Health Check Pattern — liveness, readiness, and startup probes for automated health monitoring
  5. Service Level Objective (SLO) — measurable target for an SLI over a specified time window
  6. Service Level Indicator (SLI) — ratio metric expressing the fraction of "good" user-experience events
  7. Error Budget — the allowed non-compliance with an SLO, shared between incidents and planned risks
  8. Site Reliability Engineering — applying software engineering to operations; SLO + error budget + blameless postmortems
  9. Mean Time to Recovery — average elapsed time from failure detection to full service restoration
  10. Mean Time Between Failures — average elapsed time between successive failures; a measure of reliability
  11. Availability Calculation — series and parallel component availability algebra; nines framework
  12. Disaster Recovery — planning and infrastructure for restoring service after catastrophic failures
  13. Recovery Time Objective (RTO) — maximum acceptable elapsed time from disaster to service restoration
  14. Recovery Point Objective (RPO) — maximum acceptable data loss measured as time since last backup/replication
  15. Geographic Redundancy — deploying system replicas in geographically separated regions for resilience
  16. Active-Passive Failover — standby replica promotion pattern with explicit failover trigger and data replication

Prerequisites

This chapter builds on concepts from:

Self-Check Questions

Self-Check: Observability and Reliability — Click to Reveal Answers

Q1: A service has an availability SLO of 99.95% over a 30-day window. How many minutes of downtime are allowed before the SLO is violated? If the service has been down for 18 minutes this month, what fraction of the error budget has been consumed?

Answer: Error budget = (1 - 0.9995) × 30 days × 24 hr × 60 min = 0.0005 × 43,200 = 21.6 minutes. After 18 minutes of downtime, 18/21.6 = 83.3% of the error budget is consumed. Only 3.6 minutes remain for the rest of the month, which should trigger caution around risky deployments and reliability investment focus.

Q2: A system has MTBF of 200 hours and MTTR of 2 hours. What is its availability? If MTTR is reduced to 30 minutes through runbook automation, how does availability change?

Answer: Original: 200/(200+2) = 200/202 ≈ 99.01%. With 30-minute MTTR: 200/(200+0.5) = 200/200.5 ≈ 99.75%. MTTR reduction from 2 hours to 30 minutes improved availability from 99.01% to 99.75% without changing the underlying failure rate at all — demonstrating that investing in recovery speed (runbooks, automation, on-call training) is often more cost-effective than investing in failure prevention.

Q3: Your ATAM evaluation reveals a proposed architecture with an availability requirement of 99.99% and a disaster recovery strategy that consists of "weekly backups to S3." What risks would you document?

Answer: Multiple critical risks: (1) Weekly backups imply RPO of up to 7 days — likely completely misaligned with a 99.99% availability target, which typically implies near-zero RPO. (2) Restoring from S3 backups requires hours to days (high RTO), while 99.99% availability implies RTO of minutes. (3) The DR architecture has never been tested (assumed). Document as: Risk: DR strategy is misaligned with stated availability SLO by multiple orders of magnitude; Severity: Critical; Mitigation: Align DR tier to RTO/RPO derived from the availability SLO, implement continuous replication, conduct quarterly DR tests.

Q4: A system composed of three components in series has individual availabilities of 99.9%, 99.95%, and 99.99%. What is the system availability? Which component should be prioritized for reliability investment?

Answer: System availability = 0.999 × 0.9995 × 0.9999 ≈ 99.84%. The 99.9% component contributes the most to unavailability (0.1% downtime) and should be prioritized. Improving it from 99.9% to 99.99% would bring system availability to 0.9999 × 0.9995 × 0.9999 ≈ 99.93%, a significant improvement. This illustrates why reliability investment should target the weakest link first, not the most expensive component.