Skip to content

References: Observability, Reliability, and Cloud Operations

Curated sources for deeper study of logging, metrics, distributed tracing, SLIs, SLOs, error budgets, SRE practices, MTTR/MTBF, availability calculation, disaster recovery, RTO/RPO, and geographic redundancy.

Books

  • Beyer, Betsy, Chris Jones, Jennifer Petoff, and Niall Murphy, eds. (2016). Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media. The foundational SRE book defining SLIs, SLOs, error budgets, blameless postmortems, and the reliability engineering philosophy that underpins this entire chapter's ATAM reliability framework.

  • Beyer, Betsy, et al., eds. (2018). The Site Reliability Workbook: Practical Ways to Implement SRE. O'Reilly Media. The practical companion to the SRE book, providing detailed guidance on SLO selection, error budget policy, disaster recovery planning, and on-call engineering — all directly applicable to this chapter's ATAM reliability analysis.

  • Majors, Charity, Liz Fong-Jones, and George Miranda. (2022). Observability Engineering: Achieving Production Excellence. O'Reilly Media. Covers the three pillars of observability (logs, metrics, traces) and the shift from monitoring to observability, with particular depth on the high-cardinality analysis and correlation practices described in this chapter.

Articles and Papers

  • Bourgon, Peter. (2016). "Metrics, Tracing, and Logging." Peter Bourgon's Blog. The influential post establishing the "three pillars of observability" framework referenced throughout this chapter, defining the complementary roles of metrics, logs, and distributed traces in production systems.

  • Treynor, Ben, Mike Wilkes, Betsy Beyer, et al. (2014). "Keys to SRE." Google Engineering Blog. https://sre.google/in-conversation/ Google's formal articulation of SRE principles including error budget management, toil reduction, and the engineering approach to reliability that this chapter applies to ATAM quality attribute scenarios.

Online Resources

  • "Site Reliability Engineering." Google. https://sre.google Google's official SRE resource hub including the complete SRE books, SLO templates, error budget policy examples, and the MTTR/MTBF framework used throughout this chapter's reliability engineering analysis.

  • "OpenTelemetry Documentation." CNCF OpenTelemetry. https://opentelemetry.io/docs/ The official OpenTelemetry documentation covering vendor-neutral instrumentation for logs, metrics, and traces — the standard observability stack referenced in this chapter's cloud observability section.

  • "Disaster Recovery of Workloads on AWS." AWS Well-Architected Framework. https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws/ AWS's official guide to DR tiers (backup/restore, pilot light, warm standby, active-active), RTO/RPO analysis, and cost multiplier estimation — directly supporting this chapter's disaster recovery architecture analysis.

  • "The Availability Table." High Scalability. https://highscalability.com Resource for availability "nines" calculations, system availability algebra (series vs. parallel components), and the relationship between MTTR, MTBF, and system availability used in this chapter's quantitative reliability analysis.

Videos

  • "How Observability-Driven Development Changes Everything." Charity Majors. GOTO Conference. YouTube. Majors' influential talk on building observable systems from the start, covering the instrumentation practices, SLO design, and incident response improvements that make this chapter's reliability engineering actionable.