References: Observability, Dashboards, and Alerting
-
Observability - Wikipedia - Coverage of the systems-theory origin of observability and how the three pillars (logs, metrics, traces) translate to software systems; foundational context for this chapter.
-
Dashboard (business) - Wikipedia - Background on visual data presentation including the design principles (drill-down, time-series aggregation, single-screen overview) used in this chapter's cost-dashboard examples.
-
Anomaly detection - Wikipedia - Statistical and ML approaches to detecting metric outliers; informs the alerting-rule design covered here for token-spike and cost-threshold alerts.
-
Observability Engineering - Charity Majors, Liz Fong-Jones, and George Miranda - O'Reilly - The definitive reference for modern observability practice including the high-cardinality dashboard patterns this chapter applies to LLM cost.
-
The Site Reliability Workbook - Beyer, Murphy, Rensin, Kawahara, Thorne (eds) - Google / O'Reilly - Companion to the SRE book with practical chapters on alerting philosophy, SLO design, and on-call rotation that frame the alerting rules in this chapter.
-
Grafana Documentation - Grafana Labs - Reference for the most widely used open-source dashboard tool; covers the panel types, query languages, and alerting features used in this chapter's examples.
-
Prometheus Alerting Documentation - Prometheus - Reference for the de-facto standard alerting-rule language including the for-clause time-window pattern used in this chapter's burn-rate alerts.
-
Charity Majors Blog: Sociotechnical Observability - Charity Majors - Working notes on the why-and-how of observability from the co-founder of Honeycomb; influences the team-process recommendations in this chapter.
-
Google SRE: Monitoring Distributed Systems - Google - The four-golden-signals framework (latency, traffic, errors, saturation) adapted in this chapter for LLM-specific signals (token volume, cache hit rate, cost burn rate).
-
Anthropic Engineering Blog - Anthropic - Periodic posts on observability practices for LLM-using systems; pairs well with this chapter for vendor-specific dashboard patterns.