References: Log File Analysis and Cost Hotspots

Log analysis - Wikipedia - Coverage of the broader discipline including pattern recognition, normalization, and correlation across log sources that this chapter applies to LLM call logs.
Pareto principle - Wikipedia - The 80/20 rule that motivates the top-N-cost-drivers analysis pattern central to this chapter; understanding it explains why most LLM cost optimizations target a small number of features.
Percentile - Wikipedia - Mathematical foundation for the P50/P95/P99 reporting used throughout this chapter to surface long-tail cost without being misled by means.
Designing Data-Intensive Applications - Martin Kleppmann - O'Reilly - Chapters on stream processing and aggregation underpin the log-analysis pipelines this chapter constructs.
Site Reliability Engineering - Beyer, Jones, Petoff, Murphy (eds) - Google / O'Reilly - Chapters on postmortem analysis and incident review provide the framework for the cost-spike investigation pattern in this chapter.
DuckDB Documentation - DuckDB - Reference for the embedded analytical database used in this chapter's examples for ad-hoc log analysis without setting up a full warehouse.
Pandas DataFrame Documentation - pandas - Reference for the Python data-analysis library used in this chapter's notebooks for groupby, percentile, and time-series aggregation operations on log files.
Grafana Loki LogQL - Grafana Labs - Reference for the query language used to slice and aggregate structured logs at scale; directly applicable to the per-feature and per-user roll-ups in this chapter.
Google SRE: Postmortem Culture - Google - The blameless-postmortem framework adapted in this chapter for cost-spike analysis; explains how to write up a finding that drives a structural fix.
Splunk Search Reference - Splunk - Reference for the SPL query language used in many enterprises for log analysis; relevant for engineers whose org uses Splunk rather than open-source tools.