References: A/B Testing Methodology for LLMs

A/B testing - Wikipedia - Comprehensive coverage of controlled online experiments including hypothesis design, traffic splitting, and statistical analysis that frame this chapter's LLM-specific guidance.
Statistical hypothesis testing - Wikipedia - Foundational coverage of null hypothesis, p-value, and significance level used throughout this chapter's worked examples.
Statistical power - Wikipedia - Explanation of the false-negative concept central to sample-size calculation; complements the chapter's emphasis on designing tests that can detect real effects.
Trustworthy Online Controlled Experiments - Ron Kohavi, Diane Tang, Ya Xu - Cambridge University Press - The definitive academic-practitioner reference on A/B testing co-authored by leaders of the Microsoft, Airbnb, and LinkedIn experimentation programs; the structure of this chapter borrows heavily from their framework.
Statistics (4th Edition) - David Freedman, Robert Pisani, Roger Purves - W. W. Norton - The accessible undergraduate statistics text that grounds the formal probability-theory chapters in real examples; ideal for engineers who want to deepen the statistical intuition this chapter builds on.
Microsoft Experimentation Platform: KDD Tutorials - Ron Kohavi et al. - Decade-plus archive of A/B testing tutorials, papers, and case studies from one of the largest experimentation programs in industry; rich source of advanced material beyond this chapter.
Evan Miller's A/B Test Calculators - Evan Miller - Free web calculators for sample size, sequential testing, and Bayesian inference that this chapter references for hands-on exercises.
GrowthBook Documentation - GrowthBook - Reference for the open-source experimentation platform used in this chapter's examples for traffic splitting and effect-size reporting.
Eppo Blog on Experimentation - Eppo - Practitioner posts on CUPED variance reduction, sequential testing, and metric design that extend this chapter's foundational treatment.
Lilian Weng: A/B Testing for ML - Lilian Weng - Working notes from an OpenAI researcher on experimentation patterns specific to ML-driven products; relevant for adapting traditional A/B methodology to LLM-quality measurement.