Hypothesis Testing

Summary

This chapter introduces hypothesis testing, the other main branch of statistical inference. Students will learn to formulate null and alternative hypotheses, calculate and interpret p-values, understand Type I and Type II errors, and make conclusions based on statistical evidence. These skills enable students to test claims about population parameters using sample data.

Concepts Covered

This chapter covers the following 22 concepts from the learning graph:

Hypothesis Test
Null Hypothesis
Alternative Hypothesis
Writing Hypotheses
One-Sided Test
Two-Sided Test
Test Statistic
P-Value
Calculating P-Values
Interpreting P-Values
Significance Level
Choosing Alpha
Statistical Significance
Making Conclusions
Type I Error
Type II Error
Error Tradeoffs
Power of a Test
Test for One Proportion
Conditions for Z-Test
Test for Two Proportions
Practical Significance

Prerequisites

This chapter builds on concepts from:

The Art of Statistical Detective Work

Have you ever heard someone claim that a new medication works better than an old one? Or that a coin might be unfair? Or that more than half of teenagers prefer a certain brand? How do we know if these claims are actually true, or just coincidence?

Welcome to hypothesis testing—the scientific method of statistics. This is where we become data detectives, using evidence to evaluate claims about the world. Rather than just accepting assertions at face value, we'll learn to put them on trial and let the data be the judge.

"Let's crack this nut!" Sylvia adjusts her spectacles with enthusiasm. "When I first started tracking which trees produced the best acorns, I had a hunch that oaks on the south side of the forest were more productive. But hunches aren't evidence! I needed a systematic way to test my theory. That's exactly what hypothesis testing does—it turns our hunches into rigorous questions that data can answer."

By the end of this chapter, you'll be able to:

Formulate statistical hypotheses that can be tested with data
Calculate and interpret p-values correctly
Make appropriate conclusions using significance levels
Understand the consequences of Type I and Type II errors
Conduct hypothesis tests for proportions
Distinguish between statistical and practical significance

What Is a Hypothesis Test?

A hypothesis test is a formal procedure for using sample data to evaluate a claim about a population parameter. Think of it as a trial where the claim is "innocent until proven guilty"—we start by assuming the claim is false and then look for evidence strong enough to convince us otherwise.

The general framework involves four key steps:

State the hypotheses (the claim we're testing and its alternative)
Collect data and calculate a test statistic (measuring how far the data is from what we'd expect)
Find the p-value (the probability of getting data this extreme if the claim is false)
Make a conclusion (reject or fail to reject based on the evidence)

This structured approach ensures we're not fooled by random variation. Just because 53% of people in a sample prefer chocolate doesn't mean more than half the population does—that 53% might just be sampling variability. Hypothesis testing helps us determine when the evidence is compelling enough to draw conclusions.

Everyday Situation	Statistical Question
"This coin seems biased"	Is P(heads) different from 0.5?
"The new drug is better"	Is the cure rate higher than before?
"More people prefer A over B"	Is the proportion choosing A greater than 0.5?
"The machines produce different results"	Is there a difference in proportions?

The Null and Alternative Hypotheses

Every hypothesis test begins with two competing statements about the population parameter.

The Null Hypothesis

The null hypothesis (denoted \( H_0 \), pronounced "H-naught" or "H-zero") represents the status quo—the claim that nothing special is happening. It's always a statement of equality or "no effect."

The null hypothesis is what we assume to be true until we have enough evidence to reject it. Think of it as the default position—the boring, expected outcome.

Examples of null hypotheses:

\( H_0: p = 0.5 \) (the coin is fair)
\( H_0: p = 0.20 \) (the success rate is 20%, as claimed)
\( H_0: p_1 - p_2 = 0 \) (there's no difference between groups)

The Alternative Hypothesis

The alternative hypothesis (denoted \( H_a \) or \( H_1 \)) represents what we're trying to find evidence for—the claim that something interesting IS happening. It's the research hypothesis, the thing we suspect might be true.

Examples of alternative hypotheses:

\( H_a: p \neq 0.5 \) (the coin is NOT fair)
\( H_a: p > 0.20 \) (the success rate is HIGHER than 20%)
\( H_a: p_1 - p_2 \neq 0 \) (there IS a difference between groups)

"Here's something that tripped me up at first," Sylvia admits. "The null hypothesis isn't necessarily what you believe—it's what you're trying to disprove! I suspected the south-side oaks were better, so my null hypothesis was 'there's no difference.' My alternative was 'south-side oaks produce more.' We assume the boring answer is true, then look for evidence against it."

The Logic of Hypothesis Testing

Why do we structure things this way? Because we can never prove something is definitely true—we can only accumulate evidence against the alternative. This is similar to the legal principle of "innocent until proven guilty":

Court Trial	Hypothesis Test
Defendant is presumed innocent	Null hypothesis is assumed true
Prosecution presents evidence	We collect sample data
Jury weighs the evidence	We calculate the p-value
"Guilty" if evidence is overwhelming	"Reject \( H_0 \)" if p-value is small
"Not guilty" if evidence is insufficient	"Fail to reject \( H_0 \)" if p-value is large

Note: "Not guilty" doesn't mean "innocent"—it just means there wasn't enough evidence to convict. Similarly, "fail to reject \( H_0 \)" doesn't mean \( H_0 \) is true—it just means we don't have enough evidence to reject it.

Writing Hypotheses

Writing hypotheses correctly is crucial—a poorly stated hypothesis leads to a confused analysis. Here are the rules:

Rule 1: Hypotheses are about population parameters, not sample statistics

Correct: \( H_0: p = 0.5 \) (where p is the population proportion)
Incorrect: \( H_0: \hat{p} = 0.5 \) (that's a sample statistic)

Rule 2: The null hypothesis always contains equality

Correct: \( H_0: p = 0.30 \)
Incorrect: \( H_0: p > 0.30 \)

Rule 3: The alternative hypothesis determines the test direction

The form of \( H_a \) tells us what kind of evidence would lead us to reject \( H_0 \):

If \( H_a \) contains	We call it	We reject \( H_0 \) if sample proportion is
\( \neq \)	Two-sided test	Far from \( p_0 \) in either direction
\( > \)	One-sided (right)	Much larger than \( p_0 \)
\( < \)	One-sided (left)	Much smaller than \( p_0 \)

Writing Hypotheses: Complete Example

Scenario: A company claims that 40% of customers prefer their product. You survey 200 customers and want to test if the true proportion differs from the claim.

Step 1: Identify the parameter: p = the true proportion of all customers who prefer the product

Step 2: Write the null hypothesis (the claim, assuming equality): [ H_0: p = 0.40 ]

Step 3: Write the alternative hypothesis (what you're looking for evidence of): [ H_a: p \neq 0.40 ]

This is a two-sided test because we're interested in whether the proportion differs in either direction—higher OR lower than 40%.

One-Sided vs. Two-Sided Tests

The choice between a one-sided test and a two-sided test depends on your research question.

Two-Sided Tests

A two-sided test (also called two-tailed) looks for evidence that the parameter differs from the null value in either direction. Use this when:

You want to detect any difference, regardless of direction
You don't have a strong prior reason to expect only one direction
You're being conservative and open to any surprise

Alternative hypothesis form: \( H_a: p \neq p_0 \)

Example: Testing whether a coin is fair (it could be biased toward heads OR tails).

One-Sided Tests

A one-sided test (also called one-tailed) looks for evidence in only one direction. Use this when:

You only care about differences in a specific direction
Scientific theory strongly predicts the direction
Practical considerations make only one direction meaningful

Alternative hypothesis forms: \( H_a: p > p_0 \) or \( H_a: p < p_0 \)

Example: Testing whether a new medication improves cure rates (we only care if it's BETTER, not worse).

Choose Your Test Before Seeing the Data

You must decide whether to use a one-sided or two-sided test BEFORE collecting or analyzing data. Choosing after you see the results is cheating—it's called "p-hacking" and leads to false conclusions.

Scenario	Appropriate Test
Does the proportion differ from 50%?	Two-sided
Is the proportion greater than 50%?	One-sided (right)
Is the proportion less than 50%?	One-sided (left)
Has the medication changed outcomes?	Two-sided
Has the medication improved outcomes?	One-sided (right)

The Test Statistic

A test statistic measures how far our sample result is from what we'd expect if the null hypothesis were true. It standardizes the difference, allowing us to determine how unusual our sample is.

For a test about one proportion, the test statistic is:

\[ z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} \]

Where: - \( \hat{p} \) = sample proportion (what we observed) - \( p_0 \) = hypothesized population proportion (from \( H_0 \)) - \( n \) = sample size - The denominator is the standard error under the null hypothesis

This z-score tells us how many standard errors the sample proportion is from the hypothesized value. A z-score of 0 means the sample matched expectations perfectly. A z-score of 2 or -2 means the sample was unusually far from what we expected.

Calculating a Test Statistic: Example

Scenario: A company claims 60% of customers are satisfied. In a sample of 250 customers, 135 (54%) reported satisfaction. Test whether satisfaction differs from the claim.

Given information: - \( \hat{p} = 135/250 = 0.54 \) - \( p_0 = 0.60 \) (from the claim) - \( n = 250 \)

Calculate the test statistic: [ z = \frac{0.54 - 0.60}{\sqrt{\frac{0.60(1-0.60)}{250}}} = \frac{-0.06}{\sqrt{\frac{0.24}{250}}} = \frac{-0.06}{0.031} \approx -1.94 ]

The sample proportion is about 1.94 standard errors below the claimed value. Is this unusual enough to reject the claim? We need the p-value to decide.

Diagram: Test Statistic Calculator

Interactive Test Statistic Calculator

Type: MicroSim

Bloom Taxonomy: Apply (L3) Bloom Taxonomy Verb: Calculate

Learning objective: Students will calculate z-test statistics for one-proportion hypothesis tests by inputting sample data and hypothesized values, understanding how the test statistic measures deviation from the null hypothesis.

Visual elements: - Input fields for sample size (n), number of successes (x), and hypothesized proportion (p₀) - Automatic calculation of sample proportion (p-hat = x/n) - Visual display of the z-formula with current values substituted - Number line showing the z-score position relative to 0 - Standard normal curve with the z-score marked - Color-coded output: green for small z (close to expected), yellow for moderate z, red for extreme z

Interactive controls: - Number input for sample size n (range: 10 to 1000) - Number input for number of successes x (range: 0 to n) - Slider or input for hypothesized proportion p₀ (range: 0.01 to 0.99) - "Calculate" button to compute test statistic - "Clear" button to reset all fields - Toggle to show/hide calculation steps

Behavior: - As inputs change, live updates show intermediate calculations - Standard error calculation displayed step-by-step - Test statistic updates automatically - Visual position on normal curve updates in real-time - Warning message if conditions not met (np₀ < 10 or n(1-p₀) < 10) - Display interpretation: "The sample proportion is ___ standard errors [above/below] the hypothesized value"

Canvas size: 800 x 500 pixels, responsive design Implementation: p5.js with canvas-based controls

Understanding the P-Value

The p-value is the probability of obtaining sample results at least as extreme as those observed, assuming the null hypothesis is true.

This is a mouthful, so let's break it down:

Assume \( H_0 \) is true (we're working in a hypothetical world where the null is correct)
Consider all possible samples we could draw in this world
Find the probability of getting a sample as unusual as ours (or more unusual)

A small p-value means: "If \( H_0 \) were true, results like ours would be very rare. Either we witnessed something unlikely, or \( H_0 \) isn't true."

A large p-value means: "If \( H_0 \) were true, results like ours would be fairly common. There's no reason to doubt \( H_0 \)."

P-Value Interpretation

P-Value	Verbal Interpretation
0.50	"Results like this happen half the time if \( H_0 \) is true"
0.25	"Results like this are fairly common under \( H_0 \)"
0.10	"Results like this are somewhat unusual under \( H_0 \)"
0.05	"Results like this are unusual under \( H_0 \)"
0.01	"Results like this are quite rare under \( H_0 \)"
0.001	"Results like this are very rare under \( H_0 \)"

"Acorn for your thoughts?" Sylvia asks. "The p-value confused me at first. It's NOT the probability that \( H_0 \) is true! It's the probability of seeing our data (or more extreme) IF \( H_0 \) were true. Big difference! Think of it as measuring how surprised we should be if the null hypothesis is actually correct."

Calculating P-Values

Calculating p-values depends on whether your test is one-sided or two-sided.

For a Two-Sided Test

When \( H_a: p \neq p_0 \), evidence against \( H_0 \) could come from either tail. The p-value is:

\[ \text{p-value} = 2 \times P(Z > |z|) \]

We double the one-tail probability because extreme values in either direction count as evidence.

For a One-Sided Test (Right)

When \( H_a: p > p_0 \), only large positive z-values count as evidence:

\[ \text{p-value} = P(Z > z) \]

For a One-Sided Test (Left)

When \( H_a: p < p_0 \), only large negative z-values count as evidence:

\[ \text{p-value} = P(Z < z) \]

P-Value Calculation Example

Continuing our satisfaction example where z = -1.94:

For a two-sided test (\( H_a: p \neq 0.60 \)): [ \text{p-value} = 2 \times P(Z < -1.94) = 2 \times 0.0262 = 0.0524 ]

For a one-sided test (\( H_a: p < 0.60 \)): [ \text{p-value} = P(Z < -1.94) = 0.0262 ]

The p-value tells us that if the true satisfaction rate were really 60%, we'd see a sample proportion as extreme as 54% (or more extreme) about 5.2% of the time with a two-sided test.

Diagram: P-Value Visualizer

Interactive P-Value Visualization

Type: MicroSim

Bloom Taxonomy: Understand (L2) Bloom Taxonomy Verb: Interpret

Learning objective: Students will interpret p-values by visualizing the area under the normal curve corresponding to the probability of obtaining results as extreme as the test statistic, distinguishing between one-sided and two-sided tests.

Data Visibility Requirements: - Stage 1: Show the standard normal curve with test statistic marked - Stage 2: Highlight the tail area(s) corresponding to the p-value - Stage 3: Display the calculated probability as a decimal and percentage - Stage 4: Show interpretation sentence explaining what the p-value means

Visual elements: - Standard normal distribution curve (bell curve) - Vertical line at z = 0 (center) - Movable marker showing test statistic position - Shaded region(s) representing the p-value - Numerical display of z-value and corresponding p-value - Toggle between one-sided (left/right) and two-sided views

Interactive controls: - Slider for z-statistic (range: -4 to +4) - Radio buttons: "Two-sided", "One-sided (left)", "One-sided (right)" - "Show/Hide" toggle for each tail region - Display mode: "Show area" or "Show probability" - Input field to enter a specific z-value - Reset button

Behavior: - As z-slider moves, shaded region updates in real-time - P-value display updates continuously - For two-sided test, both tails shade symmetrically - For one-sided test, only relevant tail shades - When p-value < 0.05, region turns red; otherwise blue - Interpretation text updates: "If H₀ is true, results this extreme occur ___% of the time"

Instructional Rationale: Step-through visualization with explicit data visibility is appropriate because the Understanding/interpret objective requires learners to connect the visual area representation to the numerical probability, building intuition about what p-values mean.

Canvas size: 800 x 450 pixels, responsive design Implementation: p5.js with canvas-based controls

Interpreting P-Values

Interpreting p-values correctly is essential—and widely misunderstood, even by scientists!

What the P-Value IS

The p-value is the probability of obtaining sample results at least as extreme as those observed, assuming the null hypothesis is true.

Correct interpretation: "If the true proportion were [null value], there is a [p-value] probability of observing a sample proportion at least as extreme as [observed value]."

Example: "If the true satisfaction rate were 60%, there is a 5.24% probability of observing a sample proportion at least as far from 60% as our observed 54%."

What the P-Value is NOT

NOT the probability that \( H_0 \) is true
NOT the probability that \( H_a \) is false
NOT the probability that you made an error
NOT the magnitude or importance of the effect

Common Misinterpretations

Incorrect Statement	Why It's Wrong
"There's a 5% chance \( H_0 \) is true"	P-value doesn't give probability of hypotheses
"The results are 95% likely to be real"	P-value is about data, not reality
"If p = 0.03, there's a 97% chance the treatment works"	P-value isn't about effectiveness
"A small p-value means a large effect"	P-value says nothing about effect size

"Don't worry—every statistician drops an acorn sometimes," Sylvia reassures. "P-value interpretation trips up even experts! The key is remembering we're calculating a probability about DATA, not about whether our hypothesis is true. We're asking 'how surprising is this data?' not 'what's the truth?'"

The Significance Level

The significance level (denoted \( \alpha \), the Greek letter alpha) is the threshold we set BEFORE the test to decide how small the p-value must be to reject \( H_0 \).

Think of \( \alpha \) as our "surprise threshold"—how rare must the data be before we're convinced something unusual is happening?

Common Significance Levels

\( \alpha \)	Interpretation
0.10	Reject if p-value < 0.10 (lenient)
0.05	Reject if p-value < 0.05 (most common)
0.01	Reject if p-value < 0.01 (strict)
0.001	Reject if p-value < 0.001 (very strict)

The most common choice is \( \alpha = 0.05 \), meaning we reject \( H_0 \) if there's less than a 5% chance of seeing data this extreme under \( H_0 \).

Choosing Alpha

Choosing alpha involves balancing competing concerns:

Lower \( \alpha \) (like 0.01): - Harder to reject \( H_0 \)—requires stronger evidence - Fewer false positives (wrongly rejecting a true \( H_0 \)) - More false negatives (failing to detect real effects) - Appropriate when false positives are costly (e.g., medical treatments)

Higher \( \alpha \) (like 0.10): - Easier to reject \( H_0 \)—requires less evidence - More false positives - Fewer false negatives (better at detecting real effects) - Appropriate in exploratory research or when missing effects is costly

Field	Typical \( \alpha \)	Reasoning
Clinical trials	0.01 or lower	False positives could harm patients
Psychology/social science	0.05	Balance of concerns
Exploratory research	0.10	Don't want to miss interesting leads
Particle physics	0.0000003 (5 sigma)	Extraordinary claims require extraordinary evidence

Statistical Significance

We say a result is statistically significant when the p-value falls below our chosen significance level \( \alpha \).

\[ \text{Statistically significant} \Leftrightarrow \text{p-value} < \alpha \]

Example: With \( \alpha = 0.05 \): - p-value = 0.03 → statistically significant (reject \( H_0 \)) - p-value = 0.08 → not statistically significant (fail to reject \( H_0 \))

The Meaning of Statistical Significance

When results are statistically significant, we're saying: "The observed difference is too large to reasonably attribute to random sampling variation alone."

When results are not statistically significant, we're saying: "The observed difference could plausibly be due to random sampling variation."

What Statistical Significance Does NOT Mean

It does NOT mean the results are important or meaningful
It does NOT prove \( H_0 \) is false
It does NOT mean the effect is large
It does NOT mean the finding will replicate

Statistical significance is about the strength of evidence against \( H_0 \), not about practical importance. A tiny, meaningless difference can be statistically significant with a large enough sample.

Making Conclusions

Making conclusions in hypothesis testing requires careful language. There are only two possible outcomes:

Outcome 1: Reject \( H_0 \)

When p-value < \( \alpha \): - We have sufficient evidence to reject the null hypothesis - We conclude there IS evidence supporting the alternative - The result is statistically significant

Template: "At the \( \alpha = \_\_\_ \) significance level, we reject \( H_0 \). There is statistically significant evidence that [alternative hypothesis in context]."

Example: "At the \( \alpha = 0.05 \) significance level, we reject \( H_0 \). There is statistically significant evidence that the true customer satisfaction rate differs from 60%."

Outcome 2: Fail to Reject \( H_0 \)

When p-value \( \geq \alpha \): - We do NOT have sufficient evidence to reject the null hypothesis - We CANNOT conclude the alternative is true - The result is not statistically significant

Template: "At the \( \alpha = \_\_\_ \) significance level, we fail to reject \( H_0 \). There is not sufficient evidence that [alternative hypothesis in context]."

Example: "At the \( \alpha = 0.05 \) significance level, we fail to reject \( H_0 \). There is not sufficient evidence that the true customer satisfaction rate differs from 60%."

Critical Language Points

Correct Phrasing	NEVER Say
"Fail to reject \( H_0 \)"	"Accept \( H_0 \)"
"Not sufficient evidence to conclude..."	"We prove \( H_0 \) is true"
"Evidence suggests..."	"We prove \( H_a \) is true"
"At the \( \alpha \) level"	(omitting the significance level)

"Time to squirrel away this crucial point!" Sylvia emphasizes. "We NEVER 'accept' the null hypothesis—we only fail to reject it. Just like a jury doesn't declare someone 'innocent'—they find them 'not guilty.' Absence of evidence isn't evidence of absence!"

Diagram: Hypothesis Testing Decision Flowchart

Hypothesis Testing Decision Guide

Type: workflow

Bloom Taxonomy: Apply (L3) Bloom Taxonomy Verb: Implement

Learning objective: Students will follow the complete hypothesis testing procedure, making correct conclusions based on p-values and significance levels.

Visual elements: - Flowchart with decision diamonds and process boxes - Color-coded paths for different outcomes - Summary boxes for each type of conclusion - Examples embedded at each decision point

Steps in workflow: 1. Start: "State hypotheses (H₀ and Hₐ)" Hover text: "H₀ contains =, Hₐ contains ≠, <, or >"

Process: "Check conditions" Hover text: "Verify np₀ ≥ 10 and n(1-p₀) ≥ 10 for z-test"
Decision: "Conditions met?" Hover text: "If not, cannot proceed with z-test"
Process: "Calculate test statistic z" Hover text: "z = (p̂ - p₀) / √(p₀(1-p₀)/n)"
Process: "Find p-value" Hover text: "One-sided or two-sided based on Hₐ"
Decision: "p-value < α?" Hover text: "Compare to chosen significance level"

7a. (If Yes) Result: "Reject H₀" Hover text: "Statistically significant evidence for Hₐ" Color: Green

7b. (If No) Result: "Fail to Reject H₀" Hover text: "Insufficient evidence for Hₐ" Color: Orange

End: "State conclusion in context" Hover text: "Always relate back to the original question"

Visual style: Modern flowchart with rounded rectangles Color coding: - Blue: Process steps - Yellow: Decision points - Green: Reject H₀ path - Orange: Fail to reject path

Canvas size: 800 x 600 pixels, responsive design Implementation: p5.js with canvas-based hover interactions

Type I and Type II Errors

In hypothesis testing, we make decisions based on incomplete information (sample data), so we can make mistakes. There are exactly two types of errors possible.

Type I Error: False Positive

A Type I error occurs when we reject \( H_0 \) when it's actually true. We concluded something is happening when it's actually not—a false alarm.

Consequences of Type I Error: - Claiming a treatment works when it doesn't - Concluding a coin is unfair when it's actually fair - Believing there's a difference when there isn't one

Probability of Type I Error: \( P(\text{Type I Error}) = \alpha \)

This is why \( \alpha \) is sometimes called the "Type I error rate." When we set \( \alpha = 0.05 \), we're accepting a 5% chance of a false positive.

Type II Error: False Negative

A Type II error occurs when we fail to reject \( H_0 \) when it's actually false. We missed something real—we failed to detect a true effect.

Consequences of Type II Error: - Missing an effective treatment - Declaring a biased coin "fair" - Concluding no difference when one exists

Probability of Type II Error: \( P(\text{Type II Error}) = \beta \)

The probability \( \beta \) depends on sample size, the true parameter value, and how different reality is from \( H_0 \).

The Four Possible Outcomes

	\( H_0 \) is True	\( H_0 \) is False
Reject \( H_0 \)	Type I Error (\( \alpha \))	Correct Decision!
Fail to Reject \( H_0 \)	Correct Decision!	Type II Error (\( \beta \))

Real-World Analogy

Scenario	Type I Error	Type II Error
Medical test	Healthy patient diagnosed with disease	Sick patient given clean bill of health
Fire alarm	Alarm sounds with no fire	No alarm during actual fire
Court trial	Convicting an innocent person	Acquitting a guilty person
Spam filter	Legitimate email marked as spam	Spam gets through to inbox

"Acorn for your thoughts on this?" Sylvia asks. "In my acorn quality testing, a Type I error means I reject perfectly good acorns thinking they're bad. A Type II error means I keep bad acorns thinking they're good. Both are problems, but depending on the situation, one might be worse than the other!"

Diagram: Type I and Type II Error Visualizer

Interactive Error Type Demonstration

Type: MicroSim

Bloom Taxonomy: Analyze (L4) Bloom Taxonomy Verb: Distinguish

Learning objective: Students will distinguish between Type I and Type II errors by exploring scenarios where the null hypothesis is either true or false, observing how different sample outcomes lead to correct decisions or errors.

Data Visibility Requirements: - Stage 1: Show the true state of reality (H₀ true or false) - Stage 2: Show the sample data collected - Stage 3: Show the test statistic and p-value calculated - Stage 4: Show the decision made (reject or fail to reject) - Stage 5: Show the outcome (correct decision, Type I error, or Type II error)

Visual elements: - Two parallel tracks: "Reality" track and "Our Decision" track - Reality track shows true population parameter (controlled by user) - Decision track shows sample, test statistic, p-value, and conclusion - Outcome box shows whether we made correct decision or error type - Color coding: green for correct, red for Type I, orange for Type II - Counter tracking cumulative error rates over many trials

Interactive controls: - Toggle: "H₀ is actually true" vs "H₀ is actually false" - Slider for true population proportion (when H₀ is false) - Slider for sample size n - Input for significance level α - "Draw One Sample" button - "Run 100 Samples" button to see error rates accumulate - Reset button

Behavior: - When H₀ is true and we reject → Type I Error (red highlight) - When H₀ is true and we fail to reject → Correct (green) - When H₀ is false and we reject → Correct (green) - When H₀ is false and we fail to reject → Type II Error (orange) - Running counters show: "Type I Errors: X/Y trials when H₀ true" - Running counters show: "Type II Errors: X/Y trials when H₀ false" - Demonstrates that Type I rate ≈ α when H₀ is true

Instructional Rationale: Interactive exploration with explicit state visibility is appropriate because the Analyze/distinguish objective requires learners to compare outcomes across different scenarios, building understanding of when each error type occurs.

Canvas size: 850 x 550 pixels, responsive design Implementation: p5.js with canvas-based controls

Error Tradeoffs

There's an inherent tension between Type I and Type II errors—reducing one tends to increase the other.

The Tradeoff

If we lower \( \alpha \) (more strict): - Harder to reject \( H_0 \) - Fewer Type I errors (fewer false positives) - MORE Type II errors (more missed true effects)

If we raise \( \alpha \) (more lenient): - Easier to reject \( H_0 \) - More Type I errors - FEWER Type II errors (better detection)

Balancing the Errors

The appropriate balance depends on consequences:

If Type I Error is Worse	If Type II Error is Worse
Use lower \( \alpha \) (0.01)	Use higher \( \alpha \) (0.10)
Convicting innocent people	Letting guilty people go free
Approving harmful drugs	Missing effective treatments
False alarms waste resources	Missing real threats is dangerous

The Only Free Lunch: Increase Sample Size

Want to reduce BOTH error types simultaneously? Increase your sample size. Larger samples provide more information, making it easier to:

Detect real effects (reducing Type II errors)
Have enough precision to avoid false alarms (reducing Type I errors)

This is why well-funded studies use large samples—they can afford to gather more data and make more accurate decisions.

Power of a Test

The power of a test is the probability of correctly rejecting \( H_0 \) when it's false—the probability of detecting a real effect when one exists.

\[ \text{Power} = 1 - \beta = P(\text{Reject } H_0 | H_0 \text{ is false}) \]

Higher power is always better. A powerful test rarely misses real effects.

Factors Affecting Power

Factor	Effect on Power
Larger sample size	Increases power
Larger significance level (\( \alpha \))	Increases power
Larger true effect	Increases power (easier to detect big differences)
Less variability in data	Increases power
One-sided vs two-sided test	One-sided has more power (if direction is correct)

Practical Power Guidelines

Minimum acceptable power: 0.80 (80% chance of detecting a real effect)
Desirable power: 0.90 or higher
Power below 0.50: The test is essentially useless

Power analysis is often done BEFORE data collection to determine the needed sample size. If we want 80% power to detect a specific effect size at \( \alpha = 0.05 \), we can calculate how many observations we need.

"My tail's tingling—we're onto something important!" Sylvia exclaims. "Power is like having good eyesight for data. A high-powered test can see small effects; a low-powered test might miss even obvious ones. Before starting a study, I always ask: 'Do I have enough acorns—I mean, observations—to actually detect what I'm looking for?'"

Test for One Proportion

A test for one proportion determines whether a population proportion equals a specific value. This is the most common type of proportion test.

The Setup

Parameter: p = true population proportion
Hypotheses:
\( H_0: p = p_0 \) (null value)
\( H_a: p \neq p_0 \) or \( p > p_0 \) or \( p < p_0 \)

The Test Statistic

\[ z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} \]

Note: We use \( p_0 \) (not \( \hat{p} \)) in the standard error because we calculate the probability assuming \( H_0 \) is true.

Complete Example: Testing a Claim

Scenario: A coin is flipped 200 times and lands on heads 114 times. Test whether the coin is fair at \( \alpha = 0.05 \).

Step 1: State Hypotheses - \( H_0: p = 0.5 \) (coin is fair) - \( H_a: p \neq 0.5 \) (coin is not fair)

Step 2: Check Conditions - \( np_0 = 200(0.5) = 100 \geq 10 \) ✓ - \( n(1-p_0) = 200(0.5) = 100 \geq 10 \) ✓

Step 3: Calculate Test Statistic [ \hat{p} = \frac{114}{200} = 0.57 ] [ z = \frac{0.57 - 0.50}{\sqrt{\frac{0.50(0.50)}{200}}} = \frac{0.07}{0.0354} = 1.98 ]

Step 4: Find P-Value Two-sided test: \( \text{p-value} = 2 \times P(Z > 1.98) = 2 \times 0.0239 = 0.0478 \)

Step 5: Make Conclusion Since p-value (0.0478) < \( \alpha \) (0.05), we reject \( H_0 \).

"At the 0.05 significance level, there is statistically significant evidence that the coin is not fair. The sample data suggests the probability of heads differs from 0.5."

Conditions for Z-Test

The conditions for z-test must be verified before conducting the test. If conditions aren't met, the p-value calculations may be inaccurate.

The Three Conditions

1. Random Sample The data must come from a random sample or randomized experiment. This ensures the sample is representative of the population.

2. Independence (10% Rule) Sample observations must be independent. When sampling without replacement, the sample size should be less than 10% of the population. [ n < 0.10N ]

3. Large Counts (Normality) The sampling distribution of \( \hat{p} \) must be approximately normal. This requires: [ np_0 \geq 10 \quad \text{and} \quad n(1-p_0) \geq 10 ]

Note: For hypothesis tests, we check using \( p_0 \) (the hypothesized proportion), not \( \hat{p} \).

What If Conditions Aren't Met?

Condition Violated	Problem	Solution
Not random	Results may be biased	Use proper random sampling
Not independent	Standard error is wrong	Sample less than 10% of population
Small counts	Normal approximation fails	Use exact binomial test or collect more data

Always Check Conditions First

Before calculating test statistics and p-values, verify all three conditions. Many students lose points on AP Statistics by skipping this step!

Test for Two Proportions

A test for two proportions compares proportions between two independent groups. This is useful when asking: "Is the proportion different between Group A and Group B?"

The Setup

Parameters: \( p_1 \) = proportion in population 1, \( p_2 \) = proportion in population 2
Hypotheses:
\( H_0: p_1 - p_2 = 0 \) (or equivalently, \( p_1 = p_2 \))
\( H_a: p_1 - p_2 \neq 0 \) or \( > 0 \) or \( < 0 \)

Pooled Proportion

Since \( H_0 \) assumes the proportions are equal, we combine the samples to get a pooled proportion:

\[ \hat{p}_{pool} = \frac{x_1 + x_2}{n_1 + n_2} = \frac{\text{total successes}}{\text{total sample size}} \]

The Test Statistic

\[ z = \frac{(\hat{p}_1 - \hat{p}_2) - 0}{\sqrt{\hat{p}_{pool}(1-\hat{p}_{pool})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} \]

Conditions for Two-Proportion Z-Test

Random: Both samples are random samples or from randomized experiments
Independent: Samples are independent of each other, and within each sample, observations are independent
Large Counts: For each group: \( n_i \hat{p}_{pool} \geq 10 \) and \( n_i(1-\hat{p}_{pool}) \geq 10 \)

Complete Example: Comparing Two Groups

Scenario: In a study of a new teaching method: - Control group (traditional): 45 of 120 passed (\( \hat{p}_1 = 0.375 \)) - Treatment group (new method): 68 of 130 passed (\( \hat{p}_2 = 0.523 \))

Test whether the new method produces different pass rates at \( \alpha = 0.05 \).

Step 1: State Hypotheses - \( H_0: p_1 - p_2 = 0 \) (no difference) - \( H_a: p_1 - p_2 \neq 0 \) (there is a difference)

Step 2: Calculate Pooled Proportion [ \hat{p}_{pool} = \frac{45 + 68}{120 + 130} = \frac{113}{250} = 0.452 ]

Step 3: Check Conditions - \( 120(0.452) = 54.2 \geq 10 \) ✓ - \( 120(0.548) = 65.8 \geq 10 \) ✓ - \( 130(0.452) = 58.8 \geq 10 \) ✓ - \( 130(0.548) = 71.2 \geq 10 \) ✓

Step 4: Calculate Test Statistic [ z = \frac{(0.375 - 0.523) - 0}{\sqrt{0.452(0.548)\left(\frac{1}{120} + \frac{1}{130}\right)}} = \frac{-0.148}{\sqrt{0.2477(0.0077 + 0.0077)}} = \frac{-0.148}{0.0632} = -2.34 ]

Step 5: Find P-Value Two-sided: \( \text{p-value} = 2 \times P(Z < -2.34) = 2 \times 0.0096 = 0.0192 \)

Step 6: Conclusion Since p-value (0.0192) < \( \alpha \) (0.05), we reject \( H_0 \).

"At the 0.05 significance level, there is statistically significant evidence of a difference in pass rates between the traditional and new teaching methods."

Diagram: Two-Proportion Test Comparison

Interactive Two-Proportion Test Calculator

Type: MicroSim

Bloom Taxonomy: Apply (L3) Bloom Taxonomy Verb: Execute

Learning objective: Students will conduct a complete two-proportion z-test by entering data from two groups, calculating the pooled proportion and test statistic, and interpreting the results.

Visual elements: - Two side-by-side input panels for each group's data - Visual comparison bar chart showing the two sample proportions - Pooled proportion calculation display - Test statistic formula with substituted values - Normal curve with z-score marked and p-value shaded - Conclusion statement generator

Interactive controls: - Group 1: Input fields for n₁ (sample size) and x₁ (successes) - Group 2: Input fields for n₂ (sample size) and x₂ (successes) - Radio buttons for test type: two-sided, p₁ > p₂, p₁ < p₂ - Slider for significance level α (0.01, 0.05, 0.10) - "Calculate" button - "Check Conditions" button (highlights which conditions pass/fail) - "Show Steps" toggle for detailed calculation breakdown

Behavior: - Automatically calculates p̂₁, p̂₂, pooled p̂ - Displays step-by-step calculation of standard error and z-statistic - Shows p-value with visual representation on normal curve - Compares p-value to α and states conclusion - Generates properly worded conclusion statement - Warning messages for condition violations

Canvas size: 900 x 600 pixels, responsive design Implementation: p5.js with canvas-based controls

Practical Significance

Practical significance asks whether a statistically significant result actually matters in the real world. Just because we can detect a difference doesn't mean the difference is important.

The Problem with Large Samples

With a large enough sample, even tiny differences become statistically significant. This happens because:

The standard error shrinks as n increases
Small differences produce large z-scores
P-values become very small

Example: If a medication improves outcomes from 80.0% to 80.5%, that's a 0.5% improvement. With n = 10,000 in each group, this tiny difference might have p-value < 0.001. Statistically significant? Yes. Practically important? Probably not.

Statistical vs. Practical Significance

Aspect	Statistical Significance	Practical Significance
Question answered	"Is there any difference?"	"Is the difference big enough to matter?"
Measured by	P-value	Effect size, context
Affected by	Sample size	Real-world implications
Can exist without	Large sample needed	Can't have practical without some effect

Assessing Practical Significance

When evaluating practical significance, consider:

The size of the effect: How big is the difference in practical terms?
The context: What would this difference mean in the real world?
The cost-benefit analysis: Is the difference worth acting on?
Comparison to meaningful benchmarks: How does the effect compare to what's considered meaningful in this field?

Example Evaluation: - A diet produces statistically significant weight loss of 0.5 pounds - Context: Is losing half a pound meaningful for health? - Cost: What are the burdens of this diet? - Benchmark: Doctors consider 5% body weight loss clinically meaningful

Conclusion: Statistically significant but not practically significant.

"Here's wisdom from my acorn research," Sylvia shares. "I once found a statistically significant difference in acorn weight between north and south sides of a tree—about 0.01 grams. My p-value was tiny! But would I change my foraging strategy over one hundredth of a gram? Of course not. The effect was real but meaningless for my purposes."

Complete Hypothesis Testing Examples

Let's work through complete examples that bring all the concepts together.

Example 1: One-Proportion Test

Scenario: A school claims that 75% of their graduates go to college. A sample of 180 graduates finds that 125 went to college. Test whether the true proportion differs from the claim at \( \alpha = 0.05 \).

Solution:

Step 1: State Hypotheses - \( H_0: p = 0.75 \) - \( H_a: p \neq 0.75 \)

Step 2: Check Conditions - Random: Assuming this is a random sample of graduates ✓ - Independence: 180 is less than 10% of all graduates (assuming large alumni base) ✓ - Large counts: \( 180(0.75) = 135 \geq 10 \) ✓ and \( 180(0.25) = 45 \geq 10 \) ✓

Step 3: Calculate Test Statistic [ \hat{p} = \frac{125}{180} = 0.694 ] [ z = \frac{0.694 - 0.75}{\sqrt{\frac{0.75(0.25)}{180}}} = \frac{-0.056}{0.0323} = -1.73 ]

Step 4: Find P-Value [ \text{p-value} = 2 \times P(Z < -1.73) = 2 \times 0.0418 = 0.0836 ]

Step 5: Conclusion Since p-value (0.0836) > \( \alpha \) (0.05), we fail to reject \( H_0 \).

At the 0.05 significance level, there is not sufficient evidence to conclude that the true proportion of graduates going to college differs from 75%.

Example 2: One-Sided Test

Scenario: A company's old website had a 12% conversion rate. After a redesign, 156 of 1000 visitors made a purchase. Test whether the new design improved conversions at \( \alpha = 0.01 \).

Solution:

Step 1: State Hypotheses - \( H_0: p = 0.12 \) - \( H_a: p > 0.12 \) (one-sided, testing for improvement)

Step 2: Check Conditions - \( 1000(0.12) = 120 \geq 10 \) ✓ - \( 1000(0.88) = 880 \geq 10 \) ✓

Step 3: Calculate Test Statistic [ \hat{p} = \frac{156}{1000} = 0.156 ] [ z = \frac{0.156 - 0.12}{\sqrt{\frac{0.12(0.88)}{1000}}} = \frac{0.036}{0.0103} = 3.50 ]

Step 4: Find P-Value One-sided (right): \( \text{p-value} = P(Z > 3.50) = 0.0002 \)

Step 5: Conclusion Since p-value (0.0002) < \( \alpha \) (0.01), we reject \( H_0 \).

At the 0.01 significance level, there is statistically significant evidence that the new website design has a higher conversion rate than the old design's 12%.

Common Mistakes in Hypothesis Testing

Before we conclude, let's address errors that trip up even careful students:

Mistake 1: Wrong Hypothesis Structure - Using \( \hat{p} \) instead of p in hypotheses (hypotheses are about parameters, not statistics) - Putting inequality in \( H_0 \) instead of \( H_a \)

Mistake 2: Using Wrong Standard Error - For hypothesis tests, use \( p_0 \) (the null value) in the standard error - For confidence intervals, use \( \hat{p} \) (the sample value)

Mistake 3: Confusing P-Value Interpretation - The p-value is NOT the probability that \( H_0 \) is true - It's the probability of getting data this extreme IF \( H_0 \) were true

Mistake 4: Saying "Accept \( H_0 \)" - We never "accept" the null hypothesis - We either "reject" or "fail to reject"

Mistake 5: Ignoring Practical Significance - Statistical significance doesn't imply practical importance - Always consider context and effect size

Mistake 6: Choosing Test Direction After Seeing Data - One-sided vs. two-sided must be decided before analysis - Choosing after seeing results is "p-hacking"

Mistake 7: Forgetting to Check Conditions - Always verify random sample, independence, and large counts - Invalid conditions mean unreliable p-values

Key Takeaways

"Time to squirrel away the big ideas!"

A hypothesis test uses sample data to evaluate a claim about a population parameter
The null hypothesis (\( H_0 \)) represents "no effect" and contains equality; the alternative hypothesis (\( H_a \)) is what we seek evidence for
One-sided tests look for evidence in one direction; two-sided tests look for evidence in either direction
The test statistic measures how far the sample result is from what's expected under \( H_0 \)
The p-value is the probability of getting results as extreme as observed, assuming \( H_0 \) is true
The significance level (\( \alpha \)) is the threshold for rejecting \( H_0 \); common choice is 0.05
Statistical significance means p-value < \( \alpha \); we reject \( H_0 \) and have evidence for \( H_a \)
Type I error: Rejecting a true \( H_0 \) (false positive); probability = \( \alpha \)
Type II error: Failing to reject a false \( H_0 \) (false negative); probability = \( \beta \)
Power = 1 - \( \beta \) = probability of correctly rejecting a false \( H_0 \)
For one-proportion z-test: \( z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} \)
For two-proportion z-test: Use the pooled proportion when calculating the standard error
Practical significance considers whether a statistically significant difference actually matters in context

Practice Problems

Check Your Understanding

Problem 1: A company claims that 90% of orders are delivered on time. In a random sample of 400 orders, 348 were on time.

a) State the null and alternative hypotheses to test if the proportion differs from the claim. b) Calculate the test statistic. c) Find the p-value. d) At α = 0.05, what is your conclusion?

Problem 2: A researcher wants to test if more than 60% of adults support a new policy. In a random sample of 500 adults, 325 support the policy.

a) State the appropriate hypotheses. b) Is this a one-sided or two-sided test? Why? c) Conduct the complete hypothesis test at α = 0.05.

Problem 3: Explain the difference between a Type I error and a Type II error in the context of testing whether a new drug is effective.

Problem 4: Two factories produce the same product. Factory A had 24 defects in 600 items. Factory B had 42 defects in 700 items. Test whether there's a difference in defect rates at α = 0.05.

Problem 5: A study finds that a diet pill produces statistically significant weight loss (p < 0.001) with an average loss of 0.3 pounds. Discuss statistical versus practical significance.

Solutions

Problem 1:

a) \( H_0: p = 0.90 \), \( H_a: p \neq 0.90 \)

b) \( \hat{p} = 348/400 = 0.87 \) \( z = \frac{0.87 - 0.90}{\sqrt{0.90(0.10)/400}} = \frac{-0.03}{0.015} = -2.0 \)

c) p-value = 2 × P(Z < -2.0) = 2 × 0.0228 = 0.0456

d) Since 0.0456 < 0.05, reject \( H_0 \). At the 0.05 significance level, there is statistically significant evidence that the on-time delivery rate differs from 90%.

Problem 2:

a) \( H_0: p = 0.60 \), \( H_a: p > 0.60 \)

b) One-sided (right-tailed) because we're specifically testing if more than 60% support the policy.

c) \( \hat{p} = 325/500 = 0.65 \) \( z = \frac{0.65 - 0.60}{\sqrt{0.60(0.40)/500}} = \frac{0.05}{0.0219} = 2.28 \) p-value = P(Z > 2.28) = 0.0113 Since 0.0113 < 0.05, reject \( H_0 \). There is statistically significant evidence that more than 60% of adults support the policy.

Problem 3: Type I error: Concluding the drug is effective when it actually isn't (approving an ineffective drug). This could lead to patients receiving useless treatment.

Type II error: Concluding the drug is not effective when it actually is (rejecting an effective drug). This could deny patients a beneficial treatment.

In drug testing, Type I errors are often considered more dangerous because approving an ineffective drug wastes resources and may have side effects with no benefit.

Problem 4: \( \hat{p}_A = 24/600 = 0.04 \), \( \hat{p}_B = 42/700 = 0.06 \) \( \hat{p}_{pool} = (24+42)/(600+700) = 66/1300 = 0.0508 \) \( z = \frac{0.04 - 0.06}{\sqrt{0.0508(0.9492)(1/600 + 1/700)}} = \frac{-0.02}{0.0123} = -1.63 \) p-value = 2 × P(Z < -1.63) = 2 × 0.0516 = 0.1032 Since 0.1032 > 0.05, fail to reject \( H_0 \). There is not sufficient evidence of a difference in defect rates between the factories.

Problem 5: The result is statistically significant (p < 0.001), meaning if the pill had no effect, observing 0.3 pounds average weight loss would be extremely rare.

However, the practical significance is questionable. Losing 0.3 pounds is barely noticeable and unlikely to have health benefits. The very small p-value likely results from a large sample size rather than a meaningful effect. Clinically meaningful weight loss is typically considered 5% of body weight. This pill might "work" statistically but isn't practically useful.

You've now mastered the core concepts of hypothesis testing—one of the most important tools in statistical inference. This framework for making decisions based on data applies everywhere: in medicine, business, science, and everyday life.

"Now that's a data point worth collecting!" Sylvia beams. "You've learned to be a data detective, testing claims with evidence rather than just accepting them. Whether you're evaluating a new product, a medical treatment, or even whether a coin is fair, you now have the tools to let the data speak. That's a superpower worth having!"

In the upcoming chapters, we'll extend these ideas to tests involving means and apply hypothesis testing to more complex scenarios. The logic remains the same—only the formulas change.