Hypothesis Testing
Summary
This chapter introduces hypothesis testing, the other main branch of statistical inference. Students will learn to formulate null and alternative hypotheses, calculate and interpret p-values, understand Type I and Type II errors, and make conclusions based on statistical evidence. These skills enable students to test claims about population parameters using sample data.
Concepts Covered
This chapter covers the following 22 concepts from the learning graph:
- Hypothesis Test
- Null Hypothesis
- Alternative Hypothesis
- Writing Hypotheses
- One-Sided Test
- Two-Sided Test
- Test Statistic
- P-Value
- Calculating P-Values
- Interpreting P-Values
- Significance Level
- Choosing Alpha
- Statistical Significance
- Making Conclusions
- Type I Error
- Type II Error
- Error Tradeoffs
- Power of a Test
- Test for One Proportion
- Conditions for Z-Test
- Test for Two Proportions
- Practical Significance
Prerequisites
This chapter builds on concepts from:
The Art of Statistical Detective Work
Have you ever heard someone claim that a new medication works better than an old one? Or that a coin might be unfair? Or that more than half of teenagers prefer a certain brand? How do we know if these claims are actually true, or just coincidence?
Welcome to hypothesis testing—the scientific method of statistics. This is where we become data detectives, using evidence to evaluate claims about the world. Rather than just accepting assertions at face value, we'll learn to put them on trial and let the data be the judge.
"Let's crack this nut!" Sylvia adjusts her spectacles with enthusiasm. "When I first started tracking which trees produced the best acorns, I had a hunch that oaks on the south side of the forest were more productive. But hunches aren't evidence! I needed a systematic way to test my theory. That's exactly what hypothesis testing does—it turns our hunches into rigorous questions that data can answer."
By the end of this chapter, you'll be able to:
- Formulate statistical hypotheses that can be tested with data
- Calculate and interpret p-values correctly
- Make appropriate conclusions using significance levels
- Understand the consequences of Type I and Type II errors
- Conduct hypothesis tests for proportions
- Distinguish between statistical and practical significance
What Is a Hypothesis Test?
A hypothesis test is a formal procedure for using sample data to evaluate a claim about a population parameter. Think of it as a trial where the claim is "innocent until proven guilty"—we start by assuming the claim is false and then look for evidence strong enough to convince us otherwise.
The general framework involves four key steps:
- State the hypotheses (the claim we're testing and its alternative)
- Collect data and calculate a test statistic (measuring how far the data is from what we'd expect)
- Find the p-value (the probability of getting data this extreme if the claim is false)
- Make a conclusion (reject or fail to reject based on the evidence)
This structured approach ensures we're not fooled by random variation. Just because 53% of people in a sample prefer chocolate doesn't mean more than half the population does—that 53% might just be sampling variability. Hypothesis testing helps us determine when the evidence is compelling enough to draw conclusions.
| Everyday Situation | Statistical Question |
|---|---|
| "This coin seems biased" | Is P(heads) different from 0.5? |
| "The new drug is better" | Is the cure rate higher than before? |
| "More people prefer A over B" | Is the proportion choosing A greater than 0.5? |
| "The machines produce different results" | Is there a difference in proportions? |
The Null and Alternative Hypotheses
Every hypothesis test begins with two competing statements about the population parameter.
The Null Hypothesis
The null hypothesis (denoted \( H_0 \), pronounced "H-naught" or "H-zero") represents the status quo—the claim that nothing special is happening. It's always a statement of equality or "no effect."
The null hypothesis is what we assume to be true until we have enough evidence to reject it. Think of it as the default position—the boring, expected outcome.
Examples of null hypotheses:
- \( H_0: p = 0.5 \) (the coin is fair)
- \( H_0: p = 0.20 \) (the success rate is 20%, as claimed)
- \( H_0: p_1 - p_2 = 0 \) (there's no difference between groups)
The Alternative Hypothesis
The alternative hypothesis (denoted \( H_a \) or \( H_1 \)) represents what we're trying to find evidence for—the claim that something interesting IS happening. It's the research hypothesis, the thing we suspect might be true.
Examples of alternative hypotheses:
- \( H_a: p \neq 0.5 \) (the coin is NOT fair)
- \( H_a: p > 0.20 \) (the success rate is HIGHER than 20%)
- \( H_a: p_1 - p_2 \neq 0 \) (there IS a difference between groups)
"Here's something that tripped me up at first," Sylvia admits. "The null hypothesis isn't necessarily what you believe—it's what you're trying to disprove! I suspected the south-side oaks were better, so my null hypothesis was 'there's no difference.' My alternative was 'south-side oaks produce more.' We assume the boring answer is true, then look for evidence against it."
The Logic of Hypothesis Testing
Why do we structure things this way? Because we can never prove something is definitely true—we can only accumulate evidence against the alternative. This is similar to the legal principle of "innocent until proven guilty":
| Court Trial | Hypothesis Test |
|---|---|
| Defendant is presumed innocent | Null hypothesis is assumed true |
| Prosecution presents evidence | We collect sample data |
| Jury weighs the evidence | We calculate the p-value |
| "Guilty" if evidence is overwhelming | "Reject \( H_0 \)" if p-value is small |
| "Not guilty" if evidence is insufficient | "Fail to reject \( H_0 \)" if p-value is large |
Note: "Not guilty" doesn't mean "innocent"—it just means there wasn't enough evidence to convict. Similarly, "fail to reject \( H_0 \)" doesn't mean \( H_0 \) is true—it just means we don't have enough evidence to reject it.
Writing Hypotheses
Writing hypotheses correctly is crucial—a poorly stated hypothesis leads to a confused analysis. Here are the rules:
Rule 1: Hypotheses are about population parameters, not sample statistics
- Correct: \( H_0: p = 0.5 \) (where p is the population proportion)
- Incorrect: \( H_0: \hat{p} = 0.5 \) (that's a sample statistic)
Rule 2: The null hypothesis always contains equality
- Correct: \( H_0: p = 0.30 \)
- Incorrect: \( H_0: p > 0.30 \)
Rule 3: The alternative hypothesis determines the test direction
The form of \( H_a \) tells us what kind of evidence would lead us to reject \( H_0 \):
| If \( H_a \) contains | We call it | We reject \( H_0 \) if sample proportion is |
|---|---|---|
| \( \neq \) | Two-sided test | Far from \( p_0 \) in either direction |
| \( > \) | One-sided (right) | Much larger than \( p_0 \) |
| \( < \) | One-sided (left) | Much smaller than \( p_0 \) |
Writing Hypotheses: Complete Example
Scenario: A company claims that 40% of customers prefer their product. You survey 200 customers and want to test if the true proportion differs from the claim.
Step 1: Identify the parameter: p = the true proportion of all customers who prefer the product
Step 2: Write the null hypothesis (the claim, assuming equality): [ H_0: p = 0.40 ]
Step 3: Write the alternative hypothesis (what you're looking for evidence of): [ H_a: p \neq 0.40 ]
This is a two-sided test because we're interested in whether the proportion differs in either direction—higher OR lower than 40%.
One-Sided vs. Two-Sided Tests
The choice between a one-sided test and a two-sided test depends on your research question.
Two-Sided Tests
A two-sided test (also called two-tailed) looks for evidence that the parameter differs from the null value in either direction. Use this when:
- You want to detect any difference, regardless of direction
- You don't have a strong prior reason to expect only one direction
- You're being conservative and open to any surprise
Alternative hypothesis form: \( H_a: p \neq p_0 \)
Example: Testing whether a coin is fair (it could be biased toward heads OR tails).
One-Sided Tests
A one-sided test (also called one-tailed) looks for evidence in only one direction. Use this when:
- You only care about differences in a specific direction
- Scientific theory strongly predicts the direction
- Practical considerations make only one direction meaningful
Alternative hypothesis forms: \( H_a: p > p_0 \) or \( H_a: p < p_0 \)
Example: Testing whether a new medication improves cure rates (we only care if it's BETTER, not worse).
Choose Your Test Before Seeing the Data
You must decide whether to use a one-sided or two-sided test BEFORE collecting or analyzing data. Choosing after you see the results is cheating—it's called "p-hacking" and leads to false conclusions.
| Scenario | Appropriate Test |
|---|---|
| Does the proportion differ from 50%? | Two-sided |
| Is the proportion greater than 50%? | One-sided (right) |
| Is the proportion less than 50%? | One-sided (left) |
| Has the medication changed outcomes? | Two-sided |
| Has the medication improved outcomes? | One-sided (right) |
The Test Statistic
A test statistic measures how far our sample result is from what we'd expect if the null hypothesis were true. It standardizes the difference, allowing us to determine how unusual our sample is.
For a test about one proportion, the test statistic is:
Where: - \( \hat{p} \) = sample proportion (what we observed) - \( p_0 \) = hypothesized population proportion (from \( H_0 \)) - \( n \) = sample size - The denominator is the standard error under the null hypothesis
This z-score tells us how many standard errors the sample proportion is from the hypothesized value. A z-score of 0 means the sample matched expectations perfectly. A z-score of 2 or -2 means the sample was unusually far from what we expected.
Calculating a Test Statistic: Example
Scenario: A company claims 60% of customers are satisfied. In a sample of 250 customers, 135 (54%) reported satisfaction. Test whether satisfaction differs from the claim.
Given information: - \( \hat{p} = 135/250 = 0.54 \) - \( p_0 = 0.60 \) (from the claim) - \( n = 250 \)
Calculate the test statistic: [ z = \frac{0.54 - 0.60}{\sqrt{\frac{0.60(1-0.60)}{250}}} = \frac{-0.06}{\sqrt{\frac{0.24}{250}}} = \frac{-0.06}{0.031} \approx -1.94 ]
The sample proportion is about 1.94 standard errors below the claimed value. Is this unusual enough to reject the claim? We need the p-value to decide.
Diagram: Test Statistic Calculator
Interactive Test Statistic Calculator
Type: MicroSim
Bloom Taxonomy: Apply (L3) Bloom Taxonomy Verb: Calculate
Learning objective: Students will calculate z-test statistics for one-proportion hypothesis tests by inputting sample data and hypothesized values, understanding how the test statistic measures deviation from the null hypothesis.
Visual elements: - Input fields for sample size (n), number of successes (x), and hypothesized proportion (p₀) - Automatic calculation of sample proportion (p-hat = x/n) - Visual display of the z-formula with current values substituted - Number line showing the z-score position relative to 0 - Standard normal curve with the z-score marked - Color-coded output: green for small z (close to expected), yellow for moderate z, red for extreme z
Interactive controls: - Number input for sample size n (range: 10 to 1000) - Number input for number of successes x (range: 0 to n) - Slider or input for hypothesized proportion p₀ (range: 0.01 to 0.99) - "Calculate" button to compute test statistic - "Clear" button to reset all fields - Toggle to show/hide calculation steps
Behavior: - As inputs change, live updates show intermediate calculations - Standard error calculation displayed step-by-step - Test statistic updates automatically - Visual position on normal curve updates in real-time - Warning message if conditions not met (np₀ < 10 or n(1-p₀) < 10) - Display interpretation: "The sample proportion is ___ standard errors [above/below] the hypothesized value"
Canvas size: 800 x 500 pixels, responsive design Implementation: p5.js with canvas-based controls
Understanding the P-Value
The p-value is the probability of obtaining sample results at least as extreme as those observed, assuming the null hypothesis is true.
This is a mouthful, so let's break it down:
- Assume \( H_0 \) is true (we're working in a hypothetical world where the null is correct)
- Consider all possible samples we could draw in this world
- Find the probability of getting a sample as unusual as ours (or more unusual)
A small p-value means: "If \( H_0 \) were true, results like ours would be very rare. Either we witnessed something unlikely, or \( H_0 \) isn't true."
A large p-value means: "If \( H_0 \) were true, results like ours would be fairly common. There's no reason to doubt \( H_0 \)."
P-Value Interpretation
| P-Value | Verbal Interpretation |
|---|---|
| 0.50 | "Results like this happen half the time if \( H_0 \) is true" |
| 0.25 | "Results like this are fairly common under \( H_0 \)" |
| 0.10 | "Results like this are somewhat unusual under \( H_0 \)" |
| 0.05 | "Results like this are unusual under \( H_0 \)" |
| 0.01 | "Results like this are quite rare under \( H_0 \)" |
| 0.001 | "Results like this are very rare under \( H_0 \)" |
"Acorn for your thoughts?" Sylvia asks. "The p-value confused me at first. It's NOT the probability that \( H_0 \) is true! It's the probability of seeing our data (or more extreme) IF \( H_0 \) were true. Big difference! Think of it as measuring how surprised we should be if the null hypothesis is actually correct."
Calculating P-Values
Calculating p-values depends on whether your test is one-sided or two-sided.
For a Two-Sided Test
When \( H_a: p \neq p_0 \), evidence against \( H_0 \) could come from either tail. The p-value is:
We double the one-tail probability because extreme values in either direction count as evidence.
For a One-Sided Test (Right)
When \( H_a: p > p_0 \), only large positive z-values count as evidence:
For a One-Sided Test (Left)
When \( H_a: p < p_0 \), only large negative z-values count as evidence:
P-Value Calculation Example
Continuing our satisfaction example where z = -1.94:
For a two-sided test (\( H_a: p \neq 0.60 \)): [ \text{p-value} = 2 \times P(Z < -1.94) = 2 \times 0.0262 = 0.0524 ]
For a one-sided test (\( H_a: p < 0.60 \)): [ \text{p-value} = P(Z < -1.94) = 0.0262 ]
The p-value tells us that if the true satisfaction rate were really 60%, we'd see a sample proportion as extreme as 54% (or more extreme) about 5.2% of the time with a two-sided test.
Diagram: P-Value Visualizer
Interactive P-Value Visualization
Type: MicroSim
Bloom Taxonomy: Understand (L2) Bloom Taxonomy Verb: Interpret
Learning objective: Students will interpret p-values by visualizing the area under the normal curve corresponding to the probability of obtaining results as extreme as the test statistic, distinguishing between one-sided and two-sided tests.
Data Visibility Requirements: - Stage 1: Show the standard normal curve with test statistic marked - Stage 2: Highlight the tail area(s) corresponding to the p-value - Stage 3: Display the calculated probability as a decimal and percentage - Stage 4: Show interpretation sentence explaining what the p-value means
Visual elements: - Standard normal distribution curve (bell curve) - Vertical line at z = 0 (center) - Movable marker showing test statistic position - Shaded region(s) representing the p-value - Numerical display of z-value and corresponding p-value - Toggle between one-sided (left/right) and two-sided views
Interactive controls: - Slider for z-statistic (range: -4 to +4) - Radio buttons: "Two-sided", "One-sided (left)", "One-sided (right)" - "Show/Hide" toggle for each tail region - Display mode: "Show area" or "Show probability" - Input field to enter a specific z-value - Reset button
Behavior: - As z-slider moves, shaded region updates in real-time - P-value display updates continuously - For two-sided test, both tails shade symmetrically - For one-sided test, only relevant tail shades - When p-value < 0.05, region turns red; otherwise blue - Interpretation text updates: "If H₀ is true, results this extreme occur ___% of the time"
Instructional Rationale: Step-through visualization with explicit data visibility is appropriate because the Understanding/interpret objective requires learners to connect the visual area representation to the numerical probability, building intuition about what p-values mean.
Canvas size: 800 x 450 pixels, responsive design Implementation: p5.js with canvas-based controls
Interpreting P-Values
Interpreting p-values correctly is essential—and widely misunderstood, even by scientists!
What the P-Value IS
The p-value is the probability of obtaining sample results at least as extreme as those observed, assuming the null hypothesis is true.
Correct interpretation: "If the true proportion were [null value], there is a [p-value] probability of observing a sample proportion at least as extreme as [observed value]."
Example: "If the true satisfaction rate were 60%, there is a 5.24% probability of observing a sample proportion at least as far from 60% as our observed 54%."
What the P-Value is NOT
- NOT the probability that \( H_0 \) is true
- NOT the probability that \( H_a \) is false
- NOT the probability that you made an error
- NOT the magnitude or importance of the effect
Common Misinterpretations
| Incorrect Statement | Why It's Wrong |
|---|---|
| "There's a 5% chance \( H_0 \) is true" | P-value doesn't give probability of hypotheses |
| "The results are 95% likely to be real" | P-value is about data, not reality |
| "If p = 0.03, there's a 97% chance the treatment works" | P-value isn't about effectiveness |
| "A small p-value means a large effect" | P-value says nothing about effect size |
"Don't worry—every statistician drops an acorn sometimes," Sylvia reassures. "P-value interpretation trips up even experts! The key is remembering we're calculating a probability about DATA, not about whether our hypothesis is true. We're asking 'how surprising is this data?' not 'what's the truth?'"
The Significance Level
The significance level (denoted \( \alpha \), the Greek letter alpha) is the threshold we set BEFORE the test to decide how small the p-value must be to reject \( H_0 \).
Think of \( \alpha \) as our "surprise threshold"—how rare must the data be before we're convinced something unusual is happening?
Common Significance Levels
| \( \alpha \) | Interpretation |
|---|---|
| 0.10 | Reject if p-value < 0.10 (lenient) |
| 0.05 | Reject if p-value < 0.05 (most common) |
| 0.01 | Reject if p-value < 0.01 (strict) |
| 0.001 | Reject if p-value < 0.001 (very strict) |
The most common choice is \( \alpha = 0.05 \), meaning we reject \( H_0 \) if there's less than a 5% chance of seeing data this extreme under \( H_0 \).
Choosing Alpha
Choosing alpha involves balancing competing concerns:
Lower \( \alpha \) (like 0.01): - Harder to reject \( H_0 \)—requires stronger evidence - Fewer false positives (wrongly rejecting a true \( H_0 \)) - More false negatives (failing to detect real effects) - Appropriate when false positives are costly (e.g., medical treatments)
Higher \( \alpha \) (like 0.10): - Easier to reject \( H_0 \)—requires less evidence - More false positives - Fewer false negatives (better at detecting real effects) - Appropriate in exploratory research or when missing effects is costly
| Field | Typical \( \alpha \) | Reasoning |
|---|---|---|
| Clinical trials | 0.01 or lower | False positives could harm patients |
| Psychology/social science | 0.05 | Balance of concerns |
| Exploratory research | 0.10 | Don't want to miss interesting leads |
| Particle physics | 0.0000003 (5 sigma) | Extraordinary claims require extraordinary evidence |
Statistical Significance
We say a result is statistically significant when the p-value falls below our chosen significance level \( \alpha \).
Example: With \( \alpha = 0.05 \): - p-value = 0.03 → statistically significant (reject \( H_0 \)) - p-value = 0.08 → not statistically significant (fail to reject \( H_0 \))
The Meaning of Statistical Significance
When results are statistically significant, we're saying: "The observed difference is too large to reasonably attribute to random sampling variation alone."
When results are not statistically significant, we're saying: "The observed difference could plausibly be due to random sampling variation."
What Statistical Significance Does NOT Mean
- It does NOT mean the results are important or meaningful
- It does NOT prove \( H_0 \) is false
- It does NOT mean the effect is large
- It does NOT mean the finding will replicate
Statistical significance is about the strength of evidence against \( H_0 \), not about practical importance. A tiny, meaningless difference can be statistically significant with a large enough sample.
Making Conclusions
Making conclusions in hypothesis testing requires careful language. There are only two possible outcomes:
Outcome 1: Reject \( H_0 \)
When p-value < \( \alpha \): - We have sufficient evidence to reject the null hypothesis - We conclude there IS evidence supporting the alternative - The result is statistically significant
Template: "At the \( \alpha = \_\_\_ \) significance level, we reject \( H_0 \). There is statistically significant evidence that [alternative hypothesis in context]."
Example: "At the \( \alpha = 0.05 \) significance level, we reject \( H_0 \). There is statistically significant evidence that the true customer satisfaction rate differs from 60%."
Outcome 2: Fail to Reject \( H_0 \)
When p-value \( \geq \alpha \): - We do NOT have sufficient evidence to reject the null hypothesis - We CANNOT conclude the alternative is true - The result is not statistically significant
Template: "At the \( \alpha = \_\_\_ \) significance level, we fail to reject \( H_0 \). There is not sufficient evidence that [alternative hypothesis in context]."
Example: "At the \( \alpha = 0.05 \) significance level, we fail to reject \( H_0 \). There is not sufficient evidence that the true customer satisfaction rate differs from 60%."
Critical Language Points
| Correct Phrasing | NEVER Say |
|---|---|
| "Fail to reject \( H_0 \)" | "Accept \( H_0 \)" |
| "Not sufficient evidence to conclude..." | "We prove \( H_0 \) is true" |
| "Evidence suggests..." | "We prove \( H_a \) is true" |
| "At the \( \alpha \) level" | (omitting the significance level) |
"Time to squirrel away this crucial point!" Sylvia emphasizes. "We NEVER 'accept' the null hypothesis—we only fail to reject it. Just like a jury doesn't declare someone 'innocent'—they find them 'not guilty.' Absence of evidence isn't evidence of absence!"
Diagram: Hypothesis Testing Decision Flowchart
Hypothesis Testing Decision Guide
Type: workflow
Bloom Taxonomy: Apply (L3) Bloom Taxonomy Verb: Implement
Learning objective: Students will follow the complete hypothesis testing procedure, making correct conclusions based on p-values and significance levels.
Visual elements: - Flowchart with decision diamonds and process boxes - Color-coded paths for different outcomes - Summary boxes for each type of conclusion - Examples embedded at each decision point
Steps in workflow: 1. Start: "State hypotheses (H₀ and Hₐ)" Hover text: "H₀ contains =, Hₐ contains ≠, <, or >"
-
Process: "Check conditions" Hover text: "Verify np₀ ≥ 10 and n(1-p₀) ≥ 10 for z-test"
-
Decision: "Conditions met?" Hover text: "If not, cannot proceed with z-test"
-
Process: "Calculate test statistic z" Hover text: "z = (p̂ - p₀) / √(p₀(1-p₀)/n)"
-
Process: "Find p-value" Hover text: "One-sided or two-sided based on Hₐ"
-
Decision: "p-value < α?" Hover text: "Compare to chosen significance level"
7a. (If Yes) Result: "Reject H₀" Hover text: "Statistically significant evidence for Hₐ" Color: Green
7b. (If No) Result: "Fail to Reject H₀" Hover text: "Insufficient evidence for Hₐ" Color: Orange
- End: "State conclusion in context" Hover text: "Always relate back to the original question"
Visual style: Modern flowchart with rounded rectangles Color coding: - Blue: Process steps - Yellow: Decision points - Green: Reject H₀ path - Orange: Fail to reject path
Canvas size: 800 x 600 pixels, responsive design Implementation: p5.js with canvas-based hover interactions
Type I and Type II Errors
In hypothesis testing, we make decisions based on incomplete information (sample data), so we can make mistakes. There are exactly two types of errors possible.
Type I Error: False Positive
A Type I error occurs when we reject \( H_0 \) when it's actually true. We concluded something is happening when it's actually not—a false alarm.
Consequences of Type I Error: - Claiming a treatment works when it doesn't - Concluding a coin is unfair when it's actually fair - Believing there's a difference when there isn't one
Probability of Type I Error: \( P(\text{Type I Error}) = \alpha \)
This is why \( \alpha \) is sometimes called the "Type I error rate." When we set \( \alpha = 0.05 \), we're accepting a 5% chance of a false positive.
Type II Error: False Negative
A Type II error occurs when we fail to reject \( H_0 \) when it's actually false. We missed something real—we failed to detect a true effect.
Consequences of Type II Error: - Missing an effective treatment - Declaring a biased coin "fair" - Concluding no difference when one exists
Probability of Type II Error: \( P(\text{Type II Error}) = \beta \)
The probability \( \beta \) depends on sample size, the true parameter value, and how different reality is from \( H_0 \).
The Four Possible Outcomes
| \( H_0 \) is True | \( H_0 \) is False | |
|---|---|---|
| Reject \( H_0 \) | Type I Error (\( \alpha \)) | Correct Decision! |
| Fail to Reject \( H_0 \) | Correct Decision! | Type II Error (\( \beta \)) |
Real-World Analogy
| Scenario | Type I Error | Type II Error |
|---|---|---|
| Medical test | Healthy patient diagnosed with disease | Sick patient given clean bill of health |
| Fire alarm | Alarm sounds with no fire | No alarm during actual fire |
| Court trial | Convicting an innocent person | Acquitting a guilty person |
| Spam filter | Legitimate email marked as spam | Spam gets through to inbox |
"Acorn for your thoughts on this?" Sylvia asks. "In my acorn quality testing, a Type I error means I reject perfectly good acorns thinking they're bad. A Type II error means I keep bad acorns thinking they're good. Both are problems, but depending on the situation, one might be worse than the other!"
Diagram: Type I and Type II Error Visualizer
Interactive Error Type Demonstration
Type: MicroSim
Bloom Taxonomy: Analyze (L4) Bloom Taxonomy Verb: Distinguish
Learning objective: Students will distinguish between Type I and Type II errors by exploring scenarios where the null hypothesis is either true or false, observing how different sample outcomes lead to correct decisions or errors.
Data Visibility Requirements: - Stage 1: Show the true state of reality (H₀ true or false) - Stage 2: Show the sample data collected - Stage 3: Show the test statistic and p-value calculated - Stage 4: Show the decision made (reject or fail to reject) - Stage 5: Show the outcome (correct decision, Type I error, or Type II error)
Visual elements: - Two parallel tracks: "Reality" track and "Our Decision" track - Reality track shows true population parameter (controlled by user) - Decision track shows sample, test statistic, p-value, and conclusion - Outcome box shows whether we made correct decision or error type - Color coding: green for correct, red for Type I, orange for Type II - Counter tracking cumulative error rates over many trials
Interactive controls: - Toggle: "H₀ is actually true" vs "H₀ is actually false" - Slider for true population proportion (when H₀ is false) - Slider for sample size n - Input for significance level α - "Draw One Sample" button - "Run 100 Samples" button to see error rates accumulate - Reset button
Behavior: - When H₀ is true and we reject → Type I Error (red highlight) - When H₀ is true and we fail to reject → Correct (green) - When H₀ is false and we reject → Correct (green) - When H₀ is false and we fail to reject → Type II Error (orange) - Running counters show: "Type I Errors: X/Y trials when H₀ true" - Running counters show: "Type II Errors: X/Y trials when H₀ false" - Demonstrates that Type I rate ≈ α when H₀ is true
Instructional Rationale: Interactive exploration with explicit state visibility is appropriate because the Analyze/distinguish objective requires learners to compare outcomes across different scenarios, building understanding of when each error type occurs.
Canvas size: 850 x 550 pixels, responsive design Implementation: p5.js with canvas-based controls
Error Tradeoffs
There's an inherent tension between Type I and Type II errors—reducing one tends to increase the other.
The Tradeoff
If we lower \( \alpha \) (more strict): - Harder to reject \( H_0 \) - Fewer Type I errors (fewer false positives) - MORE Type II errors (more missed true effects)
If we raise \( \alpha \) (more lenient): - Easier to reject \( H_0 \) - More Type I errors - FEWER Type II errors (better detection)
Balancing the Errors
The appropriate balance depends on consequences:
| If Type I Error is Worse | If Type II Error is Worse |
|---|---|
| Use lower \( \alpha \) (0.01) | Use higher \( \alpha \) (0.10) |
| Convicting innocent people | Letting guilty people go free |
| Approving harmful drugs | Missing effective treatments |
| False alarms waste resources | Missing real threats is dangerous |
The Only Free Lunch: Increase Sample Size
Want to reduce BOTH error types simultaneously? Increase your sample size. Larger samples provide more information, making it easier to:
- Detect real effects (reducing Type II errors)
- Have enough precision to avoid false alarms (reducing Type I errors)
This is why well-funded studies use large samples—they can afford to gather more data and make more accurate decisions.
Power of a Test
The power of a test is the probability of correctly rejecting \( H_0 \) when it's false—the probability of detecting a real effect when one exists.
Higher power is always better. A powerful test rarely misses real effects.
Factors Affecting Power
| Factor | Effect on Power |
|---|---|
| Larger sample size | Increases power |
| Larger significance level (\( \alpha \)) | Increases power |
| Larger true effect | Increases power (easier to detect big differences) |
| Less variability in data | Increases power |
| One-sided vs two-sided test | One-sided has more power (if direction is correct) |
Practical Power Guidelines
- Minimum acceptable power: 0.80 (80% chance of detecting a real effect)
- Desirable power: 0.90 or higher
- Power below 0.50: The test is essentially useless
Power analysis is often done BEFORE data collection to determine the needed sample size. If we want 80% power to detect a specific effect size at \( \alpha = 0.05 \), we can calculate how many observations we need.
"My tail's tingling—we're onto something important!" Sylvia exclaims. "Power is like having good eyesight for data. A high-powered test can see small effects; a low-powered test might miss even obvious ones. Before starting a study, I always ask: 'Do I have enough acorns—I mean, observations—to actually detect what I'm looking for?'"
Test for One Proportion
A test for one proportion determines whether a population proportion equals a specific value. This is the most common type of proportion test.
The Setup
- Parameter: p = true population proportion
- Hypotheses:
- \( H_0: p = p_0 \) (null value)
- \( H_a: p \neq p_0 \) or \( p > p_0 \) or \( p < p_0 \)
The Test Statistic
Note: We use \( p_0 \) (not \( \hat{p} \)) in the standard error because we calculate the probability assuming \( H_0 \) is true.
Complete Example: Testing a Claim
Scenario: A coin is flipped 200 times and lands on heads 114 times. Test whether the coin is fair at \( \alpha = 0.05 \).
Step 1: State Hypotheses - \( H_0: p = 0.5 \) (coin is fair) - \( H_a: p \neq 0.5 \) (coin is not fair)
Step 2: Check Conditions - \( np_0 = 200(0.5) = 100 \geq 10 \) ✓ - \( n(1-p_0) = 200(0.5) = 100 \geq 10 \) ✓
Step 3: Calculate Test Statistic [ \hat{p} = \frac{114}{200} = 0.57 ] [ z = \frac{0.57 - 0.50}{\sqrt{\frac{0.50(0.50)}{200}}} = \frac{0.07}{0.0354} = 1.98 ]
Step 4: Find P-Value Two-sided test: \( \text{p-value} = 2 \times P(Z > 1.98) = 2 \times 0.0239 = 0.0478 \)
Step 5: Make Conclusion Since p-value (0.0478) < \( \alpha \) (0.05), we reject \( H_0 \).
"At the 0.05 significance level, there is statistically significant evidence that the coin is not fair. The sample data suggests the probability of heads differs from 0.5."
Conditions for Z-Test
The conditions for z-test must be verified before conducting the test. If conditions aren't met, the p-value calculations may be inaccurate.
The Three Conditions
1. Random Sample The data must come from a random sample or randomized experiment. This ensures the sample is representative of the population.
2. Independence (10% Rule) Sample observations must be independent. When sampling without replacement, the sample size should be less than 10% of the population. [ n < 0.10N ]
3. Large Counts (Normality) The sampling distribution of \( \hat{p} \) must be approximately normal. This requires: [ np_0 \geq 10 \quad \text{and} \quad n(1-p_0) \geq 10 ]
Note: For hypothesis tests, we check using \( p_0 \) (the hypothesized proportion), not \( \hat{p} \).
What If Conditions Aren't Met?
| Condition Violated | Problem | Solution |
|---|---|---|
| Not random | Results may be biased | Use proper random sampling |
| Not independent | Standard error is wrong | Sample less than 10% of population |
| Small counts | Normal approximation fails | Use exact binomial test or collect more data |
Always Check Conditions First
Before calculating test statistics and p-values, verify all three conditions. Many students lose points on AP Statistics by skipping this step!
Test for Two Proportions
A test for two proportions compares proportions between two independent groups. This is useful when asking: "Is the proportion different between Group A and Group B?"
The Setup
- Parameters: \( p_1 \) = proportion in population 1, \( p_2 \) = proportion in population 2
- Hypotheses:
- \( H_0: p_1 - p_2 = 0 \) (or equivalently, \( p_1 = p_2 \))
- \( H_a: p_1 - p_2 \neq 0 \) or \( > 0 \) or \( < 0 \)
Pooled Proportion
Since \( H_0 \) assumes the proportions are equal, we combine the samples to get a pooled proportion:
The Test Statistic
Conditions for Two-Proportion Z-Test
- Random: Both samples are random samples or from randomized experiments
- Independent: Samples are independent of each other, and within each sample, observations are independent
- Large Counts: For each group: \( n_i \hat{p}_{pool} \geq 10 \) and \( n_i(1-\hat{p}_{pool}) \geq 10 \)
Complete Example: Comparing Two Groups
Scenario: In a study of a new teaching method: - Control group (traditional): 45 of 120 passed (\( \hat{p}_1 = 0.375 \)) - Treatment group (new method): 68 of 130 passed (\( \hat{p}_2 = 0.523 \))
Test whether the new method produces different pass rates at \( \alpha = 0.05 \).
Step 1: State Hypotheses - \( H_0: p_1 - p_2 = 0 \) (no difference) - \( H_a: p_1 - p_2 \neq 0 \) (there is a difference)
Step 2: Calculate Pooled Proportion [ \hat{p}_{pool} = \frac{45 + 68}{120 + 130} = \frac{113}{250} = 0.452 ]
Step 3: Check Conditions - \( 120(0.452) = 54.2 \geq 10 \) ✓ - \( 120(0.548) = 65.8 \geq 10 \) ✓ - \( 130(0.452) = 58.8 \geq 10 \) ✓ - \( 130(0.548) = 71.2 \geq 10 \) ✓
Step 4: Calculate Test Statistic [ z = \frac{(0.375 - 0.523) - 0}{\sqrt{0.452(0.548)\left(\frac{1}{120} + \frac{1}{130}\right)}} = \frac{-0.148}{\sqrt{0.2477(0.0077 + 0.0077)}} = \frac{-0.148}{0.0632} = -2.34 ]
Step 5: Find P-Value Two-sided: \( \text{p-value} = 2 \times P(Z < -2.34) = 2 \times 0.0096 = 0.0192 \)
Step 6: Conclusion Since p-value (0.0192) < \( \alpha \) (0.05), we reject \( H_0 \).
"At the 0.05 significance level, there is statistically significant evidence of a difference in pass rates between the traditional and new teaching methods."
Diagram: Two-Proportion Test Comparison
Interactive Two-Proportion Test Calculator
Type: MicroSim
Bloom Taxonomy: Apply (L3) Bloom Taxonomy Verb: Execute
Learning objective: Students will conduct a complete two-proportion z-test by entering data from two groups, calculating the pooled proportion and test statistic, and interpreting the results.
Visual elements: - Two side-by-side input panels for each group's data - Visual comparison bar chart showing the two sample proportions - Pooled proportion calculation display - Test statistic formula with substituted values - Normal curve with z-score marked and p-value shaded - Conclusion statement generator
Interactive controls: - Group 1: Input fields for n₁ (sample size) and x₁ (successes) - Group 2: Input fields for n₂ (sample size) and x₂ (successes) - Radio buttons for test type: two-sided, p₁ > p₂, p₁ < p₂ - Slider for significance level α (0.01, 0.05, 0.10) - "Calculate" button - "Check Conditions" button (highlights which conditions pass/fail) - "Show Steps" toggle for detailed calculation breakdown
Behavior: - Automatically calculates p̂₁, p̂₂, pooled p̂ - Displays step-by-step calculation of standard error and z-statistic - Shows p-value with visual representation on normal curve - Compares p-value to α and states conclusion - Generates properly worded conclusion statement - Warning messages for condition violations
Canvas size: 900 x 600 pixels, responsive design Implementation: p5.js with canvas-based controls
Practical Significance
Practical significance asks whether a statistically significant result actually matters in the real world. Just because we can detect a difference doesn't mean the difference is important.
The Problem with Large Samples
With a large enough sample, even tiny differences become statistically significant. This happens because:
- The standard error shrinks as n increases
- Small differences produce large z-scores
- P-values become very small
Example: If a medication improves outcomes from 80.0% to 80.5%, that's a 0.5% improvement. With n = 10,000 in each group, this tiny difference might have p-value < 0.001. Statistically significant? Yes. Practically important? Probably not.
Statistical vs. Practical Significance
| Aspect | Statistical Significance | Practical Significance |
|---|---|---|
| Question answered | "Is there any difference?" | "Is the difference big enough to matter?" |
| Measured by | P-value | Effect size, context |
| Affected by | Sample size | Real-world implications |
| Can exist without | Large sample needed | Can't have practical without some effect |
Assessing Practical Significance
When evaluating practical significance, consider:
- The size of the effect: How big is the difference in practical terms?
- The context: What would this difference mean in the real world?
- The cost-benefit analysis: Is the difference worth acting on?
- Comparison to meaningful benchmarks: How does the effect compare to what's considered meaningful in this field?
Example Evaluation: - A diet produces statistically significant weight loss of 0.5 pounds - Context: Is losing half a pound meaningful for health? - Cost: What are the burdens of this diet? - Benchmark: Doctors consider 5% body weight loss clinically meaningful
Conclusion: Statistically significant but not practically significant.
"Here's wisdom from my acorn research," Sylvia shares. "I once found a statistically significant difference in acorn weight between north and south sides of a tree—about 0.01 grams. My p-value was tiny! But would I change my foraging strategy over one hundredth of a gram? Of course not. The effect was real but meaningless for my purposes."
Complete Hypothesis Testing Examples
Let's work through complete examples that bring all the concepts together.
Example 1: One-Proportion Test
Scenario: A school claims that 75% of their graduates go to college. A sample of 180 graduates finds that 125 went to college. Test whether the true proportion differs from the claim at \( \alpha = 0.05 \).
Solution:
Step 1: State Hypotheses - \( H_0: p = 0.75 \) - \( H_a: p \neq 0.75 \)
Step 2: Check Conditions - Random: Assuming this is a random sample of graduates ✓ - Independence: 180 is less than 10% of all graduates (assuming large alumni base) ✓ - Large counts: \( 180(0.75) = 135 \geq 10 \) ✓ and \( 180(0.25) = 45 \geq 10 \) ✓
Step 3: Calculate Test Statistic [ \hat{p} = \frac{125}{180} = 0.694 ] [ z = \frac{0.694 - 0.75}{\sqrt{\frac{0.75(0.25)}{180}}} = \frac{-0.056}{0.0323} = -1.73 ]
Step 4: Find P-Value [ \text{p-value} = 2 \times P(Z < -1.73) = 2 \times 0.0418 = 0.0836 ]
Step 5: Conclusion Since p-value (0.0836) > \( \alpha \) (0.05), we fail to reject \( H_0 \).
At the 0.05 significance level, there is not sufficient evidence to conclude that the true proportion of graduates going to college differs from 75%.
Example 2: One-Sided Test
Scenario: A company's old website had a 12% conversion rate. After a redesign, 156 of 1000 visitors made a purchase. Test whether the new design improved conversions at \( \alpha = 0.01 \).
Solution:
Step 1: State Hypotheses - \( H_0: p = 0.12 \) - \( H_a: p > 0.12 \) (one-sided, testing for improvement)
Step 2: Check Conditions - \( 1000(0.12) = 120 \geq 10 \) ✓ - \( 1000(0.88) = 880 \geq 10 \) ✓
Step 3: Calculate Test Statistic [ \hat{p} = \frac{156}{1000} = 0.156 ] [ z = \frac{0.156 - 0.12}{\sqrt{\frac{0.12(0.88)}{1000}}} = \frac{0.036}{0.0103} = 3.50 ]
Step 4: Find P-Value One-sided (right): \( \text{p-value} = P(Z > 3.50) = 0.0002 \)
Step 5: Conclusion Since p-value (0.0002) < \( \alpha \) (0.01), we reject \( H_0 \).
At the 0.01 significance level, there is statistically significant evidence that the new website design has a higher conversion rate than the old design's 12%.
Common Mistakes in Hypothesis Testing
Before we conclude, let's address errors that trip up even careful students:
Mistake 1: Wrong Hypothesis Structure - Using \( \hat{p} \) instead of p in hypotheses (hypotheses are about parameters, not statistics) - Putting inequality in \( H_0 \) instead of \( H_a \)
Mistake 2: Using Wrong Standard Error - For hypothesis tests, use \( p_0 \) (the null value) in the standard error - For confidence intervals, use \( \hat{p} \) (the sample value)
Mistake 3: Confusing P-Value Interpretation - The p-value is NOT the probability that \( H_0 \) is true - It's the probability of getting data this extreme IF \( H_0 \) were true
Mistake 4: Saying "Accept \( H_0 \)" - We never "accept" the null hypothesis - We either "reject" or "fail to reject"
Mistake 5: Ignoring Practical Significance - Statistical significance doesn't imply practical importance - Always consider context and effect size
Mistake 6: Choosing Test Direction After Seeing Data - One-sided vs. two-sided must be decided before analysis - Choosing after seeing results is "p-hacking"
Mistake 7: Forgetting to Check Conditions - Always verify random sample, independence, and large counts - Invalid conditions mean unreliable p-values
Key Takeaways
"Time to squirrel away the big ideas!"
-
A hypothesis test uses sample data to evaluate a claim about a population parameter
-
The null hypothesis (\( H_0 \)) represents "no effect" and contains equality; the alternative hypothesis (\( H_a \)) is what we seek evidence for
-
One-sided tests look for evidence in one direction; two-sided tests look for evidence in either direction
-
The test statistic measures how far the sample result is from what's expected under \( H_0 \)
-
The p-value is the probability of getting results as extreme as observed, assuming \( H_0 \) is true
-
The significance level (\( \alpha \)) is the threshold for rejecting \( H_0 \); common choice is 0.05
-
Statistical significance means p-value < \( \alpha \); we reject \( H_0 \) and have evidence for \( H_a \)
-
Type I error: Rejecting a true \( H_0 \) (false positive); probability = \( \alpha \)
-
Type II error: Failing to reject a false \( H_0 \) (false negative); probability = \( \beta \)
-
Power = 1 - \( \beta \) = probability of correctly rejecting a false \( H_0 \)
-
For one-proportion z-test: \( z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} \)
-
For two-proportion z-test: Use the pooled proportion when calculating the standard error
-
Practical significance considers whether a statistically significant difference actually matters in context
Practice Problems
Check Your Understanding
Problem 1: A company claims that 90% of orders are delivered on time. In a random sample of 400 orders, 348 were on time.
a) State the null and alternative hypotheses to test if the proportion differs from the claim. b) Calculate the test statistic. c) Find the p-value. d) At α = 0.05, what is your conclusion?
Problem 2: A researcher wants to test if more than 60% of adults support a new policy. In a random sample of 500 adults, 325 support the policy.
a) State the appropriate hypotheses. b) Is this a one-sided or two-sided test? Why? c) Conduct the complete hypothesis test at α = 0.05.
Problem 3: Explain the difference between a Type I error and a Type II error in the context of testing whether a new drug is effective.
Problem 4: Two factories produce the same product. Factory A had 24 defects in 600 items. Factory B had 42 defects in 700 items. Test whether there's a difference in defect rates at α = 0.05.
Problem 5: A study finds that a diet pill produces statistically significant weight loss (p < 0.001) with an average loss of 0.3 pounds. Discuss statistical versus practical significance.
Solutions
Problem 1:
a) \( H_0: p = 0.90 \), \( H_a: p \neq 0.90 \)
b) \( \hat{p} = 348/400 = 0.87 \) \( z = \frac{0.87 - 0.90}{\sqrt{0.90(0.10)/400}} = \frac{-0.03}{0.015} = -2.0 \)
c) p-value = 2 × P(Z < -2.0) = 2 × 0.0228 = 0.0456
d) Since 0.0456 < 0.05, reject \( H_0 \). At the 0.05 significance level, there is statistically significant evidence that the on-time delivery rate differs from 90%.
Problem 2:
a) \( H_0: p = 0.60 \), \( H_a: p > 0.60 \)
b) One-sided (right-tailed) because we're specifically testing if more than 60% support the policy.
c) \( \hat{p} = 325/500 = 0.65 \) \( z = \frac{0.65 - 0.60}{\sqrt{0.60(0.40)/500}} = \frac{0.05}{0.0219} = 2.28 \) p-value = P(Z > 2.28) = 0.0113 Since 0.0113 < 0.05, reject \( H_0 \). There is statistically significant evidence that more than 60% of adults support the policy.
Problem 3: Type I error: Concluding the drug is effective when it actually isn't (approving an ineffective drug). This could lead to patients receiving useless treatment.
Type II error: Concluding the drug is not effective when it actually is (rejecting an effective drug). This could deny patients a beneficial treatment.
In drug testing, Type I errors are often considered more dangerous because approving an ineffective drug wastes resources and may have side effects with no benefit.
Problem 4: \( \hat{p}_A = 24/600 = 0.04 \), \( \hat{p}_B = 42/700 = 0.06 \) \( \hat{p}_{pool} = (24+42)/(600+700) = 66/1300 = 0.0508 \) \( z = \frac{0.04 - 0.06}{\sqrt{0.0508(0.9492)(1/600 + 1/700)}} = \frac{-0.02}{0.0123} = -1.63 \) p-value = 2 × P(Z < -1.63) = 2 × 0.0516 = 0.1032 Since 0.1032 > 0.05, fail to reject \( H_0 \). There is not sufficient evidence of a difference in defect rates between the factories.
Problem 5: The result is statistically significant (p < 0.001), meaning if the pill had no effect, observing 0.3 pounds average weight loss would be extremely rare.
However, the practical significance is questionable. Losing 0.3 pounds is barely noticeable and unlikely to have health benefits. The very small p-value likely results from a large sample size rather than a meaningful effect. Clinically meaningful weight loss is typically considered 5% of body weight. This pill might "work" statistically but isn't practically useful.
You've now mastered the core concepts of hypothesis testing—one of the most important tools in statistical inference. This framework for making decisions based on data applies everywhere: in medicine, business, science, and everyday life.
"Now that's a data point worth collecting!" Sylvia beams. "You've learned to be a data detective, testing claims with evidence rather than just accepting them. Whether you're evaluating a new product, a medical treatment, or even whether a coin is fair, you now have the tools to let the data speak. That's a superpower worth having!"
In the upcoming chapters, we'll extend these ideas to tests involving means and apply hypothesis testing to more complex scenarios. The logic remains the same—only the formulas change.