Statistical Foundations
Summary
This chapter establishes the statistical foundation essential for data science and machine learning. Students will learn descriptive statistics (mean, median, mode, variance, standard deviation), understand distributions and probability, and explore sampling concepts. The chapter covers hypothesis testing, confidence intervals, and measures of association including correlation and covariance. By the end of this chapter, students will be able to summarize datasets statistically, understand the relationship between variables, and make probabilistic inferences.
Concepts Covered
This chapter covers the following 30 concepts from the learning graph:
- Descriptive Statistics
- Mean
- Median
- Mode
- Range
- Variance
- Standard Deviation
- Quartiles
- Percentiles
- Interquartile Range
- Skewness
- Kurtosis
- Distribution
- Normal Distribution
- Probability
- Random Variables
- Expected Value
- Sample
- Population
- Sampling
- Central Limit Theorem
- Confidence Interval
- Hypothesis Testing
- P-Value
- Statistical Significance
- Correlation
- Covariance
- Pearson Correlation
- Spearman Correlation
- Correlation Matrix
Prerequisites
This chapter builds on concepts from:
The Language of Uncertainty
Here's a secret that surprises most people: the world runs on statistics. Every medical treatment you take was proven effective through statistics. Every recommendation Netflix makes uses statistical patterns. Every weather forecast is a statistical prediction. Every poll predicting election results? Statistics.
Statistics is the mathematical language for dealing with uncertainty and variation. And in a world drowning in data, those who speak this language fluently have an incredible advantage.
This chapter gives you that fluency. You'll learn to summarize thousands of data points with a handful of numbers, understand how data is distributed, measure relationships between variables, and make confident statements about populations based on samples. These aren't just academic exercises—they're the core tools that power everything from A/B testing at tech companies to clinical trials for new medicines.
Fair warning: this chapter is packed with concepts. But don't worry—each one builds on the last, and by the end, you'll have a complete statistical toolkit. Let's start with the basics.
Descriptive Statistics: Summarizing Data
Descriptive statistics are numbers that summarize and describe a dataset. Instead of looking at thousands of individual values, descriptive statistics give you the big picture in just a few numbers.
Think of it like this: if someone asks "How tall are the students in your school?", you don't list every student's height. You say something like "The average is 5'7", ranging from 4'11" to 6'4"." That's descriptive statistics in action.
There are two main categories:
- Measures of central tendency: Where is the "center" of the data? (mean, median, mode)
- Measures of spread: How spread out is the data? (range, variance, standard deviation)
1 2 3 4 5 6 7 8 9 | |
Output:
1 2 3 4 5 6 7 8 9 | |
In one command, you get count, mean, standard deviation, min, max, and quartiles. Let's understand each of these.
Measures of Central Tendency
Mean: The Arithmetic Average
The mean is what most people call "the average." Add up all the values and divide by how many there are:
1 2 3 4 5 6 7 8 9 | |
The mean is intuitive and mathematically convenient, but it has a weakness: it's sensitive to outliers.
1 2 3 | |
One extreme value pulled the mean way up. For this reason, we sometimes use the median instead.
Median: The Middle Value
The median is the middle value when data is sorted. Half the values are below it, half are above.
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
The median is robust to outliers—extreme values don't affect it much. Use median when your data might have outliers or is skewed.
Mode: The Most Common Value
The mode is the value that appears most frequently. It's the only measure of central tendency that works for categorical data.
1 2 3 4 5 6 7 8 9 | |
Data can have multiple modes (bimodal, multimodal) or no mode at all if every value appears once.
| Measure | Best For | Sensitive to Outliers? |
|---|---|---|
| Mean | Symmetric data, further calculations | Yes |
| Median | Skewed data, outliers present | No |
| Mode | Categorical data, finding most common | No |
Diagram: Central Tendency Comparison MicroSim
Mean, Median, Mode Interactive Explorer
Type: microsim
Bloom Taxonomy: Understand (L2)
Learning Objective: Help students visualize how mean, median, and mode respond differently to data changes and outliers
Canvas layout (800x500px): - Top (800x300): Interactive histogram with draggable data points - Bottom (800x200): Statistics display and controls
Visual elements: - Histogram showing data distribution - Vertical lines for mean (red), median (green), mode (blue) - Individual data points displayed as draggable circles below histogram - Statistics panel showing current values
Interactive controls: - Draggable data points: Click and drag any point to change its value - "Add Point" button: Add new data point - "Add Outlier" button: Add extreme value - "Remove Point" button: Click to remove - "Reset" button: Return to original dataset - Dropdown: Preset distributions (symmetric, left-skewed, right-skewed, bimodal)
Initial dataset: - 20 points normally distributed around 50
Behavior: - All three measures update in real-time as points are dragged - Visual indication when mean and median diverge significantly - Highlight which measure is "best" for current distribution - Animation when adding outliers to show mean shifting
Educational annotations: - "Notice how the mean moves toward the outlier" - "The median stays stable!" - "Mode shows the peak of the distribution"
Challenge tasks: - "Make the mean equal to the median" - "Create a distribution where mode ≠ median ≠ mean" - "Add an outlier that changes the mean by at least 10"
Visual style: Clean statistical visualization with color-coded measures
Implementation: p5.js with real-time statistical calculations
Measures of Spread
Knowing the center isn't enough—you also need to know how spread out the data is. Two datasets can have the same mean but very different spreads.
Range: Simplest Spread Measure
The range is simply the difference between the maximum and minimum values:
1 2 3 4 | |
Range is easy to understand but has limitations: it only uses two values and is very sensitive to outliers.
Variance: Average Squared Deviation
Variance measures how far each value is from the mean, on average. It squares the deviations (so negatives don't cancel positives):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
The problem with variance? The units are squared. If your data is in meters, variance is in meters². That's hard to interpret.
Standard Deviation: The Useful Spread Measure
Standard deviation is the square root of variance, bringing us back to the original units:
1 2 3 4 5 6 | |
Standard deviation tells you how much values typically deviate from the mean. In a normal distribution:
- ~68% of data falls within 1 standard deviation of the mean
- ~95% falls within 2 standard deviations
- ~99.7% falls within 3 standard deviations
This is called the 68-95-99.7 rule (or empirical rule).
Population vs Sample
When calculating variance/standard deviation for a sample (subset of data), divide by \(n-1\) instead of \(n\). This corrects for bias. Use np.var(data, ddof=1) or np.std(data, ddof=1) for sample statistics.
1 2 3 4 5 | |
Quartiles, Percentiles, and IQR
Quartiles: Dividing Data into Fourths
Quartiles divide sorted data into four equal parts:
- Q1 (25th percentile): 25% of data is below this value
- Q2 (50th percentile): The median—50% below, 50% above
- Q3 (75th percentile): 75% of data is below this value
1 2 3 4 5 6 7 8 9 | |
Percentiles: Any Division You Want
Percentiles generalize quartiles—the Pth percentile is the value below which P% of the data falls.
1 2 3 4 5 6 7 8 9 | |
Interquartile Range (IQR)
The interquartile range is the range of the middle 50% of data:
1 2 3 4 5 6 | |
IQR is robust to outliers and is used to detect them: any value below \(Q1 - 1.5 \times IQR\) or above \(Q3 + 1.5 \times IQR\) is considered an outlier.
1 2 3 4 5 6 | |
Diagram: Box Plot Anatomy
Interactive Box Plot Anatomy
Type: infographic
Bloom Taxonomy: Remember (L1)
Learning Objective: Help students identify and remember the components of a box plot
Purpose: Visual breakdown of box plot structure with labeled components
Layout: Central box plot with callouts pointing to each component
Main visual: A horizontal box plot with sample data showing: - Whisker extending left to minimum (non-outlier) - Box from Q1 to Q3 - Median line inside box - Whisker extending right to maximum (non-outlier) - Two outlier points beyond whiskers
Callouts (numbered with leader lines):
- MINIMUM (pointing to left whisker end)
- "Smallest non-outlier value"
- "= Q1 - 1.5×IQR or actual min, whichever is larger"
-
Color: Blue
-
Q1 / FIRST QUARTILE (pointing to left edge of box)
- "25% of data below this"
- "Left edge of box"
-
Color: Green
-
MEDIAN / Q2 (pointing to line inside box)
- "50% of data below this"
- "Center line in box"
-
Color: Red
-
Q3 / THIRD QUARTILE (pointing to right edge of box)
- "75% of data below this"
- "Right edge of box"
-
Color: Green
-
MAXIMUM (pointing to right whisker end)
- "Largest non-outlier value"
- "= Q3 + 1.5×IQR or actual max, whichever is smaller"
-
Color: Blue
-
IQR (bracket spanning the box)
- "Interquartile Range = Q3 - Q1"
- "Contains middle 50% of data"
-
Color: Orange
-
OUTLIERS (pointing to dots beyond whiskers)
- "Values beyond 1.5×IQR from box"
- "Shown as individual points"
- Color: Purple
Bottom section: "What box plots tell you at a glance" - Center (median position) - Spread (box width) - Symmetry (median position within box) - Outliers (individual points)
Interactive elements: - Hover over each component to highlight it - Click to see formula or code to calculate - Toggle between horizontal and vertical orientation
Implementation: SVG with CSS hover effects and JavaScript interactivity
Distribution Shape: Skewness and Kurtosis
Skewness: Leaning Left or Right
Skewness measures asymmetry in a distribution:
- Negative skew (left-skewed): Tail extends to the left; mean < median
- Zero skew: Symmetric; mean ≈ median
- Positive skew (right-skewed): Tail extends to the right; mean > median
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
Real-world examples:
- Right-skewed: Income, house prices, social media followers
- Left-skewed: Age at death in developed countries, exam scores (if test is easy)
Kurtosis: Tails and Peaks
Kurtosis measures the "tailedness" of a distribution—how much data is in the extreme tails versus the center:
- Positive kurtosis (leptokurtic): Heavy tails, sharp peak, more outliers
- Zero kurtosis (mesokurtic): Normal distribution
- Negative kurtosis (platykurtic): Light tails, flat peak, fewer outliers
1 2 3 4 5 6 7 8 9 10 | |
| Skewness | Distribution Shape | Mean vs Median |
|---|---|---|
| Negative | Left tail longer | Mean < Median |
| Zero | Symmetric | Mean ≈ Median |
| Positive | Right tail longer | Mean > Median |
Understanding Distributions
A distribution describes how values in a dataset are spread across different possible values. It shows the frequency or probability of each value occurring.
The Normal Distribution
The normal distribution (also called Gaussian or bell curve) is the most important distribution in statistics. It appears everywhere:
- Heights of people
- Measurement errors
- Test scores (often)
- Many natural phenomena
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
Key properties of normal distributions:
- Symmetric around the mean
- Mean = Median = Mode
- Defined by two parameters: mean (μ) and standard deviation (σ)
- The 68-95-99.7 rule applies
1 2 3 4 5 6 7 8 9 10 | |
Diagram: Normal Distribution Explorer MicroSim
Interactive Normal Distribution Explorer
Type: microsim
Bloom Taxonomy: Apply (L3)
Learning Objective: Let students manipulate mean and standard deviation to understand how they affect the normal distribution shape
Canvas layout (850x550px): - Main area (850x400): Interactive normal distribution plot - Control panel (850x150): Sliders and statistics
Visual elements: - Smooth normal distribution curve - Shaded regions showing 1σ, 2σ, 3σ areas - Vertical line at mean - Axis labels and tick marks - Current μ and σ displayed prominently
Interactive controls: - Slider: Mean (μ) range: 0 to 200, default: 100 - Slider: Standard Deviation (σ) range: 1 to 50, default: 15 - Toggle: Show 68-95-99.7 regions - Toggle: Show probability density values - Button: "Add second distribution" (for comparison) - Dropdown: Preset examples (IQ scores, heights, test scores)
Display panels: - Probability within 1σ: 68.27% - Probability within 2σ: 95.45% - Probability within 3σ: 99.73% - Current curve equation
Behavior: - Curve updates smoothly as sliders move - Shaded regions resize with σ changes - Curve shifts horizontally with μ changes - Comparison mode overlays two distributions
Educational annotations: - "Larger σ = wider, flatter curve" - "Smaller σ = narrower, taller curve" - "μ shifts the center, σ changes the spread"
Challenge tasks: - "Set parameters to match IQ distribution (μ=100, σ=15)" - "What σ makes 95% fall between 60 and 140?" - "Compare two distributions: same mean, different spread"
Visual style: Clean mathematical visualization with Plotly-like aesthetics
Implementation: p5.js or Plotly.js with real-time updates
Probability Fundamentals
Probability is the mathematical framework for quantifying uncertainty. It assigns a number between 0 and 1 to events:
- P = 0: Impossible
- P = 1: Certain
- P = 0.5: Equal chance of happening or not
1 2 3 4 5 6 7 | |
Random Variables and Expected Value
A random variable is a variable whose value depends on random outcomes. It can be:
- Discrete: Takes specific values (dice roll: 1, 2, 3, 4, 5, 6)
- Continuous: Takes any value in a range (height: 5.5, 5.51, 5.512...)
The expected value (E[X]) is the long-run average—what you'd expect on average over many repetitions:
1 2 3 4 5 6 7 8 9 10 | |
The expected value of a fair die is 3.5—you can never actually roll 3.5, but it's the average outcome over time.
Sampling: From Population to Sample
Population vs Sample
A population is the entire group you want to study. A sample is a subset you actually measure.
- Population: All high school students in the US
- Sample: 1,000 randomly selected high school students
Sampling is the process of selecting a sample from a population. Good sampling is crucial—a biased sample leads to wrong conclusions.
1 2 3 4 5 6 7 8 9 10 11 12 | |
The sample mean estimates the population mean, but there's always some error. That's where the Central Limit Theorem helps.
The Central Limit Theorem (CLT)
The Central Limit Theorem is one of the most important results in statistics. It says:
When you take many random samples from ANY population and calculate the mean of each sample, those sample means will be approximately normally distributed—regardless of the original population's distribution.
This is magical because:
- It works for any population shape (uniform, skewed, bimodal...)
- The distribution of sample means gets more normal as sample size increases
- It lets us make probability statements about sample means
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | |
Diagram: Central Limit Theorem Simulator MicroSim
Central Limit Theorem Interactive Demonstration
Type: microsim
Bloom Taxonomy: Analyze (L4)
Learning Objective: Help students understand the CLT by visualizing how sample means become normally distributed regardless of population shape
Canvas layout (900x600px): - Left panel (450x600): Population distribution - Right panel (450x600): Distribution of sample means - Bottom strip (900x100): Controls
Visual elements: - Left: Histogram of original population - Right: Histogram of sample means (builds up over time) - Normal curve overlay on right panel - Running statistics display
Interactive controls: - Dropdown: Population distribution type - Normal - Uniform - Exponential (right-skewed) - Bimodal - Custom (draw your own!) - Slider: Sample size (5, 10, 30, 50, 100) - Button: "Take One Sample" (animated) - Button: "Take 100 Samples" (fast) - Button: "Take 1000 Samples" (bulk) - Button: "Reset" - Slider: Animation speed
Display panels: - Population mean and std - Mean of sample means - Std of sample means (should ≈ σ/√n) - Number of samples taken
Behavior: - "Take One Sample" animates: highlight sample from population, calculate mean, add to right histogram - Sample means histogram builds up gradually - Normal curve overlay adjusts to fit data - Show how larger sample sizes make sample means distribution narrower
Educational annotations: - "Notice: Even though population is [skewed/uniform], sample means are normal!" - "Larger samples → narrower distribution of means" - "Standard error = σ/√n"
Challenge tasks: - "Which sample size makes sample means most normal?" - "Predict the std of sample means for n=100" - "Try the most extreme distribution—CLT still works!"
Visual style: Side-by-side comparison with animation
Implementation: p5.js with smooth animations and Plotly for histograms
Confidence Intervals: Quantifying Uncertainty
A confidence interval gives a range that likely contains the true population parameter. Instead of saying "the average is 75," you say "I'm 95% confident the average is between 72 and 78."
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
The interpretation: If we repeated this sampling process many times, 95% of the confidence intervals we calculate would contain the true population mean.
Common Misconception
A 95% confidence interval does NOT mean "there's a 95% probability the true value is in this range." The true value either is or isn't in the range—we just don't know which. The 95% refers to the long-run success rate of the method.
Hypothesis Testing: Making Decisions with Data
Hypothesis testing is a framework for making decisions based on data. You start with a hypothesis and use data to evaluate whether the evidence supports it.
The Process
- State the null hypothesis (H₀): The default assumption (usually "no effect" or "no difference")
- State the alternative hypothesis (H₁): What you're testing for
- Collect data and calculate a test statistic
- Calculate the p-value
- Make a decision based on significance level
P-Value: The Evidence Measure
The p-value is the probability of seeing results at least as extreme as yours, assuming the null hypothesis is true.
- Small p-value (< 0.05): Evidence against H₀; reject it
- Large p-value (≥ 0.05): Not enough evidence; fail to reject H₀
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | |
Statistical Significance
Statistical significance means the p-value is below a predetermined threshold (usually 0.05). It indicates the result is unlikely to have occurred by chance alone.
| P-value | Interpretation |
|---|---|
| < 0.001 | Very strong evidence against H₀ |
| < 0.01 | Strong evidence against H₀ |
| < 0.05 | Evidence against H₀ (significant) |
| ≥ 0.05 | Insufficient evidence against H₀ |
Statistical vs Practical Significance
A result can be statistically significant but practically meaningless. If a drug reduces blood pressure by 0.1 mmHg and it's significant with p < 0.001, so what? That's too small to matter clinically. Always consider effect size, not just p-values.
Diagram: Hypothesis Testing Workflow
Hypothesis Testing Decision Flowchart
Type: workflow
Bloom Taxonomy: Apply (L3)
Learning Objective: Guide students through the hypothesis testing process step by step
Purpose: Visual decision tree for conducting hypothesis tests
Visual style: Vertical flowchart with decision diamonds and process rectangles
Steps (top to bottom):
-
START: "Research Question" Hover text: "What are you trying to determine?" Color: Blue
-
PROCESS: "State Hypotheses"
- H₀: Null hypothesis (no effect/difference)
-
H₁: Alternative hypothesis (effect exists) Hover text: "H₀ is what you're trying to disprove" Color: Green
-
PROCESS: "Choose Significance Level (α)"
-
Usually α = 0.05 Hover text: "This is your threshold for 'unlikely'" Color: Green
-
PROCESS: "Collect Data & Calculate Test Statistic" Hover text: "t-test, chi-square, etc. depending on your data" Color: Green
-
PROCESS: "Calculate P-value" Hover text: "Probability of seeing this result if H₀ is true" Color: Orange
-
DECISION: "Is p-value < α?" Color: Yellow
7a. YES PATH: "Reject H₀" - "Results are statistically significant" - "Evidence supports H₁" Hover text: "But also check effect size!" Color: Red
7b. NO PATH: "Fail to Reject H₀" - "Results are not statistically significant" - "Insufficient evidence for H₁" Hover text: "This doesn't prove H₀ is true!" Color: Gray
- END: "Report Results"
- Include: test statistic, p-value, effect size, confidence interval Color: Blue
Side annotations: - "Type I Error (α): Rejecting H₀ when it's actually true" - "Type II Error (β): Failing to reject H₀ when it's actually false"
Interactive elements: - Hover over each step for detailed explanation - Click to see Python code for that step - Example problems that walk through the flowchart
Implementation: SVG with JavaScript interactivity
Correlation: Measuring Relationships
Correlation measures the strength and direction of a linear relationship between two variables. It ranges from -1 to +1:
- +1: Perfect positive correlation (as X increases, Y increases)
- 0: No linear correlation
- -1: Perfect negative correlation (as X increases, Y decreases)
Covariance: The Building Block
Covariance measures how two variables change together:
1 2 3 4 5 6 7 | |
The problem with covariance: it's affected by the scale of the variables. Covariance between height in inches and weight in pounds will be different from height in centimeters and weight in kilograms.
Pearson Correlation: The Standard Measure
Pearson correlation standardizes covariance to a -1 to +1 scale:
1 2 3 4 5 6 7 8 9 10 11 | |
Pearson correlation assumes:
- Linear relationship
- Both variables are continuous
- Data is normally distributed (roughly)
Spearman Correlation: For Non-Linear Relationships
Spearman correlation uses ranks instead of raw values, making it robust to:
- Non-linear relationships (as long as monotonic)
- Outliers
- Non-normal distributions
1 2 3 4 5 6 7 8 9 10 11 | |
| Measure | Measures | Assumptions | Best For |
|---|---|---|---|
| Pearson | Linear relationship | Normal, continuous | Linear relationships |
| Spearman | Monotonic relationship | Ordinal or continuous | Non-linear, ordinal data |
Correlation Matrix: Many Variables at Once
A correlation matrix shows correlations between all pairs of variables:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | |
Diagram: Correlation Visualizer MicroSim
Interactive Correlation Explorer
Type: microsim
Bloom Taxonomy: Analyze (L4)
Learning Objective: Help students understand correlation through interactive visualization of scatter plots with different correlation strengths
Canvas layout (850x550px): - Main area (550x500): Interactive scatter plot - Right panel (300x500): Controls and statistics - Bottom strip (850x50): Correlation strength indicator
Visual elements: - Scatter plot with data points - Best-fit line (toggleable) - Correlation coefficient displayed prominently - Correlation strength meter (-1 to +1 scale)
Interactive controls: - Slider: Target correlation (-1.0 to +1.0) - Button: "Generate Data" with current correlation - Slider: Number of points (20-200) - Slider: Noise level - Toggle: Show regression line - Toggle: Show confidence band - Dropdown: Preset examples (perfect positive, perfect negative, no correlation, moderate) - Draggable points: Move individual points to see effect
Display panels: - Pearson r - Spearman r - P-value - R² (coefficient of determination) - Sample size
Behavior: - Adjusting correlation slider regenerates data with target correlation - Dragging individual points updates all statistics in real-time - Adding outliers shows how they affect Pearson vs Spearman - Noise slider shows how correlation degrades with noise
Educational annotations: - "r = 0.8 means strong positive relationship" - "Notice Spearman handles the outlier better!" - "R² = 0.64 means 64% of variance in Y is explained by X"
Challenge tasks: - "Create data with r ≈ 0 but clear pattern (try a curve!)" - "Add an outlier that changes r by at least 0.2" - "Find the minimum sample size for statistical significance"
Visual style: Clean Plotly-like scatter plot with interactive elements
Implementation: p5.js with statistical calculations
Correlation ≠ Causation
The most important rule in statistics: correlation does not imply causation. Ice cream sales and drowning deaths are correlated (both increase in summer), but ice cream doesn't cause drowning. Always consider confounding variables and look for experimental evidence before claiming causation.
Putting Statistics into Practice
Let's combine everything in a real analysis:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | |
This workflow demonstrates the complete statistical analysis pipeline:
- Summarize with descriptive statistics
- Visualize distributions
- Explore relationships with correlations
- Test hypotheses about relationships
- Quantify uncertainty with confidence intervals
Chapter 6 Checkpoint: Test Your Understanding
Question 1: A dataset has mean = 50 and median = 65. What can you infer about the distribution?
Question 2: You test whether a new teaching method improves scores. The p-value is 0.03. What do you conclude at α = 0.05?
Question 3: Two variables have Pearson r = 0.85 and Spearman r = 0.60. What might explain this difference?
Click to reveal answers:
Answer 1: The distribution is left-skewed (negative skew). When mean < median, the tail extends to the left, pulling the mean down.
Answer 2: Since p = 0.03 < α = 0.05, you reject the null hypothesis. There is statistically significant evidence that the new teaching method affects scores. But check the effect size to see if the improvement is practically meaningful!
Answer 3: The relationship is likely non-linear. Pearson measures linear correlation (strong here), but Spearman measures monotonic correlation (weaker). There might be outliers affecting Spearman, or the relationship curves rather than being perfectly monotonic.
Achievement Unlocked: Statistical Thinker
You now speak the language of uncertainty. You can summarize data with the right measures, understand distributions, measure relationships, and make probabilistic inferences. These skills separate people who "look at data" from people who truly understand what data is telling them.
Key Takeaways
-
Descriptive statistics summarize data: measures of central tendency (mean, median, mode) and spread (range, variance, standard deviation).
-
Mean is sensitive to outliers; median is robust. Choose based on your data.
-
Standard deviation measures typical distance from the mean; use the 68-95-99.7 rule for normal distributions.
-
Quartiles and IQR divide data into parts and help identify outliers.
-
Skewness measures asymmetry; kurtosis measures tail heaviness.
-
The normal distribution is central to statistics—the bell curve appears everywhere.
-
Probability quantifies uncertainty; expected value is the long-run average.
-
Samples estimate population parameters; sampling must be done carefully to avoid bias.
-
The Central Limit Theorem says sample means are approximately normal, regardless of population shape.
-
Confidence intervals quantify uncertainty about estimates; p-values measure evidence against hypotheses.
-
Correlation measures relationship strength; Pearson for linear, Spearman for monotonic. Remember: correlation ≠ causation!
-
A correlation matrix shows all pairwise relationships at once.
You've built a solid statistical foundation. In the next chapter, you'll use these concepts to build your first predictive model with linear regression—where the statistical concepts you just learned become the engine for making predictions!