Simple Linear Regression
Summary
This chapter introduces regression analysis, the foundation of predictive modeling. Students will learn the mathematics behind linear regression, including the least squares method, interpreting coefficients (slope and intercept), and understanding residuals. The chapter covers regression assumptions and teaches students to implement linear regression using scikit-learn. By the end of this chapter, students will be able to build simple linear regression models, interpret their outputs, and make predictions.
Concepts Covered
This chapter covers the following 25 concepts from the learning graph:
- Regression Analysis
- Linear Regression
- Simple Linear Regression
- Regression Line
- Slope
- Intercept
- Least Squares Method
- Residuals
- Sum of Squared Errors
- Ordinary Least Squares
- Regression Coefficients
- Coefficient Interpretation
- Prediction
- Fitted Values
- Regression Equation
- Line of Best Fit
- Assumptions of Regression
- Linearity Assumption
- Homoscedasticity
- Independence Assumption
- Normality of Residuals
- Scikit-learn Library
- LinearRegression Class
- Fit Method
- Predict Method
Prerequisites
This chapter builds on concepts from:
From Description to Prediction
Everything you've learned so far has been about understanding data that already exists. Descriptive statistics summarize the past. Visualizations reveal patterns in historical data. Correlation shows relationships between variables.
But here's where data science gets really exciting: prediction.
What if, instead of just describing what happened, you could predict what will happen? What if you could look at a student's study hours and predict their exam score? Or see a house's square footage and estimate its price? Or know a car's age and forecast its fuel efficiency?
This is the superpower of regression analysis—the ability to draw a line through data that extends into the unknown future. It's the foundation of machine learning, the backbone of forecasting, and your first step into predictive modeling.
In this chapter, you'll learn to build your first predictive model. It's surprisingly simple—just a line—but don't let that fool you. This humble line is one of the most powerful tools in all of data science.
What is Regression Analysis?
Regression analysis is a statistical method for modeling the relationship between variables. It lets you:
- Understand how one variable affects another
- Quantify the strength of that relationship
- Predict values you haven't observed
The term "regression" has a historical origin. In the 1880s, Francis Galton studied the heights of parents and children. He noticed that very tall parents tended to have children shorter than themselves, and very short parents had taller children. Heights "regressed" toward the average. The name stuck, even though modern regression is used for much more than studying heights.
Linear Regression: The Straight-Line Model
Linear regression is the simplest form of regression—it assumes the relationship between variables is a straight line. Despite its simplicity, linear regression is:
- Easy to understand and interpret
- Fast to compute
- Surprisingly effective for many real problems
- The foundation for more complex models
When you have one input variable predicting one output variable, it's called simple linear regression. That's what we'll master in this chapter.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | |
See that line Plotly drew through the data? That's a regression line—your first predictive model! With it, you can predict the exam score for someone who studied 4.5 hours, even though you don't have that exact data point.
The Regression Equation
Every straight line can be described by an equation. You probably remember from algebra:
In statistics, we write the regression equation as:
Where:
- \(\hat{y}\) (y-hat) = the predicted value
- \(x\) = the input variable (predictor, independent variable)
- \(\beta_0\) = the intercept (where the line crosses the y-axis)
- \(\beta_1\) = the slope (how much y changes for each unit increase in x)
The \(\beta\) values are called regression coefficients—they define your model.
Understanding Slope
The slope (\(\beta_1\)) tells you the rate of change: for every one-unit increase in x, how much does y change?
- Positive slope: As x increases, y increases (uphill line)
- Negative slope: As x increases, y decreases (downhill line)
- Zero slope: x has no effect on y (horizontal line)
1 2 3 4 5 | |
Understanding Intercept
The intercept (\(\beta_0\)) is the predicted value when x = 0. It's where the line crosses the y-axis.
1 2 3 4 | |
Intercept Interpretation Caution
The intercept doesn't always have a meaningful interpretation. If x = 0 is outside your data range (like predicting house price for 0 square feet), don't interpret the intercept literally—it's just a mathematical necessity for the line equation.
Diagram: Regression Line Anatomy
Interactive Regression Line Components
Type: infographic
Bloom Taxonomy: Remember (L1)
Learning Objective: Help students identify and remember the components of a regression line and equation
Purpose: Visual breakdown of regression line with labeled components
Layout: Scatter plot with regression line and labeled callouts
Main visual: Scatter plot (600x400px) showing: - 10-15 data points with clear linear trend - Regression line through points - Y-axis intercept clearly marked - Rise and run triangle showing slope
Callouts (numbered with leader lines):
- INTERCEPT (β₀) (pointing to y-axis crossing)
- "Where line crosses y-axis"
- "Predicted y when x = 0"
- "In equation: the constant term"
-
Color: Blue
-
SLOPE (β₁) (pointing to rise/run triangle)
- "Rise over run"
- "Change in y per unit change in x"
- "Positive = uphill, Negative = downhill"
- Shows: Δy / Δx calculation
-
Color: Red
-
PREDICTED VALUE (ŷ) (pointing to a point on the line)
- "Value predicted by the model"
- "Falls exactly on the line"
- "ŷ = β₀ + β₁x"
-
Color: Green
-
ACTUAL VALUE (y) (pointing to a data point off the line)
- "Real observed value"
- "Usually not exactly on line"
-
Color: Orange
-
RESIDUAL (pointing to vertical line between actual and predicted)
- "Distance from actual to predicted"
- "Residual = y - ŷ"
- "What the model got wrong"
- Color: Purple
Bottom equation display: ŷ = β₀ + β₁x With arrows pointing to each component in the equation
Interactive elements: - Hover over each component for detailed explanation - Click to highlight related elements - Toggle to show/hide residuals for all points
Implementation: SVG with JavaScript interactivity
Finding the Best Line: Least Squares Method
There are infinite lines you could draw through a scatter plot. So how do we find the best one? We use the least squares method.
Residuals: Measuring Errors
A residual is the difference between an actual observed value and the value predicted by the model:
1 2 3 4 5 6 7 8 | |
Residuals tell us how wrong our predictions are:
- Positive residual: Model under-predicted (actual > predicted)
- Negative residual: Model over-predicted (actual < predicted)
- Zero residual: Perfect prediction (actual = predicted)
Sum of Squared Errors (SSE)
To find the best line, we want to minimize total error. But we can't just add up residuals—positive and negative would cancel out! Instead, we square them first:
This is the sum of squared errors (also called sum of squared residuals).
1 2 3 | |
Ordinary Least Squares (OLS)
Ordinary Least Squares (OLS) is the method that finds the line minimizing SSE. It's the standard algorithm for linear regression.
The math gives us formulas for the optimal coefficients:
Don't worry about memorizing these—Python will calculate them for you. The important thing is understanding the concept: OLS finds the line that makes the squared prediction errors as small as possible.
Diagram: Least Squares MicroSim
Interactive Least Squares Line Fitting
Type: microsim
Bloom Taxonomy: Understand (L2)
Learning Objective: Help students understand how the least squares method finds the best-fit line by minimizing squared errors
Canvas layout (900x600px): - Main area (650x550): Interactive scatter plot with adjustable line - Right panel (250x550): Controls and error display - Bottom strip (900x50): SSE meter
Visual elements: - Scatter plot with 8-12 data points - Adjustable regression line (can drag slope and intercept) - Vertical lines from points to line showing residuals - Squares drawn at each residual (area = squared error) - Running SSE total displayed prominently
Interactive controls: - Draggable line: Adjust slope by rotating, intercept by vertical drag - Slider: Slope (-5 to +5) - Slider: Intercept (0 to 100) - Button: "Show Optimal Line" - animates to best fit - Button: "Reset" - return to initial position - Toggle: Show/hide residual squares - Toggle: Show/hide residual values
Display panels: - Current slope and intercept - Current SSE - Optimal SSE (shown after clicking "Show Optimal") - Percentage improvement from current to optimal
SSE Meter (bottom): - Visual bar showing current SSE - Marker showing optimal SSE - Color gradient: red (high error) → green (low error)
Behavior: - As line is adjusted, SSE updates in real-time - Residual squares resize dynamically - "Show Optimal Line" smoothly animates to least squares solution - Highlight when current SSE is close to optimal
Educational annotations: - "Each square's area = squared error for that point" - "Total area of all squares = SSE" - "OLS minimizes this total area"
Challenge tasks: - "Can you get SSE below 50?" - "Find a line where all residuals are positive" - "Match the optimal line within 5% SSE"
Visual style: Clean mathematical visualization
Implementation: p5.js with real-time calculations
The Line of Best Fit
The line of best fit (also called the regression line or trend line) is the line that minimizes SSE. It's the "best" line in the sense that no other straight line would have smaller total squared errors.
1 2 3 4 5 6 7 8 9 10 11 12 13 | |
Properties of the line of best fit:
- It always passes through the point \((\bar{x}, \bar{y})\) (the means)
- The sum of residuals equals zero (positive and negative cancel)
- It minimizes SSE among all possible straight lines
Fitted Values and Predictions
Fitted values are the predictions your model makes for the data points you used to build it. They're the y-values on the regression line at each x in your training data.
1 2 3 4 5 6 | |
Prediction uses the regression equation to estimate y for new x values—values you haven't observed yet.
1 2 3 4 5 6 | |
Extrapolation Warning
Be careful predicting far outside your data range! If your data goes from 1-8 hours, predicting for 20 hours is risky. The linear relationship might not hold for extreme values. This is called extrapolation and can lead to unreliable predictions.
Interpreting Regression Coefficients
Coefficient interpretation is crucial—it's how you extract meaning from your model.
Interpreting the Slope
The slope tells you the effect size: how much y changes per unit change in x.
1 2 3 4 5 6 7 | |
The slope also tells you direction:
- Positive slope (5.5): More studying → higher scores (positive relationship)
- If slope were negative: More of x → less of y (inverse relationship)
Interpreting the Intercept
The intercept is the predicted value when x = 0.
1 2 3 | |
But context matters! Does x = 0 make sense?
| Scenario | x = 0 Meaningful? | Intercept Interpretation |
|---|---|---|
| Study hours → Score | Maybe | Baseline score without studying |
| House sq ft → Price | No | Price of 0 sq ft house? Nonsense! |
| Age → Height (children) | No | Height at age 0? (birth height, maybe) |
| Temperature → Ice cream sales | Maybe | Sales at 0°F (very cold!) |
| Coefficient | Symbol | Interpretation |
|---|---|---|
| Slope | β₁ | Change in y per unit change in x |
| Intercept | β₀ | Predicted y when x = 0 |
Assumptions of Regression
For linear regression to give reliable results, certain assumptions should hold. Think of these as the "fine print" of your model.
1. Linearity Assumption
The linearity assumption requires that the relationship between x and y is actually linear (a straight line fits well).
1 2 3 4 5 6 7 | |
If the relationship is curved, linear regression will give poor predictions. You'd need polynomial regression or other techniques.
2. Independence Assumption
The independence assumption requires that observations are independent of each other. One data point shouldn't affect another.
Violations occur when:
- Time series data (today's value depends on yesterday's)
- Clustered data (students in same class aren't independent)
- Repeated measurements on same subjects
3. Homoscedasticity
Homoscedasticity (homo = same, scedasticity = scatter) means the spread of residuals is constant across all x values.
1 2 3 4 5 6 7 8 | |
- Good: Residuals form a random horizontal band around zero
- Bad: Residuals fan out (spread increases with x) = heteroscedasticity
4. Normality of Residuals
The normality of residuals assumption requires that residuals follow a normal distribution. This matters for confidence intervals and hypothesis tests.
1 2 3 4 5 6 7 | |
For small samples, use a Q-Q plot:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
Diagram: Regression Assumptions Checker MicroSim
Interactive Assumption Diagnostic Tool
Type: microsim
Bloom Taxonomy: Analyze (L4)
Learning Objective: Help students diagnose regression assumption violations through interactive visualizations
Canvas layout (900x650px): - Top left (450x300): Original scatter plot with regression line - Top right (450x300): Residual vs fitted plot - Bottom left (450x300): Residual histogram - Bottom right (450x300): Q-Q plot of residuals
Visual elements: - All four diagnostic plots update together - Traffic light indicators (green/yellow/red) for each assumption - Assumption status panel
Interactive controls: - Dropdown: Dataset selector - "Good Data" (all assumptions met) - "Non-linear" (curved relationship) - "Heteroscedastic" (fan-shaped residuals) - "Non-normal residuals" (skewed errors) - "Outliers present" - "Custom" (add/drag points) - Button: "Diagnose" - highlights violations - Toggle: Show/hide assumption guidelines - Draggable points in custom mode
Assumption indicators: 1. LINEARITY - Green: Points follow line well - Yellow: Slight curvature - Red: Clear non-linear pattern
- INDEPENDENCE
- Note: "Cannot diagnose from plot alone"
-
Checkbox: "Data is from independent observations"
-
HOMOSCEDASTICITY
- Green: Constant spread in residual plot
- Yellow: Slight fanning
-
Red: Clear funnel shape
-
NORMALITY
- Green: Histogram bell-shaped, Q-Q on line
- Yellow: Slight deviation
- Red: Clear non-normality
Behavior: - Selecting dataset updates all four plots - Traffic lights update based on diagnostic rules - Tooltips explain what each violation means - "Diagnose" button highlights specific problem areas
Educational annotations: - "Look for patterns in the residual plot" - "Points should follow the diagonal in Q-Q plot" - "Residuals should be roughly bell-shaped"
Visual style: Dashboard layout with coordinated plots
Implementation: p5.js with Plotly.js for statistical plots
| Assumption | What to Check | Good Sign | Bad Sign |
|---|---|---|---|
| Linearity | Scatter plot | Points follow line | Curved pattern |
| Independence | Study design | Random sampling | Clustered/time data |
| Homoscedasticity | Residual plot | Even spread | Fan/funnel shape |
| Normality | Histogram/Q-Q | Bell curve, diagonal line | Skewed, curved Q-Q |
When Assumptions Are Violated
Minor violations often don't matter much. Linear regression is fairly robust. But serious violations require action: transform variables, use robust regression, or try different models. Always check assumptions!
Implementing Linear Regression with Scikit-learn
Now let's build regression models the professional way using the scikit-learn library (also called sklearn). It's the most popular machine learning library in Python.
1 2 3 | |
The LinearRegression Class
The LinearRegression class is scikit-learn's implementation of ordinary least squares regression.
1 2 3 4 5 6 | |
The Fit Method
The fit method trains the model—it calculates the optimal coefficients from your data.
1 2 3 4 5 6 7 8 9 10 | |
The .fit() method:
- Takes X (features) and y (target)
- Calculates optimal coefficients using OLS
- Stores them in
model.intercept_andmodel.coef_ - Returns the model object (for method chaining)
The Predict Method
The predict method uses the trained model to make predictions.
1 2 3 4 5 6 7 8 9 10 | |
Complete Scikit-learn Workflow
Here's the standard pattern you'll use for all sklearn models:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | |
Diagram: Scikit-learn Workflow
Machine Learning Pipeline Flowchart
Type: workflow
Bloom Taxonomy: Apply (L3)
Learning Objective: Help students memorize and apply the standard scikit-learn workflow
Purpose: Visual guide for the fit-predict pattern
Visual style: Horizontal flowchart with code snippets
Steps (left to right):
-
IMPORT Icon: Package/box Code:
from sklearn.linear_model import LinearRegressionHover text: "Import the model class you need" Color: Blue -
PREPARE DATA Icon: Table/spreadsheet Code:
X = df[['feature']]andy = df['target']Hover text: "X must be 2D, y is 1D" Color: Green Warning note: "X needs double brackets!" -
CREATE MODEL Icon: Gear/factory Code:
model = LinearRegression()Hover text: "Instantiate the model object" Color: Orange -
FIT MODEL Icon: Brain/learning Code:
model.fit(X, y)Hover text: "Train on your data - learns coefficients" Color: Purple Output: "model.coef_, model.intercept_" -
PREDICT Icon: Crystal ball Code:
y_pred = model.predict(X_new)Hover text: "Generate predictions for any X" Color: Red -
EVALUATE Icon: Checkmark/chart Code:
model.score(X, y)or metrics Hover text: "Assess model quality" Color: Teal
Annotations: - Arrow from "FIT" to coefficients stored - Note: "This pattern works for ALL sklearn models!" - Common errors callout: "Forgot to reshape X?", "Wrong array shape?"
Interactive elements: - Click each step to see full code example - Hover for detailed explanation - Toggle between LinearRegression and other model examples
Implementation: SVG with JavaScript interactivity
Putting It All Together: A Complete Example
Let's work through a complete regression analysis from start to finish:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 | |
Diagram: Interactive Regression Builder MicroSim
Build Your Own Regression Model
Type: microsim
Bloom Taxonomy: Create (L6)
Learning Objective: Let students build, visualize, and interpret their own regression models interactively
Canvas layout (950x700px): - Left panel (600x700): Main visualization area - Top (600x400): Scatter plot with regression line - Bottom (600x300): Residual plot - Right panel (350x700): Controls, coefficients, interpretation
Visual elements: - Interactive scatter plot - Regression line (updates with data) - Residual lines connecting points to line - Coefficient display - Equation display - R² score gauge
Data options: - Preset datasets: - "Study Hours vs Scores" (positive, strong) - "House Size vs Price" (positive, moderate) - "Car Age vs Value" (negative) - "Random Data" (no relationship) - Custom: Click to add points
Interactive controls: - Dropdown: Select dataset - Button: "Add Point" (click on plot to add) - Button: "Remove Point" (click to remove) - Button: "Fit Model" - calculates regression - Button: "Clear All" - Slider: Noise level (for preset datasets) - Toggle: Show residuals - Toggle: Show confidence band
Right panel displays: - Equation: ŷ = β₀ + β₁x (with actual values) - Interpretation text: - "For each unit increase in X, Y changes by [slope]" - "When X = 0, predicted Y = [intercept]" - Model quality: - R² score with visual gauge - RMSE value - Assumption indicators (traffic lights)
Prediction tool: - Input field: "Enter X value" - Button: "Predict" - Output: Predicted Y with confidence interval - Visual: Point added to plot at prediction
Behavior: - Adding/removing points triggers model refit - All statistics update in real-time - Interpretation text updates with coefficient values - Warning when extrapolating beyond data range
Educational features: - "What happens if you add an outlier?" - "Can you create data with R² > 0.9?" - "What does negative slope look like?"
Visual style: Professional dashboard with clean aesthetics
Implementation: p5.js with real-time OLS calculations
Common Pitfalls and Best Practices
Pitfall 1: Confusing Correlation with Causation
A strong relationship doesn't mean x CAUSES y. Ice cream sales predict drowning deaths (both increase in summer), but ice cream doesn't cause drowning!
Pitfall 2: Extrapolating Too Far
Your model is only reliable within the range of your training data. Predicting house prices for 50,000 square feet when your data only goes to 3,000 is dangerous.
Pitfall 3: Ignoring Assumptions
Always check your assumptions! A model fit on data with severe violations gives misleading results.
Pitfall 4: Forgetting to Reshape X
Scikit-learn needs X as a 2D array. The most common error:
1 2 3 4 5 6 7 | |
Chapter 7 Checkpoint: Test Your Understanding
Question 1: A model has equation: Price = 25000 + 150 × sqft. Interpret the slope.
Question 2: What's the predicted price for a 1,200 sqft house using this model?
Question 3: In a residual plot, you see residuals fanning out (spreading wider) as fitted values increase. Which assumption is violated?
Click to reveal answers:
Answer 1: For each additional square foot, the predicted price increases by $150. A house that's 100 sqft larger is predicted to cost $15,000 more.
Answer 2: Price = 25000 + 150 × 1200 = 25000 + 180000 = $205,000
Answer 3: Homoscedasticity is violated. The residuals should have constant spread (homoscedastic), but fanning indicates the spread changes with fitted values (heteroscedastic).
Achievement Unlocked: Prediction Pioneer
You've built your first predictive model! You can now fit lines to data, interpret what those lines mean, make predictions, and check if your model is trustworthy. This is the foundation of all machine learning—everything else builds on these concepts.
Key Takeaways
-
Regression analysis models relationships between variables to make predictions.
-
Simple linear regression uses one input (x) to predict one output (y) with a straight line.
-
The regression equation is \(\hat{y} = \beta_0 + \beta_1 x\), where β₀ is the intercept and β₁ is the slope.
-
Slope tells you how much y changes per unit increase in x. Intercept is the predicted y when x = 0.
-
Residuals are prediction errors: actual minus predicted. The least squares method finds the line minimizing the sum of squared errors.
-
OLS (Ordinary Least Squares) is the standard algorithm that finds the optimal coefficients.
-
Fitted values are predictions for training data; predictions can be made for any new x values.
-
Assumptions: linearity, independence, homoscedasticity, and normality of residuals. Check them!
-
Scikit-learn provides the professional way to do regression: create model → fit(X, y) → predict(X_new).
-
The LinearRegression class implements OLS. Use
.fit()to train and.predict()to generate predictions.
You've now mastered the fundamentals of predictive modeling. In the next chapter, you'll learn how to evaluate whether your model is actually good—because fitting a line is easy, but knowing if it's useful is the real skill!