Non-linear Models and Regularization
title: Non-linear Models and Regularization description: Bend the line and tame the beast - mastering curves and preventing overfitting generated_by: chapter-content-generator skill date: 2025-12-15 version: 0.03
Summary
This chapter expands modeling capabilities beyond linear relationships. Students will learn polynomial regression for capturing non-linear patterns, various transformation techniques, and the concept of model flexibility. The chapter introduces regularization as a technique for preventing overfitting, covering Ridge regression, Lasso regression, and Elastic Net. By the end of this chapter, students will understand how to balance model complexity with generalization and apply regularization to improve model performance.
Concepts Covered
This chapter covers the following 15 concepts from the learning graph:
- Non-linear Regression
- Polynomial Regression
- Degree of Polynomial
- Curve Fitting
- Transformation
- Log Transformation
- Feature Transformation
- Model Flexibility
- Regularization
- Ridge Regression
- Lasso Regression
- Elastic Net
- Regularization Parameter
- Lambda Parameter
- Shrinkage
Prerequisites
This chapter builds on concepts from:
Introduction: Beyond the Straight Line
You've mastered linear regression—congratulations! But here's the truth: the real world doesn't always follow straight lines. House prices don't increase linearly with size forever. Learning curves flatten out. Population growth accelerates and then stabilizes. To model these patterns, you need curves.
This chapter gives you two new superpowers:
- Bending the line: Using polynomial regression and transformations to capture curved relationships
- Taming the beast: Using regularization to prevent models from going wild with overfitting
Together, these techniques let you build models that are flexible enough to capture complex patterns yet disciplined enough to generalize to new data. It's a delicate balance—and by the end of this chapter, you'll be a master at finding it.
Non-linear Regression: When Lines Aren't Enough
Non-linear regression refers to any regression approach where the relationship between features and target isn't a simple straight line. Look at this data:
1 2 3 4 5 6 7 8 9 10 | |
If you fit a straight line to this data, you'll miss the curve entirely. The model will systematically underpredict in some regions and overpredict in others. That's not a random error—it's a sign that your model isn't flexible enough.
Non-linear regression captures these curved patterns by:
- Adding polynomial terms (x², x³, etc.)
- Transforming features (log, square root, etc.)
- Using inherently non-linear models (which we'll cover in later chapters)
The key insight: even though the relationship is curved, we can still use linear regression techniques! We just need to transform our features first.
Polynomial Regression: Curves Through Linear Regression
Polynomial regression is a clever trick: we create new features by raising the original feature to different powers, then use regular linear regression on these expanded features.
For a single feature \(x\), polynomial regression of degree 3 looks like:
This is still "linear" regression in the sense that it's linear in the coefficients (β values). But the resulting curve can bend and twist to fit complex patterns.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | |
The magic happens in PolynomialFeatures—it takes your original feature and creates new columns for each power up to the specified degree.
Degree of Polynomial: How Much Flexibility?
The degree of polynomial controls how flexible your curve can be:
- Degree 1: Straight line (regular linear regression)
- Degree 2: Parabola (one bend)
- Degree 3: S-curve possible (two bends)
- Degree 4+: Increasingly complex curves
Here's the critical trade-off:
| Degree | Flexibility | Risk of Underfitting | Risk of Overfitting |
|---|---|---|---|
| 1 | Low | High | Low |
| 2-3 | Medium | Medium | Medium |
| 5-7 | High | Low | Medium-High |
| 10+ | Very High | Very Low | Very High |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | |
Notice how degree 10 goes wild, trying to pass through every data point? That's overfitting in action. The curve fits the training data perfectly but would fail miserably on new data.
Diagram: Polynomial Degree Explorer
Polynomial Degree Explorer
Type: microsim
Bloom Taxonomy: Apply, Evaluate
Learning Objective: Interactively explore how polynomial degree affects curve flexibility and the bias-variance tradeoff
Canvas Layout (850x550): - Main area (850x400): Scatter plot with polynomial curve - Bottom area (850x150): Controls and metrics
Main Visualization: - Data points (20-50 points) with some noise - Polynomial curve that updates in real-time - Shaded confidence region showing uncertainty - Residual lines from points to curve (optional toggle)
Interactive Controls: - Slider: Polynomial Degree (1 to 15) - Dropdown: Dataset type (linear, quadratic, cubic, sine wave, step function) - Slider: Noise level (0 to high) - Button: "Generate New Data" - Checkbox: "Show Train/Test Split"
Metrics Panel: - Training R²: updates live - Test R²: updates live (when split enabled) - Number of coefficients: degree + 1 - Visual warning when overfitting detected (train >> test)
Educational Overlays: - At degree 1: "Underfitting: Missing the curve" - At degree 2-4: "Good fit for this data" - At degree 10+: "Overfitting: Chasing noise!" - Arrow pointing to where train/test scores diverge
Animation: - Smooth curve transition when degree changes - Coefficients displayed with size proportional to magnitude
Implementation: p5.js with polynomial fitting
Curve Fitting: The Art and Science
Curve fitting is the process of finding the mathematical function that best describes your data. While polynomial regression is one approach, the broader goal is matching the right curve shape to your data's underlying pattern.
Good curve fitting requires:
- Visual inspection: Plot your data first! What shape does it suggest?
- Domain knowledge: Does theory predict a certain relationship?
- Validation: Does the curve generalize to new data?
- Parsimony: Prefer simpler curves when they fit adequately
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | |
The CV score typically rises, peaks, then falls as degree increases. The peak is your sweet spot—enough flexibility to capture the pattern, not so much that you're fitting noise.
Transformation: Changing the Shape of Data
Transformation is a broader technique for handling non-linear relationships. Instead of adding polynomial terms, we transform the original variables to make the relationship more linear.
Common transformations include:
| Transformation | Formula | Use Case |
|---|---|---|
| Log | \(\log(x)\) | Exponential growth, multiplicative effects |
| Square root | \(\sqrt{x}\) | Count data, variance stabilization |
| Reciprocal | \(1/x\) | Inverse relationships |
| Power | \(x^n\) | Accelerating/decelerating patterns |
| Box-Cox | \((x^\lambda - 1)/\lambda\) | General normalization |
The key insight: if your scatter plot curves, the right transformation can straighten it—making linear regression appropriate again.
Log Transformation: The Exponential Tamer
The log transformation is probably the most useful transformation in data science. It's perfect when:
- Your data spans several orders of magnitude (1 to 1,000,000)
- The relationship looks exponential
- You want to interpret coefficients as percentage changes
- Residuals show increasing variance (heteroscedasticity)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | |
Notice how the curved relationship becomes nearly linear after log transformation? Now regular linear regression will work beautifully.
1 2 3 4 5 6 7 8 9 10 11 12 | |
Interpreting Log-Transformed Coefficients
When your target is log-transformed, coefficients represent multiplicative effects. A coefficient of 0.4 means each unit of x multiplies y by e^0.4 ≈ 1.49, or a 49% increase.
Feature Transformation: Engineering Better Inputs
Feature transformation is the deliberate modification of input features to improve model performance. This is closely related to the feature engineering we covered earlier, but with a specific focus on mathematical transformations.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | |
Scikit-learn's PowerTransformer can automatically find good transformations:
1 2 3 4 5 6 7 | |
Diagram: Transformation Gallery
Transformation Gallery
Type: infographic
Bloom Taxonomy: Understand, Apply
Learning Objective: Show common transformations side-by-side with their effects on data distribution and relationships
Layout: 2x3 grid of transformation examples
Each Panel Contains: - Original data histogram/scatter (left mini-plot) - Transformed data histogram/scatter (right mini-plot) - Transformation formula - When to use it
Panels: 1. Log Transformation - Before: Right-skewed histogram - After: Symmetric histogram - Formula: y' = log(y) - Use: Exponential relationships, multiplicative effects
- Square Root
- Before: Count data with variance proportional to mean
- After: Stabilized variance
- Formula: y' = √y
-
Use: Count data, Poisson-like distributions
-
Reciprocal (1/x)
- Before: Hyperbolic scatter
- After: Linear scatter
- Formula: y' = 1/y
-
Use: Inverse relationships
-
Square (x²)
- Before: Decelerating curve
- After: Linear relationship
- Formula: y' = x²
-
Use: Accelerating patterns
-
Box-Cox
- Before: Arbitrary skewed data
- After: Approximately normal
- Formula: y' = (y^λ - 1)/λ
-
Use: General normalization
-
Standardization
- Before: Different scales
- After: Mean=0, SD=1
- Formula: z = (x - μ)/σ
- Use: Comparing features, regularization
Interactive Elements: - Click panel to see full-size comparison - Slider to adjust transformation parameter - Button: "Try on your data" - upload CSV option
Implementation: HTML/CSS/JavaScript with D3.js visualizations
Model Flexibility: The Complexity Dial
Model flexibility refers to how adaptable a model is to different patterns in data. A highly flexible model can capture intricate patterns but risks overfitting. A rigid model may miss important patterns but generalizes better.
Think of flexibility as a dial:
- Low flexibility (simple models): Few parameters, strong assumptions, high bias, low variance
- High flexibility (complex models): Many parameters, weak assumptions, low bias, high variance
The relationship between flexibility and error follows a U-shaped curve:
- Training error always decreases with more flexibility
- Test error decreases initially, then increases (the overfitting zone)
- The optimal flexibility minimizes test error
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | |
The gap between training and test error is your overfitting indicator. When it's large, your model has learned noise specific to the training data.
Regularization: Taming Overfitting
Here's the million-dollar question: if more flexibility leads to overfitting, but we need flexibility to capture complex patterns, what do we do?
Enter regularization—a technique that adds a penalty for model complexity. Instead of just minimizing prediction error, regularized models minimize:
The complexity penalty discourages large coefficients, effectively simplifying the model. This creates a controlled trade-off between fitting the data and keeping the model simple.
Regularization gives you the best of both worlds:
- Use a flexible model (high-degree polynomial)
- Let regularization automatically "turn off" unnecessary complexity
- Result: captures real patterns, ignores noise
The regularization parameter (often called \(\lambda\) or alpha in scikit-learn) controls this trade-off:
- λ = 0: No regularization (standard linear regression)
- Small λ: Light penalty, nearly flexible
- Large λ: Heavy penalty, nearly rigid
- λ → ∞: All coefficients shrink to zero
Ridge Regression: The L2 Penalty
Ridge regression (also called L2 regularization or Tikhonov regularization) adds a penalty proportional to the squared coefficients:
The squared penalty means:
- All coefficients are shrunk toward zero
- Large coefficients are penalized more heavily
- Coefficients never become exactly zero (just very small)
- Good for multicollinearity—it stabilizes correlated features
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | |
Notice how Ridge maintains good test performance even with degree 10, while unregularized regression overfits!
Scale Your Features for Regularization
Regularization penalizes large coefficients. If features are on different scales, the penalty affects them unequally. Always standardize features before applying regularization.
Lasso Regression: The L1 Penalty
Lasso regression (Least Absolute Shrinkage and Selection Operator) uses the absolute value of coefficients as the penalty:
The L1 penalty has a special property: it can shrink coefficients all the way to exactly zero. This means Lasso performs automatic feature selection—useless features get eliminated entirely.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | |
| Aspect | Ridge (L2) | Lasso (L1) |
|---|---|---|
| Penalty | Sum of squared coefficients | Sum of absolute coefficients |
| Coefficients | Shrunk toward zero | Can become exactly zero |
| Feature selection | No | Yes (automatic) |
| Multicollinearity | Handles well | Arbitrary selection |
| Best for | Many small effects | Few important features |
Diagram: Ridge vs Lasso Comparison
Ridge vs Lasso Comparison
Type: microsim
Bloom Taxonomy: Analyze, Evaluate
Learning Objective: Visualize and compare how Ridge and Lasso penalties affect coefficient shrinkage and feature selection
Canvas Layout (900x550): - Top left (400x250): Ridge coefficient path - Top right (400x250): Lasso coefficient path - Bottom (900x250): Side-by-side coefficient comparison and controls
Coefficient Path Plots: - X-axis: Log(λ) from small to large - Y-axis: Coefficient values - Each line represents one coefficient - Show how coefficients shrink as λ increases - Ridge: All lines approach zero asymptotically - Lasso: Lines hit zero and stay there
Comparison Panel: - Bar chart showing final coefficient values - Ridge bars (blue): All non-zero, varying heights - Lasso bars (orange): Some exactly zero - Highlight eliminated features in gray
Interactive Controls: - Slider: λ (regularization strength) - both plots update - Dropdown: Select dataset (housing, synthetic, medical) - Checkbox: "Show cross-validation optimal λ" - Toggle: "Show mathematical penalty visualization"
Penalty Visualization (optional): - 2D contour plot showing loss surface - Ridge: Circular penalty (L2 ball) - Lasso: Diamond penalty (L1 ball) - Optimal point where loss contours meet penalty boundary
Key Insights Displayed: - "Lasso zeros out X features" - "Ridge reduces largest coefficient by Y%" - Optimal λ marked on both paths
Implementation: p5.js with interactive plots
Elastic Net: The Best of Both Worlds
Elastic Net combines Ridge and Lasso penalties:
Or equivalently, using a mixing parameter \(\rho\) (called l1_ratio in scikit-learn):
When \(\rho = 1\): Pure Lasso When \(\rho = 0\): Pure Ridge When \(0 < \rho < 1\): Combination
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
Elastic Net is particularly useful when:
- You have groups of correlated features (Lasso arbitrarily picks one; Elastic Net keeps related features together)
- You want some feature selection but not as aggressive as pure Lasso
- You're not sure whether Ridge or Lasso is better (try Elastic Net!)
Lambda/Alpha Parameter: Finding the Sweet Spot
The lambda parameter (called alpha in scikit-learn) controls regularization strength. Too small and you overfit; too large and you underfit. Finding the optimal λ is crucial.
Use cross-validation to find the best λ:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
Visualize the regularization path:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | |
Diagram: Lambda Tuning Playground
Lambda Tuning Playground
Type: microsim
Bloom Taxonomy: Apply, Evaluate
Learning Objective: Practice finding optimal regularization strength through interactive experimentation
Canvas Layout (850x600): - Top area (850x350): Data and model fit visualization - Bottom left (425x250): CV score vs lambda plot - Bottom right (425x250): Coefficient magnitudes
Top Panel - Model Fit: - Scatter plot of data - Polynomial curve showing current fit - Toggle between Ridge/Lasso/Elastic Net - Curve updates as lambda changes
Bottom Left - Cross-Validation: - X-axis: log(λ) scale - Y-axis: CV Score (R² or MSE) - Line showing CV performance across λ values - Vertical marker at current λ - Optimal λ highlighted with star
Bottom Right - Coefficients: - Bar chart of coefficient magnitudes - Updates in real-time as λ changes - For Lasso: Gray out zero coefficients - Show total number of non-zero coefficients
Interactive Controls: - Slider: Lambda value (log scale) - Dropdown: Regularization type (Ridge, Lasso, Elastic Net) - Slider: Polynomial degree (2-15) - Slider: l1_ratio (for Elastic Net, 0-1) - Button: "Find Optimal λ" - animates search - Button: "Generate New Data"
Metrics Display: - Current λ value - Train R² - Test R² - Cross-Validation R² - Number of non-zero coefficients
Educational Callouts: - When λ too small: "Overfitting warning!" - When λ too large: "Underfitting warning!" - At optimal: "Sweet spot found!"
Implementation: p5.js with real-time model fitting
Shrinkage: What Regularization Actually Does
Shrinkage is the technical term for what regularization does to coefficients—it pulls them toward zero. But why does shrinking coefficients help prevent overfitting?
Consider what happens when a model overfits:
- It finds patterns in noise
- These patterns require extreme coefficients
- Small changes in data cause large prediction changes
- High variance = poor generalization
Shrinkage counters this by:
- Penalizing extreme coefficients
- Forcing the model to find simpler solutions
- Reducing sensitivity to noise
- Lower variance = better generalization
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | |
As regularization increases:
- Coefficient magnitudes shrink
- Model becomes more stable
- Test performance often improves (up to a point)
Putting It All Together: A Complete Workflow
Here's a complete workflow for building regularized non-linear models:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | |
Diagram: Regularization Decision Tree
Regularization Decision Tree
Type: workflow
Bloom Taxonomy: Evaluate, Apply
Learning Objective: Guide students through choosing the right regularization approach for their problem
Visual Style: Flowchart with decision diamonds and outcome rectangles
Start: "Need to Prevent Overfitting?"
Decision 1: "Linear relationship?" - Yes → Consider if regularization is needed - No → Add polynomial features
Decision 2: "How many features vs samples?" - Many features, few samples → Strong regularization needed - Balanced → Moderate regularization - Few features, many samples → Light regularization
Decision 3: "Do you want feature selection?" - Yes, aggressive → Use Lasso - Yes, some → Use Elastic Net - No, keep all features → Use Ridge
Decision 4: "Highly correlated features?" - Yes → Use Ridge or Elastic Net (Lasso is unstable) - No → Any method works
Decision 5: "Interpretability important?" - Yes → Lasso (sparse solution) - No → Ridge (often better accuracy)
Final Outcomes: - Ridge: "Many small effects, correlated features" - Lasso: "Few important features, interpretability" - Elastic Net: "Best of both, groups of features"
Interactive Elements: - Click each decision to see explanation - Hover shows examples of each scenario - "Take Quiz" mode walks through with your data characteristics
Implementation: HTML/CSS/JavaScript with interactive flowchart
Common Pitfalls and Best Practices
Always Scale Before Regularizing Regularization penalizes coefficient magnitude. If features aren't scaled, features with larger values will be unfairly penalized.
1 2 3 4 5 6 7 8 9 | |
Don't Regularize the Intercept Scikit-learn doesn't regularize the intercept by default (which is correct). Be careful if using other implementations.
Use Cross-Validation for Lambda Never set λ by looking at test performance. Use cross-validation to find optimal λ, then evaluate on test data.
Consider the Problem Type - Prediction focus → Ridge often wins - Interpretation focus → Lasso for sparsity - Groups of related features → Elastic Net
Watch for Warning Signs - Very large or very small λ optimal → reconsider model specification - All coefficients near zero → λ too large - Test performance much worse than CV → something's wrong
Summary: Your Regularization Toolkit
You now have powerful tools for handling non-linear relationships and overfitting:
- Polynomial regression captures curved patterns using powers of features
- Degree selection balances flexibility with overfitting risk
- Transformations (log, sqrt, etc.) can linearize relationships
- Model flexibility is the dial between underfitting and overfitting
- Regularization adds complexity penalties to prevent overfitting
- Ridge (L2) shrinks all coefficients, handles multicollinearity
- Lasso (L1) performs automatic feature selection
- Elastic Net combines L1 and L2 penalties
- Lambda tuning via cross-validation finds the optimal penalty strength
With these techniques, you can build models that are flexible enough to capture complex real-world patterns while remaining robust enough to generalize to new data.
Looking Ahead
In the next chapter, we'll explore machine learning more broadly—including classification problems where we predict categories instead of numbers. You'll see how the regularization concepts you learned here apply to new types of models.
Key Takeaways
- Polynomial regression captures non-linear patterns while still using linear regression techniques
- Higher polynomial degrees increase flexibility but also overfitting risk
- Log and other transformations can linearize curved relationships
- Regularization adds a penalty for complexity, balancing fit with generalization
- Ridge (L2) shrinks coefficients; Lasso (L1) can zero them out entirely
- Elastic Net combines both penalties for flexible feature selection
- Always scale features before regularizing and use CV to find optimal λ
- The goal is finding the sweet spot: flexible enough to learn patterns, constrained enough to ignore noise