Machine Learning Foundations
Summary
This chapter explicitly connects linear algebra concepts to core machine learning algorithms and techniques. You will learn how data is represented as matrices, understand covariance and correlation, master Principal Component Analysis (PCA) for dimensionality reduction, and implement linear regression with regularization. Gradient descent, the workhorse of machine learning optimization, is covered in detail.
Concepts Covered
This chapter covers the following 20 concepts from the learning graph:
- Feature Vector
- Feature Matrix
- Data Matrix
- Covariance Matrix
- Correlation Matrix
- Standardization
- PCA
- Principal Component
- Variance Explained
- Scree Plot
- Dimensionality Reduction
- Linear Regression
- Design Matrix
- Ridge Regression
- Lasso Regression
- Regularization
- Gradient Vector
- Gradient Descent
- Batch Gradient Descent
- Learning Rate
Prerequisites
This chapter builds on concepts from:
- Chapter 1: Vectors and Vector Spaces
- Chapter 2: Matrices and Matrix Operations
- Chapter 6: Eigenvalues and Eigenvectors
- Chapter 7: Matrix Decompositions
- Chapter 8: Vector Spaces and Inner Products
Introduction
Machine learning is, at its core, applied linear algebra. When you train a model, you perform matrix operations. When you reduce dimensions, you compute eigendecompositions. When you optimize, you follow gradients through high-dimensional spaces. Understanding the linear algebra behind these operations transforms you from a user of black-box algorithms into a practitioner who can debug, optimize, and innovate.
This chapter bridges abstract linear algebra and practical machine learning. We start with how data becomes matrices, develop statistical tools like covariance, build up to PCA for dimensionality reduction, implement regression with regularization, and master gradient descent for optimization. Each section reinforces that machine learning "magic" is really linear algebra in action.
Data as Matrices
In machine learning, data is organized into matrices where each row represents an observation and each column represents a feature.
Feature Vectors
A feature vector represents a single data point as a vector of measurements or attributes:
\(\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_d \end{bmatrix} \in \mathbb{R}^d\)
where:
- \(d\) is the number of features (dimensionality)
- Each \(x_i\) is a measurement (e.g., height, weight, pixel intensity)
Examples of feature vectors:
| Domain | Features | Dimensionality |
|---|---|---|
| Housing | bedrooms, sqft, age, location | 4+ |
| Images | pixel intensities | 784 (28×28) to millions |
| Text | word counts or embeddings | 100 to 768+ |
| Tabular | mixed numerical/categorical | varies |
Feature Matrix and Data Matrix
A feature matrix (also called data matrix) stacks \(n\) feature vectors as rows:
\(X = \begin{bmatrix} — \mathbf{x}_1^T — \\ — \mathbf{x}_2^T — \\ \vdots \\ — \mathbf{x}_n^T — \end{bmatrix} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1d} \\ x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nd} \end{bmatrix}\)
where:
- \(X\) is \(n \times d\) (n samples, d features)
- Row \(i\) is sample \(\mathbf{x}_i^T\)
- Column \(j\) contains all values of feature \(j\)
Convention Alert
Some texts use columns for samples (X is d×n). We follow the more common machine learning convention where rows are samples, matching NumPy/Pandas defaults. Always check the convention when reading papers or documentation.
Diagram: Data Matrix Structure
Run the Data Matrix Structure Visualizer Fullscreen
Data Matrix Structure Visualizer
Type: infographic
Bloom Taxonomy Level: Understand
Learning Objective: Visualize the structure of data matrices and understand the relationship between rows (samples) and columns (features)
Layout: Interactive matrix representation with labeled dimensions
Visual elements: - Main panel: Color-coded matrix grid - Row labels: "Sample 1", "Sample 2", ..., "Sample n" - Column labels: "Feature 1", "Feature 2", ..., "Feature d" - Highlighted single row showing feature vector - Highlighted single column showing all values of one feature - Dimension annotations: n (rows) and d (columns)
Interactive elements: - Click a row to highlight as feature vector - Click a column to highlight as feature across all samples - Hover to see individual cell value - Toggle to show actual example data (iris, housing, etc.)
Example datasets: 1. Iris: 150 samples, 4 features (petal/sepal dimensions) 2. MNIST digit: 1 sample, 784 features (pixel values) 3. Housing: 506 samples, 13 features
Visual style: - Heat map coloring for numerical values - Clean grid lines - Responsive sizing
Implementation: HTML/CSS/JavaScript with interactive highlighting
Statistical Foundations
Before applying machine learning algorithms, we must understand the statistical structure of our data.
Standardization
Standardization transforms features to have zero mean and unit variance:
\(z_{ij} = \frac{x_{ij} - \mu_j}{\sigma_j}\)
where:
- \(\mu_j = \frac{1}{n}\sum_{i=1}^n x_{ij}\) is the mean of feature \(j\)
- \(\sigma_j = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_{ij} - \mu_j)^2}\) is the standard deviation
In matrix form, if \(\boldsymbol{\mu}\) is the row vector of means:
\(Z = (X - \mathbf{1}\boldsymbol{\mu}) \text{diag}(\boldsymbol{\sigma})^{-1}\)
Why Standardize?
- Scale invariance: Features measured in different units become comparable
- Numerical stability: Prevents features with large values from dominating
- Algorithm requirements: Many algorithms (PCA, gradient descent, regularization) assume or benefit from standardized data
| Algorithm | Standardization |
|---|---|
| PCA | Required for meaningful results |
| k-Means | Recommended |
| SVM | Required (especially with RBF kernel) |
| Neural Networks | Strongly recommended |
| Decision Trees | Not necessary |
| Linear Regression | Recommended for regularization |
Covariance Matrix
The covariance matrix captures how features vary together:
\(\Sigma = \frac{1}{n-1}(X - \mathbf{1}\boldsymbol{\mu})^T(X - \mathbf{1}\boldsymbol{\mu}) = \frac{1}{n-1}\tilde{X}^T\tilde{X}\)
where:
- \(\tilde{X}\) is the centered data matrix (mean subtracted)
- \(\Sigma\) is a \(d \times d\) symmetric positive semi-definite matrix
- \(\Sigma_{jk} = \text{Cov}(X_j, X_k)\)
Covariance Formula
The covariance between features \(j\) and \(k\):
\(\text{Cov}(X_j, X_k) = \frac{1}{n-1}\sum_{i=1}^n (x_{ij} - \mu_j)(x_{ik} - \mu_k)\)
Properties:
- Diagonal entries \(\Sigma_{jj} = \text{Var}(X_j)\) are variances
- Off-diagonal entries measure linear relationships
- \(\Sigma_{jk} > 0\): features increase together
- \(\Sigma_{jk} < 0\): one increases as other decreases
- \(\Sigma_{jk} = 0\): no linear relationship (not necessarily independent)
Correlation Matrix
The correlation matrix is the standardized covariance:
\(R = D^{-1}\Sigma D^{-1}\)
where:
- \(D = \text{diag}(\sigma_1, \ldots, \sigma_d)\) contains standard deviations
- \(R_{jk} = \frac{\Sigma_{jk}}{\sigma_j \sigma_k} = \frac{\text{Cov}(X_j, X_k)}{\sqrt{\text{Var}(X_j)\text{Var}(X_k)}}\)
Properties:
- Diagonal entries are 1 (features perfectly correlate with themselves)
- Off-diagonal entries satisfy \(-1 \leq R_{jk} \leq 1\)
- \(R_{jk} = \pm 1\): perfect linear relationship
- The correlation matrix is the covariance matrix of standardized data
Diagram: Covariance and Correlation Visualizer
Run the Covariance and Correlation Visualizer Fullscreen
Covariance and Correlation Matrix Visualizer
Type: microsim
Bloom Taxonomy Level: Analyze
Learning Objective: Understand how covariance and correlation capture relationships between features through interactive exploration
Visual elements: - Left panel: Scatter plot matrix (pairs of features) - Center panel: Covariance matrix as heatmap - Right panel: Correlation matrix as heatmap - Color scale: Blue (negative) to White (zero) to Red (positive) - Eigenvalue display for covariance matrix
Interactive controls: - Dataset selector (generated bivariate, iris, custom) - Draggable data points to modify dataset - "Standardize" toggle to see effect on covariance - Highlight cell to see corresponding scatter plot - Slider to add/remove correlation between features
Default parameters: - 2D generated data with moderate positive correlation - 100 sample points - Canvas: responsive three-panel layout
Behavior: - Real-time update of matrices as data changes - Show how correlation normalizes for scale - Highlight relationship between scatter plot shape and correlation value - Display eigenvalues/eigenvectors of covariance matrix - Demonstrate that standardized data has correlation = covariance
Implementation: p5.js with matrix visualization
Principal Component Analysis (PCA)
Principal Component Analysis is a technique for dimensionality reduction that finds the directions of maximum variance in data.
The Goal of Dimensionality Reduction
High-dimensional data presents challenges:
- Visualization: Cannot plot more than 3 dimensions
- Computation: Many algorithms scale poorly with dimensions
- Curse of dimensionality: Data becomes sparse in high dimensions
- Noise: Some dimensions may be noise rather than signal
Dimensionality reduction projects data from \(\mathbb{R}^d\) to \(\mathbb{R}^k\) where \(k < d\), preserving as much information as possible.
Principal Components
Principal components are the eigenvectors of the covariance matrix, ordered by their eigenvalues:
\(\Sigma \mathbf{v}_i = \lambda_i \mathbf{v}_i\)
where:
- \(\mathbf{v}_1\) is the first principal component (direction of maximum variance)
- \(\mathbf{v}_2\) is orthogonal to \(\mathbf{v}_1\) and captures maximum remaining variance
- \(\lambda_i\) is the variance explained by the \(i\)-th component
The principal components form an orthonormal basis aligned with the data's natural axes of variation.
PCA Algorithm
Step 1: Center the data
\(\tilde{X} = X - \mathbf{1}\boldsymbol{\mu}\)
Step 2: Compute covariance matrix
\(\Sigma = \frac{1}{n-1}\tilde{X}^T\tilde{X}\)
Step 3: Eigendecomposition
\(\Sigma = V\Lambda V^T\)
where \(V = [\mathbf{v}_1 | \cdots | \mathbf{v}_d]\) and \(\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_d)\) with \(\lambda_1 \geq \cdots \geq \lambda_d\).
Step 4: Project onto top \(k\) components
\(Z = \tilde{X}V_k\)
where \(V_k = [\mathbf{v}_1 | \cdots | \mathbf{v}_k]\) contains the first \(k\) principal components.
Variance Explained
The variance explained by each principal component is its eigenvalue:
- Total variance: \(\sum_{i=1}^d \lambda_i = \text{trace}(\Sigma)\)
- Proportion of variance explained by component \(i\): \(\frac{\lambda_i}{\sum_{j=1}^d \lambda_j}\)
- Cumulative variance explained by first \(k\) components: \(\frac{\sum_{i=1}^k \lambda_i}{\sum_{j=1}^d \lambda_j}\)
Scree Plot
A scree plot visualizes eigenvalues to help choose the number of components:
- X-axis: Component number (1, 2, 3, ...)
- Y-axis: Eigenvalue (variance explained) or proportion of variance
- Look for an "elbow" where eigenvalues drop sharply
Diagram: PCA Interactive Explorer
Run the PCA Explorer Fullscreen
PCA Step-by-Step Visualizer
Type: microsim
Bloom Taxonomy Level: Apply
Learning Objective: Understand PCA by visualizing each step from raw data to projected low-dimensional representation
Visual elements: - Panel 1: Original 2D/3D data with mean point - Panel 2: Centered data (translated to origin) - Panel 3: Principal component vectors overlaid on data - Panel 4: Projected 1D data along first PC - Scree plot showing eigenvalues - Variance explained percentage display
Interactive controls: - Data generator: cluster shape, spread, rotation - Number of points slider (20-200) - Dimension selector (2D or 3D) - Step-through buttons: "Center", "Find PCs", "Project" - Number of components to keep (k) - "Show Reconstruction" toggle
Default parameters: - Elongated 2D Gaussian cluster - 100 points - Canvas: responsive multi-panel
Behavior: - Animate centering transformation - Show eigenvectors with length proportional to eigenvalue - Demonstrate projection onto first PC - Show reconstruction error when reducing dimensions - Display scree plot updating with data changes
Implementation: p5.js with eigenvalue computation
PCA via SVD
In practice, PCA is computed using SVD for numerical stability:
\(\tilde{X} = U\Sigma V^T\)
The relationship to eigendecomposition:
- Right singular vectors \(V\) are the principal components
- Singular values relate to eigenvalues: \(\lambda_i = \frac{\sigma_i^2}{n-1}\)
- This avoids forming \(\tilde{X}^T\tilde{X}\) explicitly
Diagram: Scree Plot Interactive
Run the Scree Plot Visualizer Fullscreen
Scree Plot and Component Selection
Type: microsim
Bloom Taxonomy Level: Evaluate
Learning Objective: Use scree plots and cumulative variance to select the optimal number of principal components
Visual elements: - Left panel: Scree plot (bar chart of eigenvalues) - Right panel: Cumulative variance explained (line plot) - Threshold line for desired variance (e.g., 95%) - Elbow point detection and highlight - Reconstruction comparison at different k values
Interactive controls: - Dataset selector (synthetic, iris, digits subset) - Draggable threshold line for variance target - Number of components slider - "Show Reconstructed Data" toggle - "Compare Original vs Reconstructed" toggle
Default parameters: - Synthetic dataset with clear elbow at k=3 - 95% variance threshold line - Canvas: responsive dual-panel
Behavior: - Highlight suggested k based on elbow detection - Show which k achieves target variance - Display reconstruction error as k changes - For image data: show visual reconstruction quality - Kaiser criterion line (eigenvalue = 1 for standardized data)
Implementation: p5.js with statistical visualization
Linear Regression
Linear regression fits a linear model to predict a target variable from features.
The Model
Given features \(\mathbf{x} \in \mathbb{R}^d\) and target \(y \in \mathbb{R}\):
\(y = \mathbf{w}^T\mathbf{x} + b = w_1x_1 + w_2x_2 + \cdots + w_dx_d + b\)
where:
- \(\mathbf{w} \in \mathbb{R}^d\) is the weight vector
- \(b \in \mathbb{R}\) is the bias (intercept)
Design Matrix
The design matrix augments features with a column of ones to absorb the bias:
\(X = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1d} \\ 1 & x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{nd} \end{bmatrix}\)
Now the model becomes:
\(\mathbf{y} = X\boldsymbol{\theta}\)
where \(\boldsymbol{\theta} = [b, w_1, \ldots, w_d]^T\) combines bias and weights.
Ordinary Least Squares
The least squares solution minimizes:
\(J(\boldsymbol{\theta}) = \|X\boldsymbol{\theta} - \mathbf{y}\|^2 = \sum_{i=1}^n (x_i^T\boldsymbol{\theta} - y_i)^2\)
The closed-form solution (normal equations):
\(\hat{\boldsymbol{\theta}} = (X^TX)^{-1}X^T\mathbf{y}\)
Diagram: Linear Regression Visualizer
Run the Linear Regression Visualizer Fullscreen
Linear Regression Interactive Visualizer
Type: microsim
Bloom Taxonomy Level: Apply
Learning Objective: Understand linear regression as finding the best-fit line/plane by minimizing squared errors
Visual elements: - Main panel: Scatter plot of data points - Fitted line/plane (2D or 3D) - Residual lines from points to fitted line - Loss function surface (for 2D: 3D surface of loss vs w, b) - Current parameter values display
Interactive controls: - Drag data points to modify dataset - Manual sliders for w and b to see effect on fit and loss - "Fit OLS" button to compute optimal parameters - Toggle residual visualization - Switch between 1D (line fit) and 2D (plane fit) examples
Default parameters: - 2D scatter with linear relationship plus noise - 20 data points - Canvas: responsive
Behavior: - Real-time residual and loss computation - Show that OLS solution is at minimum of loss surface - Display R² score for goodness of fit - Highlight vertical (y) residuals vs perpendicular distance - Show normal equations computation
Implementation: p5.js with regression computation
Regularization
Regularization adds a penalty term to prevent overfitting by constraining model complexity.
Why Regularize?
Without regularization, models can:
- Overfit to noise in training data
- Have large, unstable weights
- Perform poorly on new data
- Fail when features are correlated (multicollinearity)
Ridge Regression (L2)
Ridge regression adds an L2 penalty on weights:
\(J(\boldsymbol{\theta}) = \|X\boldsymbol{\theta} - \mathbf{y}\|^2 + \alpha\|\boldsymbol{\theta}\|^2\)
where:
- \(\alpha \geq 0\) is the regularization strength
- \(\|\boldsymbol{\theta}\|^2 = \sum_j \theta_j^2\) (typically excluding bias)
The closed-form solution:
\(\hat{\boldsymbol{\theta}} = (X^TX + \alpha I)^{-1}X^T\mathbf{y}\)
Key properties:
- Always invertible (even if \(X^TX\) is singular)
- Shrinks weights toward zero
- Keeps all features (no feature selection)
- Equivalent to adding "fake" data points
Lasso Regression (L1)
Lasso regression uses an L1 penalty:
\(J(\boldsymbol{\theta}) = \|X\boldsymbol{\theta} - \mathbf{y}\|^2 + \alpha\|\boldsymbol{\theta}\|_1\)
where:
- \(\|\boldsymbol{\theta}\|_1 = \sum_j |\theta_j|\) is the L1 norm
Key properties:
- Produces sparse solutions (some weights exactly zero)
- Performs automatic feature selection
- No closed-form solution (requires iterative optimization)
- Useful when only few features are truly relevant
| Property | Ridge (L2) | Lasso (L1) |
|---|---|---|
| Penalty | \(\sum \theta_j^2\) | $\sum |
| Sparsity | No | Yes (feature selection) |
| Closed-form | Yes | No |
| Multicollinearity | Handles well | Picks one of correlated features |
| Geometry | Circular constraint | Diamond constraint |
Diagram: Regularization Geometry
Run the Regularization Geometry Visualizer Fullscreen
Regularization Geometry Visualizer
Type: microsim
Bloom Taxonomy Level: Analyze
Learning Objective: Understand how L1 and L2 regularization constrain weights geometrically and why L1 produces sparsity
Visual elements: - 2D parameter space (θ₁ vs θ₂) - Contour lines of unregularized loss function - L2 constraint region (circle) - L1 constraint region (diamond) - OLS solution point - Regularized solution point - Regularization path as α varies
Interactive controls: - Slider for regularization strength α - Toggle between L1 and L2 - Drag ellipse center (changing OLS solution location) - "Show Regularization Path" toggle - Animation of solution as α increases
Default parameters: - OLS solution at (3, 2) - Moderate α - Canvas: responsive
Behavior: - Show how constraint region intersects loss contours - Demonstrate L1 hitting corners (sparse solution) - Animate solution moving toward origin as α increases - Show weight values and their evolution - Display sparsity count for L1
Implementation: p5.js with geometric visualization
Gradient-Based Optimization
When closed-form solutions don't exist or are too expensive, we use iterative gradient descent.
Gradient Vector
The gradient vector of a scalar function \(f : \mathbb{R}^d \to \mathbb{R}\) collects all partial derivatives:
\(\nabla f(\boldsymbol{\theta}) = \begin{bmatrix} \frac{\partial f}{\partial \theta_1} \\ \frac{\partial f}{\partial \theta_2} \\ \vdots \\ \frac{\partial f}{\partial \theta_d} \end{bmatrix}\)
Key properties:
- Points in the direction of steepest increase
- Magnitude indicates rate of change
- At a minimum, \(\nabla f = \mathbf{0}\)
For linear regression loss \(J(\boldsymbol{\theta}) = \|X\boldsymbol{\theta} - \mathbf{y}\|^2\):
\(\nabla J(\boldsymbol{\theta}) = 2X^T(X\boldsymbol{\theta} - \mathbf{y})\)
Gradient Descent Algorithm
Gradient descent iteratively moves in the negative gradient direction:
\(\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \eta \nabla J(\boldsymbol{\theta}^{(t)})\)
where:
- \(\eta > 0\) is the learning rate (step size)
- \(t\) is the iteration number
- We move against the gradient to decrease the function
Batch Gradient Descent
Batch gradient descent uses all training samples to compute the gradient at each step:
1 2 3 4 5 6 7 | |
Characteristics:
- Deterministic updates (same path from same initialization)
- Smooth convergence
- Expensive per iteration for large datasets
- May be slow for large \(n\)
Learning Rate
The learning rate \(\eta\) controls step size and critically affects convergence:
| Learning Rate | Behavior |
|---|---|
| Too small | Very slow convergence, may take forever |
| Just right | Smooth, efficient convergence |
| Too large | Oscillation, overshooting |
| Way too large | Divergence (loss increases) |
Choosing the learning rate:
- Start with \(\eta = 0.01\) or \(0.001\)
- Use learning rate schedules (decay over time)
- Adaptive methods (Adam, RMSprop) adjust per-parameter
Diagram: Gradient Descent Visualizer
Run the Gradient Descent Visualizer Fullscreen
Gradient Descent Interactive Visualizer
Type: microsim
Bloom Taxonomy Level: Apply
Learning Objective: Understand how gradient descent navigates the loss surface and how learning rate affects convergence
Visual elements: - Main panel: 3D surface plot of loss function J(θ₁, θ₂) - Contour plot view (top-down) - Current position marker - Gradient arrow at current position - Path traced by optimization - Loss vs iteration plot
Interactive controls: - Learning rate slider (0.001 to 1.0, log scale) - "Step" button for single iteration - "Run" button for continuous optimization - "Reset" button to reinitialize - Starting point selector (click on surface) - Loss function selector (quadratic, Rosenbrock, etc.)
Default parameters: - Simple quadratic loss with single minimum - Learning rate = 0.1 - Starting point away from minimum - Canvas: responsive multi-view
Behavior: - Show gradient vector at each step - Trace optimization path on contour plot - Display convergence (or divergence) in loss plot - Demonstrate oscillation with high learning rate - Show slow progress with low learning rate - Count iterations to convergence
Implementation: p5.js with 3D surface rendering (WEBGL)
Variants of Gradient Descent
Beyond batch gradient descent, several variants improve efficiency:
Stochastic Gradient Descent (SGD):
- Uses single sample per update: \(\nabla J_i(\boldsymbol{\theta})\)
- Fast iterations but noisy updates
- Can escape local minima due to noise
Mini-batch Gradient Descent:
- Uses subset of samples (batch size \(b\)): \(\frac{1}{b}\sum_{i \in B}\nabla J_i(\boldsymbol{\theta})\)
- Balances noise and efficiency
- Standard in deep learning (batch size 32-256)
Momentum:
- Accumulates velocity: \(\mathbf{v}^{(t+1)} = \beta\mathbf{v}^{(t)} + \nabla J(\boldsymbol{\theta}^{(t)})\)
- Update: \(\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \eta\mathbf{v}^{(t+1)}\)
- Accelerates through flat regions, dampens oscillations
| Method | Per-Iteration Cost | Convergence | Noise |
|---|---|---|---|
| Batch GD | \(O(nd)\) | Smooth | None |
| SGD | \(O(d)\) | Noisy | High |
| Mini-batch | \(O(bd)\) | Moderate | Moderate |
Diagram: Learning Rate Effect Visualizer
Run the Learning Rate Effect Visualizer Fullscreen
Learning Rate Effect on Convergence
Type: microsim
Bloom Taxonomy Level: Evaluate
Learning Objective: Understand how learning rate choice affects optimization behavior through side-by-side comparison
Visual elements: - Three parallel contour plots with different learning rates - Path traces showing optimization trajectories - Loss curves for each learning rate - Status indicators: "Converging", "Oscillating", "Diverging" - Step count to convergence
Interactive controls: - Individual learning rate sliders for each panel - Preset buttons: "Too Small", "Just Right", "Too Large" - Shared "Run All" button - "Reset All" button - Speed slider for animation
Default parameters: - Left: η = 0.01 (too small) - Center: η = 0.1 (good) - Right: η = 0.5 (too large) - Same starting point for all
Behavior: - Simultaneous animation of all three optimizations - Real-time loss comparison plot - Show oscillation in too-large case - Show slow progress in too-small case - Identify optimal learning rate region - Display final loss values
Implementation: p5.js with synchronized animations
Putting It All Together
Here's a complete machine learning pipeline using these concepts:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | |
Diagram: ML Pipeline Workflow
Run the ML Pipeline Workflow Fullscreen
Machine Learning Pipeline Workflow
Type: workflow
Bloom Taxonomy Level: Create
Learning Objective: Understand the complete ML pipeline from raw data to trained model
Visual style: Flowchart with processing stages
Steps: 1. Start: "Raw Data" Hover text: "Original features, possibly different scales and units"
- Process: "Standardization" Hover text: "Transform to zero mean, unit variance"
- Input: Raw features X
-
Output: Standardized Z
-
Process: "PCA (optional)" Hover text: "Reduce dimensionality while preserving variance"
- Input: Standardized data Z
- Output: Reduced data (k dimensions)
-
Decision: Scree plot analysis
-
Process: "Train/Test Split" Hover text: "Hold out data for evaluation"
-
Process: "Model Selection" Hover text: "Choose algorithm and hyperparameters"
- Branch A: OLS (no regularization)
- Branch B: Ridge (L2)
-
Branch C: Lasso (L1)
-
Process: "Optimization" Hover text: "Find optimal parameters"
- Closed-form (Ridge) or
-
Gradient descent (Lasso, Neural Networks)
-
Process: "Evaluation" Hover text: "Assess on test set"
-
Metrics: MSE, R², etc.
-
End: "Trained Model"
Color coding: - Blue: Data processing - Green: Modeling - Orange: Optimization - Purple: Evaluation
Interactive: - Click nodes to see code examples - Hover for detailed explanations
Implementation: D3.js or Mermaid.js
Summary
This chapter connected linear algebra to machine learning:
Data Representation:
- Feature vectors represent samples as \(d\)-dimensional vectors
- Data matrices organize \(n\) samples as rows, \(d\) features as columns
- Consistent conventions are crucial for correct matrix operations
Statistical Foundations:
- Standardization ensures comparable scales and improves algorithm performance
- Covariance matrices capture feature relationships: \(\Sigma = \frac{1}{n-1}\tilde{X}^T\tilde{X}\)
- Correlation matrices are standardized covariances with values in \([-1, 1]\)
Dimensionality Reduction:
- PCA finds directions of maximum variance via eigendecomposition
- Principal components are eigenvectors of the covariance matrix
- Scree plots help choose the number of components to retain
- Use SVD for numerical stability in practice
Regression:
- Linear regression minimizes squared error: \(J = \|X\boldsymbol{\theta} - \mathbf{y}\|^2\)
- Design matrices incorporate the bias term
- Ridge (L2) shrinks weights, handles multicollinearity
- Lasso (L1) produces sparse solutions for feature selection
Optimization:
- Gradient vectors point in the direction of steepest increase
- Gradient descent iteratively minimizes: \(\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \eta\nabla J\)
- Learning rate is critical: too small = slow, too large = diverge
- Batch, mini-batch, and stochastic variants trade off noise vs. efficiency
Self-Check: Why does PCA use the covariance matrix of centered data rather than the original data?
Centering (subtracting the mean) is essential because PCA seeks directions of maximum variance from the data's center of mass. Without centering, the first principal component would largely capture the offset from the origin rather than the true variation structure. The covariance matrix of centered data measures how features vary around their means, which is exactly what PCA needs to find the principal directions of spread.