Machine Learning Pipeline Workflow

About This MicroSim

This interactive flowchart visualizes the complete machine learning pipeline from raw data to a trained model. Each stage is color-coded by category and includes both hover explanations and Python code examples.

Pipeline Stages:

Raw Data - Original features, possibly different scales and units
Standardization - Transform to zero mean, unit variance
PCA (optional) - Reduce dimensionality while preserving variance
Train/Test Split - Hold out data for evaluation
Model Selection - Choose algorithm (OLS, Ridge, or Lasso)
Optimization - Find optimal parameters
Evaluation - Assess on test set (MSE, R-squared)
Trained Model - Ready for deployment

How to Use

Hover over nodes to see detailed explanations of each stage
Click on nodes to view Python code examples for that stage
Click "Clear Selection" or click elsewhere to dismiss the code panel
Follow the arrows to understand the data flow through the pipeline

Color Coding

Color	Category	Stages
Blue	Data Processing	Raw Data, Standardization, PCA, Train/Test Split
Green	Modeling	Model Selection, OLS, Ridge, Lasso
Orange	Optimization	Optimization
Purple	Evaluation	Evaluation, Trained Model

The ML Pipeline in Linear Algebra Terms

The machine learning pipeline heavily relies on linear algebra operations:

Standardization

\[z = \frac{x - \mu}{\sigma}\]

This transforms the data matrix \(X\) so each column has mean 0 and variance 1.

PCA (Principal Component Analysis)

Uses SVD to find the directions of maximum variance: \(\(X = U\Sigma V^T\)\)

The principal components are the columns of \(V\).

Linear Regression Models

OLS (Ordinary Least Squares): \(\(\hat{\beta} = (X^TX)^{-1}X^Ty\)\)

Ridge Regression (L2 regularization): \(\(\hat{\beta} = (X^TX + \lambda I)^{-1}X^Ty\)\)

Lasso Regression (L1 regularization): Minimizes \(\|y - X\beta\|_2^2 + \lambda\|\beta\|_1\)

Embedding

<iframe src="https://dmccreary.github.io/linear-algebra/sims/ml-pipeline/main.html" height="552px" scrolling="no"></iframe>

Lesson Plan

Learning Objectives

Students will be able to:

Describe the complete workflow of a machine learning pipeline
Explain the purpose of each preprocessing and modeling stage
Compare and contrast OLS, Ridge, and Lasso regression approaches
Identify where linear algebra operations occur in the pipeline

Suggested Activities

Trace the pipeline: Follow the data flow and explain what happens at each stage
Code walkthrough: Use the code examples to implement a complete pipeline in Python
Model comparison: Run all three regression types on the same dataset and compare results

Assessment Questions

Why is standardization important before applying PCA?
What is the key difference between Ridge and Lasso regularization?
Why do we need a separate test set for evaluation?
At which stage does the closed-form solution \((X^TX)^{-1}X^Ty\) get computed?