Machine Learning Pipeline Workflow
Run the ML Pipeline Workflow Fullscreen
Edit the MicroSim with the p5.js editor
About This MicroSim
This interactive flowchart visualizes the complete machine learning pipeline from raw data to a trained model. Each stage is color-coded by category and includes both hover explanations and Python code examples.
Pipeline Stages:
- Raw Data - Original features, possibly different scales and units
- Standardization - Transform to zero mean, unit variance
- PCA (optional) - Reduce dimensionality while preserving variance
- Train/Test Split - Hold out data for evaluation
- Model Selection - Choose algorithm (OLS, Ridge, or Lasso)
- Optimization - Find optimal parameters
- Evaluation - Assess on test set (MSE, R-squared)
- Trained Model - Ready for deployment
How to Use
- Hover over nodes to see detailed explanations of each stage
- Click on nodes to view Python code examples for that stage
- Click "Clear Selection" or click elsewhere to dismiss the code panel
- Follow the arrows to understand the data flow through the pipeline
Color Coding
| Color | Category | Stages |
|---|---|---|
| Blue | Data Processing | Raw Data, Standardization, PCA, Train/Test Split |
| Green | Modeling | Model Selection, OLS, Ridge, Lasso |
| Orange | Optimization | Optimization |
| Purple | Evaluation | Evaluation, Trained Model |
The ML Pipeline in Linear Algebra Terms
The machine learning pipeline heavily relies on linear algebra operations:
Standardization
This transforms the data matrix \(X\) so each column has mean 0 and variance 1.
PCA (Principal Component Analysis)
Uses SVD to find the directions of maximum variance: \(\(X = U\Sigma V^T\)\)
The principal components are the columns of \(V\).
Linear Regression Models
OLS (Ordinary Least Squares): \(\(\hat{\beta} = (X^TX)^{-1}X^Ty\)\)
Ridge Regression (L2 regularization): \(\(\hat{\beta} = (X^TX + \lambda I)^{-1}X^Ty\)\)
Lasso Regression (L1 regularization): Minimizes \(\|y - X\beta\|_2^2 + \lambda\|\beta\|_1\)
Embedding
1 | |
Lesson Plan
Learning Objectives
Students will be able to:
- Describe the complete workflow of a machine learning pipeline
- Explain the purpose of each preprocessing and modeling stage
- Compare and contrast OLS, Ridge, and Lasso regression approaches
- Identify where linear algebra operations occur in the pipeline
Suggested Activities
- Trace the pipeline: Follow the data flow and explain what happens at each stage
- Code walkthrough: Use the code examples to implement a complete pipeline in Python
- Model comparison: Run all three regression types on the same dataset and compare results
Assessment Questions
- Why is standardization important before applying PCA?
- What is the key difference between Ridge and Lasso regularization?
- Why do we need a separate test set for evaluation?
- At which stage does the closed-form solution \((X^TX)^{-1}X^Ty\) get computed?