PCA Step-by-Step Visualizer

About This MicroSim

This visualization demonstrates Principal Component Analysis (PCA) step by step, showing how data is transformed from its original representation to a lower-dimensional projection.

PCA finds the directions of maximum variance in the data and projects onto those directions:

\[\text{PCA: } X \rightarrow X_{centered} \rightarrow \text{eigenvectors of } X^TX \rightarrow \text{projection}\]

Key Features

Four-Panel View: See each transformation stage simultaneously
Animated Transitions: Watch the data transform step by step
Interactive Data Generation: Control cluster shape, spread, rotation, and elongation
Scree Plot: Visualize eigenvalue magnitudes
Variance Explained: See how much information each component captures
Reconstruction Toggle: Visualize the error from dimensionality reduction

The PCA Steps

Step	Operation	Geometric Effect
0	Original Data	Raw data with mean point shown
1	Center	Translate data so mean is at origin
2	Find PCs	Compute eigenvectors of covariance matrix
3	Project	Project onto first k principal components

How to Use

Click "1. Center" to see the data shifted to have zero mean
Click "2. Find PCs" to compute and display principal component vectors
Click "3. Project" to see data projected onto the first PC
Adjust k slider to keep 1 or 2 components
Toggle "Show Reconstruction" to see projection error
Use data controls to change cluster shape and regenerate

Understanding the Visualization

Panel 1: Original Data

Raw data points colored by index
Red dot shows the mean (center of mass)
Data may be offset from origin

Panel 2: Centered Data

Data shifted so mean is at origin
Same shape, just translated
This is essential before computing covariance

Panel 3: Principal Components

Red arrow (PC1): Direction of maximum variance
Green arrow (PC2): Direction of second-most variance (orthogonal to PC1)
Arrow lengths proportional to square root of eigenvalues

Panel 4: Projected Data

Points projected onto PC1 (when k=1)
Red line shows the projection subspace
Orange dashed lines show reconstruction error (when enabled)

Scree Plot

Bar chart showing eigenvalue magnitudes
Helps decide how many components to keep
Large drop between bars suggests natural dimensionality

Variance Explained

Percentage of total variance captured by each PC
Sum of kept components shows information retained
Higher is better for reconstruction quality

Learning Objectives

After using this MicroSim, students will be able to:

Explain why centering data is the first step in PCA
Identify principal components as eigenvectors of the covariance matrix
Understand that eigenvalues indicate variance along each PC direction
Demonstrate how projection reduces dimensionality while preserving variance
Interpret scree plots to choose the number of components

Observations to Make

Elongated Cluster

When the elongation is high: - PC1 captures most of the variance (large first eigenvalue) - Projection onto PC1 loses little information - Scree plot shows large first bar, small second bar

Circular Cluster (elongation ≈ 1)

When data is roughly circular: - Both eigenvalues are similar - Neither PC direction is clearly "best" - Dimensionality reduction loses significant information

Effect of Rotation

Changing rotation changes PC directions
But variance explained remains the same
PCA finds the natural axes regardless of original orientation

Mathematical Background

For data matrix \(X\) (centered):

Covariance Matrix: \(C = \frac{1}{n}X^TX\)
Eigendecomposition: \(C = V\Lambda V^T\)
\(V\): matrix of eigenvectors (principal components)
\(\Lambda\): diagonal matrix of eigenvalues
Projection: \(X_{reduced} = XV_k\)
\(V_k\): first k columns of V
Reconstruction: \(\hat{X} = X_{reduced}V_k^T\)
Reconstruction Error: \(\|X - \hat{X}\|^2 = \sum_{i=k+1}^{d} \lambda_i\)

Lesson Plan

Introduction (5 minutes)

Ask: "How can we represent high-dimensional data with fewer dimensions while keeping the important structure?"

Introduce PCA as finding the directions where data varies most.

Demonstration (10 minutes)

Generate an elongated cluster and step through all phases
Point out:
Mean computation and centering
PC1 aligns with the elongated direction
Projection loses the "thickness" but keeps the "length"
Toggle reconstruction to show what information is lost

Exploration (10 minutes)

Have students:

Adjust elongation and observe scree plot changes
Find settings where 90% variance is captured by PC1
Find settings where both PCs capture equal variance

Assessment Questions

Why must we center the data before computing covariance?
What does a large gap in the scree plot indicate?
When would reducing to 1 dimension be a bad choice?
How does the rotation of the original data affect the variance explained?

Applications

Image Compression: Reduce dimensionality of face images (eigenfaces)
Feature Extraction: Find most informative features for machine learning
Data Visualization: Project high-dimensional data to 2D or 3D for plotting
Noise Reduction: Remove low-variance components that may be noise

References

Chapter 9: Dimensionality Reduction (PCA section)
3Blue1Brown: PCA
Shlens, J. "A Tutorial on Principal Component Analysis"