LoRA Low-Rank Adaptation Visualizer
Run the LoRA Visualizer Fullscreen
Edit the MicroSim with the p5.js editor
About This MicroSim
This visualization demonstrates LoRA (Low-Rank Adaptation), a technique for efficiently fine-tuning large language models by training only a small number of additional parameters.
Instead of updating the full weight matrix \(W\), LoRA adds a low-rank decomposition:
where: - \(W\) is frozen (not trained) - \(B \in \mathbb{R}^{d \times r}\) and \(A \in \mathbb{R}^{r \times k}\) are trainable - \(r \ll \min(d, k)\) is the rank (typically 4, 8, or 16)
How to Use
- Rank Slider: Adjust the LoRA rank (1-16) to see parameter savings
- Dimension Slider: Change matrix size to see scaling behavior
- Observe: Watch how parameter count changes with rank
Key Insights
Parameter Efficiency
For a \(d \times k\) weight matrix:
| Method | Parameters |
|---|---|
| Full fine-tuning | \(d \times k\) |
| LoRA (rank \(r\)) | \(r(d + k)\) |
Example: With \(d = k = 4096\) and \(r = 8\): - Full: 16.7M parameters per matrix - LoRA: 65K parameters (0.4%)
Why Low-Rank Works
Research suggests that model adaptation often lies in a low-dimensional subspace. The "intrinsic dimension" of fine-tuning is much smaller than the total parameter count.
LoRA Benefits
- Memory efficient: Train only 0.1-1% of original parameters
- No inference latency: Can merge \(W' = W + BA\) after training
- Modular: Swap different LoRA adapters for different tasks
- Stable: Original model weights remain frozen
Lesson Plan
Learning Objectives:
- Understand why low-rank approximations enable efficient fine-tuning
- Calculate parameter savings for different rank values
- Explain the LoRA forward pass: \(h = Wx + BAx\)
Activities:
- Find the rank that achieves 99% parameter savings for a 1024×1024 matrix
- Compare parameter counts for different model sizes
- Discuss when LoRA might not work well (what if task requires full-rank updates?)
Assessment:
- Why is \(B\) initialized to zeros and \(A\) to small random values?
- How does LoRA compare to other efficient fine-tuning methods?
- What's the computational overhead during training vs inference?