MicroSim Similarity Map
An interactive 2D visualization of MicroSim similarity. Each dot represents a MicroSim and dots that are near each other are more similar.
This plot used embeddings created from the MicroSim metadata JSON file and Principal Component Analysis (PCA) for dimensionality reduction.
1 | |
About This Visualization
This visualization projects 384-dimensional semantic embeddings of MicroSims into a 2D space using PCA. Each point represents a MicroSim, colored by subject area.
Key Statistics
- Total MicroSims: 868
- Subject Areas: 14 categories
How to Use
- Hover over points to see MicroSim details (title, repository)
- Click on any point to open the MicroSim in a new tab
- Use the legend to show/hide subject areas
- Zoom and pan to explore clusters
Interpreting the Map
- Clusters indicate semantically similar MicroSims
- Distance between points reflects semantic similarity in the embedding space
- Points from the same subject area tend to cluster together, but cross-subject clustering reveals shared educational concepts
Technical Details
The embeddings were generated using the all-MiniLM-L6-v2 sentence transformer model on combined MicroSim metadata (title, description, learning objectives). PCA reduces the 384-dimensional vectors to 2D for visualization.
Files
| File | Description |
|---|---|
main.html |
Main HTML page |
style.css |
CSS styles |
script.js |
JavaScript for loading data and creating the Plotly visualization |
data.json |
Plotly data and layout configuration |
Lesson Plan
Overview
This lesson introduces fundamental concepts in machine learning and data science: embeddings, similarity, clustering, and dimensionality reduction. Students will use this interactive visualization to explore how computers can understand and organize text by meaning.
Learning Objectives
By the end of this lesson, students will be able to:
- Explain what an embedding is and why it is useful
- Describe how similarity is measured between items in an embedding space
- Identify clusters in a visualization and explain what they represent
- Understand why dimensionality reduction is necessary for visualization
- Apply these concepts to interpret real-world data visualizations
Target Audience
- High school students (grades 10-12) or college freshmen
- Adult learners interested in AI/ML fundamentals
- No programming experience required
- Basic math familiarity helpful but not required
Prerequisites
- Ability to use a web browser and interact with visualizations
- Understanding of basic concepts like "similar" and "different"
Key Concepts
What is an Embedding?
An embedding is a way to represent something (like a word, sentence, or document) as a list of numbers. Think of it like giving everything a unique "address" in a mathematical space.
Analogy: Imagine describing a fruit using numbers:
- Sweetness: 1-10
- Size: 1-10
- Color (red to green): 1-10
An apple might be [7, 5, 8] while an orange might be [8, 6, 2]. These number lists are simple embeddings!
In this visualization, each MicroSim is represented by 384 numbers that capture its meaning based on its title, description, and learning objectives.
What is Similarity?
Similarity measures how close two items are in the embedding space. Items with similar meanings have similar number patterns.
In the visualization: Points that are close together represent MicroSims with similar educational content. Points far apart cover different topics.
What is Clustering?
A cluster is a group of items that are similar to each other and different from items in other groups. Clusters emerge naturally when similar things group together.
In the visualization: You can see clusters of points with the same color (subject area), but also notice how some subjects overlap - this reveals shared concepts across disciplines.
What is Dimensionality Reduction (PCA)?
We cannot visualize 384 dimensions on a 2D screen. Principal Component Analysis (PCA) is a technique that reduces the 384 numbers to just 2 numbers (x and y coordinates) while preserving as much of the "structure" as possible.
Analogy: Imagine taking a 3D object and casting its shadow on a wall. You lose some information, but the shadow still tells you something about the object's shape.
Activities
Activity 1: Explore the Map (10 minutes)
- Look at the visualization without clicking anything
- Observe: Which subject areas form tight clusters? Which are spread out?
- Discuss: Why might "Mathematics" MicroSims cluster together?
Activity 2: Find Surprising Neighbors (15 minutes)
- Click on a MicroSim point to open it
- Find a nearby point from a different subject area
- Open that MicroSim and compare them
- Question: What do these two MicroSims have in common that made the computer place them near each other?
- Share your findings with the class
Activity 3: Subject Area Investigation (15 minutes)
- Use the "Uncheck All" button to hide all points
- Check only ONE subject area at a time
- Observe: Does this subject form one cluster or multiple clusters?
- Hypothesize: If a subject has multiple clusters, what might explain this?
- Try this with 3-4 different subject areas
Activity 4: Cross-Disciplinary Connections (20 minutes)
- Check two subject areas that you think might be related (e.g., Physics and Mathematics)
- Find: Where do they overlap? Where are they separate?
- Check two subject areas you think are unrelated
- Verify: Are they actually far apart in the visualization?
- Write: A short paragraph explaining one surprising connection you discovered
Activity 5: Limitations Discussion (10 minutes)
- The visualization only captures 11.2% of the original information (variance explained)
- Discuss: What information might be lost in this 2D view?
- Consider: Could two MicroSims appear close in 2D but actually be far apart in the original 384 dimensions?
Discussion Questions
- Why do computers need to convert text into numbers to understand meaning?
- What are the advantages of using embeddings instead of simple keyword matching?
- If you were building a "recommended MicroSims" feature, how could you use this similarity information?
- What other types of content could be visualized using embeddings? (images, music, products?)
- What biases might exist in how the embedding model understands "similarity"?
Assessment
Formative Assessment
- Observe student interactions with the visualization
- Listen to partner discussions during activities
- Check for accurate use of vocabulary (embedding, similarity, cluster)
Summative Assessment Options
Option A - Written Response: Explain in your own words how this visualization was created, starting from MicroSim descriptions and ending with colored dots on a 2D map. Use the terms: embedding, similarity, dimensionality reduction.
Option B - Create an Analogy: Create your own analogy for embeddings using a real-world example (like the fruit example above). Explain how your analogy demonstrates similarity and clustering.
Option C - Analysis Report: Choose one subject area and write a 1-page analysis of its clustering patterns. Include screenshots, identify at least 2 sub-clusters, and hypothesize why they exist.
Extensions
- For advanced students: Explore the Embeddings Documentation to learn how embeddings are generated with code
- Cross-curricular: Connect to biology (taxonomy/classification), library science (cataloging), or social studies (demographic clustering)
- Project idea: Have students suggest what additional metadata would improve the embeddings (e.g., difficulty level, interactivity type)
Related
- MicroSim Search - Search MicroSims by facets
- Embeddings Documentation - How embeddings are generated
References
- Plotly.js - Open-source JavaScript graphing library used for the interactive scatter plot visualization
- Sentence Transformers - Python framework for state-of-the-art sentence, text, and image embeddings
- all-MiniLM-L6-v2 Model - Sentence transformer model that maps sentences to a 384-dimensional dense vector space
- scikit-learn PCA - Principal Component Analysis implementation used for dimensionality reduction
- Principal Component Analysis (Wikipedia) - Background on the PCA algorithm for dimensionality reduction