Aggregation Pipeline Simulator
About This MicroSim
This interactive simulation demonstrates how a data aggregation pipeline processes MicroSim metadata records. Watch as sample records flow through four processing stages: normalization, deduplication, validation, and enrichment.
Key Features
- 8 sample input records with varying data quality
- 4 pipeline stages with visual processing
- Real-time statistics tracking outcomes
- Processing logs showing transformations
- Step-through or batch processing modes
Pipeline Stages
| Stage | Purpose | Example Transformation |
|---|---|---|
| Normalize | Standardize field names | name → title, topic → subject |
| Deduplicate | Remove duplicate entries | Merge by URL/title |
| Validate | Check required fields | Reject if missing title/description |
| Enrich | Add computed fields | Quality score, timestamps |
Sample Records
The simulator includes 8 sample records demonstrating different scenarios:
- Valid records (green) - Complete metadata, passes all stages
- Needs normalization (blue) - Uses legacy field names
- Duplicates (yellow) - Same content from forked repositories
- Invalid (red) - Missing required fields
Learning Objectives
After using this simulation, students will be able to:
- Execute a data aggregation pipeline step-by-step
- Observe how each stage transforms the data
- Identify which records pass or fail validation
Lesson Plan
Grade Level
Undergraduate / Graduate (Data Engineering, ETL Pipelines)
Duration
25-30 minutes
Materials Needed
- This simulation
- Understanding of JSON data formats
Procedure
-
Introduction (5 min): Explain the challenge of combining data from multiple sources with varying formats
-
Step-Through Mode (10 min):
- Click "Process Next" for each record
- Enable "Show Logs" to see transformations
-
Observe how different record types are handled
-
Batch Processing (5 min):
- Reset the simulation
- Click "Process All" to watch the full pipeline
-
Note the final statistics
-
Analysis (5 min):
- What percentage of records succeeded?
- Which stage rejected the most records?
-
How do quality scores vary?
-
Discussion (5 min):
- Why normalize before deduplication?
- What other validation rules might be useful?
- How would you handle partial data?
Assessment
Students should be able to: - Explain the purpose of each pipeline stage - Predict whether a given record will pass or fail - Describe how normalization handles legacy field names
Technical Details
Framework: p5.js
Canvas Size: Responsive width, 560px height
Interaction: Button controls, speed slider, log toggle