Data Pipelines and Aggregation
Summary
This chapter covers the data gathering and aggregation techniques used to build a searchable collection of MicroSims. You'll learn about web crawling approaches, using the GitHub API to discover repositories, repository mining techniques for extracting metadata, and strategies for aggregating data from multiple sources. The chapter focuses on building automated pipelines that keep MicroSim collections current. After completing this chapter, students will be able to design and implement data collection pipelines for MicroSim metadata.
Concepts Covered
This chapter covers the following 14 concepts from the learning graph:
- Data Gathering
- Web Crawling
- GitHub API
- Repository Mining
- MicroSim Repositories
- Data Aggregation
Prerequisites
This chapter assumes only the prerequisites listed in the course description. The Data Gathering concept is foundational with no dependencies within this textbook.
Why Data Pipelines Are Your Secret Weapon
Imagine you're an educator who just discovered 500 amazing MicroSims scattered across 40 different GitHub repositories. Manually copying metadata from each one would take days—and by the time you finished, new simulations would already be published. You'd be trapped in an endless game of catch-up!
This is where data pipelines become your superpower. A well-designed pipeline automatically discovers new MicroSims, extracts their metadata, validates quality, and delivers a fresh, searchable collection—all while you're sipping coffee or teaching your next class.
In this chapter, you'll learn to build the data infrastructure that powers MicroSim search. Think of it as creating a network of friendly robots that tirelessly explore GitHub, find educational treasures, and organize them for easy discovery. By the end, you'll understand how our search system maintains an up-to-date collection of 400+ MicroSims from dozens of repositories.
Let's build some pipelines!
Understanding Data Gathering
Data gathering is the systematic collection of information from various sources. For MicroSim search, this means finding and extracting metadata from wherever simulations are published.
The Challenge of Distributed Content
MicroSims aren't stored in one central location—they're scattered across:
- Individual GitHub repositories
- Course websites and educational platforms
- Personal project pages
- Organizational collections
Each source has its own structure, format, and update schedule. Your data gathering strategy must handle this diversity gracefully.
Key Data Gathering Principles
| Principle | Description | Example |
|---|---|---|
| Automation | Minimize manual intervention | Scripts that run on schedule |
| Reliability | Handle errors gracefully | Retry failed requests, log issues |
| Efficiency | Don't waste resources | Cache responses, avoid redundant calls |
| Respect | Honor rate limits and terms | Wait between API calls |
| Freshness | Keep data current | Incremental updates, not full rebuilds |
Types of Data Sources
Different sources require different gathering techniques:
-
APIs (Application Programming Interfaces): Structured endpoints that return data in predictable formats. Ideal for programmatic access. GitHub's API is our primary source.
-
Web Pages: Human-readable HTML that must be parsed. Useful when APIs aren't available, but more fragile.
-
File Systems: Direct access to files in known locations. Works when you control the source.
-
Databases: Structured data stores with query interfaces. Relevant for enterprise deployments.
For MicroSim gathering, we focus primarily on APIs because GitHub provides excellent programmatic access to repository contents.
Web Crawling Fundamentals
Web crawling is the automated process of visiting web pages, extracting information, and following links to discover more pages. While our primary approach uses the GitHub API, understanding crawling concepts helps you adapt to sources without good API support.
Anatomy of a Web Crawler
A basic crawler performs three operations in a loop:
- Fetch: Request a URL and receive the response
- Parse: Extract useful information from the response
- Discover: Find new URLs to visit
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | |
Crawling Best Practices
Being a "good citizen" of the web means following these practices:
- Respect robots.txt: Check what the site allows crawlers to access
- Rate limiting: Don't bombard servers with rapid requests
- User-Agent identification: Identify your crawler so administrators can contact you
- Error handling: Don't crash on bad responses
- Incremental crawling: Only fetch what's changed since last time
Why We Prefer APIs Over Crawling
| Aspect | Web Crawling | API Access |
|---|---|---|
| Reliability | Breaks when HTML changes | Stable, versioned contracts |
| Efficiency | Downloads entire pages | Returns only needed data |
| Respect | May strain servers | Designed for programmatic use |
| Structure | Must parse HTML | Returns structured JSON |
| Updates | Hard to detect changes | Often includes modification timestamps |
For MicroSim collection, the GitHub API provides everything we need—no HTML parsing required!
When Crawling Is Necessary
Sometimes APIs don't exist or don't provide the data you need. In those cases, web crawling with libraries like BeautifulSoup (Python) or Puppeteer (JavaScript) can extract information from HTML pages. Just be extra careful about rate limiting and error handling.
The GitHub API: Your Gateway to MicroSims
The GitHub API is a powerful interface that lets you programmatically access repository information, file contents, and metadata. It's the backbone of our MicroSim data pipeline.
API Basics
GitHub's REST API uses simple HTTP requests to access data:
1 | |
This request returns a JSON array listing all items in that directory. It's like asking GitHub: "What's in this folder?"
Authentication and Rate Limits
GitHub enforces rate limits to prevent abuse:
| Access Type | Rate Limit |
|---|---|
| Unauthenticated | 60 requests/hour |
| Authenticated | 5,000 requests/hour |
| GitHub Actions | 1,000 requests/hour |
For any serious data gathering, you need authentication. A GitHub Personal Access Token gives you 83× more capacity!
1 2 3 4 5 6 | |
Key API Endpoints for MicroSim Discovery
| Endpoint | Purpose | Example |
|---|---|---|
/users/{username}/repos |
List user's repositories | Find all dmccreary/* repos |
/repos/{owner}/{repo}/contents/{path} |
Get directory contents | List sims in a repo |
/repos/{owner}/{repo}/contents/{path} |
Get file contents | Fetch metadata.json |
/search/code |
Search for files | Find metadata.json files |
Example: Listing Repositories
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
This simple script discovers all repositories owned by a user—the first step in finding MicroSims.
Diagram: GitHub API Workflow
GitHub API Workflow Visualization
Type: workflow
Bloom Level: Understand (L2) Bloom Verb: explain
Learning Objective: Students will explain the sequence of GitHub API calls needed to discover MicroSims by tracing the workflow from user repositories through directory contents to metadata extraction.
Visual style: Flowchart with numbered steps and API call boxes
Steps: 1. Start: "Discover MicroSims" Hover text: "Begin the data gathering process"
-
Process: "List User Repositories" API call: GET /users/{username}/repos Hover text: "Find all repositories owned by the target user"
-
Process: "Filter Course Repositories" Hover text: "Select repos likely to contain MicroSims (name contains 'course', 'microsim', etc.)"
-
Loop: "For Each Repository" Hover text: "Process each discovered repository"
-
Process: "Check for /docs/sims Directory" API call: GET /repos/{owner}/{repo}/contents/docs/sims Hover text: "Verify the standard MicroSim location exists"
-
Decision: "Directory Exists?" Hover text: "Handle repos with and without MicroSims"
7a. Process: "List MicroSim Directories" Hover text: "Enumerate all simulation folders"
7b. Process: "Skip Repository" Hover text: "No MicroSims here, continue to next repo"
-
Loop: "For Each MicroSim Directory" Hover text: "Process each discovered simulation"
-
Process: "Fetch metadata.json" API call: GET /repos/{owner}/{repo}/contents/docs/sims/{sim}/metadata.json Hover text: "Retrieve the simulation's metadata file"
-
Decision: "Metadata Valid?" Hover text: "Validate against schema, check required fields"
11a. Process: "Store Metadata" Hover text: "Add to aggregated collection"
11b. Process: "Log Warning" Hover text: "Record missing or invalid metadata for review"
- End: "Aggregated Collection Complete" Hover text: "All discovered metadata ready for search indexing"
Color coding: - Blue: API call steps - Yellow: Decision points - Green: Success outcomes - Orange: Warning/alternative paths
Implementation: p5.js with interactive hover explanations for each step
Repository Mining Techniques
Repository mining is the process of extracting structured information from code repositories. For MicroSims, this means finding metadata.json files and parsing their contents.
The Standard MicroSim Structure
Our crawler expects MicroSims to follow a predictable structure:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | |
The crawler navigates this structure to find all metadata.json files.
Mining Strategy
Our repository mining follows this strategy:
- Enumerate repositories: Get list of all repos for target users/organizations
- Check structure: Look for
docs/sims/directory - List simulations: Get contents of the sims directory
- Extract metadata: Fetch and parse each metadata.json
- Validate: Check required fields, flag incomplete entries
- Aggregate: Combine into single searchable collection
Handling Missing Metadata
Not every simulation has metadata yet. The crawler tracks these gaps:
1 2 3 4 5 | |
This categorization helps prioritize metadata improvement efforts.
Example: Mining a Single Repository
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | |
This function encapsulates the core mining logic for a single repository.
GitHub Returns Base64
When fetching file contents via the GitHub API, the content is base64-encoded. Don't forget to decode it before parsing as JSON!
MicroSim Repositories: Finding the Treasure
MicroSim repositories are the source of our educational simulations. Understanding how to discover and track them is essential for maintaining a comprehensive collection.
Discovery Strategies
Finding repositories with MicroSims requires multiple approaches:
| Strategy | Description | Pros | Cons |
|---|---|---|---|
| Known Users | Track specific creators | Reliable, high quality | Limited scope |
| Code Search | Search for metadata.json files | Discovers new sources | Many false positives |
| Topic Tags | Search by repository topics | Self-organized | Inconsistent tagging |
| Fork Networks | Follow forks of template repos | Finds derivatives | May be abandoned |
Our Primary Approach: Known User Mining
For the MicroSim search system, we focus on repositories from known contributors:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
Identifying MicroSim Repositories
Not every repository contains MicroSims. We use heuristics to filter candidates:
- Name patterns: Contains "course", "tutorial", "microsim", "simulation"
- Description keywords: Mentions "interactive", "visualization", "p5.js"
- File structure: Has
docs/sims/directory - Recent activity: Updated within the last year
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
Tracking Repository Status
Maintain a registry of known repositories and their status:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | |
This registry enables incremental updates—only crawl repositories that have changed.
Diagram: Repository Discovery Flow
Repository Discovery Flow Visualization
Type: diagram
Bloom Level: Analyze (L4) Bloom Verb: examine
Learning Objective: Students will examine the repository discovery process by tracing how candidate repositories are filtered and validated to identify MicroSim collections.
Visual layout: Funnel diagram showing filtering stages
Stages (top to bottom): 1. "All User Repositories" (100 repos) - Wide funnel entrance - Gray color - Tooltip: "Starting point: all public repositories for known creators"
- "Name/Description Filter" (40 repos)
- Narrower section
- Light blue
-
Tooltip: "Filter by keywords: course, tutorial, microsim, simulation"
-
"Structure Check" (25 repos)
- Narrower
- Medium blue
-
Tooltip: "Verify docs/sims/ directory exists"
-
"Metadata Presence" (18 repos)
- Narrower
- Dark blue
-
Tooltip: "At least one valid metadata.json found"
-
"Active Repositories" (15 repos)
- Narrow
- Green
- Tooltip: "Updated within the last year, not archived"
Visual elements: - Repository icons flowing through funnel - Count labels at each stage - Arrows showing flow direction - Side annotations explaining each filter - Rejected repos shown fading off to sides
Interactive controls: - Hover over each stage for detailed explanation - Click stage to see example repositories at that level - Toggle: "Show rejected repos" (displays what was filtered out) - Slider: "Animate flow" (shows repos moving through funnel)
Color scheme: - Funnel gradient from gray to green - Rejected items: red/orange - Selected/active: gold highlight
Implementation: p5.js with animated funnel visualization
Data Aggregation: Building the Collection
Data aggregation combines data from multiple sources into a unified collection. For MicroSim search, this means merging metadata from dozens of repositories into a single searchable JSON file.
The Aggregation Challenge
Each repository might have:
- Different metadata completeness levels
- Slightly different field naming conventions
- Overlapping or duplicate simulations (forks)
- Varying quality scores
The aggregator must normalize, deduplicate, and validate while preserving information.
Aggregation Pipeline Architecture
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | |
Normalization: Speaking the Same Language
Different repositories might use different field names for the same concept:
| Source Variation | Normalized Field |
|---|---|
title, name, dublinCore.title |
title |
subject, topic, educational.subjectArea |
subjectArea |
library, framework, technical.framework |
framework |
creator, author, dublinCore.creator |
creator |
The normalizer maps all variations to our standard schema:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | |
Deduplication: One Copy of Each
Simulations might appear in multiple repositories (forks, mirrors). We deduplicate by URL:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
Validation: Quality Gates
Before including a MicroSim in the collection, validate it meets minimum requirements:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
Enrichment: Adding Value
After normalization, enrich the data with computed fields:
- Quality score: Based on metadata completeness
- Last updated: When the metadata was last modified
- Source repository: Where this MicroSim came from
- URL construction: Build full URL from repository and path
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | |
Diagram: Aggregation Pipeline Simulator
Data Aggregation Pipeline Simulator
Type: microsim
Bloom Level: Apply (L3) Bloom Verb: execute
Learning Objective: Students will execute a simulated data aggregation pipeline by processing sample MicroSim records through normalization, deduplication, and validation stages, observing how each stage transforms the data.
Canvas layout: - Left panel (30%): Input queue with sample records - Center panel (50%): Pipeline visualization with stages - Right panel (20%): Output and statistics
Visual elements: - Input queue showing 8-10 sample metadata records as cards: - Some with missing fields (red highlight) - Some duplicates (yellow highlight) - Some with varying field names (blue highlight) - Some complete and valid (green highlight) - Pipeline stages as connected processing boxes: - Normalize (blue) - Deduplicate (purple) - Validate (orange) - Enrich (green) - Output collection showing final records - Statistics panel: - Records processed - Duplicates removed - Validation failures - Quality score distribution
Sample input records: 1. Complete geometry MicroSim (valid) 2. Physics sim with "name" instead of "title" (needs normalization) 3. Duplicate of #1 from forked repo (will be deduplicated) 4. Chemistry sim missing description (validation failure) 5. Math sim with nested dublinCore fields (needs normalization) 6. Complete biology MicroSim (valid) 7. Empty metadata object (validation failure) 8. CS sim with legacy field names (needs normalization)
Interactive controls: - Button: "Process Next Record" (step through one at a time) - Button: "Process All" (run entire pipeline) - Button: "Reset" - Toggle: "Show detailed logs" - Speed slider: Animation speed
Behavior: - Record cards animate through pipeline stages - Each stage shows transformation applied - Failed records divert to "rejected" bin with explanation - Duplicates merge into single record - Statistics update in real-time - Detailed logs show exact transformations
Animation: - Cards slide through pipeline - Stage boxes pulse when active - Transformation details appear as overlay - Rejected records fade out or move to error bin
Color scheme: - Valid records: Green - Needs normalization: Blue - Duplicates: Yellow - Validation failures: Red - Pipeline stages: Gradient from blue to green
Implementation: p5.js with animated card flow and real-time statistics
Building the Crawler: A Practical Example
Let's walk through building a complete MicroSim crawler. This is the actual approach used for our search system.
Project Structure
1 2 3 4 5 6 7 8 | |
The Main Crawler Script
Here's a simplified version of our crawler:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | |
Running the Crawler
Execute the crawler from the command line:
1 2 3 4 5 | |
Expected output:
1 2 3 4 5 6 7 8 9 10 11 | |
Automation Ready
This crawler can run on a schedule (GitHub Actions, cron job) to keep your collection automatically updated. No manual intervention required!
Incremental Updates: Staying Fresh
Rather than re-crawling everything each time, incremental updates process only what's changed.
Why Incremental?
| Full Crawl | Incremental Update |
|---|---|
| Processes all repositories | Only changed repositories |
| Uses many API calls | Minimal API calls |
| Slow (minutes to hours) | Fast (seconds to minutes) |
| Good for initial load | Good for regular updates |
Detecting Changes
GitHub provides metadata to detect changes:
pushed_at: When the repository was last pushed toupdated_at: When the repository was last updated- Commit SHA: Hash of the latest commit
1 2 3 4 | |
Single-Repository Updates
For quick updates when you know what changed:
1 2 | |
This is perfect for:
- After publishing new MicroSims
- Fixing metadata errors
- Testing changes before full crawl
Logging and Monitoring
Good pipelines provide visibility into their operation through logging.
Log Levels
| Level | Use Case | Example |
|---|---|---|
| INFO | Normal operation | "Processing repository X" |
| WARNING | Non-critical issues | "Missing optional field Y" |
| ERROR | Failed operations | "Could not fetch metadata.json" |
| DEBUG | Detailed diagnostics | "API response: {...}" |
Structured Logging (JSONL)
Our crawler logs to JSONL (JSON Lines) format for easy analysis:
1 2 3 | |
JSONL files are:
- Easy to append (one line per entry)
- Easy to filter (grep, jq)
- Easy to analyze (load into pandas)
Log Analysis
Find issues quickly:
1 2 3 4 5 6 7 | |
Quality Metrics and Profiling
Data profiling helps you understand your collection's quality and completeness.
Key Metrics
| Metric | Description | Target |
|---|---|---|
| Total count | Number of MicroSims | Growing! |
| Completeness | % of fields filled | >80% |
| Required fields | % with all required fields | 100% |
| Subject coverage | Distribution across subjects | Balanced |
| Grade coverage | Distribution across levels | Balanced |
The Data Profiler
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | |
Generating Reports
1 2 3 4 | |
The report helps identify:
- Underrepresented subjects (need more content)
- Quality issues (need metadata enrichment)
- Missing required fields (need corrections)
Diagram: Collection Quality Dashboard
MicroSim Collection Quality Dashboard
Type: infographic
Bloom Level: Evaluate (L5) Bloom Verb: assess
Learning Objective: Students will assess MicroSim collection quality by interpreting dashboard metrics and identifying areas needing improvement.
Canvas layout: Dashboard with multiple visualization panels
Panels:
- "Collection Overview" (top-left, 25%):
- Big number: Total MicroSims count
- Trend indicator (up/down from last crawl)
-
Subtitle: "From X repositories"
-
"Subject Distribution" (top-right, 25%):
- Horizontal bar chart showing count per subject
- Subjects: Math, Physics, Chemistry, Biology, CS, Other
- Color coded by subject
-
Tooltip on hover: exact count and percentage
-
"Quality Score Distribution" (middle-left, 25%):
- Pie or donut chart
- Segments: High (80+), Medium (50-79), Low (<50)
- Colors: Green, Yellow, Red
-
Center text: Average score
-
"Grade Level Coverage" (middle-right, 25%):
- Stacked bar or area chart
- Levels: K-5, 6-8, 9-12, Undergraduate, Graduate
-
Shows balance across educational levels
-
"Field Completeness" (bottom-left, 25%):
- Grid of field names with completion percentage
- Color intensity indicates completeness
- Required fields marked with asterisk
-
Fields: title, description, subject, gradeLevel, framework, learningObjectives, etc.
-
"Repository Contributions" (bottom-right, 25%):
- Treemap showing MicroSims per repository
- Size = count, color = average quality
- Click to see repository details
Interactive controls: - Dropdown: "Time period" (current, last week, last month) - Filter by subject (checkboxes) - Toggle: "Show only issues" (highlight problem areas) - Button: "Export report"
Hover behavior: - All charts show detailed tooltips - Clicking drills down to details
Color scheme: - Quality: Red (low) → Yellow (medium) → Green (high) - Subjects: Consistent with other visualizations - Trend indicators: Red (down), Green (up), Gray (unchanged)
Data updates: - Dashboard reflects current collection state - Compare to previous crawl to show trends
Implementation: p5.js or Chart.js with multiple coordinated charts
Scheduling and Automation
Manual crawling doesn't scale. Automation keeps your collection fresh without constant attention.
Scheduling Options
| Method | Best For | Example |
|---|---|---|
| Cron job | Server-based scheduling | 0 2 * * * /path/to/crawl.sh |
| GitHub Actions | Free, integrated with repos | Workflow file in .github/ |
| Cloud Functions | Serverless, event-driven | AWS Lambda, Google Cloud Functions |
| CI/CD pipeline | Part of deployment process | Run after merge to main |
GitHub Actions Example
Our crawler runs daily via GitHub Actions:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | |
This workflow:
- Runs automatically every day at 6 AM
- Uses a secure GitHub token for API access
- Commits updated data back to the repository
- Can be triggered manually when needed
The || exit 0 Pattern
The git commit ... || exit 0 pattern prevents the workflow from failing when there are no changes to commit. Smart pipelines handle "nothing to do" gracefully.
Error Handling and Recovery
Robust pipelines handle failures gracefully without losing progress.
Common Failure Modes
| Failure | Cause | Recovery |
|---|---|---|
| Rate limit exceeded | Too many API calls | Wait and retry with backoff |
| Network timeout | Slow/unreliable connection | Retry with longer timeout |
| Invalid JSON | Malformed metadata.json | Log error, skip entry |
| Missing directory | Repo structure changed | Log warning, continue |
| Authentication failure | Expired/invalid token | Alert, fail fast |
Retry with Exponential Backoff
When requests fail, retry with increasing delays:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | |
Checkpointing for Long Crawls
For large crawls, save progress periodically:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | |
Key Takeaways
-
Data gathering systematically collects information from distributed sources—essential for building comprehensive MicroSim collections
-
Web crawling automates page fetching and parsing, but APIs are preferred when available for reliability and efficiency
-
The GitHub API provides structured access to repository contents, with authenticated requests enabling 5,000 calls per hour
-
Repository mining extracts metadata.json files from known directory structures, tracking completeness and gaps
-
MicroSim repositories are discovered through known creators, code search, and heuristic filtering of repository characteristics
-
Data aggregation combines multi-source data through normalization, deduplication, validation, and enrichment stages
-
Incremental updates process only changed repositories, dramatically reducing API calls and execution time
-
Structured logging (JSONL format) provides visibility into pipeline operation for debugging and monitoring
-
Quality metrics reveal collection health—completeness, coverage, and areas needing improvement
-
Automation through GitHub Actions or cron keeps collections current without manual intervention
What's Next?
You've learned to build data pipelines that automatically discover, extract, and aggregate MicroSim metadata. Your search system can now maintain a fresh, comprehensive collection with minimal manual effort.
In the final chapter, we'll bring everything together:
- Building complete search interfaces
- Deployment strategies
- Future directions for MicroSim search
- How AI agents can leverage MicroSim search for generation
The infrastructure is ready—now let's make it shine for users!
Ready to explore educational foundations? Continue to Chapter 10: Educational Foundations.