Learning Graph Quality and Validation
Summary
This chapter focuses on validating and assessing the quality of your learning graph. You'll learn techniques for detecting circular dependencies and validating that your graph is a proper Directed Acyclic Graph (DAG). The chapter covers self-dependency checking and introduces comprehensive quality metrics including orphaned nodes, disconnected subgraphs, and linear chain detection.
You'll learn to analyze your graph using indegree and outdegree metrics, calculate average dependencies per concept, and determine the maximum dependency chain length. The chapter culminates with learning how to generate an overall learning graph quality score. Additionally, you'll explore taxonomy distribution metrics to ensure balanced category representation and avoid over-representation of any single topic area.
Concepts Covered
This chapter covers the following 16 concepts from the learning graph:
- Circular Dependency Detection
- DAG Validation
- Self-Dependency Checking
- Quality Metrics for Graphs
- Orphaned Nodes
- Disconnected Subgraphs
- Linear Chain Detection
- Indegree Analysis
- Outdegree Analysis
- Average Dependencies Per Concept
- Maximum Dependency Chain Length
- Learning Graph Quality Score
- Taxonomy Categories
- TaxonomyID Abbreviations
- Category Distribution
- Avoiding Over-Representation
Prerequisites
This chapter builds on concepts from:
Introduction to Learning Graph Quality Validation
Creating a learning graph is a significant achievement, but ensuring its quality is equally important for effective educational outcomes. A well-constructed learning graph serves as the foundation for your intelligent textbook, guiding students through concepts in a logical, dependency-aware sequence. Poor quality graphs—those with circular dependencies, orphaned concepts, or imbalanced taxonomy distributions—can confuse learners and undermine the pedagogical value of your materials.
This chapter introduces systematic approaches for validating and assessing the quality of your learning graph. You'll learn both structural validation techniques that ensure your graph is mathematically sound as a Directed Acyclic Graph (DAG), and quality metrics that measure pedagogical effectiveness. These validation techniques are essential for identifying and correcting issues before generating chapter content, as structural problems in your graph will propagate throughout your entire textbook.
The validation process combines automated analysis through Python scripts with manual review of quality reports. By the end of this chapter, you'll be able to generate comprehensive quality assessments for your learning graphs and make data-driven improvements to enhance their educational value.
Directed Acyclic Graphs and Educational Dependencies
Learning graphs must be structured as Directed Acyclic Graphs (DAGs) to represent prerequisite relationships correctly. In a DAG, directed edges point from prerequisite concepts to dependent concepts, and the graph contains no cycles—you cannot follow the dependency arrows and return to your starting concept.
This DAG structure ensures that students can learn concepts in a valid sequence. If your graph contains a cycle (Concept A depends on B, B depends on C, and C depends on A), there is no valid starting point for learning—a logical impossibility that must be detected and corrected.
DAG Validation
Validating that your learning graph is a proper DAG involves checking two critical properties:
- Acyclicity: No circular dependency chains exist in the graph
- Connectivity: All concepts are reachable from foundational nodes
The analyze-graph.py Python script performs DAG validation automatically by implementing a depth-first search (DFS) algorithm with cycle detection. During traversal, the algorithm maintains three node states:
- White (unvisited): Node has not been explored
- Gray (in progress): Node is being explored, currently on the recursion stack
- Black (completed): Node and all its descendants have been fully explored
If the algorithm encounters a gray node during traversal, it has detected a back edge indicating a cycle. This validation runs in O(V + E) time complexity, where V is the number of vertices (concepts) and E is the number of edges (dependencies).
Diagram: DAG Validation Algorithm Visualization
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | |
MicroSim Generator Recommendations:
- vis-network (98/100) - Network graph visualization ideal for displaying DAG validation algorithm with colored nodes
- mermaid-generator (85/100) - Flowchart capabilities support algorithm visualization with decision points
- microsim-p5 (75/100) - Custom interactive visualization possible but requires more development effort
Circular Dependency Detection
Circular dependencies represent the most critical structural flaw in a learning graph. They create logical impossibilities in the learning sequence and must be identified and eliminated before proceeding with content generation.
Common sources of circular dependencies include:
- Bidirectional prerequisites: Concept A requires B, and B requires A
- Multi-hop cycles: A requires B, B requires C, C requires A
- Self-dependencies: A concept incorrectly lists itself as a prerequisite
The analyze-graph.py script reports all cycles found, displaying the complete dependency chain for each cycle. This detailed output allows you to identify which dependency link to remove to break the cycle.
Here's an example of cycle detection output:
1 2 3 4 5 6 7 8 | |
Self-Dependency Checking
Self-dependencies occur when a concept incorrectly lists its own ConceptID in its dependencies column. While technically a special case of circular dependencies, self-dependencies are so common—often resulting from copy-paste errors in CSV editing—that the validation script checks for them explicitly before running the general cycle detection algorithm.
The self-dependency check is trivial but essential:
1 2 3 | |
Any self-dependencies detected indicate data entry errors that should be corrected immediately in your learning-graph.csv file.
Quality Metrics for Learning Graphs
Beyond structural validation, effective learning graphs exhibit certain quality characteristics that enhance their pedagogical value. Quality metrics quantify these characteristics, providing objective measures for assessing and comparing learning graphs.
The following metrics help identify potential issues that, while not structurally invalid, may indicate pedagogical problems or opportunities for improvement.
Orphaned Nodes
An orphaned node is a concept that no other concept depends upon—it has an outdegree of zero. While terminal concepts (endpoints in the learning journey) naturally have no dependents, excessive orphaned nodes suggest concepts that may be:
- Too specialized or advanced for the course scope
- Improperly isolated from the main learning progression
- Missing their dependent concepts due to incomplete graph construction
A well-designed learning graph typically has 5-10% orphaned nodes, representing culminating concepts and specialized topics. If more than 20% of your concepts are orphaned, review them to determine whether they should be connected to later material or removed from the graph entirely.
Diagram: Orphaned Nodes Identification Chart
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | |
MicroSim Generator Recommendations:
- chartjs-generator (97/100) - Scatter plot chart type directly supports indegree vs outdegree visualization
- bubble-chart-generator (80/100) - Could add third dimension (concept importance) via bubble size
- microsim-p5 (72/100) - Custom scatter plot possible with manual axis and point rendering
Disconnected Subgraphs
A disconnected subgraph is a cluster of concepts isolated from the main learning graph—they have no dependency paths connecting them to foundational concepts. This indicates a serious structural problem: students cannot reach these concepts through the normal learning progression.
Disconnected subgraphs typically result from:
- Copy-pasting concept blocks without establishing connections
- Incomplete dependency mapping during graph construction
- Accidental deletion of bridging concepts
The analyze-graph.py script uses a connectivity analysis algorithm to identify all disconnected components. In a valid learning graph, there should be exactly one connected component containing all concepts. Any additional components indicate isolated concept clusters that need to be integrated into the main graph.
Linear Chain Detection
A linear chain is a sequence of concepts where each concept depends on exactly one predecessor and is depended upon by exactly one successor, forming a single-file progression. While some linear sequences are natural (basic → intermediate → advanced), excessive linear chains indicate missed opportunities for:
- Parallel learning paths that students could explore in different orders
- Cross-concept connections that reinforce understanding
- Flexible curriculum that accommodates different learning styles
Linear chains are identified by checking each concept's indegree and outdegree:
1 2 | |
Quality learning graphs typically have 20-40% of concepts in linear chains, with the remainder providing branching paths and concept integration points. If more than 60% of concepts form linear chains, consider adding cross-dependencies to create a richer learning network.
Diagram: Linear Chain vs Network Structure Comparison
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | |
MicroSim Generator Recommendations:
- vis-network (95/100) - Network visualization perfectly suited for comparing linear vs networked graph structures
- mermaid-generator (82/100) - Can create side-by-side graph diagrams with different layouts
- microsim-p5 (78/100) - Force-directed graph layout possible but requires physics simulation coding
Graph Analysis Metrics
Quantitative metrics provide objective measures of graph structure and complexity. These metrics help you understand your learning graph's characteristics and compare it to best practices for educational graph design.
Indegree and Outdegree Analysis
Indegree (number of prerequisites) and outdegree (number of dependents) are fundamental graph metrics that reveal concept roles within the learning progression:
- High indegree: Advanced concepts requiring substantial prior knowledge
- Low indegree (0): Foundational concepts accessible without prerequisites
- High outdegree: Core concepts that enable many subsequent topics
- Low outdegree (0): Specialized or terminal concepts
Distribution of indegree values across your learning graph indicates its prerequisite structure:
| Indegree | Interpretation | Typical % of Concepts |
|---|---|---|
| 0 | Foundational concepts | 5-10% |
| 1-2 | Early concepts with minimal prerequisites | 30-40% |
| 3-5 | Intermediate concepts requiring solid foundation | 40-50% |
| 6+ | Advanced concepts requiring extensive background | 5-15% |
If your graph has too many high-indegree concepts (>20% with indegree ≥ 6), consider whether some prerequisites are redundant or if the course scope is too advanced. Conversely, if most concepts have indegree 0-1, you may be missing important prerequisite relationships.
Average Dependencies Per Concept
The average dependencies per concept metric indicates overall graph connectivity and curriculum density:
1 | |
For educational learning graphs, empirical research suggests optimal ranges:
- 2.0-3.0: Appropriate for introductory courses with linear progressions
- 3.0-4.0: Ideal for intermediate courses with moderate integration
- 4.0-5.0: Suitable for advanced courses with high concept integration
- >5.0: May indicate over-specification of prerequisites
The analyze-graph.py script calculates this metric and flags values outside the recommended 2.0-4.5 range. Graphs with average dependencies below 2.0 may be too linear, while those above 5.0 may impose unrealistic prerequisite burdens on learners.
Diagram: Average Dependencies Distribution Bar Chart
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | |
MicroSim Generator Recommendations:
- chartjs-generator (98/100) - Histogram/bar chart with annotations and shaded regions natively supported
- microsim-p5 (70/100) - Custom bar chart rendering with manual annotation placement required
- mermaid-generator (25/100) - Limited chart capabilities, not ideal for detailed histograms
Maximum Dependency Chain Length
The maximum dependency chain length represents the longest sequence of prerequisite concepts from any foundational node to any terminal node. This metric indicates the depth of your curriculum and affects course duration planning.
For a 200-concept learning graph, typical maximum chain lengths are:
- 8-12 concepts: Short course (4-6 weeks)
- 12-18 concepts: Standard semester course (12-15 weeks)
- 18-25 concepts: Extended course or multi-semester sequence
- >25 concepts: May indicate overly linear structure
The chain length affects student progress velocity. If your maximum chain is 20 concepts deep, students must complete at least 20 learning steps to reach the most advanced material—establishing a minimum time investment regardless of study intensity.
Critical path analysis identifies these longest chains, helping you understand pacing requirements and potential bottlenecks in the learning progression. Concepts on the critical path deserve extra attention in content development, as delays in mastering these concepts cascade through all dependent material.
Learning Graph Quality Score
The overall learning graph quality score provides a single metric (0-100) that aggregates multiple quality dimensions into an interpretable assessment. While individual metrics reveal specific issues, the quality score enables quick comparison and tracking of improvements over time.
The quality scoring algorithm used by analyze-graph.py weights various factors:
Structural Validity (40 points):
- DAG validation passes (20 points)
- No self-dependencies (10 points)
- All concepts in single connected component (10 points)
Connectivity Quality (30 points):
- Orphaned nodes 5-15% of total (10 points, scaled for deviation)
- Average dependencies 2.5-4.0 per concept (10 points, scaled)
- Maximum chain length appropriate for scope (10 points)
Distribution Quality (20 points):
- No linear chains exceeding 20% of graph (10 points)
- Indegree distribution follows expected pattern (10 points)
Taxonomy Balance (10 points):
- No single taxonomy category exceeds 30% (5 points)
- At least 5 taxonomy categories represented (5 points)
Interpretation of quality scores:
| Score Range | Quality Level | Interpretation |
|---|---|---|
| 90-100 | Excellent | Publication-ready, well-structured graph |
| 75-89 | Good | Minor improvements recommended |
| 60-74 | Acceptable | Several issues to address before content generation |
| 40-59 | Poor | Significant structural or quality problems |
| 0-39 | Critical | Major revision required |
The quality score should be calculated after every significant graph revision. Track scores over time to ensure your changes improve rather than degrade graph quality.
Diagram: Learning Graph Quality Score Calculator MicroSim
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | |
MicroSim Generator Recommendations:
- microsim-p5 (92/100) - Interactive gauge and sliders are core p5.js strengths with DOM controls
- chartjs-generator (65/100) - Can create gauge charts but limited interactivity compared to p5.js
- vis-network (20/100) - Not designed for gauge visualizations or quality scoring interfaces
Taxonomy Distribution and Balance
Beyond graph structure, the distribution of concepts across taxonomy categories affects curriculum balance and learning progression. A well-balanced taxonomy distribution ensures students encounter appropriate variety across knowledge domains without over-concentration in any single area.
Taxonomy Categories
Learning graphs typically categorize concepts using a TaxonomyID field that groups related concepts into domains. Common taxonomy categories for technical courses include:
- FOUND - Foundational concepts and definitions
- BASIC - Basic principles and core ideas
- ARCH - Architecture and system design
- IMPL - Implementation and practical skills
- TOOL - Tools and technologies
- SKILL - Professional skills and practices
- ADV - Advanced topics and specializations
The number and specificity of taxonomy categories varies by subject matter. Introductory courses might use 5-8 broad categories, while specialized courses might employ 10-15 granular categories.
TaxonomyID Abbreviations
TaxonomyIDs use 3-5 letter abbreviations for compactness in CSV files and visualization color-coding. When designing your taxonomy, choose abbreviations that are:
- Distinctive: No two categories should share the same first 3 letters
- Mnemonic: Abbreviation should suggest the full category name
- Consistent: Use similar grammatical forms (nouns vs. adjectives)
Example taxonomy abbreviations:
| TaxonomyID | Full Category Name | Color Code (visualization) |
|---|---|---|
| FOUND | Foundational Concepts | Red |
| BASIC | Basic Principles | Orange |
| ARCH | Architecture & Design | Yellow |
| IMPL | Implementation | Light Green |
| DATA | Data Management | Green |
| TOOL | Tools & Technologies | Light Blue |
| QUAL | Quality Assurance | Blue |
| ADV | Advanced Topics | Purple |
Category Distribution Analysis
The category distribution metric shows what percentage of your total concepts fall into each taxonomy category. This distribution should reflect the emphasis and scope of your course.
Healthy category distributions typically exhibit:
- No single category exceeds 30%: Avoid over-concentration
- Top 3 categories contain 50-70% of concepts: Natural emphasis areas
- At least 5 categories represented: Adequate coverage breadth
- Foundational category: 5-10% of concepts: Appropriate base layer
The taxonomy-distribution.py script generates a detailed report showing both absolute counts and percentages for each category, enabling quick identification of imbalanced distributions.
Diagram: Taxonomy Distribution Pie Chart
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | |
MicroSim Generator Recommendations:
- chartjs-generator (98/100) - Pie chart with percentage labels and color coding is primary Chart.js use case
- microsim-p5 (68/100) - Custom pie rendering possible but Chart.js provides better built-in features
- venn-diagram-generator (15/100) - Designed for overlapping sets, not category distribution
Avoiding Over-Representation
Over-representation occurs when a single taxonomy category dominates the learning graph, consuming more than 30% of total concepts. This imbalance can result from:
- Scope creep: Course expanded in one area without proportional breadth
- Expert bias: Instructor's specialization over-emphasized
- Incomplete mapping: Other categories insufficiently developed
Over-representation in foundational or basic categories suggests the course may be too introductory, while over-representation in advanced or specialized categories indicates potential accessibility issues for learners.
To correct over-representation:
- Review over-represented category: Identify concepts that could be consolidated or removed
- Expand under-represented categories: Add concepts to balance distribution
- Reclassify borderline concepts: Move concepts to more appropriate categories
- Validate against learning outcomes: Ensure distribution aligns with stated course objectives
The taxonomy distribution report generated by taxonomy-distribution.py flags any categories exceeding the 30% threshold, enabling quick identification of balance issues.
Generating Quality Reports with Python Scripts
The learning graph quality validation process relies on three Python scripts located in the docs/learning-graph/ directory. These scripts analyze your learning-graph.csv file and generate comprehensive quality reports in Markdown format.
analyze-graph.py Script
The analyze-graph.py script performs comprehensive graph validation and quality analysis:
Usage:
1 2 | |
Checks performed:
- CSV format validation
- Self-dependency detection
- Cycle detection (DAG validation)
- Connectivity analysis
- Orphaned node identification
- Linear chain detection
- Indegree/outdegree statistics
- Maximum dependency chain calculation
- Overall quality score computation
Output: Generates quality-metrics.md report file containing all findings, metrics, and a final quality score. Any critical issues (cycles, disconnected subgraphs) are highlighted at the top of the report.
csv-to-json.py Script
The csv-to-json.py script converts your learning graph CSV to vis-network JSON format for visualization:
Usage:
1 2 | |
Functionality:
- Parses CSV with ConceptID, ConceptLabel, Dependencies, TaxonomyID columns
- Generates nodes array with id, label, and group (taxonomy) fields
- Generates edges array with from and to fields (dependency arrows)
- Adds metadata section with graph statistics
- Validates JSON output format
Output: Creates learning-graph.json file that can be loaded by vis-network visualization tools to display your learning graph interactively.
taxonomy-distribution.py Script
The taxonomy-distribution.py script analyzes the distribution of concepts across taxonomy categories:
Usage:
1 2 | |
Analysis performed:
- Counts concepts per taxonomy category
- Calculates percentage distribution
- Identifies over-represented categories (>30%)
- Identifies under-represented categories (<3%)
- Generates distribution table and summary statistics
Output: Creates taxonomy-distribution.md report with a table showing each category's count and percentage, plus recommendations for rebalancing if needed.
All three scripts should be run after any changes to your learning graph CSV file. Incorporate the generated reports into your MkDocs navigation to make quality metrics visible to reviewers and collaborators.
Summary and Best Practices
Validating learning graph quality ensures your intelligent textbook rests on a sound pedagogical foundation. This chapter covered both structural validation (DAG properties, connectivity) and quality metrics (orphaned nodes, dependency distribution, taxonomy balance) that collectively determine graph effectiveness.
Key takeaways for maintaining high-quality learning graphs:
- Always validate DAG structure first: Circular dependencies and disconnected subgraphs are critical errors that must be fixed before proceeding
- Target quality scores above 75: Scores in this range indicate graphs ready for content generation
- Monitor taxonomy distribution: Keep any single category below 30% and ensure at least 5 categories represented
- Aim for 2.5-4.0 average dependencies: This range balances prerequisite completeness with learner accessibility
- Accept 5-15% orphaned nodes: Terminal and specialized concepts naturally have no dependents
- Run all three Python scripts after edits: Complete quality assessment requires structural validation, format conversion, and taxonomy analysis
Learning graph validation is iterative. Your first quality score may be low, but systematic application of the techniques in this chapter will guide improvements. Track your quality scores over time, targeting incremental increases until you achieve publication-ready scores above 85.
With a validated, high-quality learning graph in hand, you're ready to proceed to the next phase: converting your graph data to visualization formats and generating the rich content that will bring your intelligent textbook to life.
References
-
Topological Sorting - 2024 - GeeksforGeeks - Comprehensive tutorial on topological sorting algorithms including both DFS and BFS (Kahn's Algorithm) approaches for ordering DAG vertices, essential for understanding how to validate learning graph structure and generate valid prerequisite-respecting learning sequences.
-
Introduction to Directed Acyclic Graph - 2024 - GeeksforGeeks - Educational resource explaining DAG properties, cycle detection algorithms, and common applications in scheduling and prerequisite management, providing theoretical foundation for learning graph quality validation techniques.