Concept Taxonomy

This document defines the categorical taxonomy for organizing the 300 concepts in the learning graph.

Taxonomy Categories

Category Name	TaxonomyID	Description
Foundation Concepts	FOUND	Core data science concepts, definitions, and fundamental terminology
Python Environment	PYENV	Python installation, setup, IDE, package management, and development tools
Data Structures	DSTRC	Python and pandas data structures including DataFrames, lists, arrays
Data Cleaning	CLEAN	Data preprocessing, handling missing values, validation, and transformation
Visualization	VIZ	Data visualization concepts, matplotlib, plotting techniques
Statistics	STATS	Statistical concepts, measures, distributions, and probability
Regression	REGR	Linear regression, model fitting, coefficients, and assumptions
Model Evaluation	EVAL	Model performance metrics, validation, overfitting, cross-validation
Advanced Regression	ADVR	Multiple regression, feature selection, regularization techniques
NumPy Computing	NUMPY	NumPy arrays, matrix operations, linear algebra, vectorization
Machine Learning	ML	Machine learning concepts, training, optimization, gradient descent
Neural Networks	NN	Neural network architecture, layers, activation functions, deep learning
PyTorch	TORCH	PyTorch library, tensors, autograd, training loops
Best Practices	BEST	Explainability, reproducibility, ethics, documentation, version control
Projects	PROJ	Capstone projects, end-to-end pipelines, deployment, communication

Category Descriptions

FOUND - Foundation Concepts

Core concepts that introduce data science terminology and fundamental ideas. These are typically the first concepts students encounter and form the basis for all other learning.

PYENV - Python Environment

Concepts related to setting up and managing the Python development environment, including installation, package managers, virtual environments, and IDEs.

DSTRC - Data Structures

Python native data structures (lists, dictionaries, tuples) and pandas structures (DataFrame, Series). These are essential for data manipulation.

CLEAN - Data Cleaning

Techniques for preparing data for analysis including handling missing values, detecting outliers, removing duplicates, and transforming data.

VIZ - Visualization

Data visualization concepts using matplotlib and seaborn, including various plot types, customization, and best practices for visual communication.

STATS - Statistics

Statistical foundations including descriptive statistics, probability, distributions, correlation, and hypothesis testing.

REGR - Regression

Linear regression concepts from simple to multiple regression, including model fitting, interpretation, and assumptions.

EVAL - Model Evaluation

Techniques for assessing model performance, including metrics, train/test splits, cross-validation, and understanding overfitting/underfitting.

ADVR - Advanced Regression

Advanced modeling techniques including multiple regression, feature engineering, regularization (Ridge, Lasso), and non-linear models.

NUMPY - NumPy Computing

NumPy library concepts including array operations, broadcasting, vectorization, and linear algebra for efficient computation.

ML - Machine Learning

Introduction to machine learning paradigms, supervised/unsupervised learning, optimization algorithms, and gradient descent.

NN - Neural Networks

Artificial neural network concepts including architecture, activation functions, forward/backward propagation, and deep learning.

TORCH - PyTorch

PyTorch-specific concepts including tensors, autograd, neural network modules, optimizers, and training workflows.

BEST - Best Practices

Professional practices including explainability, reproducibility, documentation, version control, and ethical considerations.

PROJ - Projects

Applied concepts for real-world projects including end-to-end pipelines, model deployment, and communicating results.