Glossary of Terms
ISO Definition
A term definition is considered to be consistent with ISO metadata registry guideline 11179 if it meets the following criteria:
- Precise
- Concise
- Distinct
- Non-circular
- Unencumbered with business rules
Accuracy
Accuracy is a metric used to evaluate classification models, representing the proportion of correct predictions over the total number of predictions.
Example: In a spam email classifier, if the model correctly identifies 90 out of 100 emails, the accuracy is 90%.
Algorithm
An algorithm is a step-by-step procedure or set of rules designed to perform a specific task or solve a problem.
Example: Implementing the k-means clustering algorithm to group similar data points in an unsupervised learning task.
Anaconda
Anaconda is a free and open-source distribution of Python and R programming languages for scientific computing and data science.
Example: Using Anaconda to manage Python packages and environments for data analysis projects in the course.
Analytics
Analytics involves examining data sets to draw conclusions about the information they contain, often using specialized software and statistical techniques.
Example: Performing customer behavior analytics using Pandas and Matplotlib to improve marketing strategies.
API (Application Programming Interface)
An API is a set of protocols and tools that allow different software applications to communicate with each other.
Example: Utilizing the Twitter API to collect real-time tweets for sentiment analysis in Python.
Array
An array is a data structure that stores a collection of items at contiguous memory locations, allowing for efficient indexing.
Example: Using NumPy arrays to perform vectorized operations for faster numerical computations.
Artificial Intelligence (AI)
AI is the simulation of human intelligence processes by machines, especially computer systems, enabling them to perform tasks that typically require human intelligence.
Example: Exploring AI concepts by implementing machine learning models that can recognize images or understand natural language.
Attribute
An attribute refers to a variable or feature in a dataset that represents a characteristic of the data points.
Example: In a dataset of cars, attributes might include horsepower, weight, and fuel efficiency.
AUC (Area Under the Curve)
AUC is a performance metric for classification models, representing the area under the Receiver Operating Characteristic (ROC) curve.
Example: Comparing models by evaluating their AUC scores to determine which has better classification performance.
Bagging
Bagging, short for Bootstrap Aggregating, is an ensemble learning technique that improves model stability and accuracy by combining predictions from multiple models trained on random subsets of the data.
Example: Implementing bagging with decision trees to reduce variance and prevent overfitting in the course project.
Bar Chart
A bar chart is a graphical representation of data using rectangular bars to show the frequency or value of different categories.
Example: Creating a bar chart with Matplotlib to visualize the count of different species in an ecological dataset.
Bias
Bias refers to the error introduced by approximating a real-world problem, which may be complex, by a much simpler model.
Example: Recognizing high bias in a linear model that underfits the data during regression analysis.
Bias-Variance Tradeoff
The bias-variance tradeoff is the balance between a model's ability to generalize to new data (variance) and its accuracy on the training data (bias).
Example: Adjusting the complexity of a model to find the optimal point where both bias and variance are minimized.
Big Data
Big Data refers to datasets that are too large or complex for traditional data-processing software to handle efficiently.
Example: Discussing how tools like Hadoop or Spark can process big data in the context of data science.
Box Plot
A box plot is a graphical representation of data that displays the distribution's quartiles and averages, highlighting the median and outliers.
Example: Using Seaborn to create box plots for visualizing the distribution of test scores across different classrooms.
Bootstrapping
Bootstrapping is a statistical resampling technique that involves repeatedly drawing samples from a dataset with replacement to estimate a population parameter.
Example: Applying bootstrapping methods to estimate confidence intervals for a sample mean in a data analysis assignment.
Classification
Classification is a supervised learning task where the goal is to predict discrete labels or categories for given input data.
Example: Building a logistic regression model to classify emails as spam or not spam.
Clustering
Clustering is an unsupervised learning technique that groups similar data points together based on their features.
Example: Using k-means clustering to segment customers into different groups based on purchasing behavior.
Confusion Matrix
A confusion matrix is a table used to evaluate the performance of a classification model by comparing predicted and actual labels.
Example: Analyzing a confusion matrix to calculate precision and recall for a disease diagnosis model.
Correlation
Correlation measures the statistical relationship between two variables, indicating how one may predict the other.
Example: Calculating the correlation coefficient between hours studied and exam scores to determine their relationship.
Cross-Validation
Cross-validation is a technique for assessing how a predictive model will perform on an independent dataset by partitioning the data into complementary subsets for training and validation.
Example: Using k-fold cross-validation to evaluate the generalization performance of a machine learning model.
CSV (Comma-Separated Values)
CSV is a file format that uses commas to separate values, commonly used for storing tabular data.
Example: Importing a CSV file into a Pandas DataFrame to begin data analysis.
DataFrame
A DataFrame is a two-dimensional labeled data structure in Pandas, similar to a spreadsheet or SQL table.
Example: Manipulating data stored in a DataFrame to clean and prepare it for analysis.
Data Mining
Data mining is the process of discovering patterns and knowledge from large amounts of data.
Example: Extracting useful information from a large customer database to identify purchasing trends.
Data Science
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract insights from structured and unstructured data.
Example: Applying data science techniques to analyze social media data for sentiment analysis.
Data Visualization
Data visualization is the graphical representation of data to help people understand complex data easily.
Example: Creating interactive dashboards using Matplotlib or Seaborn to present findings.
Decision Tree
A decision tree is a flowchart-like structure used for making decisions or predictions based on input features.
Example: Building a decision tree classifier to predict whether a loan application should be approved.
Deep Learning
Deep learning is a subset of machine learning involving neural networks with multiple layers that can learn representations from data.
Example: Exploring deep learning concepts by creating a neural network for image recognition tasks.
Dimensionality Reduction
Dimensionality reduction involves reducing the number of input variables in a dataset while retaining as much information as possible.
Example: Using Principal Component Analysis (PCA) to reduce features before training a model.
Distribution
A distribution describes how values of a variable are spread or dispersed.
Example: Plotting the normal distribution of test scores to analyze class performance.
Dummy Variable
A dummy variable is a binary variable created to include categorical data in regression models.
Example: Converting categorical variables like 'Gender' into dummy variables for a regression analysis.
Encoding
Encoding transforms data into a different format using a specific scheme.
Example: Applying one-hot encoding to convert categorical variables into numerical format for machine learning models.
Ensemble Learning
Ensemble learning combines predictions from multiple machine learning models to improve overall performance.
Example: Using a random forest, which is an ensemble of decision trees, to enhance prediction accuracy.
Exploratory Data Analysis (EDA)
EDA is an approach to analyzing data sets to summarize their main characteristics, often using visual methods.
Example: Performing EDA to detect anomalies and patterns before building predictive models.
Feature Engineering
Feature engineering involves creating new input features from existing ones to improve model performance.
Example: Combining 'Date of Birth' and 'Current Date' to create a new feature 'Age' for a predictive model.
Feature Scaling
Feature scaling adjusts the range of features in the data to ensure they contribute equally to the model.
Example: Applying standardization to features before using gradient descent algorithms.
Feature Selection
Feature selection is the process of selecting a subset of relevant features for model construction.
Example: Using correlation analysis to remove redundant features that do not improve the model.
Function
In programming, a function is a block of organized, reusable code that performs a single action.
Example: Defining a Python function to calculate the mean of a list of numbers in data analysis.
F1 Score
The F1 score is the harmonic mean of precision and recall, used as a measure of a test's accuracy.
Example: Evaluating a classification model with imbalanced classes using the F1 score.
Gradient Boosting
Gradient Boosting is an ensemble technique that builds models sequentially, each correcting the errors of the previous one.
Example: Implementing Gradient Boosting Machines (GBM) to improve prediction accuracy on complex datasets.
Gradient Descent
Gradient descent is an optimization algorithm used to minimize the cost function in machine learning models.
Example: Using gradient descent to find the optimal weights in a linear regression model.
Grid Search
Grid search is a hyperparameter optimization technique that exhaustively searches through a specified subset of hyperparameters.
Example: Applying grid search to find the best combination of parameters for a support vector machine classifier.
Histogram
A histogram is a graphical representation showing the distribution of numerical data by depicting the number of data points that fall within specified ranges.
Example: Creating a histogram to visualize the frequency distribution of ages in a dataset.
Hyperparameter Tuning
Hyperparameter tuning involves adjusting the parameters that govern the training process of a model to improve performance.
Example: Tuning the number of trees and depth in a random forest model to achieve better accuracy.
Hypothesis Testing
Hypothesis testing is a statistical method used to make decisions about the properties of a population based on sample data.
Example: Conducting a t-test to determine if there is a significant difference between two groups' means.
Imputation
Imputation is the process of replacing missing data with substituted values.
Example: Filling missing values in a dataset with the mean or median of the column.
Inferential Statistics
Inferential statistics use a random sample of data taken from a population to describe and make inferences about the population.
Example: Estimating the average height of all students in a university by sampling a subset.
Interpolation
Interpolation is a method of estimating unknown values that fall between known data points.
Example: Using interpolation to estimate missing temperature readings in a time series dataset.
JSON (JavaScript Object Notation)
JSON is a lightweight data-interchange format that is easy for humans to read and write and for machines to parse and generate.
Example: Reading data from a JSON file into a Pandas DataFrame for analysis.
Jupyter Notebook
Jupyter Notebook is an open-source web application that allows you to create and share documents containing live code, equations, visualizations, and narrative text.
Example: Using Jupyter Notebook to write Python code and document the data analysis process.
k-means Clustering
k-means clustering is an unsupervised learning algorithm that partitions data into k distinct clusters based on feature similarity.
Example: Segmenting customers into groups based on purchasing behavior using k-means clustering.
K-Nearest Neighbors (KNN)
KNN is a simple, supervised machine learning algorithm that classifies new cases based on the majority class of their k nearest neighbors.
Example: Implementing KNN to predict whether a patient has a certain disease based on symptoms.
Label Encoding
Label encoding converts categorical text data into numerical values by assigning a unique integer to each category.
Example: Transforming the 'Color' feature into numerical labels before model training.
Learning Rate
The learning rate is a hyperparameter that controls how much we adjust the model weights with respect to the loss gradient.
Example: Setting an appropriate learning rate in gradient descent to ensure the model converges.
Linear Regression
Linear regression is a supervised learning algorithm that models the relationship between a dependent variable and one or more independent variables.
Example: Predicting house prices based on features like size and location using linear regression.
Logistic Regression
Logistic regression is a classification algorithm used to predict the probability of a categorical dependent variable.
Example: Using logistic regression to determine the likelihood of a customer churning.
Machine Learning
Machine learning is a subset of AI that focuses on building systems that learn from and make decisions based on data.
Example: Implementing various machine learning algorithms to solve classification and regression problems in the course.
Matplotlib
Matplotlib is a Python library used for creating static, animated, and interactive visualizations.
Example: Plotting data trends using Matplotlib to support data analysis conclusions.
Missing Data
Missing data occurs when no value is stored for a variable in an observation, which can impact data analysis.
Example: Identifying and handling missing data in a dataset before model training.
Model Selection
Model selection involves choosing the best model from a set of candidates based on their predictive performance.
Example: Comparing different algorithms like decision trees and logistic regression to select the best model for a classification task.
Multicollinearity
Multicollinearity occurs when independent variables in a regression model are highly correlated, which can affect the model's stability.
Example: Detecting multicollinearity using the Variance Inflation Factor (VIF) and addressing it in the dataset.
Natural Language Processing (NLP)
NLP is a field of AI that gives computers the ability to understand, interpret, and generate human language.
Example: Analyzing text data for sentiment analysis using NLP techniques.
Neural Network
A neural network is a series of algorithms that mimic the operations of a human brain to recognize patterns and solve complex problems.
Example: Building a simple neural network to classify images of handwritten digits.
Normal Distribution
The normal distribution is a continuous probability distribution characterized by a symmetrical, bell-shaped curve.
Example: Assuming normal distribution of residuals in linear regression models.
Normalization
Normalization scales data to fit within a specific range, often between 0 and 1, to ensure all features contribute equally.
Example: Applying Min-Max normalization to features before training a neural network.
NumPy
NumPy is a Python library used for working with arrays and providing functions for mathematical operations on large, multi-dimensional arrays and matrices.
Example: Using NumPy arrays for efficient numerical computations in data science projects.
One-Hot Encoding
One-hot encoding converts categorical variables into a binary matrix representation.
Example: Transforming the 'Country' feature into multiple binary columns representing each country.
Optimization
Optimization involves adjusting the inputs or parameters of a model to minimize or maximize some objective function.
Example: Optimizing the weights in a neural network to reduce the loss function during training.
Outlier
An outlier is a data point that differs significantly from other observations, potentially indicating variability in measurement or experimental errors.
Example: Identifying outliers in a dataset using box plots and deciding whether to remove or transform them.
Overfitting
Overfitting occurs when a model learns the training data too well, capturing noise and details that negatively impact its performance on new data.
Example: Preventing overfitting by using regularization techniques and cross-validation.
Pandas
Pandas is a Python library providing high-performance, easy-to-use data structures and data analysis tools.
Example: Using Pandas DataFrames to manipulate and analyze tabular data in the course.
Parameter
A parameter is a configuration variable that is internal to the model and estimated from data.
Example: The coefficients in a linear regression model are parameters learned during training.
PCA (Principal Component Analysis)
PCA is a dimensionality reduction technique that transforms data into a new coordinate system, reducing the number of variables while retaining most information.
Example: Applying PCA to reduce the dimensionality of a dataset before clustering.
Pipeline
A pipeline is a sequence of data processing components or steps, where the output of one component is the input to the next.
Example: Creating a scikit-learn pipeline to standardize data and train a model in a single workflow.
Precision
Precision is a metric that measures the proportion of true positives among all positive predictions.
Example: Calculating precision to evaluate a model where false positives are costly, such as in fraud detection.
Predictive Modeling
Predictive modeling uses statistics and data to predict outcomes with data models.
Example: Building a predictive model to forecast sales based on historical data.
Probability Distribution
A probability distribution describes how the values of a random variable are distributed.
Example: Using the normal distribution to model the heights of individuals in a population.
Python
Python is a high-level, interpreted programming language known for its readability and versatility in data science.
Example: Writing Python scripts to automate data cleaning and analysis tasks.
Random Forest
Random Forest is an ensemble learning method that constructs multiple decision trees during training and outputs the mode of the classes for classification tasks.
Example: Implementing a random forest classifier to improve accuracy over a single decision tree.
Regression
Regression is a set of statistical processes for estimating the relationships among variables.
Example: Performing linear regression to understand how the price of a house varies with its size.
Regularization
Regularization adds a penalty to the loss function to prevent overfitting by discouraging complex models.
Example: Applying Lasso regularization to reduce overfitting in a regression model.
Recall
Recall is a metric that measures the proportion of actual positives correctly identified.
Example: Evaluating recall in a medical diagnosis model where missing a positive case is critical.
ROC Curve (Receiver Operating Characteristic Curve)
An ROC curve is a graphical plot illustrating the diagnostic ability of a binary classifier as its discrimination threshold is varied.
Example: Plotting the ROC curve to select the optimal threshold for a classification model.
Root Mean Squared Error (RMSE)
RMSE is a metric used to measure the difference between values predicted by a model and the actual values.
Example: Using RMSE to assess the performance of a regression model predicting housing prices.
Sampling
Sampling involves selecting a subset of data from a larger dataset to estimate characteristics of the whole population.
Example: Drawing a random sample from a large dataset to make computations more manageable.
Scikit-learn
Scikit-learn is a Python library for machine learning that provides simple and efficient tools for data analysis and modeling.
Example: Using scikit-learn to implement machine learning algorithms like SVMs and random forests.
Seaborn
Seaborn is a Python data visualization library based on Matplotlib that provides a high-level interface for drawing attractive statistical graphics.
Example: Creating complex visualizations like heatmaps and violin plots using Seaborn.
SMOTE (Synthetic Minority Over-sampling Technique)
SMOTE is a technique used to address class imbalance by generating synthetic samples of the minority class.
Example: Applying SMOTE to balance the dataset before training a classifier on imbalanced data.
Standard Deviation
Standard deviation measures the amount of variation or dispersion in a set of values.
Example: Calculating the standard deviation to understand the spread of exam scores in a class.
StandardScaler
StandardScaler is a scikit-learn tool that standardizes features by removing the mean and scaling to unit variance.
Example: Using StandardScaler to preprocess data before feeding it into a machine learning algorithm.
Statistical Significance
Statistical significance indicates that the result of a test is unlikely to have occurred by chance alone.
Example: Interpreting p-values to determine if the difference between two groups is statistically significant.
Supervised Learning
Supervised learning is a type of machine learning where models are trained using labeled data.
Example: Training a supervised learning model to predict house prices based on historical data.
Time Series
Time series data is a sequence of data points collected or recorded at time intervals.
Example: Analyzing stock prices over time to forecast future market trends.
Tokenization
Tokenization is the process of breaking text into smaller units called tokens, often words or phrases.
Example: Tokenizing text data for input into a natural language processing model.
Training Set
A training set is a subset of the dataset used to train machine learning models.
Example: Splitting data into training and test sets to build and evaluate a model.
T-test
A t-test is a statistical test used to compare the means of two groups.
Example: Performing a t-test to determine if there is a significant difference in test scores between two classes.
Underfitting
Underfitting occurs when a model is too simple and fails to capture the underlying pattern of the data.
Example: Addressing underfitting by increasing the complexity of the model or adding more features.
Unsupervised Learning
Unsupervised learning involves training models on data without labeled responses, aiming to find hidden patterns.
Example: Using unsupervised learning techniques like clustering to segment customers.
Validation Set
A validation set is a subset of the dataset used to tune hyperparameters and prevent overfitting during model training.
Example: Using a validation set to adjust the learning rate and number of layers in a neural network.
Variance
Variance measures how far a set of numbers is spread out from their average value.
Example: Calculating the variance to understand the variability in a dataset.
Visualization
Visualization refers to the graphical representation of information and data.
Example: Creating line charts and scatter plots to visualize trends and relationships in the data.
Weight
In machine learning models, weights are parameters that are learned during training to map input features to outputs.
Example: Adjusting weights in a neural network during training to minimize the loss function.
Z-score
A z-score indicates how many standard deviations an element is from the mean.
Example: Calculating z-scores to identify outliers in a dataset.
XGBoost
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient and flexible.
Example: Implementing XGBoost to improve model performance on a classification task.
Confusion Matrix
A confusion matrix is a table used to describe the performance of a classification model.
Example: Using a confusion matrix to calculate precision, recall, and F1 score for a classifier.
Hyperparameter
A hyperparameter is a configuration that is external to the model and whose value cannot be estimated from data.
Example: Setting the number of neighbors in a KNN algorithm as a hyperparameter to tune.
Kernel
In machine learning, a kernel is a function used in algorithms like SVM to transform data into a higher-dimensional space.
Example: Choosing a radial basis function (RBF) kernel for an SVM to handle non-linear data.
Lasso Regression
Lasso regression is a type of linear regression that uses L1 regularization to reduce overfitting and perform feature selection.
Example: Applying lasso regression to identify the most important features in a dataset.
Mean Absolute Error (MAE)
MAE is a measure of errors between paired observations expressing the same phenomenon.
Example: Evaluating a regression model by calculating the MAE between predicted and actual values.
Overfitting
Overfitting occurs when a model learns the training data too well, capturing noise and details that negatively impact performance on new data.
Example: Observing overfitting in a model that performs well on training data but poorly on test data.
Pearson Correlation Coefficient
The Pearson correlation coefficient measures the linear correlation between two variables.
Example: Calculating the Pearson coefficient to assess the strength of the relationship between two features.
R-Squared (Coefficient of Determination)
R-squared is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by independent variables.
Example: Interpreting an R-squared value of 0.85 to mean that 85% of the variance in the dependent variable is predictable.
Sampling Bias
Sampling bias occurs when some members of a population are systematically more likely to be selected in a sample than others.
Example: Ensuring random sampling in data collection to avoid sampling bias.
Tokenization
Tokenization is the process of splitting text into individual units (tokens), such as words or phrases.
Example: Tokenizing customer reviews to prepare text data for sentiment analysis.
Univariate Analysis
Univariate analysis examines each variable individually to summarize and find patterns.
Example: Performing univariate analysis on the 'Age' feature to understand its distribution.
Variance Inflation Factor (VIF)
VIF quantifies the severity of multicollinearity in regression analysis.
Example: Calculating VIF to detect multicollinearity and decide whether to remove correlated features.
White Noise
White noise refers to a time series of random data points that have a constant mean and variance.
Example: Checking residuals for white noise to validate the assumptions of a time series model.
Cross-Entropy Loss
Cross-entropy loss measures the performance of a classification model whose output is a probability between 0 and 1.
Example: Using cross-entropy loss as the loss function in a logistic regression model.
Epoch
An epoch refers to one complete pass through the entire training dataset.
Example: Training a neural network for 10 epochs to optimize the weights.
Fitting
Fitting a model involves adjusting its parameters to best match the data.
Example: Fitting a linear regression model to the training data by minimizing the cost function.
Hyperplane
A hyperplane is a flat affine subspace of one dimension less than its ambient space, used in SVMs to separate classes.
Example: Understanding how an SVM finds the optimal hyperplane to classify data points.
Iteration
An iteration refers to one update of the model's parameters during training.
Example: Observing loss reduction after each iteration in gradient descent optimization.
Learning Curve
A learning curve plots the model's performance on the training and validation sets over time or as the training set size increases.
Example: Analyzing the learning curve to diagnose if a model is overfitting or underfitting.
Loss Function
A loss function measures how well a machine learning model performs, guiding the optimization process.
Example: Using Mean Squared Error (MSE) as the loss function in a regression model.
Mini-Batch Gradient Descent
Mini-batch gradient descent is an optimization algorithm that updates the model parameters using small batches of data.
Example: Accelerating training by using mini-batches instead of the entire dataset in each iteration.
Multivariate Analysis
Multivariate analysis examines the relationship between multiple variables simultaneously.
Example: Performing multivariate regression to understand how multiple features affect the target variable.
Natural Language Processing (NLP)
NLP focuses on the interaction between computers and human language.
Example: Using NLP techniques to analyze customer feedback and extract key themes.
Optimization Algorithm
An optimization algorithm adjusts the parameters of a model to minimize the loss function.
Example: Choosing Adam optimizer for faster convergence in training a neural network.
Precision-Recall Curve
A precision-recall curve plots the trade-off between precision and recall for different threshold settings.
Example: Using the precision-recall curve to select the threshold that balances precision and recall.
Reinforcement Learning
Reinforcement learning is a type of machine learning where an agent learns to make decisions by performing actions and receiving rewards.
Example: Discussing reinforcement learning concepts as an advanced topic in the course.
Stratified Sampling
Stratified sampling involves dividing the population into subgroups and sampling from each to ensure representation.
Example: Using stratified sampling to maintain the class distribution in training and test sets.
Support Vector Machine (SVM)
SVM is a supervised learning algorithm that finds the hyperplane that best separates classes.
Example: Implementing an SVM classifier for a binary classification problem in the course.
Synthetic Data
Synthetic data is artificially generated data that mimics the properties of real data.
Example: Generating synthetic data to augment the dataset and improve model training.
Training Loss
Training loss measures the error on the training dataset during model training.
Example: Monitoring training loss to assess how well the model is learning from the training data.
Type I Error
A Type I error occurs when the null hypothesis is true but is incorrectly rejected.
Example: Understanding Type I errors when interpreting p-values in hypothesis testing.
Type II Error
A Type II error occurs when the null hypothesis is false but erroneously fails to be rejected.
Example: Recognizing the implications of Type II errors in statistical testing.
Validation Loss
Validation loss measures the error on the validation dataset, used to tune model hyperparameters.
Example: Observing validation loss to detect overfitting during model training.
Weight Initialization
Weight initialization is the process of setting the initial values of the weights before training a neural network.
Example: Using random initialization methods to start training a deep learning model.
Word Embedding
Word embedding is a representation of text where words with similar meaning have similar vector representations.
Example: Implementing word embeddings like Word2Vec in NLP tasks.
XGBoost
XGBoost is an optimized gradient boosting library designed for performance and speed.
Example: Using XGBoost to improve model accuracy in classification problems.
Z-score Normalization
Z-score normalization scales data based on mean and standard deviation.
Example: Applying z-score normalization to standardize features before training a model.
o1