Generating Commonly Used Deep Learning Equations
Prompt
Create a list of the 100 most commonly used equations in a college level course in deep learning.
For each equation, return the following:
- The name of the equation in markdown level 4.
- Within the body, return the equation in LaTeX format
- Surround the equation with dollar signs
- For each variable in the equation, add a "where:" that defines each variable
- Put blank lines around the list of variables
- Wrap the variables names in dollar signs
Examples:
Slope and Intercept
\(y = m * x + b\)
where:
- \(y\) is the vertical axis
- \(x\) is the horizontal axis
- \(m\) is the slope
- \(b\) is the intercept
Area of a Circle
\(A = 2\pi r^2\) where:
- \(A\) is the area of the circle
- \(r\) is the radius
Response
Below are 100 commonly encountered equations in a college-level deep learning course, presented in alphabetical order. Each entry has:
- A heading in Markdown level 4.
- The equation in LaTeX, surrounded by dollar signs.
- A where: section defining each variable, with each variable in dollar signs and blank lines around the list.
100 Common Equations in Deep Learning
Below is a curated list of 100 frequently encountered equations in a college-level Deep Learning course. Each entry includes:
- A name (in level 4 markdown).
- The equation (in LaTeX, surrounded by single dollar signs).
- A “where:” section explaining each variable.
The equations are grouped loosely by theme (basic algebra, loss functions, optimization, activations, convolution, RNNs, attention mechanisms, etc.) but may appear in any Deep Learning curriculum.
1. Mean Squared Error (MSE)
\(L_{\text{MSE}} = \frac{1}{N}\sum_{i=1}^{N}(y^{(i)} - \hat{y}^{(i)})^2\)
where:
- \(N\) is the number of samples
- \(y^{(i)}\) is the true label for the \(i\)-th sample
- \(\hat{y}^{(i)}\) is the predicted value for the \(i\)-th sample
2. Cross-Entropy Loss (Binary)
\(L_{\text{CE}} = -\frac{1}{N}\sum_{i=1}^{N}\Big[y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)})\Big]\)
where:
- \(N\) is the number of samples
- \(y^{(i)} \in \{0, 1\}\) is the true binary label
- \(\hat{y}^{(i)}\) is the predicted probability of the positive class
3. Cross-Entropy Loss (Multi-Class)
\(L_{\text{CE}} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{c=1}^{C} y_{c}^{(i)} \log(\hat{y}_{c}^{(i)})\)
where:
- \(N\) is the number of samples
- \(C\) is the number of classes
- \(y_{c}^{(i)} \in \{0,1\}\) indicates the true class of sample \(i\)
- \(\hat{y}_{c}^{(i)}\) is the predicted probability for class \(c\)
4. Softmax Function
\(\hat{y}_j = \frac{\exp(z_j)}{\sum_{k=1}^{C}\exp(z_k)}\)
where:
- \(z_j\) is the logit for class \(j\)
- \(C\) is the total number of classes
- \(\hat{y}_j\) is the probability assigned to class \(j\)
5. Sigmoid (Logistic) Function
\(\sigma(z) = \frac{1}{1 + e^{-z}}\)
where:
- \(z\) is the input or logit
- \(\sigma(z)\) is the output between 0 and 1
6. Tanh Function
\(\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}\)
where:
- \(z\) is the input
- \(\tanh(z)\) outputs values in the range \([-1, 1]\)
7. Rectified Linear Unit (ReLU)
\(\text{ReLU}(z) = \max(0, z)\)
where:
- \(z\) is the input
- \(\text{ReLU}(z)\) is zero for negative \(z\) and \(z\) itself if \(z > 0\)
8. Leaky ReLU
\(\text{LeakyReLU}(z) = \begin{cases} z & \text{if } z \ge 0\\ \alpha \, z & \text{if } z < 0 \end{cases}\)
where:
- \(z\) is the input
- \(\alpha\) is a small positive slope (e.g., 0.01) for negative \(z\)
9. Weighted Sum of Inputs
\(z = \sum_{i=1}^{d} w_i x_i + b\)
where:
- \(d\) is the number of input features
- \(w_i\) is the weight for feature \(x_i\)
- \(x_i\) is the \(i\)-th input
- \(b\) is the bias term
10. Neural Network Output (One Layer)
\(\hat{y} = f\bigg(\sum_{i=1}^{d} w_i x_i + b\bigg)\)
where:
- \(f(\cdot)\) is an activation function
- \(w_i\), \(x_i\), and \(b\) are defined as above
11. Gradient Descent (Parameter Update)
\(\theta \leftarrow \theta - \eta \,\nabla_\theta L(\theta)\)
where:
- \(\theta\) represents parameters (weights, biases)
- \(\eta\) is the learning rate
- \(L(\theta)\) is the loss function
- \(\nabla_\theta L(\theta)\) is the gradient of the loss w.r.t. parameters
12. Chain Rule (Single-Variable)
\(\frac{d}{dx} f(g(x)) = f'(g(x)) \cdot g'(x)\)
where:
- \(f\) and \(g\) are functions
- \(f'\) and \(g'\) are their derivatives
13. Chain Rule (Multivariable)
\(\frac{\partial}{\partial x_i} L = \sum_{j} \frac{\partial L}{\partial z_j} \frac{\partial z_j}{\partial x_i}\)
where:
- \(L\) is the loss function
- \(z_j\) are intermediate variables dependent on \(x_i\)
14. Momentum Update
where:
- \(v\) is the velocity term
- \(\beta\) is the momentum hyperparameter (e.g., 0.9)
- \(\eta\) is the learning rate
- \(\nabla_\theta L(\theta)\) is the gradient
15. Nesterov Accelerated Gradient (NAG)
where:
- \(v\) is the velocity term
- \(\beta\) is the momentum factor
- \(\theta\) are the parameters
- \(\eta\) is the learning rate
16. RMSProp Update
where:
- \(E[g^2]\) is the running average of squared gradients
- \(\rho\) is the decay rate (e.g., 0.9)
- \(\eta\) is the learning rate
- \(\epsilon\) is a small constant for numerical stability
17. Adam Optimizer
where:
- \(m\) is the first moment estimate
- \(v\) is the second moment estimate
- \(\beta_1\) and \(\beta_2\) are exponential decay rates
- \(\eta\) is the learning rate
- \(t\) is the current time step
- \(\epsilon\) is a small constant
18. L1 Regularization
\(R_{L1} = \lambda \sum_{j} |w_j|\)
where:
- \(\lambda\) is the regularization coefficient
- \(w_j\) are the weights
19. L2 Regularization
\(R_{L2} = \frac{\lambda}{2} \sum_{j} w_j^2\)
where:
- \(\lambda\) is the regularization strength
- \(w_j\) are the model’s weights
- The factor \(\tfrac{1}{2}\) is often included by convention
20. Weighted Cross-Entropy (Class Imbalance)
\(L_{\text{WCE}} = -\frac{1}{N} \sum_{i=1}^N \Big[\alpha \, y^{(i)} \log(\hat{y}^{(i)}) + (1-\alpha)\,(1-y^{(i)})\log(1-\hat{y}^{(i)})\Big]\)
where:
- \(\alpha\) is a weight factor for the positive class
- \(y^{(i)} \in \{0,1\}\) is the true label
- \(\hat{y}^{(i)}\) is the predicted probability
21. Negative Log-Likelihood (NLL)
\(L_{\text{NLL}} = -\sum_{i=1}^N \log \, p(y^{(i)} | x^{(i)})\)
where:
- \(p(y|x)\) is the predicted probability of label \(y\) given input \(x\)
- \(N\) is the number of samples
22. Normal Equation (Linear Regression)
\(\theta = (X^\top X)^{-1} X^\top y\)
where:
- \(X\) is the design matrix
- \(y\) is the vector of targets
- \(\theta\) is the vector of parameters
23. Coefficient of Determination (\(R^2\))
\(R^2 = 1 - \frac{\sum_{i=1}^N (y^{(i)} - \hat{y}^{(i)})^2}{\sum_{i=1}^N (y^{(i)} - \bar{y})^2}\)
where:
- \(y^{(i)}\) is the actual value
- \(\hat{y}^{(i)}\) is the predicted value
- \(\bar{y}\) is the mean of \(y^{(i)}\)
24. Batch Normalization (Mean)
\(\mu_B = \frac{1}{m}\sum_{i=1}^{m} x_i\)
where:
- \(m\) is the number of samples in a mini-batch
- \(x_i\) is the \(i\)-th activation in the batch
25. Batch Normalization (Variance)
\(\sigma_B^2 = \frac{1}{m}\sum_{i=1}^{m} (x_i - \mu_B)^2\)
where:
- \(x_i\) is the \(i\)-th activation in the batch
- \(\mu_B\) is the batch mean
- \(\sigma_B^2\) is the batch variance
26. Batch Normalization (Forward Pass)
\(\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}, \quad y_i = \gamma \, \hat{x}_i + \beta\)
where:
- \(x_i\) is the input activation
- \(\mu_B\), \(\sigma_B^2\) are batch statistics
- \(\epsilon\) is a small constant
- \(\gamma\), \(\beta\) are learnable scale and shift parameters
27. Dropout (Forward Pass)
\(\tilde{h}^{(l)} = \mathbf{m}^{(l)} \odot h^{(l)}\)
where:
- \(h^{(l)}\) is the activation vector at layer \(l\)
- \(\mathbf{m}^{(l)}\) is a random binary mask (each entry ~ Bernoulli(p))
- \(\odot\) denotes elementwise multiplication
28. 2D Convolution (Without Stride/Padding)
\((I * K)(x,y) = \sum_{u=-r}^{r} \sum_{v=-s}^{s} I(x+u, y+v)\, K(u,v)\)
where:
- \(I\) is the input image
- \(K\) is the kernel
- \(r\), \(s\) define kernel size offsets
29. Transposed Convolution
\(\text{ConvTranspose}(z, K) = \sum_{u}\sum_{v} z(u,v)\, \text{Upsample}(K, \text{stride})\)
where:
- \(z(u,v)\) is an activation map
- \(K\) is the kernel
- \(\text{stride}\) is the factor by which the output is upsampled
30. Max Pooling (2D)
\(\text{MaxPool}(x,y) = \max_{(p,q)\in \,\mathcal{R}_{x,y}} I(p,q)\)
where:
- \(I\) is the input feature map
- \(\mathcal{R}_{x,y}\) is the receptive field around location \((x,y)\)
31. Average Pooling (2D)
\(\text{AvgPool}(x,y) = \frac{1}{|\mathcal{R}_{x,y}|}\sum_{(p,q)\in \,\mathcal{R}_{x,y}} I(p,q)\)
where:
- \(\mathcal{R}_{x,y}\) is the region of pooling
- \(|\mathcal{R}_{x,y}|\) is the size of that region
32. RNN Hidden State Update
\(h_t = f(W_{hh} \, h_{t-1} + W_{xh} \, x_t + b_h)\)
where:
- \(h_t\) is the hidden state at time \(t\)
- \(x_t\) is the input at time \(t\)
- \(W_{hh}, W_{xh}\) are weight matrices
- \(b_h\) is a bias vector
- \(f(\cdot)\) is an activation (e.g., tanh)
33. LSTM Input Gate
\(i_t = \sigma(W_{xi} x_t + W_{hi} h_{t-1} + b_i)\)
where:
- \(i_t\) is the input gate
- \(x_t\) is the current input
- \(h_{t-1}\) is the previous hidden state
- \(W_{xi}, W_{hi}\) are weight matrices
- \(b_i\) is the bias
- \(\sigma\) is the sigmoid function
34. LSTM Forget Gate
\(f_t = \sigma(W_{xf} x_t + W_{hf} h_{t-1} + b_f)\)
where:
- \(f_t\) is the forget gate
- \(W_{xf}, W_{hf}\) are weight matrices
- \(b_f\) is the bias
35. LSTM Output Gate
\(o_t = \sigma(W_{xo} x_t + W_{ho} h_{t-1} + b_o)\)
where:
- \(o_t\) is the output gate
- \(W_{xo}, W_{ho}\) are weight matrices
- \(b_o\) is the bias
36. LSTM Cell State
\(c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t, \quad \tilde{c}_t = \tanh(W_{xc} x_t + W_{hc} h_{t-1} + b_c)\)
where:
- \(c_t\) is the cell state
- \(f_t, i_t\) are forget and input gates
- \(\tilde{c}_t\) is the candidate cell state
- \(\odot\) is elementwise multiplication
37. GRU Update Gate
\(z_t = \sigma(W_{xz} x_t + W_{hz} h_{t-1} + b_z)\)
where:
- \(z_t\) is the update gate
- \(x_t\) is the input at time \(t\)
- \(h_{t-1}\) is the previous hidden state
- \(W_{xz}, W_{hz}\) are weight matrices
- \(b_z\) is the bias
38. GRU Reset Gate
\(r_t = \sigma(W_{xr} x_t + W_{hr} h_{t-1} + b_r)\)
where:
- \(r_t\) is the reset gate
- \(W_{xr}, W_{hr}\) are weight matrices
- \(b_r\) is the bias
39. GRU Hidden State
\(h_t = z_t \odot h_{t-1} + (1 - z_t) \odot \tilde{h}_t, \quad \tilde{h}_t = \tanh(W_{xh} x_t + r_t \odot (W_{hh} h_{t-1}))\)
where:
- \(h_t\) is the updated hidden state
- \(z_t, r_t\) are the update and reset gates
- \(\tilde{h}_t\) is the candidate hidden state
40. Attention Score (Dot Product)
\(e_{t,s} = h_t^\top \, s_s\)
where:
- \(h_t\) is the query vector
- \(s_s\) is the key vector
- \(e_{t,s}\) is the scalar score measuring alignment
41. Attention Weights (Softmax)
\(\alpha_{t,s} = \frac{\exp(e_{t,s})}{\sum_{k=1}^{S} \exp(e_{t,k})}\)
where:
- \(e_{t,s}\) is the attention score for query \(t\) and key \(s\)
- \(S\) is the total number of key vectors
42. Context Vector
\(c_t = \sum_{s=1}^{S} \alpha_{t,s}\, v_s\)
where:
- \(\alpha_{t,s}\) are attention weights
- \(v_s\) is the value vector associated with key \(s\)
- \(c_t\) is the resulting weighted sum
43. Scaled Dot-Product Attention
\(\text{Attention}(Q,K,V) = \text{softmax}\bigg(\frac{QK^\top}{\sqrt{d_k}}\bigg)V\)
where:
- \(Q\) is the matrix of query vectors
- \(K\) is the matrix of key vectors
- \(V\) is the matrix of value vectors
- \(d_k\) is the dimensionality of keys/queries
44. Multi-Head Attention
\(\text{MHA}(Q,K,V) = \big[\text{head}_1, \dots, \text{head}_h\big]W^O\)
where:
- \(\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)\)
- \(W^O\) is the output projection matrix
- \(h\) is the number of heads
45. Feed-Forward (Transformer)
\(\text{FFN}(x) = \max(0, xW_1 + b_1) \, W_2 + b_2\)
where:
- \(x\) is the input
- \(W_1, b_1, W_2, b_2\) are learned parameters
- \(\max(0,\cdot)\) indicates ReLU (or another activation)
46. Positional Encoding (Sinusoidal)
\(\text{PE}(pos, 2i) = \sin\big(pos / 10000^{2i/d_{\text{model}}}\big), \quad \text{PE}(pos, 2i+1) = \cos\big(pos / 10000^{2i/d_{\text{model}}}\big)\)
where:
- \(pos\) is the position index
- \(i\) is the dimension index
- \(d_{\text{model}}\) is the model dimensionality
47. Perceptron Rule
\(\hat{y} = \begin{cases} 1 & \text{if } w^\top x + b \ge 0\\ 0 & \text{otherwise} \end{cases}\)
where:
- \(w\) is the weight vector
- \(x\) is the input vector
- \(b\) is the bias term
48. General Activation Forward
\(a = f(z)\)
where:
- \(z\) is the pre-activation (weighted sum)
- \(f(\cdot)\) is an activation function (e.g., sigmoid, ReLU)
49. Derivative of ReLU
\(\frac{d}{dz}\text{ReLU}(z) = \begin{cases} 1 & z > 0\\ 0 & z \le 0 \end{cases}\)
where:
- \(z\) is the input
50. Derivative of Sigmoid
\(\sigma'(z) = \sigma(z)\big(1-\sigma(z)\big)\)
where:
- \(\sigma(z)\) is the sigmoid function
51. Derivative of Tanh
\(\frac{d}{dz}\tanh(z) = 1 - \tanh^2(z)\)
where:
- \(\tanh(z)\) is the hyperbolic tangent function
52. Derivative of Softmax (Vector Form)
\(\frac{\partial \hat{y}_j}{\partial z_k} = \hat{y}_j \big(\delta_{jk} - \hat{y}_k\big)\)
where:
- \(\hat{y}_j\) is the softmax output for class \(j\)
- \(z_k\) is the logit for class \(k\)
- \(\delta_{jk}\) is the Kronecker delta (1 if \(j=k\), 0 otherwise)
53. Derivative of Cross-Entropy w.r.t. Logits (Softmax)
\(\frac{\partial L_{\text{CE}}}{\partial z_j} = \hat{y}_j - y_j\)
where:
- \(\hat{y}_j\) is the predicted probability (softmax output)
- \(y_j\) is the one-hot target
54. L2 Distance (Euclidean)
\(d_{\text{Euclidean}}(x,y) = \sqrt{\sum_{i=1}^{d}(x_i - y_i)^2}\)
where:
- \(x, y \in \mathbb{R}^d\) are vectors
- \(d\) is the number of dimensions
55. Manhattan Distance (L1)
\(d_{\text{Manhattan}}(x,y) = \sum_{i=1}^d |x_i - y_i|\)
where:
- \(x, y\) are vectors in \(\mathbb{R}^d\)
- \(|\,\cdot\,|\) is the absolute value
56. Cosine Similarity
\(\text{cos\_sim}(x, y) = \frac{x \cdot y}{\|x\| \, \|y\|}\)
where:
- \(x \cdot y\) is the dot product
- \(\|x\|\) is the norm of \(x\)
57. Dot Product
\(x \cdot y = \sum_{i=1}^{d} x_i \, y_i\)
where:
- \(x, y\) are vectors in \(\mathbb{R}^d\)
58. Matrix Multiplication
\((AB)_{ij} = \sum_{k=1}^{n} A_{ik}\, B_{kj}\)
where:
- \(A\) is a \((m \times n)\) matrix
- \(B\) is a \((n \times p)\) matrix
- \((AB)\) is a \((m \times p)\) matrix
59. Determinant of a 2×2 Matrix
\(\det\begin{pmatrix} a & b\\ c & d \end{pmatrix} = ad - bc\)
where:
- \(a, b, c, d\) are elements of the matrix
60. Matrix Inverse (2×2)
\(\begin{pmatrix} a & b \\ c & d \end{pmatrix}^{-1} = \frac{1}{ad - bc} \begin{pmatrix} d & -b \\ -c & a\end{pmatrix}\)
where:
- \(a, b, c, d\) are elements
- \(ad - bc \neq 0\)
61. Frobenius Norm
\(\|A\|_F = \sqrt{\sum_{i=1}^{m}\sum_{j=1}^{n} |A_{ij}|^2}\)
where:
- \(A\) is an \(m \times n\) matrix
- \(A_{ij}\) is the element in row \(i\), column \(j\)
62. KL Divergence
\(D_{\text{KL}}(P \| Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}\)
where:
- \(P\) and \(Q\) are discrete probability distributions
- The sum is over all possible outcomes \(x\)
63. Jensen-Shannon Divergence
\(D_{\text{JS}}(P\|Q) = \frac{1}{2} D_{\text{KL}}\big(P \| M\big) + \frac{1}{2} D_{\text{KL}}\big(Q \| M\big)\)
where:
- \(M = \frac{1}{2}(P + Q)\)
- \(P\) and \(Q\) are probability distributions
64. Bayes’ Rule
\(P(A|B) = \frac{P(B|A)\,P(A)}{P(B)}\)
where:
- \(A\) and \(B\) are events
- \(P(A)\), \(P(B)\), \(P(B|A)\) are known or can be inferred
65. Covariance
\(\text{Cov}(X,Y) = E[(X - E[X])(Y - E[Y])]\)
where:
- \(E[\cdot]\) denotes the expectation
- \(X, Y\) are random variables
66. Pearson Correlation
\(\rho_{X,Y} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}\)
where:
- \(\sigma_X\) and \(\sigma_Y\) are standard deviations of \(X\) and \(Y\)
67. PCA Covariance Eigen-Decomposition
\(C = U \Lambda U^\top\)
where:
- \(C\) is the covariance matrix
- \(U\) is the matrix of eigenvectors
- \(\Lambda\) is the diagonal matrix of eigenvalues
68. Singular Value Decomposition (SVD)
\(A = U \Sigma V^\top\)
where:
- \(A\) is an \((m \times n)\) matrix
- \(U\) and \(V\) are orthonormal matrices
- \(\Sigma\) is the diagonal matrix of singular values
69. Gaussian Distribution (1D)
\(\mathcal{N}(x \mid \mu,\sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}}\exp\Big(-\frac{(x-\mu)^2}{2\sigma^2}\Big)\)
where:
- \(\mu\) is the mean
- \(\sigma^2\) is the variance
70. Probability Integral (Discrete)
\(\sum_{x} P(x) = 1\)
where:
- \(x\) ranges over all possible outcomes
- \(P(x)\) is a probability distribution
71. Probability Integral (Continuous)
\(\int_{-\infty}^{\infty} p(x)\, dx = 1\)
where:
- \(p(x)\) is a probability density function
72. Exponential Family (General Form)
\(p(x|\theta) = h(x) \exp\big(\eta(\theta)^\top T(x) - A(\theta)\big)\)
where:
- \(h(x)\), \(T(x)\) are known functions
- \(\eta(\theta)\) is the natural parameter
- \(A(\theta)\) is the log-partition function
73. Weighted Adjacency (Graph Neural Network)
\(h_v^{(l+1)} = \sigma\Bigg(\sum_{u \in \mathcal{N}(v)} \alpha_{vu} \, W^{(l)} h_u^{(l)}\Bigg)\)
where:
- \(h_v^{(l)}\) is the hidden representation of node \(v\) at layer \(l\)
- \(\alpha_{vu}\) is the attention or adjacency weight from node \(u\) to \(v\)
- \(W^{(l)}\) is a trainable weight matrix
- \(\sigma\) is an activation function
74. GNN Aggregation
\(h_v^{(l+1)} = \text{AGGREGATE}\big(\{\,h_u^{(l)} : u \in \mathcal{N}(v)\}\big)\)
where:
- \(\mathcal{N}(v)\) is the set of neighbors of node \(v\)
- \(\text{AGGREGATE}(\cdot)\) could be sum, mean, max, or an attention-based mechanism
75. RMSProp (Repeated for Emphasis)
\(\theta \leftarrow \theta - \frac{\eta}{\sqrt{E[g^2] + \epsilon}} \nabla_\theta L(\theta)\)
where:
- \(\theta\) is the parameter vector
- \(\eta\) is the learning rate
- \(E[g^2]\) is the running average of squared gradients
- \(\epsilon\) prevents division by zero
76. Weighted Combination for Multi-Head
\(\text{head}_i = \text{Attention}(QW_i^Q, \; KW_i^K, \; VW_i^V)\)
where:
- \(Q, K, V\) are query, key, and value matrices
- \(W_i^Q, W_i^K, W_i^V\) are parameter matrices for head \(i\)
77. Gradient of MSE w.r.t. Weights
\(\frac{\partial L_{\text{MSE}}}{\partial w_j} = \frac{1}{N}\sum_{i=1}^N 2 \,(\hat{y}^{(i)} - y^{(i)}) \, x_j^{(i)}\)
where:
- \(w_j\) is weight \(j\)
- \(x_j^{(i)}\) is the \(j\)-th feature of sample \(i\)
78. Gradient of Binary Cross-Entropy w.r.t. Weights
\(\frac{\partial L_{\text{CE}}}{\partial w_j} = \frac{1}{N} \sum_{i=1}^N (\hat{y}^{(i)} - y^{(i)})\, x_j^{(i)}\)
where:
- \(\hat{y}^{(i)}\) is the predicted probability
- \(y^{(i)}\) is the true label
79. Logistic Regression Decision Boundary
\(\hat{y} = \sigma(w^\top x + b) \implies \text{Decision} = \begin{cases} 1 & \text{if } w^\top x + b \ge 0\\ 0 & \text{otherwise} \end{cases}\)
where:
- \(w, b\) are parameters
- \(x\) is the input
- \(\sigma\) is the sigmoid function
80. Balanced Cross-Entropy
\(L_{\text{bal}} = -\frac{1}{N} \sum_{i=1}^N \Big[\beta \, y^{(i)} \log(\hat{y}^{(i)}) + (1-\beta)(1-y^{(i)}) \log(1-\hat{y}^{(i)})\Big]\)
where:
- \(\beta\) is a factor weighting the positive class
- \(y^{(i)}, \hat{y}^{(i)}\) as before
81. F1 Score
\(\text{F1} = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}\)
where:
- \(\text{precision} = \frac{TP}{TP+FP}\)
- \(\text{recall} = \frac{TP}{TP+FN}\)
82. Precision
\(\text{precision} = \frac{TP}{TP + FP}\)
where:
- \(TP\) = true positives
- \(FP\) = false positives
83. Recall
\(\text{recall} = \frac{TP}{TP + FN}\)
where:
- \(TP\) = true positives
- \(FN\) = false negatives
84. AUC (Area Under ROC Curve)
\(\text{AUC} = \int_0^1 TPR(\text{FPR}^{-1}(x))\, dx\)
where:
- \(TPR\) is the true positive rate
- \(FPR\) is the false positive rate
- The integral is conceptually the area under the ROC curve
85. Step Decay (Learning Rate Scheduling)
\(\eta_{\text{new}} = \eta_{\text{old}} \cdot \gamma^{\lfloor \frac{\text{epoch}}{k}\rfloor}\)
where:
- \(\eta_{\text{old}}\) is the current learning rate
- \(\eta_{\text{new}}\) is the updated learning rate
- \(\gamma \in (0,1)\) is the decay factor
- \(k\) is the step size in epochs
86. Polynomial Kernel (SVM)
\(K(x, x') = (x^\top x' + c)^p\)
where:
- \(x, x'\) are feature vectors
- \(c\) is a constant (often 1)
- \(p\) is the polynomial degree
87. Radial Basis Function Kernel (RBF)
\(K(x, x') = \exp\Big(-\frac{\|x - x'\|^2}{2\sigma^2}\Big)\)
where:
- \(x, x'\) are feature vectors
- \(\sigma\) is the kernel width
88. Hinge Loss (SVM)
\(L_{\text{hinge}} = \max\big(0, 1 - y_i (w^\top x_i + b)\big)\)
where:
- \(y_i \in \{-1, +1\}\) is the true label
- \(w\) is the weight vector
- \(x_i\) is the input
- \(b\) is the bias
89. Margin (Hard-SVM)
\(\text{Margin} = \frac{2}{\|w\|}\)
where:
- \(w\) is the normal vector to the decision boundary
- \(\|w\|\) is the Euclidean norm of \(w\)
90. Q-Learning Update (RL)
\(Q(s,a) \leftarrow Q(s,a) + \alpha \Big[r + \gamma \max_{a'} Q(s', a') - Q(s,a)\Big]\)
where:
- \(Q(s,a)\) is the current Q-value
- \(\alpha\) is the learning rate
- \(r\) is the reward
- \(\gamma\) is the discount factor
- \(s'\) is the next state
- \(a'\) is the next action
91. Bellman Equation (Value Function)
\(V^\pi(s) = \mathbb{E}_{a\sim \pi}\Big[r(s,a) + \gamma V^\pi(s')\Big]\)
where:
- \(V^\pi(s)\) is the value of state \(s\) under policy \(\pi\)
- \(r(s,a)\) is the immediate reward
- \(\gamma\) is the discount factor
- \(s'\) is the next state
92. Actor-Critic Gradient
\(\nabla_\theta J(\theta) = \mathbb{E}\big[\nabla_\theta \log \pi_\theta(a|s)\, (Q_w(s,a) - b(s))\big]\)
where:
- \(\pi_\theta\) is the policy (actor)
- \(Q_w(s,a)\) is the critic’s estimate of action-value
- \(b(s)\) is a baseline to reduce variance
93. Softplus Activation
\(\text{Softplus}(z) = \log(1 + e^z)\)
where:
- \(z\) is the input
94. ELU (Exponential Linear Unit)
\(\text{ELU}(z) = \begin{cases} z & z \ge 0,\\ \alpha(e^z - 1) & z < 0 \end{cases}\)
where:
- \(\alpha\) is a positive constant
95. SELU (Scaled ELU)
\(\text{SELU}(z) = \lambda \begin{cases} z & z \ge 0\\ \alpha(e^z - 1) & z < 0 \end{cases}\)
where:
- \(\alpha\) and \(\lambda\) are constants (e.g., \(\alpha \approx 1.673\), \(\lambda \approx 1.051\))
96. Knowledge Distillation
\(L_{\text{KD}} = \tau^2 \cdot \text{KL}\big(p_\text{teacher}(x;\tau)\,\|\,p_\text{student}(x;\tau)\big)\)
where:
- \(\tau\) is the temperature
- \(p_\text{teacher}\) and \(p_\text{student}\) are softmax outputs of teacher and student networks, respectively
97. Threshold Function (Binary Step)
\(\theta(z) = \begin{cases} 1 & z \ge 0\\ 0 & z < 0 \end{cases}\)
where:
- \(z\) is the input
98. Weighted Hinge Loss
\(L_{\text{whinge}} = w_i \,\max\big(0, 1 - y_i (w^\top x_i + b)\big)\)
where:
- \(w_i\) is the sample weight or class weight
- \(y_i \in \{-1, +1\}\) is the true label
99. Perplexity (Language Modeling)
\(\text{PPL} = \exp\Big(\frac{1}{N}\sum_{i=1}^{N} -\log p(w_i)\Big)\)
where:
- \(w_i\) are tokens in the sequence
- \(p(w_i)\) is the predicted probability of token \(w_i\)
- \(N\) is the total number of tokens
100. Cycle Consistency Loss (CycleGAN)
\(L_{\text{cyc}}(G,F) = \mathbb{E}_{x \sim p(x)}\big[\|F(G(x)) - x\|\big] + \mathbb{E}_{y \sim p(y)}\big[\|G(F(y)) - y\|\big]\)
where:
- \(G\) is the generator mapping domain \(X \to Y\)
- \(F\) is the generator mapping domain \(Y \to X\)
- \(x, y\) are samples from domains \(X, Y\)
- \(\|\cdot\|\) is a distance measure (e.g., L1 norm)