• ML papers use mathematical notation that maps directly to NumPy/Python operations
• Σ → np.sum(), E[X] → np.mean(), matrix multiplication → @ operator
• This cheat sheet translates common symbols into code you can copy-paste and experiment with
Turn the maths you see in ML papers into programmer-friendly code patterns.
Why “Maths-as-Code”?
ML papers speak in symbols. Engineers speak in loops, arrays, and functions. This guide acts as the Rosetta Stone between the two:
- Notation → Meaning in plain English
- Intuition in one line
- Code Equivalent (NumPy/Python first, JS when useful)
Keep this page open when reading a paper; copy the snippets into a notebook and try them on a tiny dataset.
🧩 1. Scalars, Vectors, Matrices, and Tensors
Mental model: A tensor is just an n-dimensional array. Every ML data structure is one of these.
| Math | Meaning | Python / NumPy | ML Use |
|---|---|---|---|
x | Scalar — single number | x = 3.14 | Learning rate, bias |
𝐱 = [x₁, …, xₙ] | Vector — ordered list | x = np.array([1,2,3]) | Features, weights |
𝐗 = [[xᵢⱼ]] | Matrix — 2D grid | X = np.array([[1,2],[3,4]]) | Weight matrices |
𝓧 | Tensor — n-D array | np.array([...]) | Images, embeddings |
𝐗ᵀ | Transpose — swap rows/cols | X.T | Switching dimensions |
import numpy as np
X = np.array([[1, 2], [3, 4]])
print("Matrix:", X)
print("Transpose:", X.T)
// JavaScript equivalent (using standard arrays)
const X = [[1, 2], [3, 4]];
const XT = X[0].map((_, i) => X.map(row => row[i])); // transpose
🔁 2. Summation, Product, and Averages (Σ, ∏, E[·])
Mental model: Σ means loop and add. E[X] means average (expected value).
| Math | Meaning | Code Equivalent | ML Use |
|---|---|---|---|
Σᵢ₌₁ⁿ xᵢ | Sum elements | np.sum(x) | Total loss |
E[X] = (1/n) Σ xᵢ | Expectation / mean | np.mean(x) | Batch average |
∏ᵢ₌₁ⁿ xᵢ | Product | np.prod(x) | Likelihood |
Var(X) | Variance | np.var(x) | Feature scaling |
import numpy as np
x = np.array([1, 2, 3, 4])
np.sum(x), np.mean(x), np.var(x)
⚙️ 3. Linear Algebra Operations
Mental model: Matrices transform vectors. Neural networks are chains of transforms + non-linearities.
| Math | Meaning | Code Equivalent | ML Use |
|---|---|---|---|
𝐱·𝐲 = Σ xᵢyᵢ | Dot product | np.dot(x, y) | Similarity, regression |
𝐗𝐘 | Matrix multiply | X @ Y or np.matmul(X,Y) | Forward pass |
‖𝐱‖₂ = √Σ xᵢ² | L2 norm | np.linalg.norm(x) | Normalization |
𝐈 | Identity matrix | np.eye(n) | Initialization |
𝐗⁻¹ | Inverse | np.linalg.inv(X) | Solving linear systems |
import numpy as np
X = np.array([[1, 2], [3, 4]])
W = np.array([[0.5], [0.2]])
y = X @ W # matrix multiplication
np.linalg.norm(y) # L2 norm
📐 4. Calculus: Gradients and Derivatives
Mental model: Gradients point in the direction of steepest increase. We descend in the opposite direction.
| Math | Meaning | Code Equivalent | ML Use |
|---|---|---|---|
∂f/∂x | Partial derivative | grad[i] | How one parameter affects loss |
∇f or ∇L | Gradient (vector of partials) | np.gradient(f) or autograd | Direction to update weights |
∇²f or H | Hessian (second derivatives) | np.gradient(np.gradient(f)) | Curvature, optimization |
# With autograd libraries (PyTorch example)
import torch
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2 # y = x²
y.backward() # compute gradient
print(x.grad) # dy/dx = 2x = 4.0
🎲 5. Probability Notation
Mental model: P(A) is a number between 0 and 1. P(A|B) is “probability of A given that B happened.”
| Math | Meaning | Code Equivalent | ML Use |
|---|---|---|---|
P(A) | Probability of A | count_A / total | Prior probability |
P(A|B) | Probability of A given B | Bayes’ rule | Posterior, classification |
P(A,B) | Joint probability | P(A) * P(B|A) | Likelihood |
𝔼[X] | Expected value | np.mean(x) | Average outcome |
argmax P(y|x) | Most likely class | np.argmax(probs) | Prediction |
import numpy as np
# Softmax: convert logits to probabilities
def softmax(logits):
exp_logits = np.exp(logits - np.max(logits)) # numerical stability
return exp_logits / np.sum(exp_logits)
logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits)
prediction = np.argmax(probs) # argmax
🎯 6. Common ML Symbols
Mental model: These symbols appear constantly — knowing them speeds up paper reading.
| Symbol | Meaning | Typical Use |
|---|---|---|
θ (theta) | Model parameters | Weights and biases collectively |
L or J | Loss function | What we minimize during training |
α or η | Learning rate | Step size for gradient descent |
λ | Regularization strength | Controls overfitting penalty |
ŷ (y-hat) | Prediction | Model’s output |
y | True label | Ground truth |
ε (epsilon) | Small constant | Numerical stability (e.g., 1e-8) |
# Gradient descent in one line
theta = theta - alpha * gradient # θ ← θ - α∇L
🧪 Quick Reference Card
Notation → NumPy/Python
─────────────────────────────────────
Σ xᵢ → np.sum(x)
∏ xᵢ → np.prod(x)
E[X] → np.mean(x)
Var(X) → np.var(x)
𝐗ᵀ → X.T
𝐗𝐘 → X @ Y
‖𝐱‖ → np.linalg.norm(x)
argmax → np.argmax(x)
∇L → loss.backward() (autograd)
θ ← θ - α∇L → theta -= alpha * grad
What’s Next?
This cheat sheet covers the most common notation. For deeper dives:
- Linear algebra: Gilbert Strang’s Linear Algebra and Its Applications
- Calculus for ML: 3Blue1Brown’s Essence of Calculus (YouTube)
- Probability: Think Bayes by Allen Downey (free online)
Keep this page open when reading papers — copy the snippets into a notebook and experiment with small examples.