Supervised Learning Primer
The minimum learning theory you need before any ML paper. Five short topics covering what "learning" actually means as a math problem — fitting an input-to-output mapping — and the five components every supervised model shares: the function family, the loss, the optimizer, the overfitting trap, and the regularization that keeps the model from falling into it.
Input → Output Mapping
Every supervised model is a function. Training is the search for which one.
Supervised learning is the problem of approximating an unknown function from examples. Show the model a bunch of (input, output) pairs and ask it to predict the output for new inputs it hasn't seen. The "supervised" part is that the examples come with a labeled answer — Section 2 of the data primer's features-and-labels idea, now turned into a learning objective.
Write the mapping abstractly:
ŷ = fθ(x)
x is one input (a feature vector, an image, a sequence of tokens, anything). ŷ ("y-hat") is the model's predicted output. fθ is the model itself — a function whose behavior is controlled by the parameters θ (Greek letter "theta," collective name for every learnable knob inside the model). Training is a search for the value of θ that makes ŷ = fθ(x) match the true y as well as possible across the dataset.
Picking what fθ looks like — the function family you're searching over — is the modeling decision. Different families give you wildly different capabilities:
- Linear regression.
ŷ = w · x + b— a weighted sum of the inputs. Two parameters per input, plus a bias. Cheap, fast, surprisingly competitive on tabular data. - Decision tree. A nested set of "if x[3] > 1.5 then…" rules. Each leaf stores a prediction. Trees compose into random forests and gradient-boosted trees, which still win Kaggle competitions on small structured data.
- Neural network. A stack of linear layers interleaved with element-wise nonlinearities (ReLU, GELU). Millions of parameters at the small end, trillions at the frontier. Universal approximators given enough width or depth — which means in theory they can fit nearly any input-to-output relationship the data hints at.
Concrete example: predicting house price from sqft, using linear regression. One feature (x = sqft), one parameter to learn (w) plus a bias (b). The model is ŷ = w · x + b. Training looks for the combination of (w, b) that draws the line passing through the data as well as possible:
Running example. The 5 housing points you just saw — and the best-fit line w ≈ 0.75, b ≈ 130 — recur throughout this primer. §2 squares the residuals of that line into MSE; §3 watches gradient descent rediscover the slope; §4 and §5 keep the same 5 points but trade the line for a far more flexible polynomial, showing how overfitting happens and how regularization fights it.
Three pieces show up over and over no matter which family you pick:
- Hypothesis space. The set of all functions
fθyou're willing to consider — every choice ofθ. Big hypothesis spaces can fit more patterns but are also easier to overfit. - Inductive bias. What the family quietly assumes about the world. Linear regression assumes the relationship is, well, linear. Convolutional networks assume translation invariance ("a cat is still a cat if you shift the photo"). Transformers assume tokens interact through attention. No model is "assumption-free"; the trick is picking a family whose assumptions fit the problem.
- Capacity. How expressive the family is. Linear regression has low capacity (one number per feature). A 70-billion-parameter LLM has enormous capacity. High capacity is necessary for hard problems but is also exactly what makes overfitting possible — see Sections 4 and 5.
The remaining four sections answer the obvious follow-up: given a function family, how do you actually find the best θ? The answer breaks into "measure how wrong the current θ is" (Section 2 — loss), "nudge θ toward less-wrong" (Section 3 — optimization), "check that you've learned a pattern, not memorized the training set" (Section 4 — generalization), and "build that check into the loss" (Section 5 — regularization).
In a Transformer: the function family is "stacked self-attention blocks with residual connections and feed-forward layers." The parameters θ are the billions of weight matrices inside each block plus the token embedding table. The inputx is a sequence of token IDs; the output ŷ is a probability distribution over the next token (Section 2 of the probability primer). Training searches across the parameter space for the θ that makes this distribution match what the next token actually was, averaged over trillions of training tokens. Every other piece of this primer applies — only the function family is special.
Loss Function
A single number that says how wrong the current model is.
Section 1 set up the picture: find parameters θ so that ŷ = fθ(x) matches the true y as well as possible. But "as well as possible" needs a definition you can compute, otherwise the optimizer has nothing to optimize. The loss function is that definition. It takes a prediction ŷ and a true label y and returns a single number L(ŷ, y) measuring how wrong the prediction is. Lower is better; zero means perfect.
Two losses cover 95% of what you'll meet in practice:
- Mean Squared Error (MSE), for regression (continuous outputs):
L = mean[ (ŷ − y)² ]. Take the per-example error, square it (so positive and negative errors don't cancel and big errors hurt extra-much), then average across the dataset. - Cross-entropy, for classification (discrete outputs):
L = −mean[ log P(y) ]. The model assigns probabilities to each class via softmax; the loss is the negative log-probability of the correct class. The probability primer's log-prob section is exactly this.
Concrete example, using the same 5 housing points from §1 and the best-fit line ŷ = 0.75 · sqft + 130 the demo settled on:
sqft truth y prediction ŷ error (ŷ − y) squared error
──── ─────── ──────────── ────────────── ──────────────
600 480 580 +100 10,000 (k²)
850 920 768 −152 23,104 (k²)
1450 1080 1218 +138 19,044 (k²)
2100 1900 1705 −195 38,025 (k²)
3200 2480 2530 +50 2,500 (k²)
MSE = mean( squared errors )
= (10000 + 23104 + 19044 + 38025 + 2500) / 5
= 18,535 (in k², thousand-dollars squared)
≈ $136k (after √, → RMSE — same units as y)Why square the residuals instead of using absolute value |ŷ − y|? Three practical reasons that show up everywhere:
- Differentiable everywhere. The square has a smooth derivative (
2(ŷ − y)); the absolute value has a kink at zero where the derivative is undefined. Optimizers love smooth. - Big errors get punished extra. Squaring an error of 100 gives 10,000; an error of 10 gives 100. The optimizer naturally focuses on whatever the model is getting most wrong — a feature, not a bug, in many problems.
- Tidy probabilistic story. MSE pops out as the maximum-likelihood loss when you assume the noise on
yis Gaussian. Most "natural" losses come from a similar probabilistic argument.
Many other losses exist for special problems — Huber (robust to outliers), hinge (SVM classifiers), KL divergence (matching one distribution to another), focal loss (rare positives). But almost every one shares the same shape: per-example penalty, averaged across the dataset.
Two properties of the loss that the next section depends on:
- The loss is a function of θ. Hold the dataset fixed, vary the model parameters, and the loss is a number that depends on those parameters. We write it
L(θ). Training is the search for theθthat minimizes this function. - It usually has a gradient. Because the loss is differentiable in
θ, the calculus primer's gradient∇L(θ)exists and points uphill on the loss surface. Section 3 takes that arrow and runs with it.
In a Transformer: the loss is cross-entropy, computed over the next-token distribution at every position in every training example, averaged across the batch. A single forward pass over a 4096-token sequence on a batch of 256 sequences produces ~1 million per-token cross-entropy values; the loss the optimizer actually sees is the mean of all of them. Every training step is built around that single scalar — compute it, get the gradient with respect to every parameter, take a step (Section 3). Lowering this one number, over and over, is what produces every interesting capability a modern LLM has.
Optimization
Nudge θ in the direction that lowers the loss, repeat a million times.
Section 2 turned "fit the data" into "minimize L(θ)." Now you actually have to do it. The loss is a function from a high-dimensional parameter space to a single number; finding its minimum directly is hopeless for any non-trivial model. The practical answer is the calculus primer's last section, applied at industrial scale: gradient descent.
The update rule, one line:
θ ← θ − η · ∇L(θ)
Compute the gradient — the vector of "how does the loss change if I push each parameter a tiny bit." Step against it. The learning rate η (eta) controls how big a step. Then repeat. After enough steps the parameters settle near a minimum of the loss. That's the entire algorithm. Every ML model you've ever heard of was trained by repeating this loop.
Three flavors based on how much data you look at to compute one gradient:
- Batch (full-batch) gradient descent. Use the entire dataset for every update. Most accurate gradient, slowest per step, doesn't fit in memory for big data. Almost nobody does this anymore.
- Stochastic gradient descent (SGD). Use just one example per update. Tons of noise in each gradient — but cheap, and the noise can actually help escape bad regions of the loss surface. Original recipe; pure SGD is now mostly a baseline.
- Mini-batch gradient descent. Use a small chunk — 16, 64, 256, 4096 examples per update. The best of both worlds: gradient is a reasonable estimate without requiring the whole dataset. This is what every modern model uses. When people say "SGD" in 2026, they almost always mean mini-batch SGD.
Walking the loss surface with gradient descent looks like rolling a ball downhill — except the "hill" lives in a million-dimensional space you can't draw. The 1-D picture is the right intuition anyway:
Two practical knobs every project tunes:
- Learning rate (η). The most-tuned hyperparameter in ML. Too small and training takes forever; too big and the parameter jumps past minima and the loss oscillates or explodes. Modern recipes usually schedule the learning rate — warm up from 0, then decay over training (linear, cosine, step).
- Batch size. Bigger batches give a less-noisy gradient (more samples → variance shrinks like 1/√n; see Probability primer Section 4) but cost more memory per step. Most papers report the batch size they used.
Two important improvements built on top of vanilla mini-batch SGD that you'll see everywhere:
- Momentum. Keep a running exponential average of recent gradients and step in that direction. Smooths out the noise from mini-batches; speeds up progress along consistent gradient directions; helps roll through small bumps in the loss surface that pure SGD would get stuck on.
- Adam. Adds a second running average — of squared gradients — and divides by its square root. Effect: each parameter gets its own per-coordinate learning rate, automatically scaled down on dimensions where the gradient has been big. Robust enough to be the default optimizer for almost every Transformer.
How do you get ∇L(θ) for a model with billions of parameters? The chain rule, applied at scale — backpropagation. Forward pass: compute the loss. Backward pass: walk the chain of operations in reverse, multiplying local derivatives at every step. Calculus primer Section 3 is the entire mechanism; the optimizer in this section is whatever step rule consumes the gradient backprop hands it.
In a Transformer: the training loop is exactly this. (1) Sample a batch of token sequences from the dataset. (2) Forward pass through the Transformer to get the loss. (3) Backward pass to get the gradient with respect to every parameter. (4) Apply one optimizer step (almost always AdamW — Adam plus a weight-decay regularizer from Section 5). Repeat for billions of steps. Modern training infrastructure (data parallel, tensor parallel, pipeline parallel, ZeRO, gradient checkpointing) exists to make this four-step loop scale to clusters of thousands of GPUs — but the loop itself is unchanged.
Overfitting & Generalization
A model that nails its homework but fails the exam has learned the wrong thing.
Imagine a student who memorizes every practice problem perfectly but freezes on a slightly different question. They've learned the specific examples, not the pattern behind them. That same failure happens in ML, and it has a name: overfitting. The opposite — performing well not just on the training data but on examples the model has never seen — is generalization. The whole point of supervised learning is generalization. Training loss alone tells you nothing.
The diagnostic is the validation set from the data primer's Section 3. Train the model; track L_train (loss on the training set) and L_val (loss on the held-out validation set) as the optimizer runs. Three regimes show up:
- Underfitting. Both losses are high and roughly equal. The model is too small or hasn't trained long enough; it can't even fit the training data, let alone generalize. Fix: bigger model, more training, better features.
- Just right. Both losses are low; validation loss is just a bit above training loss. The model has captured the actual signal and ignored the noise.
- Overfitting. Training loss is tiny and still dropping; validation loss is much higher and starting to rise. The model has memorized training-set quirks that don't hold on new data. Fix: stop training earlier, shrink the model, add data, regularize (Section 5).
Time to expand §1's function family. Keep the same 5 housing points but stop assuming a line — fit polynomials of varying degree to the same data. The classic picture:
train RMSE val RMSE interpretation
────────── ─────────── ──────────────────────────────
degree 0 high high "just predict the mean price"
degree 1 moderate moderate the §1 best-fit line
degree 4 $0 very high interpolates every housing
point exactly; wiggles wildly
between themThe degree-12 polynomial achieves zero training loss — it can pass through every data point exactly — but its predictions between the data points are absurd. Low training loss is necessary but not sufficient for a good model. The gap between training and validation loss — sometimes called the generalization gap — is the real diagnostic.
Why does overfitting happen? A model with enough capacity has more freedom than the data constrains. There are infinitely many functions that pass through any finite set of points; "fit the training data" alone doesn't pick a single one. Some of those functions are smooth and generalize; some twist wildly between data points and don't. Without something pushing the optimizer toward the smooth ones, it'll happily settle on a wild one if that lets it drive training loss to zero.
Three levers that fight overfitting, in the order you usually pull them:
- More data. The cleanest fix. Twice the data constrains the function more tightly and makes memorization harder. This is why "scaling laws" papers consistently find that more pretraining data improves downstream accuracy in predictable ways.
- Less capacity. A smaller model has less freedom to memorize. Shrink the width, drop a layer, switch from a polynomial of degree 12 to one of degree 3.
- Regularization. Keep the model large but bias the optimizer toward simpler solutions. Section 5 is exactly this.
A subtle point that catches everyone the first time: "big model = always overfit" is wrong. Modern deep learning routinely trains models with vastly more parameters than training examples and gets fine generalization anyway. The "classical" bias-variance story (more parameters → worse generalization once you cross some threshold) breaks down for big neural networks. The phenomenon is called double descent: validation loss can rise, then fall again as you make the model even bigger. Why exactly is still an active research area, but practically: don't assume "huge model" implies "overfit."
In a Transformer: LLMs train for one epoch over hundreds of billions of tokens. The training set is so big that pure memorization is hard to even fit; generalization is the natural outcome rather than an exception. Overfitting still happens during fine-tuning, where the model is adapted to a much smaller, task-specific dataset — and exactly the diagnostics in this section (training loss dropping while validation loss rises) are how people detect it. Modern fine-tuning recipes use small learning rates, early stopping, LoRA (a parameter-efficient way to reduce effective capacity), and the regularizers from the next section to keep the generalization gap tight.
Regularization
Bake a preference for simple solutions into the loss.
Section 4 named the problem: with enough capacity, many different parameter settings achieve zero training loss, but only some of them generalize. Without a tiebreaker the optimizer can land on any of them, and the wild ones are catastrophic on new data. Regularization is the tiebreaker. It bakes a preference for simpler solutions directly into the loss — so the optimizer has to balance "fit the data" against "stay simple," and ties go to the simple side.
The single template every regularizer instantiates:
Lreg(θ) = Ldata(θ) + λ · Ω(θ)
Ldata is the original loss from Section 2. Ω(θ) ("omega") is a penalty that measures "how complicated this parameter setting is." λ ("lambda") is the regularization strength — a knob the user picks. Adding the penalty changes what the optimizer is searching for: the new minimum is the parameter setting that fits the data and is simple, with λ trading off between the two.
Four regularizers cover almost every practical case:
- L2 (weight decay).
Ω(θ) = Σ θᵢ²— the sum of squared parameter values. Penalizes big weights. Tends to spread "fit-the-data" mass across many small weights instead of letting a few weights grow huge. Mechanically, the gradient update gets an extra−η · 2λ · θterm, which "decays" each weight toward zero on every step — hence the name weight decay. - L1.
Ω(θ) = Σ |θᵢ|— the sum of absolute parameter values. Has a remarkable side effect: it drives many weights all the way to exactly zero, producing a sparse model. Useful when you want feature selection built into training. - Dropout. During training only, randomly set a fraction of the activations to zero on each forward pass. The model can't rely on any one neuron and effectively learns an ensemble of "what if these neurons were missing" sub-models. At inference time, all neurons are kept. Cheap, surprisingly effective, ubiquitous in the 2010s — partly replaced by big batches + weight decay in modern Transformers, but still in the toolbox.
- Early stopping. Watch the validation loss while training. The moment it starts rising, stop. The cheapest "regularizer" in the world — no penalty term, no extra hyperparameter beyond "how patient am I" — and it works because cutting training short leaves the parameters in a "smoothness" state they would have left behind if you kept going.
Concrete picture. Section 4's overfit degree-12 polynomial passes through every point but wiggles wildly between them. With an L2 penalty on the polynomial coefficients, the optimizer prefers small coefficients, which directly suppresses the wiggling. As λ grows, the fit shrinks from "passes through every point" toward "smooth curve that mostly ignores noise":
Notice the shape of the tradeoff. At λ = 0 we get the wild Section 4 polynomial. At very large λ the penalty dominates and the fit becomes nearly constant — the model is so simple it barely uses the data at all. Somewhere in between is a sweet spot where the validation loss is minimized. How do you find it? Validation set sweep. Try several values of λ, pick the one with the lowest validation loss. (Section 3 of the data primer: that's what the validation set is for.)
A useful way to think about regularization is from the probability primer: every regularizer corresponds to a prior belief about what reasonable parameter values look like. L2 corresponds to "weights are probably small and Gaussian-distributed." L1 corresponds to "weights are probably zero unless evidence forces them otherwise." The regularized loss is then the negative log-posterior — fit + prior, in the Bayesian sense. The same idea, two vocabularies.
In a Transformer: the optimizer is almost universally AdamW — Adam plus a weight-decay regularizer of the L2 family. The "W" is literally the weight decay. Other regularizers in heavy use: dropout inside attention and MLP layers (especially during fine-tuning); label smoothing in cross-entropy (a soft version of L2 on the output distribution); gradient clipping (prevents a single huge gradient from destroying a long training run, structurally similar to a "regularize the update step itself" idea). And of course the strongest regularizer in modern LLM training is simply training on more data — see Section 4's "more data" lever. Every choice is a different point on the same tradeoff curve: how willing am I to let the model fit the training data, given how much I trust the training data to look like the real world?