Loss & Stability Primer

Four tools that show up in every deep-learning paper, that nobody bothers to define because "you know what softmax is, right?" — but a beginner doesn't, and even experts forget the details. Two natural pairs: softmax turns raw scores into a probability distribution and cross-entropy measures how wrong that distribution is — output and loss. Then dropout regularizes by randomly zeroing neurons and BatchNorm / LayerNorm keep activations bounded so very deep networks remain trainable — the two pieces of training stability. All four are inside every Transformer; this is the last primer before that one.

Softmax

Any real-valued vector → a clean probability distribution.

A neural network's output layer spits out raw real numbers — they can be negative, bigger than 1, of arbitrary scale. To use them as a probability distribution (which is what classification needs, what attention needs, what next-token prediction needs), we need to turn them into a vector of positive numbers that sums to 1. The standard tool for this job is softmax.

One line of math:

softmax(z)_i = exp(z_i) / Σ_j exp(z_j)

Apply exp element-wise to make everything positive, then divide by the total so the result sums to 1. That's the whole recipe.

Concrete example: scores z = (2.0, 1.0, 0.5).

  exp(z)        =  (7.39,  2.72,  1.65)
  sum of exp(z) =   11.76
  softmax(z)    =  (0.628, 0.231, 0.141)   ← sums to 1.000

1 / 3

Softmax: take any real-valued vector, return a clean probability distribution. The bigger inputs get exponentially bigger shares.

Three properties make softmax the default choice over every other "vector → distribution" recipe:

Order-preserving. If z_a > z_b then softmax(z)_a > softmax(z)_b. The argmax of the scores is also the argmax of the probabilities — softmax is "argmax that you can backprop through."
Differentiable. The pure argmax operation has a flat gradient (zero everywhere) — useless for backprop. Softmax is smooth, so the optimizer can shift probability mass from one class to another in tiny steps.
Exponential separation. The bigger a score, the exponentially bigger its share of the probability mass. A score 2 units higher than the rest takes ~7× more probability; a score 5 units higher takes ~150×. This is "winner-take-most" without being "winner-take-all."

Softmax has a knob most introductions skip: temperature. Replace softmax(z) with softmax(z / T) for some T > 0 and the distribution changes shape. T → 0 sharpens it toward one-hot (the maximum eats everything); T → ∞ flattens it toward uniform (every class equally likely). Standard T = 1 is the default. In LLM sampling, temperature is the knob that controls how "creative" the model is — high temperature gives weird, surprising tokens; low temperature gives the obvious one every time.

A practical note: numerical stability. exp(z_i) can overflow for moderate z_i (about 700 in float32). The trick: subtract max(z) from every z_i before exponentiating. The result is identical (the constant cancels in the ratio) but no term blows up. Every real framework does this for you — but if you ever write softmax by hand, remember the subtraction.

In a Transformer: softmax shows up in two places, both essential. Attention weights: turn the raw dot-product scores between queries and keys into a distribution over positions, so the model "spends" a fixed budget of attention across the sequence. Output: a final linear layer maps the hidden state to a vocabulary of ~50,000 raw scores, and softmax turns those into next-token probabilities. The model literally sums to 1 over every token it knows.

Cross-Entropy Loss

The loss that pairs with softmax in ~every classifier.

Softmax turns scores into a distribution. Cross-entropy turns "your distribution vs the right answer" into a single number — the loss. Together they are the standard classification setup, used in roughly every paper that involves discrete labels: ImageNet classifiers, language models, next-token prediction in a Transformer.

The formula, for one training example with true class c:

L = −log(p_c)

Where p_c is the softmax probability the model assigned to the correct class. That's the whole loss — one logarithm of one probability. The general form L = −Σ_i y_i log(p_i) with a one-hot target y collapses to the same thing because every term except the true class is multiplied by zero.

Reading −log(p) off a graph:

p_c = 1.00 → L = 0. Perfect prediction; loss bottoms out.
p_c = 0.95 → L ≈ 0.05. Confident and right; loss is tiny.
p_c = 0.50 → L ≈ 0.69. Uncertain; loss is bounded.
p_c = 0.10 → L ≈ 2.30. Confident and wrong; loss is large.
p_c → 0 → L → ∞. The model assigned no probability to the truth — infinite punishment. Don't do that.

1 / 4

Cross-entropy is the height of the −log curve at the probability you assigned to the true class. Confident-wrong is punished much harder than uncertain.

The asymmetry between confident-correct and confident-wrong is the whole point of the log. Squared error doesn't have this asymmetry — a prediction off by 0.5 always costs 0.25, whether the truth is 0 or 1. Cross-entropy cares deeply about which side of "right" you fell on, and that gives the optimizer a sharp signal when the model is confidently wrong: gradients are large precisely where the model is most confused. That's why nearly every classifier uses cross-entropy and not MSE on probabilities.

A clean mathematical bonus: when you compose softmax + cross-entropy and take the gradient with respect to the pre-softmax logits, the answer simplifies to ∂L/∂z_i = p_i − y_i. Predicted minus target — that's it. The complicated softmax derivative and the chain-rule pieces from −log all cancel. Every deep-learning framework exploits this by exposing one fused operation ("softmax cross-entropy with logits"), which is more numerically stable and ~twice as fast as computing softmax explicitly and then taking −log.

The information-theoretic backstory. The full −Σ y_i log p_i is the cross-entropy H(y, p) between the target distribution y and the predicted distribution p. It equals H(y) + KL(y ‖ p) — target entropy plus the KL divergence from y to p. For one-hot targets H(y) = 0, so minimizing cross-entropy is minimizing the KL divergence between the prediction and the answer. (For soft labels — distillation, label smoothing — both terms matter; KL is what you actually care about.) This is also why minimizing cross-entropy is equivalent to maximum likelihood under a categorical model.

In a Transformer: the entire training objective for a language model is "cross-entropy of the next token, summed over every position in every sequence in every batch in the dataset." Trillions of tokens, each with one cross-entropy term. That is essentially the entire loss function of GPT-3, GPT-4, Claude, Gemini, LLaMA — all of them. The architecture is fancier than this primer can fit; the loss is one line.

Dropout

A brutally simple regularizer: randomly switch off half the neurons.

Networks overfit, especially the deep ones. The supervised primer's §5 covered the gentle regularizers (weight decay, early stopping). This section is about the brutal one. The idea is unreasonable when first heard: every forward pass during training, randomly turn off half the neurons in a layer and pretend they don't exist. Inference uses every neuron as usual. Somehow this works extraordinarily well.

Mechanically:

Training forward (inverted dropout, p = keep probability):
  m   ~  Bernoulli(p)      ← independent per neuron
  a'  =  (a · m) / p       ← drop, then scale up survivors

Evaluation forward:
  a'  =  a                 ← no dropout, no scaling

Backward:
  ∂L/∂a  =  (∂L/∂a') · m / p     ← gradient only flows
                                    through the survivors

The 1/p scaling ("inverted dropout") keeps the expected activation the same as eval-time, so you don't have to do any rescaling at inference. Standard p: usually p = 0.5 for fully-connected hidden layers, p = 0.9 (light dropout) for layers near the input, and often no dropout on the output layer itself.

1 / 4

During training: a different random subset of neurons every forward pass. During eval: nothing is dropped. Same network, two modes.

Why this absurd recipe regularizes:

Co-adaptation breaks. Without dropout, neuron A might learn to fire only when neuron B fires too — they "memorize" features as a pair. Drop one of them randomly and the strategy fails. After enough training under dropout, every neuron has had to be useful on its own, against any random selection of teammates.
Ensemble effect. Each random mask defines a different sub-network. Training under dropout is, roughly, training an exponentially large ensemble of sub-networks that share weights. Inference (no dropout) is an approximate average over all of them, which is why it works as well as it does.
Plain noise injection. Adding random noise to activations is itself regularizing — it prevents the network from sharply memorizing exact patterns. Even if you don't buy the ensemble framing, this much always holds.

Things to know that nobody mentions until you trip on them. (1) Dropout is only active during training; forgetting to switch the model to eval mode before inference is a classic bug. (2) Dropout breaks the assumption that activations are deterministic given the input — a model with dropout returns a different prediction every time you call it during training. (3) Dropout interacts badly with BatchNorm (§4) — together they can introduce a "variance shift" between train and eval. The usual remedy: put dropout after BatchNorm, or skip dropout entirely if your network has BatchNorm.

Dropout has lost a little popularity in modern very-large models. Big enough datasets, big enough models with sufficient regularization from other sources (weight decay, data augmentation, scale itself), and the marginal benefit of dropout shrinks. But it's still everywhere in Transformer codebases — attention dropout, FFN dropout, residual dropout — applied at small probabilities like p = 0.1. The ImageNet-era 50% is rare now.

In a Transformer: typically three slots, all with low drop rate (0.0 – 0.1). Attention dropout — applied to the softmax output of attention before it multiplies the values. FFN dropout — applied to the hidden activations of the FFN sublayer. Residual dropout — applied to the output of each sublayer before the residual sum. The dropout rate is one of the most commonly tuned hyperparameters, often pushed to zero for huge models that don't need extra regularization.

BatchNorm & LayerNorm

The layer that resets activations to mean 0, std 1 — every step, every layer.

The backprop primer's §3 said one thing clearly: in deep networks, the product of local derivatives across layers tends to vanish or explode. The most powerful weapon against that drift is a normalization layer dropped between every "Linear" and the next. BatchNorm (2015) was the first widely-used recipe; LayerNorm (2016) is the Transformer's favorite. Both compute mean and variance, subtract the mean, divide by the std, and apply a learnable scale + bias.

The shared recipe, applied across whichever axis the variant chose:

Given an activation tensor x:
  μ      =  mean(x, axis)
  σ      =  std (x, axis)
  x_hat  =  (x − μ) / (σ + ε)
  y      =  γ · x_hat + β        ← γ, β are learnable

In plain English:
  1. subtract the mean
  2. divide by the standard deviation
  3. multiply by a learnable scale, add a learnable bias

The difference between BatchNorm and LayerNorm is one line: which axis you compute μ and σ over.

BatchNorm — normalize each feature across the batch. For an activation of shape [batch, features], the mean and std are taken down each column. Every feature ends up with mean 0, std 1 across the batch. Requires a meaningful batch size to estimate the statistics; original home: convolutional networks.
LayerNorm — normalize each example across its features. For [batch, features], the mean and std are taken across each row. Every example ends up with mean 0, std 1 across its features. Doesn't care about batch size at all; original home: RNNs, then Transformers took it over.

1 / 3

Same activations, two normalization axes. BatchNorm needs a meaningful batch (CNNs love it). LayerNorm needs nothing extra (Transformers use it for that reason).

Why this helps so much:

Activations stay bounded. No matter how the weights drift during training, the next normalization layer pulls activations back to a controlled scale. That kills the "exploding signals" failure mode of deep networks.
You can use higher learning rates. Without normalization, an aggressive step can overshoot and turn activations into NaN. With it, the next normalization step soaks up the damage and rescales.
Gradients flow better. Normalization changes the loss landscape into one that's easier to walk down — empirically observed, theoretically argued about. (The original justification — "reduces internal covariate shift" — has been partly disputed, but the empirical wins are uncontroversial.)
Mild regularizer. BatchNorm in particular adds batch-dependent noise to each example's activations (because its μ, σ depend on what else is in the batch). That noise is small but consistent regularization.

Practical gotchas. BatchNorm behaves differently at train time (uses the batch statistics) than at eval time (uses running averages of the train-time statistics). Forgetting to set the model to eval mode can produce confusing prediction instability with batch size 1. BatchNorm also struggles with very small batches (estimates are noisy) and is awkward for sequence models where the batch dim is crowded by sequence length variance. LayerNorm sidesteps all of this — it never depends on the batch, so train and eval are identical.

A worthwhile mention: RMSNorm. A small simplification of LayerNorm that drops the mean-subtraction and divides only by the root-mean-square of the input. Slightly faster, slightly fewer parameters (no β), and empirically just as effective on Transformers. LLaMA, Mistral, and most modern open-weight LLMs use RMSNorm instead of LayerNorm. The principle is identical; the implementation drops two operations.

In a Transformer: a normalization layer (LayerNorm or RMSNorm) is applied around every sublayer. Modern blocks use the "pre-LN" pattern: y = x + Sublayer(Norm(x)) — normalize first, then apply the sublayer (attention or FFN), then add to the residual. This pattern makes very deep networks (50+ layers) trainable from scratch without exotic schedules. The original 2017 Transformer used "post-LN" (y = Norm(x + Sublayer(x))); pre-LN took over within a couple of years because it's noticeably more stable. If you crack open the source of any modern LLM, every Transformer block opens with a norm layer.