Backprop Primer

"Compute the gradient of the loss with respect to every parameter." With millions of parameters, this sounds impossible — and would be, if you tried to do it by hand. Backpropagation is the algorithm that does it for you in one pass backwards through the network, applying the chain rule from the calculus primer to every connection. Four topics: backprop itself, why you can't initialize all weights to zero, the vanishing / exploding gradient problem that haunts deep networks, and the MLP — the simple deep network that hides inside every Transformer block.

01

Backpropagation

The chain rule, applied to every parameter, in one pass.

The gradient descent primer ended on a satisfying note: every iteration is just w ← w − η · ∇L. The gradient descent primer also quietly assumed you can compute that gradient. With a single-parameter loss, sure — calculus primer §1. With a million parameters spread across fifty layers, this is no longer a plausible thing to do by hand. Backpropagation is the algorithm that makes it possible anyway.

Backprop has two properties that, together, are why deep learning works at all:

  • It's automatic. You write down the forward pass; the framework (PyTorch, JAX, TensorFlow) gives you every gradient for free. No symbolic algebra, no manual differentiation.
  • It's linear in the size of the network. One forward pass plus one backward pass costs roughly the same as two forward passes. For a network with a million parameters, all million gradients pop out in one extra pass — not a million separate computations.

The trick is a relentless application of the chain rule from the calculus primer (§3). Walk the computation backwards: at each step, the gradient flowing back is the gradient at the next step times the local derivative. Each parameter's gradient is the product of every local derivative on the path from that parameter to the loss.

Here's a concrete example: a 1-input, 1-hidden-neuron, 1-output network. Inputs: x = 2, target y = 5. Weights: w₁ = 1.5, b₁ = 0.5, w₂ = 0.8, b₂ = 0.1. Loss is squared error.

Forward:
  z   =  w₁·x  + b₁  =  1.5·2 + 0.5  =  3.5
  a   =  ReLU(z)     =  ReLU(3.5)    =  3.5
  ŷ   =  w₂·a  + b₂  =  0.8·3.5 + 0.1 =  2.9
  L   =  (ŷ − y)²    =  (−2.1)²      =  4.41

Backward (one local derivative per step):
  ∂L/∂ŷ   =  2 · (ŷ − y)                  =  −4.20
  ∂L/∂w₂  =  ∂L/∂ŷ · a                    =  −4.20 · 3.5  =  −14.70
  ∂L/∂a   =  ∂L/∂ŷ · w₂                   =  −4.20 · 0.8  =  −3.36
  ∂L/∂z   =  ∂L/∂a  · ReLU'(z)            =  −3.36 · 1    =  −3.36
  ∂L/∂w₁  =  ∂L/∂z  · x                   =  −3.36 · 2    =  −6.72
L = (ŷ − y)² forward → ← backwardw₁, b₁w₂, b₂loss2x3.5a = ReLU(w₁·x + b₁)2.90ŷ = w₂·a + b₂4.41L (target y = 5)∂L/∂w₁ = ?∂L/∂w₂ = ?∂L/∂ŷ = ?Forward pass: x = 2 → a = 3.5 → ŷ = 2.9 → L = 4.41. Target was y = 5.
1 / 4
Forward pass once, backward pass once. Every parameter gets its gradient via the chain rule — no symbolic math, no manual differentiation.

Look at the rhythm. Each backward line is the previous backward result, multiplied by one local derivative. ∂L/∂w₁ = ∂L/∂z · ∂z/∂w₁ — that's the chain rule from the calculus primer, no more no less. Every parameter's gradient is just the product of local derivatives stacked along the path from the parameter to the loss.

The "back" in backprop matters too: gradients are computed output-to-input, not input-to-output. Why? Because once you know ∂L/∂ŷ, you can reuse it to compute every gradient at the previous layer (here: w₂, b₂, a). Then once you know ∂L/∂a, you can reuse it for everything at the layer before that. Each layer's "inherited gradient" gets multiplied by that layer's local derivatives and passed on. The total work is one walk backwards through the graph.

A tiny aside about how backprop is implemented. During the forward pass, the framework records the computation graph — every operation, plus the values it produced. During the backward pass, it walks the graph in reverse, multiplying by local derivatives whose formulas are known for every built-in op (matrix multiply, ReLU, sigmoid, softmax, layer norm, attention — every one of them has a hand-written gradient). The user never touches this. They write the forward pass; loss.backward() handles the rest.

In a Transformer: the same algorithm, just on a vastly bigger graph. A 70-billion-parameter model still gets all 70 billion gradients from one backward pass over the same computation graph that produced its output. Every weight in every attention head, every FFN, every layer norm — backprop computes them all, all at once, in time proportional to one extra forward pass. Without backprop, training a Transformer would not be a slow problem; it would be an impossible one.

02

Weight Initialization

Why every weight starts at a small random number — not zero.

Backprop tells the optimizer how to improve the weights. But before the first update, the weights need starting values. "Set them all to zero" is the natural first guess — and it kills training before it begins. Here is why, and what to do instead.

Imagine a layer with 100 neurons, all initialized to the same zero weights. Every neuron in that layer sees the same input and applies the same recipe to it, so every neuron computes the same output. They are, functionally, the same neuron — just copied 100 times. A "wide layer" with 100 copies of one neuron is no wider than a layer with one neuron.

It gets worse during the backward pass:

  • Since every neuron in the layer computes the same output, every neuron also receives the same gradient from the layer above.
  • Same input, same gradient → same weight update.
  • After the step, all 100 neurons still have identical weights. The layer is permanently one neuron pretending to be a hundred. Training cannot break the symmetry; backprop only preserves it.

This is the symmetry problem. Any initialization that makes neurons within a layer identical traps them in lockstep forever — no matter how clever your loss, how careful your optimizer, how many epochs you wait.

The fix: break the symmetry. Initialize each weight to a small independent random number (Gaussian or uniform — either works). Now every neuron in the layer sees inputs differently; they compute different outputs, receive different gradients, and develop into different roles over training. A wide layer is actually wide.

two networks, same inputs x = (1, 2)all-zero init12h1h2h3inputhiddenrandom init12h1h2h3inputhiddenSame inputs feed two 2→3 networks. Left: every weight = 0. Right: distinct random weights per neuron.
1 / 3
All-zero init collapses every neuron into the same neuron. Random init gives each neuron a different role to specialize into.

Random init breaks symmetry, but you can't pick the random scale carelessly. Too small (variance near zero) and signals dampen layer by layer — by layer 10 every activation is essentially zero, and the network can't learn. Too big (variance near one) and signals explode layer by layer — activations saturate, the loss is NaN. There's a narrow band of init scales where signals propagate cleanly through depth, and the standard recipes are designed to land in that band:

  • Xavier (Glorot) init — scale weights so the variance of activations stays roughly constant across layers, assuming sigmoid/tanh activations. Variance ≈ 1 / n_in.
  • He init — same idea, but tuned for ReLU. Since ReLU zeros out half the inputs, you need twice the variance to compensate. Variance ≈ 2 / n_in.
  • Biases — usually start at zero. They don't suffer from the symmetry problem (only weights tie neurons together), and gradients quickly push them to useful values.

The "small" matters. If you init weights at variance 1 on a deep ReLU network, your first forward pass produces enormous activations at the output. If you init at variance 0.001, every activation past layer 5 is dust. Variance ≈ 1/n_in is the goldilocks zone, give or take a factor of 2.

In a Transformer: linear layers (QKV projections, attention output, FFN layers) are typically initialized with a Gaussian of variance ≈ 1 / d_model. Some recipes scale the output projection's init down further (e.g., by 1 / √(2N) where N is the number of layers), to keep the residual-stream variance from inflating as more blocks are stacked. Layer norms start as identity (gain = 1, bias = 0). The exact constants vary; the principle doesn't — break symmetry, keep variance bounded.

03

Vanishing & Exploding Gradients

The product of fifty local derivatives is rarely a small number.

Backprop computes ∂L/∂w at an early layer as a product of local derivatives along the path from that weight to the loss. In a 50-layer network, that product has ~50 terms. If every term is a little under 1, the product is essentially zero. If every term is a little over 1, the product is essentially infinity. Either way, training breaks.

Two failure modes, both caused by the same multiplication:

  • Vanishing gradients — most local derivatives are small. Their product shrinks exponentially with depth. The early layers get gradients near zero, so their weights barely update; the network behaves as if those layers are frozen even though they're technically trainable.
  • Exploding gradients — most local derivatives are big. Their product grows exponentially with depth. The early layers get enormous gradients, the optimizer takes a huge step in some random direction, weights diverge, and the loss becomes NaN within a few iterations.

Concretely: 0.5 raised to the 50th power is ≈ 8 × 10⁻¹⁶. 1.5 raised to the 50th power is ≈ 6.4 × 10⁸. A modest per-layer factor compounds into a catastrophe over depth.

|∂L / ∂w| at layer k — product of k local derivativesk = 0outputk = 9deepestvanishing (σ = 0.5)0.002stable (σ = 1.0)exploding (σ = 1.5)Vanishing. Each layer's gradient is half the next — by layer 10 it's ≈ 0.002. The deepest weights barely move during training.
1 / 3
In deep networks, the gradient at an early layer is the product of many local derivatives. Stay above 1 and it explodes; stay below 1 and it vanishes; only a narrow band trains.

The culprits, in roughly the order they show up in practice:

  • Saturating activations. Sigmoid has σ'(z) ≤ 0.25, max. A 20-layer sigmoid network multiplies 20 terms each ≤ 0.25 — gradient is literally guaranteed to be at most 0.25²⁰ ≈ 10⁻¹². Tanh is a little better (max 1.0) but still saturates to derivative 0 in its tails. ReLU helps a lot (derivative is exactly 0 or 1, no shrinking in the "alive" half), which is one of several reasons it took over from sigmoid/tanh.
  • Bad initialization. §2 already touched on this — if you init weights at variance 4, your local derivatives are also blown up by a similar factor, and the product explodes; init at variance 0.01 and it vanishes. Xavier / He init are precisely the recipes designed to land the per-layer multiplier near 1.
  • Just being deep. Even with reasonable activations and init, stacking 100+ layers naively will drift. The product of 100 numbers near 1 still wanders away from 1.

The remedies that make deep networks trainable in 2025:

  • Better activations — ReLU, GELU, SiLU. Non-saturating in the positive half.
  • Proper initialization — Xavier for tanh, He for ReLU, 1/d_model for Transformers.
  • Normalization layers — BatchNorm, LayerNorm, RMSNorm. They re-centre and re-scale activations after every layer, so the product can't drift unboundedly.
  • Residual connectionsx + sublayer(x). The gradient at any layer is at least the gradient at the next layer (because of the "plus x"), so it can't shrink to zero. This is the deep-learning trick that unlocked very deep networks.
  • Gradient clipping — a cheap hack against explosion. If the gradient's norm exceeds some threshold (e.g., 1.0), rescale it down. Saves you from a NaN, occasionally at the cost of biased updates.

In a Transformer: all five remedies, applied at once. GELU activations, careful init scaled by depth, LayerNorm before each sublayer, residual connections wrapping every sublayer, and gradient clipping at typically ‖g‖ ≤ 1. Together they make 100-layer Transformers trainable; without any of them, even the original 12-layer GPT-1 would have been a struggle. Residual connections especially are not an afterthought — they're what makes the depth possible at all.

04

The MLP (= the Transformer's FFN)

Stack the neuron primer's layer a few times and you have a deep network.

Take the neural-net primer's "layer of neurons," stack two or three of them with activations between, and you have an MLP — a Multi-Layer Perceptron. It is the simplest deep network there is, the thing nearly every deep-learning paper until ~2014 was about, and (still) the FFN sublayer of every Transformer block. If you understood the neural-net primer's forward pass, you understand the MLP. It's the same operation, three times.

The recipe in one line: Linear → activation, repeated, with the last layer plain Linear (no activation, because you usually want raw logits / scalar outputs).

MLP(x):
  h₁  =  ReLU( W₁ · x  + b₁ )      ← layer 1
  h₂  =  ReLU( W₂ · h₁ + b₂ )      ← layer 2
   ŷ  =          W₃ · h₂ + b₃      ← layer 3 (no activation)
  return ŷ

That's the entire definition. A "deep" MLP just means more ReLU(W·_+b) lines in the middle. The depth (number of layers) and the widths (size of each h_i) are hyperparameters; everything else is one operation, repeated.

MLP — Linear → ReLU, twice, then Linearinputhidden 1hidden 2output1.0-1.00.5ŷ1ŷ23-dim input ready. Three layers to traverse before we see an output.
1 / 4
An MLP is just a stack of (linear + activation) layers — a 3 → 4 → 4 → 2 net here. The Transformer's FFN sublayer is exactly this shape, dropped into every block.

Two facts about MLPs that surprise people:

  • One hidden layer is, in principle, enough. The universal approximation theorem says: a single hidden layer with enough neurons can approximate any continuous function to arbitrary precision. In principle. In practice, "enough neurons" means an absurd number, and the result is unbearable to train. So we stack — depth costs less than width.
  • An MLP without activations is a single linear layer. Stack five linear layers with no non-linearity in between and the whole stack collapses to (W₅ · W₄ · W₃ · W₂ · W₁) · x, a single linear map. The depth gives you nothing. Non-linearity is what makes "deep" different from "wide."

Why "feedforward"? Because information flows in one direction — input on the left, output on the right, no loops. (Networks with loops are recurrent — RNNs and their friends.) The forward pass is one walk left-to-right; the backward pass is one walk right-to-left. Backprop loves this structure: no cycles means no fixed-point iteration, no ambiguity about gradient order.

The "multi-layer perceptron" name is partly historical baggage. A "perceptron" was Rosenblatt's 1958 single-layer linear classifier. The "multi-layer" prefix added depth and non-linearity, and that's the network we've been describing all primer long. "MLP" and "FFN" (Feed-Forward Network) refer to the same architecture — you'll see both names depending on which paper you're reading.

In a Transformer: the FFN sublayer is an MLP, full stop. The standard shape is two layers: an "up-projection" from d_model to 4 · d_model, a GELU (or SiLU) activation, and a "down-projection" back to d_model. Every Transformer block contains exactly one of these, immediately after the attention sublayer. In a model the size of GPT-3, the FFN sublayers hold roughly two-thirds of all parameters — attention does the routing, the MLP does the bulk of the actual computation. That, fundamentally, is what a 175B-parameter language model spends its weights on: 96 copies of "Linear → GELU → Linear," glued together by attention, plus a few hundred million parameters of embedding and normalization at the edges. You now know what every single one of those weights does. Time to read about the glue.