Gradient Descent Primer

The optimizer that trains every modern model. Five short topics covering the one-line update rule, the single most-tuned hyperparameter (the learning rate), the variant nobody is allowed to skip (SGD with mini-batches), the vocabulary that confuses every newcomer (epoch vs iteration vs step), and the training loop those four ideas compose into.

01

The Update Rule

One line of math, repeated billions of times.

Imagine a marble sitting on a hillside. Gravity pulls it downhill — perpendicular to the slope lines, in the direction of steepest descent. Gradient descent is that marble. The marble is a vector of parameters; the hillside is the loss function's surface; "downhill" is the direction opposite the gradient. Take a small step, recompute where you are, step again. That's the entire algorithm — every modern model you've heard of was trained by repeating this loop.

One line of math captures it:

θ ← θ − η · ∇L(θ)

Four symbols, each carrying real meaning:

  • θ — the parameter vector. For a Transformer that's billions of weights; for the supervised primer's housing line it's just (w, b).
  • L(θ) — the loss function. One scalar per choice of θ; smaller is better. We don't optimize the loss directly; we walk against its gradient.
  • ∇L(θ) — the gradient (calculus primer §4): a vector of the same shape as θ, pointing in the direction of steepest increase. The minus sign flips that into steepest decrease.
  • η (eta) — the learning rate: how big a step we take. The subject of §2.

The minus sign is doing most of the work. The gradient by itself points uphill; we want downhill, so we step against it. If we walked WITH the gradient — same formula but a plus sign — the marble would roll up the hill, the loss would explode, and training would never converge. Forgetting the minus is one of the most common bugs anyone hand-implements an optimizer; frameworks hide it inside the optimizer, but it sits there in every step.

θLL(θ) = (θ − 3)² + 0.5 · η = 0.3θ = -1.00 · L = 16.50Start at θ = −1. L = 16.5, far from the minimum at θ = 3.
1 / 4
The minus sign in θ ← θ − η · ∇L is what flips the uphill gradient into a downhill step.

The marble metaphor is exact except for one thing: in real models the hillside lives in a million-dimensional space we can't draw. The 1-D picture above — a parabolic loss bowl with a marker walking down it — is the right intuition; nothing about the shape of the algorithm changes when you scale up. Every dimension just gets its own component of the gradient, and the marble steps along all of them at once.

A subtlety worth flagging early: the gradient's direction is well-defined, but its magnitude depends on how steep the loss surface is. A steep slope produces a long gradient; a flat surface produces a short one. Combined with η, this means the step size η · ‖∇L‖ is large when training has a lot of work to do and small when training is nearly converged — automatic adaptation, built into the rule.

Gradient descent isn't guaranteed to find the global minimum — it only finds a minimum (or sometimes a saddle point, or a plateau). For a convex loss like the housing MSE from the supervised primer's §3, there's only one minimum and gradient descent finds it. For a deep neural network the loss surface has zillions of local minima, but a remarkable empirical fact about modern deep learning is that most of those local minima are nearly as good as the global one. So in practice, we run gradient descent, accept whatever minimum we land in, and the resulting model usually works.

In a Transformer: the update rule runs every training step. The forward pass computes the loss; backpropagation (calculus primer §3) computes ∇L for every one of the billions of parameters simultaneously; the optimizer applies the update. The whole rest of this primer is about making that update — and the loop wrapped around it — actually work at scale.

02

Learning Rate

The most consequential knob in machine learning.

The η in θ ← θ − η · ∇L is the learning rate. It controls one thing: how big a step the optimizer takes per update. That sounds boring; in practice it's the single most-tuned hyperparameter in all of ML. A bad η can ruin a perfectly correct model.

Three regimes:

  • Too small. The marble inches toward the minimum. Training takes forever; you burn GPU hours without progress. Symptom: training loss decreases, but slowly and steadily, and the curve never seems to bend toward a plateau.
  • Just right. The marble rolls briskly to the bottom and settles. Training loss drops fast at first, then bends into a plateau as gradients shrink. This is what every paper's training curve looks like when it's working.
  • Too big. The marble overshoots the minimum, lands on the opposite side of the bowl, and bounces back — possibly even further up than where it started. Loss oscillates, or worse, blows up to NaN and training crashes.
θLL(θ) = (θ − 3)² + 0.5η = 0.02 · too smallη = 0.02 — too small. After 8 steps the marker has barely moved away from θ = −1; reaching θ = 3 would take hundreds of steps.
1 / 3
Same loss bowl, same starting point, 8 steps each — only η differs.

Picking η is mostly empirical. There's no closed-form rule because the right value depends on the model architecture, batch size, parameter initialization, and shape of the loss surface — all of which change between projects. The workflow:

  • Try a few orders of magnitude. If η = 1e-3 works, try 3e-4 and 3e-3. Don't fiddle with the 4th decimal place before you've looked across factors of 10.
  • Plot training loss. If it diverges, lower η. If it's a stately downhill saunter, raise it. The fastest "first-pass" loss curve is the one that comes closest to diverging without quite doing so.
  • LR finder. A trick popularized by fast.ai: run a single mini-epoch ramping η from very small to very large; pick the η just below the point where loss explodes. Cheap, surprisingly effective.

In practice η doesn't stay constant — almost every modern paper uses a learning-rate schedule: η is a function of the training step. Two pieces that show up over and over:

  • Warmup. Start with η = 0, ramp linearly to the target value over the first few hundred to few thousand steps. Why? When parameters are fresh out of random initialization, the loss surface is treacherous and large early steps can knock training into a bad region. Warmup buys time for the model to find a smooth basin before the optimizer starts taking real steps.
  • Decay. After warmup, gradually lower η over the rest of training. Common shapes: cosine (smooth ramp down following a half cosine), linear (straight ramp), step (drop by 10× at fixed milestones). All of them serve the same purpose: take big steps early when there's obvious work to do; take small steps later for fine-grained convergence near the minimum.

A typical Transformer training curve combines both: warmup over the first 2000 steps from η = 0 to 3e-4, then cosine decay back to 3e-5 over the rest of training. Every paper reports the schedule because the schedule meaningfully changes the result.

Two related parameters that interact with η:

  • Batch size. Larger batches produce a more accurate gradient estimate (probability primer §4: variance shrinks like 1/n); with a more accurate gradient you can safely take a larger step. A common rule of thumb is "if you double the batch size, you can also (roughly) double η" — though this breaks down for very large batches.
  • Momentum / Adam. Optimizers that accumulate a velocity (momentum) or per-parameter scaling (Adam, RMSProp) effectively rescale the step size from the bare η · ∇L. The "right" η for SGD is rarely the right η for Adam — Adam's effective step at well-trained coordinates is much smaller, so its η can be bigger.

In a Transformer: the GPT family famously uses peak η in the 1e-4 to 6e-4 range; PaLM and Chinchilla papers report similar values. Warmup over the first ~0.4% of training steps + cosine decay to 0.1× peak by the end of training is the canonical schedule across the Transformer literature. When a training run mysteriously plateaus or explodes, the first thing anyone changes is the learning-rate schedule.

03

Stochastic Gradient Descent

Estimate the gradient from a small chunk of data — fast, noisy, and what every modern optimizer actually does.

Section 1 wrote the gradient as ∇L(θ), where L is the loss over the entire dataset. For a model trained on millions or billions of examples, that's a lie of convenience. You can't actually compute the gradient on the whole dataset every step — a single Common Crawl pass would cost weeks. Real training cheats: estimate the gradient from a small chunk of the data at each step. Cheap. Noisy. Universal.

Three variants based on chunk size:

  • Batch gradient descent. Compute ∇L over the entire training set, then take one step. The "honest" version. Slow per step, expensive per step, ridiculous for big datasets.
  • Stochastic gradient descent (SGD), strict. Compute ∇L on a single example, take one step, repeat. Fastest per step. The gradient estimate is wildly noisy — each example pulls the parameters in its own direction — but on average the noise cancels and you still walk downhill. Coined in 1951; the original "SGD."
  • Mini-batch SGD. Compute ∇L on a small group of examples, take one step, repeat. Group size is the batch size: typically 32, 64, 256, or in modern LLM training, 4M tokens per batch (which is 1000s of sequences). The best of both worlds — and what "SGD" in 2026 almost always means.
θLL(θ) = (θ − 3)² + 0.5batch = full datasetFull-batch GD: every gradient is exact. The marker walks a smooth, monotone path into the basin.
1 / 3
Same step count (14), same η — only the noise in the gradient changes. Batch size controls that noise.

Why does mini-batch win? Two reasons, both worth understanding.

Compute efficiency. A modern GPU computes the gradient for a batch of examples in roughly the same wall-clock time as for a single example. Hardware is SIMD-parallel; one example wastes most of the silicon. Doubling the batch from 1 to 2 almost doubles throughput. Doubling from 256 to 512 typically also improves throughput, but only by maybe 30% — eventually you saturate the GPU and bigger batches just take longer. The sweet spot depends on the hardware.

Gradient quality. The gradient from a single example is a noisy estimate of the true (full-dataset) gradient — like a random opinion. Averaging across n examples shrinks the noise variance by a factor of n (probability primer §4, again). So a batch of 256 has √256 = 16× less noise per coordinate than batch-1 SGD. With smoother gradients you can take bigger steps without overshooting, which often more than compensates for the extra compute.

The noise from mini-batch isn't pure overhead — it's sometimes useful. Sharp local minima are unstable to noise; gradient descent with noisy updates tends to skip past them and settle in flatter regions that often generalize better. There's a small literature suggesting some noise is good for training, which is one of the arguments against pure batch gradient descent even when it would be cheap.

Practical batch-size knobs that show up in every paper:

  • Shuffle each epoch. Without shuffling, the model sees the same mini-batches in the same order on every pass — and can memorize their position. Always shuffle the training set before each epoch.
  • Gradient accumulation. Want a logical batch of 4096 but the GPU only holds 256 in memory? Compute the gradient on 256 examples, accumulate it, repeat 16 times, then take a single update step. The optimizer sees a 4096-example gradient; the hardware never holds more than 256 at once.
  • Linear scaling rule. Doubling the batch size lets you (roughly) double the learning rate — the gradient gets more accurate, so you can step bigger. Holds up to a "critical batch size" that depends on the model and problem; beyond that, returns diminish.

In a Transformer: modern LLMs use enormous batch sizes — measured in millions of tokens, not just hundreds of examples — but always via mini-batch + gradient accumulation, never as a single forward-and-backward pass over a million tokens. Training is fundamentally mini-batch from start to finish; the only thing that's changed in 70 years is the size of "mini." When you read "batch size 4M" in a paper, that's the gradient-accumulation total, summed across thousands of GPUs and dozens of accumulation steps.

04

Epoch · Iteration · Step

Three words for "how much training has happened" — they don't mean the same thing.

Mini-batch SGD turns training into a counting problem. The dataset gets sliced into chunks, each chunk produces one update, the updates accumulate. Three terms describe that accumulation, and they get conflated constantly — even in papers. Pinning them down is the entire job of this short section.

  • Step (sometimes iteration) — one application of the update rule. Forward pass, backward pass, optimizer.step(). The atom of training. When PyTorch prints "step 12345" or HuggingFace shows global_step=…, that's a single weight update.
  • Iteration — usually a synonym for step. Some frameworks (Keras, earlier TF) say "iteration"; others say "step"; some papers use both interchangeably. Treat them as the same unless the context says otherwise.
  • Epoch — one full pass through the training dataset. If the dataset has 50,000 examples and the batch size is 100, one epoch is 50,000 / 100 = 500 steps.

The bookkeeping in one formula:

steps per epoch = dataset size / batch size

12 samples · batch size = 3 · 4 batches per epochepoch1step1training set (shuffled each epoch)01234567891011batch 1batch 2batch 3batch 4Epoch 1, step 1: optimizer processes the first batch of 3.
1 / 9
12 samples ÷ 3 per batch = 4 steps per epoch. Steps keep counting across epochs; the dataset reshuffles at each epoch boundary.

Why three words for what feels like the same idea? Because they answer subtly different questions:

  • How long has the optimizer been running? Count steps — that's the number of update applications, the thing the LR schedule is keyed to.
  • How much data has the model seen? Count epochs — that says "we've shown it the training set 7 times." Useful for thinking about data coverage and overfitting risk.
  • What compute did this cost? Multiply steps × batch size × per-example cost. Or epochs × dataset size × per-example cost. Same answer, different decomposition.

One more term that's common but easy to miss:

  • Total training tokens (or training examples) — the headline number for modern LLMs. GPT-3 was trained on ~300 billion tokens; Chinchilla on ~1.4 trillion; modern frontier models on ~10-15 trillion. This is steps × batch_size_in_tokens — at scale, the cleanest "how big was this training run" number because dataset and epoch boundaries become blurry (frontier LLMs often train for less than one epoch, with the dataset deduplicated and curated harder than re-passed).

Where these terms trip people up:

  • "How many epochs?" only makes sense for finite datasets you fully revisit. For an LLM on Common Crawl, you train for less than one epoch — the dataset is too big to repeat. Asking "how many epochs?" is the wrong unit; ask "how many tokens?"
  • "Train for 100 steps" is meaningless without batch size. A 100-step run on batch 1 is dramatically different from a 100-step run on batch 4096. Always report both.
  • Loss vs steps vs epochs plots. When a paper shows "loss vs steps," mentally compute what that means in data. A loss curve that drops smoothly over 100,000 steps might mean 200 epochs (small dataset, lots of revisiting) or 0.1 epochs (huge dataset, mostly new data each step) — wildly different stories about generalization.

In a Transformer: training runs are typically logged in steps for the optimizer's internal bookkeeping (learning-rate schedule, checkpoint intervals) and in tokens for headline reporting ("trained on 2T tokens"). Epochs barely come up — frontier LLM training datasets are so big that running through them once already costs millions of dollars in GPU time. When you read "300B tokens" in a paper, that's the ground truth for compute; "epoch" and "step" are derived bookkeeping numbers around it.

05

The Training Loop

Compute loss, compute gradient, update parameters — repeat a million times.

Everything in this primer composes into a four-line loop, repeated until the loss is low enough or the budget runs out. PyTorch in pseudocode:

  for batch in dataloader:
      logits = model(batch.inputs)         # forward pass
      loss   = loss_fn(logits, batch.targets)   # one scalar
      loss.backward()                      # backprop fills .grad
      optimizer.step()                     # θ ← θ − η · ∇L
      optimizer.zero_grad()                # reset .grad to zero

Five lines of Python is genuinely the entire training algorithm for a Transformer. Every paper you'll read, every model card you'll see, comes from running roughly that loop until the budget is exhausted. Everything else — distributed training, mixed precision, ZeRO, FlashAttention, gradient checkpointing — is engineering infrastructure to make that loop run faster on bigger clusters.

Three line-by-line notes:

  • loss.backward() is calculus primer §3 — chain rule applied to the computation graph PyTorch built during the forward pass. It populates a .grad attribute on every parameter with that parameter's ∂L/∂θ.
  • optimizer.step() reads each .grad, applies the update rule from §1 (plus whatever extras the optimizer adds: momentum, Adam's per-coordinate scaling, weight decay), and writes back to the parameter tensor.
  • zero_grad() is the bookkeeping that catches everyone once. .grad accumulates across backward passes by default — a feature that enables gradient accumulation (§3). Forget to zero it and the next step uses the sum of two batches' gradients, not just the current one. Most code either calls zero_grad() explicitly or uses optimizer.zero_grad(set_to_none=True).

Watching that loop run is the most satisfying part of training. The loss curve dropping over millions of steps is the visible payoff of every primer's worth of math:

01000steplossone forward + backward + update per stepStep 0: training begins. Loss is whatever random-initialization produces — usually very high.
1 / 4
Every modern training run produces a curve in this shape — fast initial drop, long power-law tail.

Two phases that show up in nearly every loss curve:

  • Fast initial drop. The first few hundred steps slash loss by orders of magnitude as the model leaves random-initialization-land and starts to learn basic structure. On a log-scale loss plot this looks like a near-vertical drop.
  • Long plateau. After the easy gains, loss decays much more slowly — a long power-law-shaped tail. Most training time is spent in this plateau, eking out the last bits of performance. The shape looks like training has stopped, but loss is still dropping; just slowly.

Practical things you'll add to the loop in real projects:

  • Validation eval. Every few thousand steps, compute loss on a held-out validation set (data primer §3). If validation loss starts rising while training loss is still dropping, you're overfitting (supervised primer §4) — time to stop or regularize harder.
  • Checkpointing. Every N steps, save the model + optimizer state to disk. A crash, a power outage, a bad code change — all easier to recover from when there's a checkpoint to revert to. Frontier LLM training writes checkpoints every few hundred steps.
  • Logging. Loss, learning rate, gradient norm, throughput (tokens/sec or examples/sec), GPU utilization, occasional samples. The training loop is opaque without instrumentation; tools like Weights & Biases or TensorBoard exist to make it observable.
  • Gradient clipping. Cap the gradient's norm to some maximum (often 1.0) to prevent a single rogue mini-batch from destroying weeks of training. Three lines of code; saves countless training runs.
  • Mixed-precision training. Most of the forward and backward pass runs in 16-bit floats; the optimizer's accumulator stays in 32-bit. ~2× speedup, 2× less memory, almost free if you remember to scale the loss to avoid underflow.

In a Transformer: the loop above is exactly the training loop for GPT, LLaMA, Claude, Mistral, Gemini — every modern LLM. The only differences are scale: thousands of GPUs running data-parallel copies of the loop, sharing gradient averages every step; sequences of thousands of tokens instead of single examples; "batch size" measured in millions of tokens via gradient accumulation. The arithmetic of the loop is identical to PyTorch on a laptop. This loop, run millions of times, is the entire reason modern AI exists.