Optimizers & Training Tricks Primer

The gradient descent primer ended on one line: w ← w − η · ∇L. Real LLM training is that line wrapped in four layers of practical machinery — every one of which is a required ingredient for a model that actually converges. Four short topics: SGD → Adam → AdamW (the optimizer everyone uses); learning-rate schedules (warmup + cosine decay); mixed-precision training (fp16 / bf16); and distributed training (data / tensor / pipeline parallelism). After this primer, the configs you see at the top of every training script make sense.

SGD → Adam → AdamW

A per-parameter adaptive step size, plus the weight-decay fix that made LLM training work.

The gradient descent primer's §1 introduced plain SGD — w ← w − η · ∇L. It works fine on simple problems. On modern deep networks it leaves a lot on the table: one global learning rate has to be small enough for the steepest direction, which means it's too small for the shallow ones, which means progress is slow. The history of optimization in deep learning is one long story of finding ways around that constraint.

First fix: momentum. Carry a running average of past gradients and step in that direction. The result smooths out noise and accelerates progress along consistent directions. Still one global learning rate, but the optimizer makes better use of it.

Second fix: Adam (Kingma & Ba, 2014) — adaptive moment estimation. Keep two running averages: m_t of gradients (first moment) and v_t of squared gradients (second moment). The first one is just momentum. The second one tells you how variable each parameter's gradients are. Divide the step by √v_t and each parameter gets its own effective learning rate.

Adam update at step t:
  m_t  =  β₁ · m_{t-1} + (1 − β₁) · g_t          ← first moment
  v_t  =  β₂ · v_{t-1} + (1 − β₂) · g_t²         ← second moment
  m̂_t  =  m_t / (1 − β₁ᵗ)                        ← bias correction
  v̂_t  =  v_t / (1 − β₂ᵗ)
  w    ←  w  −  η · m̂_t / (√v̂_t + ε)

Typical: β₁ = 0.9, β₂ = 0.95–0.999, ε = 1e-8

The big idea — large recent gradients get smaller effective steps; small recent gradients get larger effective steps. On an elongated loss surface, the steep axis gets shrunk and the flat axis gets stretched, so you head toward the minimum directly rather than zigzagging.

1 / 3

Adam keeps a per-parameter running scale (the second moment v) and divides by it. The steep axis is auto-shrunk, the shallow one is auto-stretched.

Third fix: AdamW (Loshchilov & Hutter, 2017). Adam plus the right way to do weight decay. The original Adam tried to mix L2 regularization into the gradient itself, and because L2 also gets scaled by √v_t, the actual amount of weight decay depended on each parameter's gradient variance — a mess. AdamW decouples the two: do the Adam update, then separately shrink the weights by η · λ · w:

AdamW update at step t:
  w  ←  w  −  η · m̂_t / (√v̂_t + ε)  −  η · λ · w
                                       ─────────
                                       weight decay
                                       decoupled from adaptive scaling

Subtle change, big consequences. AdamW generalizes better at scale, and every modern LLM training run uses it. PyTorch, JAX, and HuggingFace ship AdamW as the default optimizer for transformer training.

What people actually do in 2026:

AdamW for almost everything. Default for LLM pretraining, fine-tuning, and most other deep-learning tasks.
SGD with momentum still wins on certain image classification tasks (ResNet, ViT in some settings), where the loss landscape is friendly enough that adaptive methods don't help and can even hurt generalization.
Lion, Sophia, Adafactor, Shampoo, Muon — newer optimizers that claim to beat AdamW on memory or convergence. None has fully replaced AdamW in production yet (as of 2026), but the field experiments constantly.

A note about Adam's memory cost: it keeps two extra fp32 tensors per parameter (m and v). For a 7B model, that's 56 GB of optimizer state on top of the 28 GB of fp32 weights — about 84 GB just for the optimizer side of training. The VRAM section of the hardware primer (§2) showed why this is the binding constraint. Memory-efficient optimizers like Adafactor and 8-bit AdamW exist specifically to address this.

In a Transformer: the optimizer applied to every parameter in every block is AdamW with β₁ = 0.9, β₂ = 0.95, ε = 1e-8, weight decay λ ≈ 0.1, plus a learning-rate schedule (§2). That single sentence describes the optimizer for GPT-3, LLaMA, PaLM, Claude, Gemini — virtually every flagship LLM trained between 2020 and 2026. The recipe is so dominant that "AdamW" is essentially synonymous with "modern LLM training."

Learning-Rate Schedules

Warmup at the start, decay at the end. Nobody trains LLMs with a constant η.

AdamW gives you a per-parameter step. The remaining question is: how should the global step size η evolve over the course of training? It turns out a constant η is bad. Every modern LLM uses a schedule with two phases: warmup at the start, then decay over the rest.

Warmup. Start with η = 0 and ramp linearly up to the peak over the first ~0.1–2% of training. Why? Adam's second moment v_t needs to accumulate gradient-magnitude statistics before its divisor is meaningful; if you hit full η in the first step, v_t is wildly unreliable and the resulting update is huge. Warmup gives the optimizer's state a chance to stabilize before full-throttle steps land. Skipping warmup is the canonical "loss goes NaN at step 5" mistake.

Decay. After warmup, decrease η over the rest of training, ending at something tiny (often 10% of peak, sometimes 0). The intuition: early training wants big steps to find the right neighborhood; late training wants small steps to polish the model near the minimum. Several decay shapes are in use:

Cosine decay — η_t = η_peak · 0.5 · (1 + cos(π · t/T)). Smooth, no abrupt drops. The LLM default.
Linear decay — straight line from η_peak down to η_end. Common in fine-tuning.
Inverse-sqrt — η_t = η_peak / √t. Original Transformer paper used this; less common now.
WSD (Warmup-Stable-Decay) — hold at η_peak for most of training, then decay sharply at the end. Useful when you don't know the final budget in advance (you can resume and decay later).

1 / 3

Start at 0 to let Adam's variance estimate stabilize; ramp to peak; then cosine-decay back so the model can fine-tune in small steps near the end.

The values you'll see most often, for a Transformer pretraining run:

Peak LR:        3e-4 to 1e-3  (smaller for bigger models)
Warmup steps:   500 – 2000     (0.1–2% of total)
Total steps:    10k – millions (data-dependent)
Decay shape:    cosine to 10% of peak
Min LR:         3e-5 to 1e-4

A practical fact about scaling: as models grow, the optimal peak LR shrinks. A 7B model trains best at η ≈ 3e-4; a 70B model wants closer to η ≈ 5e-5. The "Maximal Update Parameterization" (μP) line of research gives you scaling rules so you can tune at small scale and predict the right LR for a bigger model.

In a Transformer: the schedule from the training script is usually wired into the get_lr() function called every step. The single most important hyperparameter is the peak LR; the second is the warmup length; the third is the decay shape. Every model card or technical report from a major LLM lab tells you these three numbers. If your training diverges, the LR schedule is the first place to look.

Mixed-Precision Training

Keep activations in 2-byte floats. The math still works, the model still trains, the GPU runs twice as fast.

Hardware tensors primer (§2) showed why VRAM matters. Mixed-precision training is the single most effective optimization people apply: keep most of the network in a 2-byte floating-point format (fp16 or bf16) instead of 4-byte fp32. Result: half the VRAM and roughly double the throughput on modern tensor-core GPUs.

The cast of characters:

fp32 — 32-bit IEEE 754. 1 sign, 8 exponent, 23 mantissa. Range ~10⁻³⁸ to 10³⁸, ~7 decimal digits of precision. The safe default for everything numerical, but expensive: 4 bytes per number.
fp16 — 16-bit half precision. 1 + 5 + 10. Range ~6×10⁻⁵ to ~6.5×10⁴, ~3 decimal digits. Halves memory and (on Tensor Core hardware) doubles throughput. But: the narrow range means gradients can underflow to 0, and activations can overflow to inf.
bf16 (brain float) — also 16 bits, but 1 + 8 + 7. Same exponent range as fp32 (so no overflow / underflow problems), but only 7 mantissa bits (~2 decimal digits of precision). Built by Google for ML; now the default for large-model training on every major hardware platform (A100, H100, TPU, MI300X).
fp8 — 8-bit float. Two variants (E4M3 and E5M2). H100-and-up hardware can run matmuls in fp8 for further speedup; widely used for inference, increasingly for training. Standardized in the OCP MX specs.

1 / 3

Mixed-precision training keeps activations and gradients in fp16 or bf16 for speed, but maintains an fp32 master copy of weights for stable optimizer updates.

The "mixed" in mixed precision matters. You don't put everything in fp16/bf16 — some operations need higher precision. The standard recipe:

Weights, activations, gradients — fp16 or bf16. Forward pass and backward pass both happen in low precision. This is where the speedup comes from.
Master copy of weights — fp32. The optimizer state and the authoritative weight values are kept in full precision. When you update, you compute the update in fp16/bf16 and apply it to the fp32 master.
Reductions (mean, sum, especially across many values) — fp32. Accumulating many small low-precision numbers loses precision fast; reductions are usually run in fp32 even when the inputs are bf16.
Loss scaling — only for fp16. Multiply the loss by a large constant (e.g., 1024) before backward to push gradient magnitudes up out of fp16's underflow zone, then divide the gradient by that same constant after. bf16 doesn't need this (the exponent range matches fp32), which is most of why it's won.

In code, you almost never write this by hand. PyTorch's autocast and AMP handle the casting:

# PyTorch — autocast in bf16
with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
    out = model(x)
    loss = criterion(out, y)
loss.backward()                    # gradients automatically bf16
optimizer.step()                   # update fp32 master copy

What this buys you, in real numbers, on an H100 training a 7B Transformer:

~50% less VRAM for activations and the optimizer's second tensor. Lets you double the batch size or context length.
~2× more tflops. H100 advertises 989 TFLOPS fp16/bf16 vs 67 TFLOPS fp32 on dense matmuls — a real 10× difference, which is why "fp32 training" is essentially extinct in 2026.
Same final loss, within noise. The model is functionally identical to one trained in fp32, just trained faster.

In a Transformer: every reference implementation of every modern LLM trains in bf16 by default. Inference often goes further, to fp8 or int8 or even 4-bit quantization. The pretraining of the Transformer you next read about (and any new LLM you might fine-tune in 2026) is overwhelmingly likely to be bf16 with an fp32 master copy of weights — that combination is, at this point, the boring default.

Distributed Training

DP × TP × PP — three independent axes for splitting a model across hundreds of GPUs.

A 70B-parameter model in bf16 weighs 140 GB just for the weights — already larger than a single H100's 80 GB. Training requires gradients, optimizer states, and activations on top of that. The hardware primer (§2) showed a 7B model overshooting an H100 at training time; 70B and larger guarantees multi-GPU. So how does a model split across multiple GPUs work? Three orthogonal techniques, often combined.

1. Data parallel (DP). Each GPU has a full copy of the model. The batch is divided across GPUs. Each GPU does its own forward + backward on its slice of the batch, producing local gradients. Then an all-reduce averages the gradients across all GPUs, so every GPU's weights are updated identically. Simple, easy to scale across many GPUs — but requires the model to fit on each GPU.

Variants of DP shard not just the data but parts of the training state. ZeRO (Microsoft) and FSDP (PyTorch) shard optimizer states, gradients, and weights across the DP group, gathering and scattering as needed. This shrinks per-GPU memory dramatically while keeping the "one copy per GPU" mental model. Most modern training scripts use FSDP by default.

2. Tensor parallel (TP). When the model itself doesn't fit on one GPU, split each layer's weight matrix across GPUs. A 4096 × 4096 weight matrix on 4 GPUs becomes four 4096 × 1024 slabs. Forward and backward require all-to-all communication (each GPU computes a partial matmul, then they aggregate) on every layer. The communication cost is enormous, so TP usually lives within one node, connected by NVLink (≥ 600 GB/s per direction), not across nodes connected by InfiniBand.

3. Pipeline parallel (PP). Slice the model vertically — GPU 0 owns layers 1–8, GPU 1 owns layers 9–16, etc. The activations from GPU 0 flow into GPU 1, then GPU 2, etc. Naively, GPUs sit idle while their predecessors finish — the "pipeline bubble." Real implementations use micro-batching: split the batch into smaller chunks and pipeline them, so all stages stay busy. PP is the right tool when even TP isn't enough — the model is so big that a single layer doesn't fit on one node.

1 / 3

Three independent axes. Real LLM training stacks them — a 1024-GPU run might be 32-way DP × 8-way TP × 4-way PP.

Real-world LLM training uses combinations:

2D (DP + TP). Common up to ~10B. DP across many nodes, TP within each node. Each node holds a full copy of the (sharded) model; nodes coordinate via all-reduce.
3D (DP + TP + PP). Standard for 70B+ models. Imagine a 1024-GPU run: 32-way DP × 8-way TP × 4-way PP. Each axis splits work in a different dimension; the total = 32 · 8 · 4 = 1024.
Sequence parallel. A relatively new addition: split the sequence dimension across GPUs in addition to the others. Important for very long context windows where activations along the sequence axis dominate memory.
Mixture-of-experts (MoE). A different kind of "parallelism" — route different tokens to different expert sub-networks. Each expert lives on one GPU, so the model can be enormous while each forward pass only activates a few experts. Lets you have a 1T parameter model with the per-token cost of an 80B.

The hardware that runs this:

Within a node: 8 GPUs connected by NVLink/NVSwitch at ~600+ GB/s/direction. TP communication is cheap enough to be practical here.
Between nodes: InfiniBand or RoCE at ~200–400 GB/s. Fast enough for DP all-reduce, too slow for fine-grained TP.
Software: PyTorch FSDP, NVIDIA Megatron-LM, DeepSpeed, JAX with Mesh + GSPMD, Maxtext. Different stacks, same three axes underneath.

In a Transformer: the same architecture you're about to read about scales from 1 GPU (a 1B model fits) to 16 GPUs (a 70B model with TP + PP) to thousands (a frontier LLM with 3D parallelism + MoE). The recipe for going from "this works on one GPU" to "this is training on 25k H100s for six months" is the contents of this section. Most of the engineering effort behind any flagship model announcement is here, not in the architecture.

That's the prerequisite stack. Next primer: the Transformer itself — finally.