Transformer Block Primer

Attention is the star, but it doesn't work alone. Every Transformer block wraps attention in three other pieces — normalization, residual connections, and a position-wise feedforward MLP — and stacks N of these blocks into a model. Four short topics: LayerNorm and RMSNorm; residual connections and why they make deep stacks trainable; the FFN that does most of the parameter work; and how it all assembles into a complete block and a complete model.

LayerNorm and RMSNorm

Keep each token's feature vector at a stable scale so training doesn't blow up.

In a deep network, the magnitude of an activation vector can drift wildly from layer to layer. The same is true of gradients in reverse. If you don't hold the scale steady somewhere, deep stacks become impossible to train: either values explode and overflow, or shrink toward zero and kill gradients. Normalization is the answer.

LayerNorm. Introduced in 2016, used by the original Transformer. For each token's feature vector x ∈ R^d_model, compute its mean and standard deviation across the d_model dimensions, then standardize:

μ = mean(x)            // a scalar per token
σ = std(x)             // a scalar per token
LN(x) = γ · (x − μ) / σ  +  β

γ and β are learned per-dimension scale and bias parameters (vectors of length d_model). The crucial property: this happens independently for each token. No statistics flow between tokens — which is what makes it different from BatchNorm and a good fit for variable sequence lengths.

1 / 4

LayerNorm centers, then rescales to unit variance. RMSNorm drops the centering step — empirically just as good, ~7% faster. Llama, Qwen, DeepSeek all use RMSNorm.

RMSNorm. 2019. The modern default for open LLMs. The observation is that the mean-subtraction step in LayerNorm contributes little — the “recentering” doesn't change the model's capacity in a useful way. So RMSNorm drops it:

rms = sqrt( mean(x²) )         // no centering
RMSNorm(x) = γ · x / rms       // no β either

That's it. No mean subtraction, no β bias. Fewer ops per vector, roughly 7–10% faster in practice. Empirically the model trains and generalizes about as well as with LayerNorm. The win compounds because norm is applied at every block, so the per-block savings stack up across the whole forward and backward pass.

Who uses what.

LayerNorm: the original Transformer, BERT, GPT-2, GPT-3, T5.
RMSNorm: Llama 1, 2, 3. Qwen. DeepSeek. Mistral. PaLM (its T5 variant). Gemma. Basically every modern open LLM.

Pre-norm vs post-norm. A separate but related question: does normalization go before attention / FFN (pre-norm) or after the residual (post-norm)? The original 2017 paper used post-norm. Around 2020, several papers showed pre-norm is much more stable: gradients flow cleanly through residuals without being squashed by a norm layer right at the start of backprop. Every modern decoder LLM uses pre-norm. We'll see the resulting layout in §4.

Residual Connections

Add the input back to the output. This single trick is what makes deep stacks trainable.

A residual connection (or “skip connection”) is the simplest-looking idea in deep learning, and one of the most consequential. Take any block that maps a vector x to some f(x), and modify its output to be:

y = f(x) + x

That's the whole change. The block does its transformation, and then the original input is added back in. Functionally, the block now learns to predict the delta from the input rather than the full output. Two things follow, and they're both important.

1 / 4

Adding the input back to the output creates a direct "highway" for both activations and gradients to flow through the stack. Deep Transformers don't train without it.

1. Identity is the easy default. If f outputs all zeros, the block becomes the identity function: y = x. A deep stack of zero-initialized blocks is just the identity from input to output. The optimizer then learns small, useful deltas in each layer, rather than having to learn the full output transform from scratch in every layer. This is a much friendlier starting point.

2. Gradient highway. The chain rule applied to the residual y = f(x) + x gives dy/dx = (df/dx) + 1. That “+ 1” is the residual path's contribution. Even if df/dx is tiny — vanishing in a deep stack — the gradient through the residual path is always 1, so the upstream gradient gets propagated unchanged. The plain backbone is a multiplicative chain of derivatives; the residual path is an additive chain of 1's.

Before residual connections (ResNet, 2015), training a vision network beyond ~10 layers was unreliable; beyond ~20 was nearly impossible. After residuals, ResNet-152 trained without trouble. Transformers inherit this same property: GPT-3 has 96 layers, Llama 70B has 80, and they train because every sublayer is a residual block.

Two residuals per Transformer block. A Transformer block has two sublayers — attention and FFN — and each gets its own residual:

x  ←  x + Attention( Norm(x) )    // residual 1
x  ←  x + FFN( Norm(x) )          // residual 2

The x variable is reused — the block updates it in place, twice. From the residual's point of view, attention and FFN are each producing a small correction. The skeleton of the model is the identity function; every block adds a small specialized adjustment on top.

Why both norm and residual? They solve different problems. Norm controls the scale of activations at each step. Residuals preserve gradient flow and identity-as-default. Together they form the trainability backbone — neither alone is enough for a deep stack.

Position-Wise Feedforward (FFN)

A two-layer MLP applied independently to each token. Where most of the model's parameters actually live.

Attention moves information between tokens. The feedforward network — the “FFN” or “MLP” sublayer — is what each token does on its own once it has the right information. It's the second half of every Transformer block, applied position-wise: the same MLP runs at every token, with no information flow between positions.

FFN(x) = W_down · σ( W_up · x )

W_up: shape (d_model, 4 · d_model). Projects the token vector into a wider “hidden” space — usually 4× the model dimension. For GPT-3 with d_model = 12288, the hidden is 49152.
σ: a pointwise nonlinearity. GELU in the original Transformer and GPT-2. SiLU (also called Swish) in Llama, Qwen, Mistral. The exact choice matters less than the fact that you have one.
W_down: shape (4 · d_model, d_model). Projects back down so the output has the same shape as the input — a requirement for stacking blocks.

1 / 4

Applied independently to each token, same weights everywhere. Hidden is 4× wider than d_model — this is where most of a Transformer's parameters live (~2/3 of every block).

Why position-wise? Because attention already did the cross-token work. The FFN just needs to transform each token's contextualized vector into something useful. Sharing the same MLP across positions means the FFN learns one fixed function and applies it everywhere — which is cheap, generalizes well, and works on sequences of any length.

Why 4×? Empirical sweet spot from 2017. The intuition: attention's expressiveness is limited by d_model, and the FFN needs enough room to do the per-token computation that attention can't. 4× turns out to be enough headroom for most tasks; making it wider helps in big models, smaller hurts. Some modern variants like Llama use a slightly different shape (the SwiGLU variant has roughly 2/3 × 4 = 8/3× the model dim, with three matrices instead of two) but the structure is the same.

Where the parameters live. Quick count for one block at d_model = 768, 4 · d_model = 3072:

Attention: W_Q, W_K, W_V, W_O — each is 768 × 768, totaling 4 · 768² ≈ 2.4M.
FFN: W_up is 768 × 3072, W_down is 3072 × 768. Total 2 · 768 · 3072 ≈ 4.7M.

FFN is about twice as big as attention. In a Transformer where you might intuitively think “attention is the model,” it's actually the FFN that holds most of the parameters and does most of the token-level computation. Researchers now believe FFN is where most factual knowledge is stored — the “memory” of the model — with attention being the routing layer that decides which memories to activate for each token.

The nonlinearity matters. Without σ, the FFN would collapse: W_down · (W_up · x) = (W_down · W_up) · x, which is just a single linear layer with a low-rank parameterization. The nonlinearity is what makes the up–down structure actually compute something more expressive than one linear layer. SiLU and GELU both keep small positive values and squash big negatives — a smooth, monotonic version of ReLU.

The Full Block, and N Blocks Into a Model

Norm, attention, residual, norm, FFN, residual. Stack N copies, wrap with an embedding and a head.

The four pieces we've seen — norm, attention, residual, FFN — assemble in one fixed pattern. This is the pre-norm Transformer block, used by every modern decoder-only LLM:

# one block
x  ←  x + Attention( Norm( x ) )
x  ←  x + FFN( Norm( x ) )

Two sublayers, two residuals, two norms. Read it line by line: normalize x, run attention on it, add the result back to the original x; normalize the new x, run FFN, add that result back too. The output is the same shape as the input, ready to be fed into the next block.

1 / 4

Pre-norm style (used by every modern decoder LLM): normalize first, then attention or FFN, then add the residual. Same block N times; final norm + LM head finishes the model.

Stacking N of them. A model has many copies of this block, with completely independent weights — each block learns its own attention patterns, its own FFN function, its own norm parameters. Information flows from input to output through every block in sequence; there's no skipping or branching.

GPT-2 small: N = 12 blocks, d_model = 768.
GPT-2 medium / large / XL: N = 24 / 36 / 48.
GPT-3: N = 96, d_model = 12288.
Llama 2 / 3 7B: N = 32. Llama 70B: N = 80.
Modern frontier models: typically 40–120 blocks.

The wrapping. A complete language model is then:

x = Embedding( token_ids )          // (seq_len, d_model)
for block in blocks:
    x = block(x)                    # N times
x = FinalNorm(x)
logits = LM_Head(x)                 // (seq_len, vocab_size)

Embedding turns token IDs into d_model-dim vectors (RoPE-aware positional info is added inside attention, not here). N blocks transform the vectors. A final norm cleans up the scale. The LM head — a single linear layer of shape (d_model, vocab_size) — projects each token's vector to a score over the entire vocabulary; softmax over those scores gives the next-token probability distribution. That single linear is often the largest single weight tensor in the whole model (50,000 × 768 = 38M for GPT-2 small; 50,000 × 12288 = 614M for GPT-3).

Causal masking and KV cache. For language modeling, the attention inside each block uses a causal mask (covered in the self-attention primer) — each token can only attend to itself and earlier tokens. At inference, the K and V matrices from prior tokens are cached so they don't need to be recomputed for every new generated token; this KV cache is the dominant memory cost during text generation and is what techniques like MQA / GQA / sliding window are trying to shrink.

Where this leaves us. Twenty primers in, the picture is now complete. Tokens → embeddings → N pre-norm blocks (each one = norm + multi-head attention with RoPE + residual + norm + FFN + residual) → final norm → linear head → next-token logits. That sentence — together with all the pieces we've unpacked along the way — is a modern decoder-only LLM. Training it adds a few more pieces (cross-entropy loss, AdamW, learning-rate schedule, mixed precision — all in earlier primers), but the architecture is now exactly what you'd implement in a few hundred lines of PyTorch.