Self-Attention Primer

Every token in a Transformer needs to look at every other token and decide which ones matter for me, right now. That decision is self-attention: the operation at the heart of every modern LLM. Four short topics: the Query / Key / Value mental model; the score matrix Q · Kᵀ; scaled dot-product attentionwith the famous √d divisor and softmax; and the final weighted sum of V — walked through end-to-end on a tiny example.

Query, Key, Value

Three projections from one embedding — three different roles each token can play in attention.

Every token enters a Transformer layer as one vector — its embedding plus positional information. To do attention, that single vector is fanned out into three separate vectors via three learned linear projections: Query (Q), Key (K), and Value (V). The names are borrowed from information retrieval, and the analogy holds up surprisingly well.

Query (Q). What this token is looking for. Think of it as a search query the token sends out: "I'm a verb — does any noun nearby want to be my subject?"
Key (K). What this token advertises about itself — a tag-like summary other tokens use to decide if it's relevant. Think of it as a card on a bulletin board: "I'm a noun, third person, animate."
Value (V). The content this token contributes if it turns out to be a good match. If Q matches K, the requesting token gets V mixed into its representation.

1 / 4

Per token: x · W_Q → Query (what I'm looking for), x · W_K → Key (what I offer), x · W_V → Value (content I'll share if matched).

The three projections are produced by multiplying the input embedding x by three learned weight matrices: Q = x · W_Q, K = x · W_K, V = x · W_V. These three weight matrices are where all the attention learning lives — gradient descent shapes them so that “what each token is looking for” and “what each token offers” line up usefully for the language modeling task.

Why split a single embedding into three views? Because the three roles are fundamentally different. The thing a verb wants to look for in its subject is different from the thing the verb wants to advertise to its own dependents, and different again from the actual information the verb contributes downstream. Bundling all three into one vector would force the model to use the same direction in embedding space for three jobs at once. Three matrices = three jobs, cleanly separated.

A subtle point about dimensions. In the original Transformer, the input embedding is 512-dimensional and each of Q, K, V is also 512-dimensional. In multi-head attention (the actual form used in practice — covered in the Transformer primer), the Q/K/V are split into h heads of dimension d_k = d_model / h, and each head computes its own attention in parallel. For this primer we'll stick to single-head attention with a tiny d_k = 4 so the matrices stay readable.

The running example. Throughout this primer we'll use the same toy sentence: “the cat sat”. Three tokens, four dimensions per Q/K/V vector. That's 12 numbers per matrix, small enough to write on the back of an envelope, big enough to show every step of attention end-to-end.

The Score Matrix: Q · Kᵀ

Every Query meets every Key. The dot product measures how well they fit.

Now the magic step. For every token i, we want a number that says “how strongly does i want to look at every other token j?” The way to get that number is the simplest measure of similarity between two vectors in linear algebra: the dot product.

S[i, j] = Q[i] · K[j]. Take the Query row for token i, take the Key row for token j, multiply element-wise, sum. That single number is the attention score. Big positive score = good match. Near zero = no match. Negative = mismatch.

1 / 4

Every token's Query is dot-producted with every token's Key. The 3×3 result S[i,j] = how strongly token i wants to look at token j.

Doing this for every pair (i, j) gives an n × n matrix — the score matrix S. In matrix form that's the beautifully compact S = Q · Kᵀ, where Q is n × d_k(queries stacked as rows), Kᵀ is d_k × n (keys stacked as columns), and the product is n × n.

On our toy example with 3 tokens, the score matrix is 3 × 3. Row i is “what token i is looking at.” In the demo, row “cat” reads [the=2, cat=4, sat=6]: cat is mildly interested in “the,” somewhat interested in itself, and most interested in “sat.” This is exactly what a healthy language model should produce — a verb-aware noun should attend strongly to its verb.

Why dot product? Several reasons. First, it's the cheapest meaningful similarity measure: one fused multiply-add per dimension. Second, it's differentiable, so gradients flow through it cleanly. Third, by rotating Q and K through learned matrices, the model can effectively use any similarity function it wants — the dot product in the projected space is wonderfully flexible. Fourth, dot products map onto matrix multiplication, which GPUs eat for breakfast.

Cost. This is the famous O(n²) in “attention is quadratic.” For a sequence of length n = 1000, the score matrix has a million entries. For n = 32,000 (a long context), it's a billion. Memory and compute scale with n². Everything in efficient attention — flash attention, sparse attention, sliding window — is about not materializing this matrix in full.

Scale and Softmax: From Scores to Weights

Raw scores can be any size. Two simple operations turn them into a probability distribution.

The score matrix tells us how strongly each token wants to look at each other token, but the numbers are unbounded — they could be 3, or 300, or −5. To use them as weights for a sum, we want them positive and we want them to sum to 1. Two operations get us there.

Scale by √d_k. Divide every score by the square root of the key dimension. With d_k = 4, that's a factor of 2. With d_k = 64 (a real Transformer head), a factor of 8. Why? Without it, for high dimensions the variance of the dot product grows with d_k, pushing scores into regions where softmax saturates and gradients vanish. Dividing by √d_k keeps the variance constant. Small fix, huge stability win.
Softmax. The same softmax from the probability primer. For each row independently: exp(x_i) / Σ exp(x_j). The output of each row is a probability distribution — non-negative, sums to 1.

1 / 4

Divide by √d_k to keep raw scores small (so softmax doesn't saturate), then softmax turns them into a probability distribution over tokens.

Put together: A = softmax(Q · Kᵀ / √d_k). That single line is the entire scaled dot-product attention formula except for one final step (the weighted sum, next section). A is the attention matrix — a row of weights for each token, telling it how to mix the values.

On our running example, the row for “cat” ends up roughly [the=0.09, cat=0.24, sat=0.67]. Read those as percentages: cat puts 67% of its attention on sat, 24% on itself, and 9% on the. The softmax sharpened the originally 2-to-4-to-6 score ratio into a much more decisive 9-24-67 weighting, because exp() rewards the biggest scores disproportionately.

One more wrinkle: masking. For language modeling, a token shouldn't be able to attend to future tokens — that would be cheating during training. The trick: before softmax, add a large negative number (often −∞) to the entries we want to forbid. exp(−∞) = 0, so those positions get exactly zero weight. This is the causal mask, and it's the difference between an encoder (no mask, every token sees the whole sequence — used by BERT) and a decoder (causal mask, each token sees only the past — used by GPT).

Numerical stability. Real implementations don't compute softmax as exp(x) / Σ exp(x) directly — for large scores that overflows. Instead they subtract the row max first: softmax(x) = softmax(x − max(x)), which is mathematically identical but stays in finite floating-point range. Every library does this; you almost never have to think about it.

The Weighted Sum of V

Take the attention weights, take the Value rows, mix. The output is a contextualized token.

We finally have everything. For each token i, attention has produced a row of weights summing to 1, telling i how much of every other token to mix in. The last step is to actually do the mixing — and what gets mixed are the Value vectors.

output[i] = Σ A[i, j] · V[j]. For the “cat” row, that's 0.09 · V[the] + 0.24 · V[cat] + 0.67 · V[sat]. The result is a new 4-dim vector — the same shape as the original V row, but blended from all the values according to attention.

1 / 4

Each token's V row is scaled by its softmax weight, then summed. The result is a context-aware vector — "cat" now carries information about "sat".

In matrix form, the entire self-attention operation is a single line:

Attention(Q, K, V) = softmax( Q · Kᵀ / √d_k ) · V

That's it. That's the whole thing. Q, K, V are each n × d_k. The score matrix Q · Kᵀ is n × n. After scale + softmax we get the n × n attention matrix. Multiplying by V (n × d_v) gives the output, again n × d_v: a new representation for each of the n tokens. Three matrix multiplies and one softmax. That single formula, with multi-head, residual connections, layer norm, and an MLP wrapped around it, is the Transformer.

What just happened to “cat”. Before attention, the representation of “cat” was based only on the embedding of the word “cat” — a static identity. After attention, its representation is roughly 0.67 · V[sat] plus smaller contributions from itself and “the.” In other words: “cat” now knows that the verb is “sat,” and it carries that information forward. This is what people mean when they say attention gives you contextual representations. The static-embeddings limitation we hit at the end of the embeddings primer? Resolved.

And every token at once. The example walked through “cat,” but every row of the score matrix and every row of the attention matrix is independent. The output is computed for all n tokens in parallel — that's the point of expressing this as matrix multiplies. RNNs were inherently sequential (token t depended on t − 1's hidden state). Attention is inherently parallel (every position is computed in one shot). This is the entire reason Transformers train fast on GPUs.

Where this leaves us. One layer of self-attention takes a sequence of n input vectors and produces a sequence of n output vectors of the same length, each contextualized by the rest of the sequence. Stack 12, or 96, or 175 of these (separated by little feedforward networks), and you have GPT-2, GPT-3, GPT-4. That's the Transformer in one sentence. The next primer, the main course, fills in the “little feedforward networks,” the residual scaffolding, the multi-head structure, and the layer norm — but the operation at the heart of it all is the one we just walked through.