Positional Encoding Primer

Self-attention treats a sequence as a set: shuffle the tokens and the attention output for each one is unchanged. Language is not a set — word order matters. Transformers fix this by injecting position information into each token. Four short topics: why attention is order-blind; sinusoidal encoding (the original Transformer paper); learned position embeddings (GPT-2); and RoPE — the rotary scheme that Llama, Qwen, DeepSeek, and most modern open LLMs all use.

Why Attention Is Order-Blind

Shuffle the input tokens, get the shuffled outputs. Attention sees a bag of vectors, not a sequence.

We've been talking as if attention “knows” that cat comes before sat. Look back at the formula softmax(Q · Kᵀ / √d_k) · V. Notice what isn't in it: the position index. The dot product Q[i]·K[j] uses the contents of the two tokens. It doesn't use where they are in the sentence. As a result, attention has a property called permutation equivariance: if you shuffle the input tokens, the output tokens come out in the same shuffled order, but the actual vectors are unchanged.

1 / 4

Without position information, self-attention is permutation-equivariant: feed the same tokens in any order, get the same outputs (just shuffled). Order is invisible.

Concretely on “the cat sat”: if you give the model “cat sat” or “sat cat”, attention produces the exact same pair of output vectors — just associated with different tokens. For a language model, this is catastrophic. “Dog bites man” and “man bites dog” have to produce different representations, or the model can't do anything useful.

The simple fix. Before the input goes into attention, add a vector that depends on each token's position. So the input to attention is

x[i] = token_embedding(token_i) + position_encoding(i)

Now the input to attention is no longer pure token content — it carries the position too, baked into the vector. The Q, K, V projections that follow inherit that information. Even though attention itself never reads a position index, the content it sees at each slot already encodes where in the sequence that slot is.

What makes a good positional encoding? A few properties people generally want:

Unique per position. No two positions should look the same — the encoding has to be informative.
Bounded values. If the encoding values grow unboundedly with position, training becomes unstable.
Distance-aware. Ideally, the encoding makes nearby positions look similar and distant positions look different — the model should be able to learn “3 tokens ago” without having to memorize that abstract idea from scratch for every absolute pair.
Extrapolation. If trained on sequences up to 1024, can the model handle a sequence of 2048 at inference time? This turns out to be the property that most distinguishes the three approaches we'll cover.

The next three sections are three answers to “how do we get a useful positional encoding” — sinusoidal (clever math, no learned parameters), learned (just look it up), and RoPE (the modern compromise that gets the best of both).

Sinusoidal Encoding (the Original Transformer)

A hand-designed pattern of sines and cosines at many frequencies. Zero learned parameters.

The 2017 “Attention Is All You Need” paper proposed a fixed, non-learned positional encoding made entirely of sines and cosines. The formula is:

PE(pos, 2i)   = sin( pos / 10000^(2i / d_model) )
PE(pos, 2i+1) = cos( pos / 10000^(2i / d_model) )

For every position pos ∈ [0, max_len) and every dimension d ∈ [0, d_model), this gives one number — sin or cos of pos · freq, where the frequency depends on the dimension index. Pairs of dimensions (2i, 2i+1) share a frequency: one is the sin, the other the cos.

1 / 4

Each (pos, dim) cell is sin or cos at a frequency that depends on the dim. Pairs of adjacent dims share a frequency (sin / cos). Far columns oscillate slowly; near columns oscillate fast.

The key insight is the multi-scale frequency. Low-index dimensions oscillate fast — they distinguish adjacent positions. High-index dimensions oscillate slowly — they distinguish positions only over long stretches. Together, the full d_model-dim vector at each position is a unique “multi-frequency signature” — like binary encoding for integers, but with sines and cosines instead of bits.

Why this particular formula? Two nice properties:

Linear combination encodes relative position. Because of the trig identity sin(a + b) = sin a cos b + cos a sin b, the encoding at position pos + k can be written as a fixed linear transformation of the encoding at pos. So the model can, in principle, learn a layer that “shifts attention by k positions” using a single weight matrix.
No max length. The formula is well-defined for any position, including positions much larger than anything seen during training. In practice, sinusoidal encodings extrapolate to longer sequences somewhat — though not as well as you'd hope, since the model still has to learn to interpret far-out positions.

How it's used. The encoding is just added to the token embedding before the input enters the first attention layer:

x[pos] = embed(token_pos) + PE[pos]    // same shape, element-wise sum

That's the entire mechanism. Then attention proceeds as usual. The position information rides along inside each x[pos] vector. There's no special handling inside the attention computation — it stays the same order-blind operation, but the input vectors are no longer order-blind.

The catch. Adding position to content is a slightly crude thing to do. It assumes the model can disentangle “the cat-ness of this vector” from “the position-3-ness of this vector” after they've been summed. In practice it works, but it's the kind of design choice that newer methods (RoPE, ALiBi) try to improve on.

Learned Position Embeddings (GPT-2)

Give every position its own embedding row. Let the optimizer figure out what each row should be.

BERT, GPT-2, and many other early Transformers ditched the sinusoidal formula and used a much simpler approach: treat positions exactly like tokens. Create a learned embedding table of size (max_len × d_model), where each row is one position's encoding. Look it up by position index. Done.

P = nn.Embedding(max_len, d_model)   // learned parameters
x[pos] = embed(token_pos) + P[pos]   // add it in, same as sinusoidal

1 / 4

Same idea as word embeddings: nn.Embedding(max_len, d_model). Each position has its own learned vector. Simple to implement. Can't extrapolate past max_len.

The data structure is identical to the word embedding table — same call, different vocabulary (positions instead of tokens). At training time, gradient descent shapes each row to encode whatever it is about position p that's useful for the language modeling task. The model figures out the “language of positions” on its own, no hand-crafted formula required.

What it gets right. Two things, both substantial.

Simplicity. Anyone who has implemented word embeddings has already implemented position embeddings. No trig, no frequency schedules, no need to think.
Best fit for the task. The model isn't forced to accept the designer's assumption that positions need to look like sines and cosines. It can learn whatever positional pattern is most useful, given its data.

What it gets wrong. One catastrophic thing, and one smaller thing.

No extrapolation, full stop. The table has exactly max_len rows. If your model was trained with max_len = 1024 and you give it a 1025-token sequence at inference time, there is no row to look up. Position 1024 has never had a row, so it has never had a gradient, so even at training time it would have been an out-of-bounds error. This single property — inability to handle sequences longer than what was seen during training — is what pushed the field away from learned positional embeddings.
No built-in distance structure. Position 100 and position 101 are adjacent. The sinusoidal formula makes their encodings nearly identical for low-frequency dimensions. The learned table just has two arbitrary rows — there's nothing in the parameterization that says they should be similar. In practice the model usually learns smooth-ish rows, but it's an artifact of training, not a built-in inductive bias.

Who uses it today? Not many. BERT and GPT-2 used it, and plenty of fine-tunes inherit it. But every model that wants to handle long context — which now means almost every serious LLM — has moved on to either relative position encodings or RoPE.

RoPE — Rotary Position Embedding

Don't add position. Rotate Q and K by an angle that encodes position. The dot product naturally becomes relative.

Both previous methods add a position vector to the token embedding before anything else happens. RoPE (Su et al., 2021) takes a completely different angle: don't touch the token embedding at all. Instead, apply a rotation to Q and K inside the attention computation. The rotation angle depends on the token's position. That's it.

Specifically: treat each consecutive pair of dimensions (2i, 2i+1) of Q (and similarly K) as a 2-D vector. For a token at position p, rotate this 2-D vector by angle p · θ_i, where the per-pair angular frequency is

θ_i = 1 / 10000^(2i / d_model)

(yes, the same multi-frequency schedule as sinusoidal, repurposed). The rotation matrix for one pair is just the standard 2-D rotation:

[q'_2i  ]   [cos(p·θ_i)  -sin(p·θ_i)] [q_2i  ]
[q'_2i+1] = [sin(p·θ_i)   cos(p·θ_i)] [q_2i+1]

1 / 4

Each (2i, 2i+1) pair of dimensions is treated as a 2D vector and rotated by angle p · θ_i. After rotation, ⟨Q_m, K_n⟩ depends only on (m − n) — relative position is baked into attention scores.

Why this matters. The dot product of two rotated 2-D vectors has a beautiful property: rotating both by the same angle leaves their dot product unchanged, and rotating them by different angles changes the dot product as a function of thedifference. So if Q is at position m and K is at position n:

⟨ rot(Q, m·θ) , rot(K, n·θ) ⟩  =  some function of (m − n)·θ

Attention scores after RoPE depend only on relative positions, not on absolute ones. The model never has to learn what “position 4378” means in isolation — it only ever sees position differences. This is exactly the inductive bias the original Transformer's designers hoped for, now baked into the math instead of left for the model to discover.

Why everyone uses RoPE now. Several reasons:

Relative positions for free. No extra parameters, no extra computation — just rotations applied to Q and K. Each attention head automatically sees relative positions in its scores.
Long-context generalization. Because there's no fixed lookup table, RoPE works in principle at any position. With small interpolation tricks (Position Interpolation, NTK-aware scaling, YaRN), a model trained on 4K tokens can be extended to 32K, 128K, or 1M tokens with only short fine-tuning runs.
Works on Q and K, not on V. A subtle but useful asymmetry: V is unrotated, so the “content” that gets mixed into outputs has no position information directly. Position only affects who attends to whom, not what gets passed along. This separation is cleaner than adding position into the value path.
Empirically just better. The 2021 RoPE paper showed modest gains, but the long-context era made it dominant — basically every open-source LLM from 2022 onward uses it. Llama 1, 2, 3. Qwen. Mistral. DeepSeek. PaLM. Gemma. The list is essentially exhaustive.

The full picture, one more time. Inside each attention layer:

Q = x · W_Q,   K = x · W_K,   V = x · W_V
Q ← RoPE(Q, positions)
K ← RoPE(K, positions)
A = softmax( Q · Kᵀ / √d_k )
out = A · V

Two added lines, no new parameters. The result is an attention block that understands relative position natively, has no max-length limit baked in, and extrapolates gracefully when nudged. That's why every modern open LLM uses it.

Where we are. Nineteen primers in, we have built up the complete data path for one attention block in a modern Transformer: tokenization → embedding → RoPE-rotated Q/K split across heads → scaled dot-product attention → concat + W_O. The next primer is the rest of the Transformer block — layer norm, residual connections, and the feedforward MLP that wraps around attention.