Linear Algebra Primer

The minimum linear algebra you need before reading any Transformer diagram. Nine short topics covering vectors, matrices, the arithmetic on them, dot products, matrix multiplication, cosine similarity, norm, and the transpose / reshape moves used everywhere in attention. No prior math beyond high-school algebra assumed.

Vector

An ordered list of numbers that points somewhere.

You already know what a vector is — you just haven't called it that. When you give someone an address like (123 Main St, Apt 4B), you're giving an ordered list. The order matters: "Apt 4B at 123 Main St" makes sense, but rearrange the pieces and it's nonsense. The order encodes the meaning.

When you pick a color on a screen, (255, 0, 0) is bright red, but (0, 255, 0) is bright green. Same three numbers, different order, completely different color. Your GPS coordinates (37.78, −122.42) mean "San Francisco"; swap them and you're in Antarctica. The order is the meaning.

A vector, in math, is just this same idea formalized: an ordered list of numbers where each position has a specific meaning. Each number is called a component; the count of components is the vector's dimension.

Textbook notation: a vector is written v = (v₁, v₂, …, vₙ) or v = [v₁, v₂, …, vₙ] — sometimes in bold (v). An n-dimensional vector of real numbers lives in ℝⁿ, written v ∈ ℝⁿ. Our example [3, 4] is in ℝ².

v34

→

A 2-dimensional vector v = [3, 4]. Each cell is one component; the position of each cell matters.

But vectors aren't just labeled boxes — they have a geometric life. Treat each component as a coordinate, and the vector becomes a point in space. By convention, we then draw it as an arrow from the origin to that point.

v34

1 / 2

From a list, to a point on a plane, to an arrow with direction and length. Click ▶ to play the transition, or step through it with the ⏮ ⏭ buttons.

The arrow representation reveals the magic. A vector is now a directional quantity: it carries both a direction (where it points) and a magnitude (how long it is). Walking "100 meters" is just a number — useless for finding you. Walking "100 meters east" is a vector — it tells someone exactly where you'll end up.

Once you start looking, vectors are everywhere in physics and daily life:

Wind: 20 mph from the southwest = a 2D velocity vector.
Force: pushing a couch 50 N toward the door = a 3D force vector.
RGB color: every pixel on your screen is a 3D vector (red, green, blue).
Your day: (hours_slept, hours_worked, hours_exercised) — a 3D vector summarizing today.
A song: 44,100 samples per second of audio = an N-dimensional vector (where N is huge).

A second vector lives on the same plane:

v34

w12

Two vectors v and w in the same coordinate frame. Comparing them, adding them, scaling them — every operation in this primer operates on pairs (or grids) of vectors like these.

Vectors really earn their keep at high dimensions. We're drawing 2D for clarity, but every operation in this primer — adding, scaling, dot product — generalizes to any number of dimensions. Linear algebra lets us reason geometrically about spaces we can't see.

In a Transformer: every token becomes a vector with hundreds to thousands of components. The word cat might become something like [0.42, −0.17, 0.88, …] (GPT-2 uses 768 numbers, the largest open models go past 16000). Dog gets a different vector with similar values. The model never "sees" the word — it only sees those numbers, and learns by finding patterns in them. Because cat and dog end up near each other in that high-dimensional space, the model treats them as related. The whole intelligence of an LLM lives in where each token sits in this vector space.

Matrix

A 2D grid of numbers — or, equivalently, a stack of vectors.

Like vectors, you've used matrices before. Every spreadsheet is a matrix: a class roster with students as rows and test scores as columns; a weekly schedule with weekdays as rows and hours as columns. A digital photo zoomed all the way in is a matrix too, where each cell holds a pixel's brightness (or three matrices stacked for RGB).

Formally, a matrix is a 2D grid of numbers. Where a vector has one dimension (its length), a matrix has two: rows and columns. The convention is always "rows × columns" — so a 2 × 3 matrix has 2 rows and 3 columns, and not the other way around.

Textbook notation: an R × C real-valued matrix is written M ∈ ℝ^R×C. The entry at row i, column j is written M_ij (or M[i][j] in code).

M (2 × 3)

123456

1 / 3

One matrix, two readings. Highlight a row (length 3), then a column (length 2). Same numbers, different vectors.

That dual reading — "row vector" view vs. "column vector" view — is the matrix's two faces. The same grid can mean very different things depending on which way you slice it.

Think of a small online store with this sales matrix:

                Mon   Tue   Wed
  T-shirts        4     7     2
  Mugs            1     0     5
  Books           3     2     1

Read by row, each row is a product's sales over the week (T-shirts sold 4 + 7 + 2 = 13 units). Read by column, each column is a day's sales across products (Monday sold 4 + 1 + 3 = 8 units total). One grid; both views are valid; whichever you need depends on the question you're asking.

Matrices also do something vectors can't: they represent transformations. A matrix is, secretly, a function — feed it a vector, get back a different vector. We'll see exactly how in Section 06 (matrix multiplication), but the key intuition is this: a matrix can encode rotation, scaling, shearing, projection, and almost anything else you might want to do to a vector — all packed into one tidy grid of numbers.

That's why matrices are everywhere in graphics, physics, and machine learning:

Rotating a 3D model in a video game = multiplying every vertex by a 3×3 rotation matrix.
Resizing a photo = multiplying its pixel matrix by a scaling matrix.
A neural network layer = multiplying input vectors by a learned weight matrix.
Search engines store millions of documents as rows in a giant matrix.
Markov chains use a "transition matrix" to predict the next state.

In a Transformer: almost every learned parameter sits in a matrix. The embedding matrix is V × D — one row per vocabulary token (V ≈ 50,000), D learned numbers per row. The attention projection matrices W_Q, W_K, W_V are each D × D, transforming token vectors into Query / Key / Value spaces. A 7-billion-parameter model is, mechanically, just a handful of large matrices and the rules for how to multiply them.

Add / Subtract

Combine two directional quantities into one.

Adding two vectors is exactly what your intuition tells you it should be.

Imagine you walk 3 blocks east, then turn and walk 4 blocks north. Where do you end up? Not 3 blocks east and 4 blocks north as two separate facts — you end up at a single new location: 3 east and 4 north of where you started. That single destination is the sum of the two walking vectors.

The rule is simple: add component by component. The two vectors must have the same dimension (you can't add a 2D walking vector to a 3D one any more than you can add latitude to a price). For u = [3, 1] and v = [1, 2], the sum is [3+1, 1+2] = [4, 3].

Textbook notation: for any index i, (u + v)_i = u_i + v_i. Both vectors must live in the same space — formally, u, v ∈ ℝⁿ implies u + v ∈ ℝⁿ.

The arithmetic is the easy part; the geometry is where the intuition lives. Place the tail of v at the head of u, and the arrow from the origin to v's new tip is u + v — the "head-to-tail" rule, exactly how you'd trace your walking path on a map.

u31

v12

1 / 3

Three frames: lay out u and v; add component-wise to get [3+1, 1+2] = [4, 3]; then place v's tail at u's head — the arrow from the origin to that new tip is u + v. The two arrows and the sum form a triangle (or a parallelogram if you draw all four from the origin).

This isn't just an abstract math game. Sailors, pilots, and weather forecasters all use vector addition every day: a sailboat moving at 10 knots northeast through a current of 3 knots south actually moves in some new direction at some new speed — and that direction is exactly the head-to-tail sum of those two vectors. Pilots call this "ground track" vs. "heading," and getting it wrong puts your plane in the wrong country.

Subtraction is the same idea with one sign flipped. u − v is "the arrow you'd add to v to get back to u" — or equivalently, "the difference between where they point."

u31

v12

u − v = [3−1, 1−2]

u−v2-1

u − v = [3−1, 1−2] = [2, −1]. Component by component, with negative numbers allowed.

Where vector subtraction quietly does heavy lifting: any time you want to know "how different are these two things?" — that's a subtraction. Two RGB colors? Subtract them and the result tells you exactly which channels differ and by how much. Two players' skill profiles? Subtract them. The length of the difference vector tells you how unlike the two players really are.

In a Transformer: residual connections, arguably the single most important architectural trick in modern deep learning, are vector addition. A layer's output is added back to its input: x_out = x_in + Layer(x_in). This is exactly the element-wise add from above, and it's what lets information flow cleanly through 100+ stacked layers without getting smeared into noise. Without residual connections, deep Transformers don't train at all.

Scalar Multiplication

Stretch, flip, or shrink — same direction, different size.

A scalar is just a single number — as opposed to a vector, which is a list. The word comes from "scale": a scalar tells you how much to scale a vector by. Multiply a vector by 2 and it doubles in length (same direction). Multiply by 0.5 and it halves. Multiply by −1 and it flips around to point the other way.

The rule is straightforward: multiply every component by the scalar. For v = [2, 1]: 2v = [4, 2], −v = [−2, −1], 0.5v = [1, 0.5].

Textbook notation: for any scalar α ∈ ℝ and vector v ∈ ℝⁿ, the product is defined component-wise: (αv)_i = α · v_i.

v21

2 × [2, 1]

2v42

1 / 3

Three frames in sequence. 2v doubles the length; −v flips direction; 0.5v halves the length. Direction is preserved (or perfectly reversed); only length changes.

You've done scalar multiplication a thousand times without naming it:

Doubling a recipe. The ingredients vector (2 eggs, 1 cup flour, 0.5 cup milk) times 2 becomes (4 eggs, 2 cups flour, 1 cup milk). Every component scales together.
Turning the volume knob. An audio waveform is a vector of sample values. Multiplying the whole vector by 0.5 = quieter; by 2 = louder (clipping aside).
Zooming a photo. Each pixel's coordinates get multiplied by the zoom factor. Zoom 2× and every pixel sits twice as far from the center.
Reversing direction. A car's velocity vector multiplied by −1 = same speed, opposite direction (i.e., backing up at the same speed).

Three things to remember:

Positive scalars preserve direction. Length scales by the scalar.
Negative scalars flip the direction. Length scales by the absolute value.
Zero collapses the vector to the origin (length 0). One leaves it unchanged.

Scalar multiplication is the simplest operation here, but don't underestimate it. Every single iteration of gradient descent — the algorithm that trains every neural network — is a scalar multiplication. The "learning rate" you hear about in ML talks is exactly the scalar that scales the gradient vector before subtracting it from the weights:

new_weights = old_weights − learning_rate · gradient

Too big and training overshoots and diverges; too small and it crawls. Tuning that one scalar is one of the most common tasks in machine learning practice.

In a Transformer: beyond the learning-rate update above, scalar multiplication powers a handful of other key moves. Softmax temperature scales logits before normalization (we'll meet softmax in the Transformer track). Attention scaling divides dot-product scores by √d_k to keep them stable as model dimension grows. Gradient clipping's "scale to a max norm" trick is also just a scalar multiplication. Anywhere you see "multiply this whole thing by a number" — that's scalar multiplication doing the work.

Dot Product

The most-used operation in deep learning, by a mile.

If you only learn one thing from this primer, learn the dot product. Every other concept — attention, similarity, projection, matrix multiplication — is just a dot product wearing a different hat.

Here's the recipe: take two vectors of the same dimension, multiply paired entries, and sum the results. Out comes a single number. That's the whole definition.

Textbook notation: using summation,

u · v = ∑_i=1ⁿ u_i v_i

Some textbooks write the dot product as ⟨u, v⟩ instead of u · v — same thing, also called the inner product.

u34

v21

1 / 4

u = [3, 4], v = [2, 1]. Multiply paired entries to get [6, 4], then add: 6 + 4 = 10. So u · v = 10.

The dot product looks unremarkable. But it quietly answers a powerful question: how aligned are these two vectors?

Imagine you and a friend each fill out a "movie preference" vector — one number per genre — rating how much you enjoy each:

             Action  Comedy  Romance  Horror  Sci-Fi
  You         5       3       1        2       4
  Friend A    5       3       0        3       4   ← matches you
  Friend B    0       2       5        0       1   ← opposite of you

Compute the dot product between your vector and each friend's. Positive case — with Friend A: 5·5 + 3·3 + 1·0 + 2·3 + 4·4 = 56. Counter-example — with Friend B: 5·0 + 3·2 + 1·5 + 2·0 + 4·1 = 15.

Friend A's number is much bigger — and it's because their tastes line up with yours on the same genres (high ratings on the same rows). Friend B's is small because their high ratings are on the genres you don't care for — and their low ratings are on yours. Same operation, opposite outcome: one number tells you, in a single step, who'd enjoy the same movies as you. This is literally how recommendation systems start.

Why does this work? The whole mechanism lives in multiplication. A product a · b is only large when both numbers are large in the same sign — multiplication acts like an "AND-detector" per dimension. Look at the Action column: you and Friend A both scored 5, so 5 · 5 = 25 contributes heavily to the sum. Friend B scored 0, so 5 · 0 = 0 contributes nothing. Only two-sided agreement adds much.

Sum across all dimensions, and you're effectively counting the agreements. The bigger the count, the more aligned the two vectors are. That's the entire mechanism — and it generalizes to any number of dimensions, which is why the same operation that scores movie compatibility also scores LLM attention.

There's also a clean geometric reading:

u · v = ‖u‖ · ‖v‖ · cos(θ)

The double bars ‖v‖ mean "the length of v" — we'll meet this formally as the norm in Section 08.

So the dot product packages three things into one number: how long u is, how long v is, and how aligned they are in direction. Same direction → big positive number. Perpendicular → exactly zero. Opposite → big negative number.

u · v > 0

1 / 3

Same direction → large positive. Perpendicular → exactly 0. Opposite → large negative.

The perpendicular case is the most striking: perpendicular vectors have a dot product of exactly zero. They share no directional information at all — if your preference vector were perpendicular to a friend's, you'd have no overlap in taste whatsoever. That's a strong mathematical statement, and it pops out of just five multiplications and four additions.

Once you start seeing the dot product as "how aligned are these two things," its applications open up:

Lighting in 3D graphics. The brightness of a surface is the dot product of "direction of light" with "surface normal." Facing the light = bright. Sideways = dim. Facing away = dark (or zero).
Google search relevance. Your query is a vector; each document is a vector. The dot product ranks "how aligned with the query" each document is.
Spam filters. An email's features form a vector; the model has a learned "spam direction" vector. Dot product → big positive = likely spam.

In a Transformer: attention scores. Every "how much should this token attend to that one?" question is answered by a dot product — specifically, between a Query vector and a Key vector. Large dot product → "pay a lot of attention." Near-zero → "ignore." Negative → "actively suppress." Stack billions of these dot products across many layers and many heads, and you get GPT.

Matrix Multiplication

Many dot products at once — and the workhorse of modern AI.

Once you understand the dot product, matrix multiplication is mostly bookkeeping. When you multiply a matrix M (with R rows) by a vector v, you get a new vector with R entries — each entry is one dot product, of one row of M with v. That's really all there is.

1234

1 / 4

M · v: row-by-row dot products. Row 0 → 17. Row 1 → 39. So Mv = [17, 39].

Why does that matter? This is the mechanism that lets a single matrix transform a vector — into a rotated version, a stretched version, a projected-onto-a-plane version, or any other linear transformation. Section 02 hinted that a matrix acts like a function; multiplication is exactly how.

Here's a concrete example. Suppose you run that little online store from Section 02 (T-shirts, mugs, books) and you have prices per item:

prices = [ T-shirt 20, mug 8, book 15 ]   # a vector

sales =  Mon    Tue    Wed
T-shirt    4      7      2
mug        1      0      5
book       3      2      1                # a 3×3 matrix

To compute revenue per day, multiply prices by the sales matrix:

Mon revenue:  20·4 + 8·1 + 15·3  = 133
Tue revenue:  20·7 + 8·0 + 15·2  = 170
Wed revenue:  20·2 + 8·5 + 15·1  =  95

Three dot products. Three answers. One operation. That's matrix multiplication — it lets you do many dot products at once, whatever the rows and columns happen to represent (alignments, weighted sums, projections).

Matrix times matrix. Same idea, but with multiple "v"s side by side. If M is R × K and N is K × C, then MN is R × C, computed by R · C independent dot products. The K's must match — that's the shape rule that catches every beginner.

Textbook notation: the entry at row i, column j of MN is

(MN)_ij = ∑_k=1^K M_ik N_kj

— exactly the dot product of row i of M with column j of N.

Matrix multiplication is everywhere, often invisibly:

3D games. Every frame, every model's vertices are multiplied by a "camera matrix" to figure out where they should appear on your screen. Tens of millions of matrix multiplications per second.
PageRank. Google's original algorithm was, essentially, multiplying a giant "link matrix" by itself repeatedly until it converged.
Image filters. Sharpen, blur, edge-detection — all done by multiplying tiny matrices ("kernels") against patches of the image.
Convolutional neural nets. Each layer multiplies image patches by learned filter matrices to detect edges, textures, and shapes. Trillions of multiplications go into recognizing a single photograph.

The unreasonable speed of GPUs. Modern GPUs are essentially purpose-built for matrix multiplication. They can do thousands of dot products in parallel because each cell of the output matrix is independent of the others — perfect for parallel hardware. A consumer GPU can do trillions of multiply-add operations per second. The reason we can train trillion-parameter models at all is that matrix multiplication maps beautifully onto GPU hardware.

In a Transformer: matrix multiplication is, quite literally, the Transformer. A single layer contains at least: one matmul to project tokens to Q, one to K, one to V, one to combine attended values, two for the feed-forward block. Add a final matmul at the very end to project back to vocabulary logits. Stack 32 to 80 of these layers — and that's a modern LLM. Speeding up training and inference is largely the engineering of squeezing more matmul out of the same hardware.

Cosine Similarity

Direction only, length normalized away.

Recall the dot product. It mixes two things: how long the vectors are, and how aligned they are. Often we want only the alignment, with the lengths factored out. That's what cosine similarity does.

Let's make the movie example a little more realistic. Suppose you are a cinephile who rates on a 0–10 scale, while Friends A and B still rate on the 0–5 scale from Section 05. The three vectors now look like:

              Action  Comedy  Romance  Horror  Sci-Fi
You (0–10)    10      6       2        4       8
Friend A       5      3       0        3       4
Friend B       0      2       5        0       1

Compute the dot products with this new data:

You · A = 10·5 + 6·3 + 2·0 + 4·3 + 8·4 = 112
You · B = 10·0 + 6·2 + 2·5 + 4·0 + 8·1 = 30

A is the better match, just like in Section 05 — but half of each number comes from your wider scale, not from any change in taste. (Compare to Section 05's 56 and 15: both doubled.) Raw dot products mix "how aligned are tastes?" with "how big do you rate?" and we can't tell which is which.

The fix: divide the dot product by the lengths of the two vectors.

cos_sim(u, v) = (u · v) / (‖u‖ · ‖v‖)

This strips both scales away, leaving only direction. Because u · v = ‖u‖ · ‖v‖ · cos(θ), dividing by the lengths cancels them out and leaves you with exactly cos(θ) — the cosine of the angle between the two vectors. The result always lands in [−1, 1], regardless of anyone's rating scale.

Compute the three norms (the lengths from Section 05):

‖You‖ = √(10² + 6² + 2² + 4² + 8²) = √220 ≈ 14.83
‖A‖   = √(5²  + 3² + 0² + 3² + 4²) = √59  ≈  7.68
‖B‖   = √(0²  + 2² + 5² + 0² + 1²) = √30  ≈  5.48

Then plug in:

cos_sim(You, A) = 112 / (14.83 × 7.68) ≈ 0.98 — almost identical tastes.
cos_sim(You, B) = 30 / (14.83 × 5.48) ≈ 0.37 — some overlap, but quite different.

Now the numbers mean something on their own: 0.98 is "same taste profile," 0.37 is "weakly compatible." Even better: if you went back to a 0–5 scale (halve every entry, matching Section 05's data), the cosines would land at exactly the same 0.98 and 0.37. Cosine throws out the length / scale and keeps only direction — a property called scale-invariance.

cos_sim = 0.99

1 / 3

Nearly parallel → cos_sim ≈ 1. Perpendicular → 0. Opposite → −1.

Real-world uses are absolutely everywhere:

Document search. Each document and each query becomes a vector (representing its word distribution, or its semantic embedding). Cosine similarity ranks documents by topical alignment. Long documents and short queries get compared fairly because length is normalized out.
Spotify and Netflix recommendations. Your taste profile is a vector; each song or movie is a vector. Recommendations are roughly "highest cosine similarity items you haven't consumed yet."
Plagiarism detection. Two documents' vectors with cos_sim near 1 probably share their structure (the same words at the same proportions). Doesn't catch perfect rewording, but catches a lot.
Face recognition. Each face becomes a 128-dim or 512-dim "face embedding." Comparing two faces = cosine similarity between their embeddings. If above some threshold, declare a match.

In a Transformer: attention scores are technically dot products, not cosines — but the network learns to keep its attention vectors at similar scales, so a dot product on those vectors behaves a lot like a cosine similarity. That's how the model judges which earlier tokens are most relevant to each new one. Cosine similarity also powers the retrieval half of RAG (retrieval-augmented generation): an entire industry (Pinecone, Weaviate, Qdrant, pgvector, Chroma, …) has sprung up just to store billions of embedding vectors and answer "what's most similar to this query?" at massive scale.

Norm (Length)

Pythagorean theorem, generalized to any number of dimensions.

Question: if you walk 3 blocks east and then 4 blocks north, how far are you from where you started? Not 7 blocks — you didn't take the long way around. You took the hypotenuse of a right triangle whose legs are 3 and 4. By Pythagorean theorem, √(3² + 4²) = √25 = 5 blocks. That hypotenuse is the norm (or length) of the vector [3, 4], written ‖v‖ (double bars).

That's the whole idea. The norm of a vector is its geometric length — the distance from the origin to its tip. The formula is just Pythagoras applied across however many components you happen to have.

v34

1 / 3

The components 3 and 4 are the legs of a right triangle; the vector is the hypotenuse, with length 5.

For D dimensions, the formula stretches gracefully:

‖v‖ = √(v₁² + v₂² + … + v_D²)

Square every component, add them up, take the square root. Done. It works for 2D, 3D, 100D, or 12288D — exactly the same procedure.

What does it mean for the movie preferences from Sections 05–07? Use the original 0–5 scale data and compute each person's norm:

‖You‖ = √(5² + 3² + 1² + 2² + 4²) = √55 ≈ 7.42
‖A‖ = √(5² + 3² + 0² + 3² + 4²) = √59 ≈ 7.68
‖B‖ = √(0² + 2² + 5² + 0² + 1²) = √30 ≈ 5.48

Read these as total opinion intensity — how strongly someone feels across all genres combined. Friend A has the biggest norm: they rate several genres strongly. Friend B has the smallest: mostly zeros, with one big spike on Romance, so the total magnitude is modest. You sit in between. Same vectors, looked at a different way: cosine asked "are these vectors aligned?"; norm asks "how big is this vector?"

Switch You to the 0–10 cinephile version from Section 07 and the norm exactly doubles — not because tastes changed, but because the rating scale did. Norm captures magnitude; cosine ignores it.

Why so important? Because "how big is this thing?" comes up constantly, and the norm is how you answer it:

GPS distance. The straight-line distance between two coordinates is the norm of the difference vector between them.
Sound volume. The "loudness" of an audio signal is roughly the norm of its sample vector.
Comparing two photos. Subtract their pixel matrices; the norm of the difference tells you how visually different they are.
Physics velocity. A velocity vector's norm is the speed. A force vector's norm is the magnitude of the force.

There's also a useful trick called normalization: divide a vector by its own norm. The result is a vector pointing in the same direction with length exactly 1 (a "unit vector"). It's useful when you only care about direction; cosine similarity relies on the same idea, dividing the dot product by both norms in one step instead of normalizing each vector first.

Technically what we just defined is the L2 norm (also called Euclidean norm), sometimes written ‖v‖₂ with a subscript to be explicit. There are other norms — ‖v‖₁ sums absolute values instead of squaring; ‖v‖_∞ takes the max component; etc. But in deep learning, "the norm" almost always means L2 unless someone says otherwise.

In a Transformer: LayerNorm and RMSNorm — two of the normalization layers found inside every modern Transformer — both compute a norm and divide by it, keeping each token's vector at a roughly consistent scale so the math behaves. Gradient clipping (one line in every PyTorch training loop: torch.nn.utils.clip_grad_norm_) caps the gradient's norm to keep training from blowing up when the loss landscape gets steep. And cosine similarity, which you just saw, is literally "dot product divided by two norms."

Transpose & Reshape

Same numbers, different shape.

Think about a spreadsheet of students × test scores. Sometimes you want one student's row across all tests; sometimes you want one test's column across all students. The data's the same — what changes is which way you read it. Transpose formalizes this: a matrix flipped along its diagonal, so the rows become columns and the columns become rows. We write it M^T.

Concrete example. Recall Section 02's store sales matrix:

              Mon   Tue   Wed
  T-shirts     4     7     2
  Mugs         1     0     5
  Books        3     2     1

Rows = products, columns = days. To answer "which day had the most total sales?" you have to scan column-by-column, summing as you go. Annoying. Transpose it and the same data becomes rows = days, columns = products:

         T-shirts  Mugs  Books
  Mon       4        1     3
  Tue       7        0     2
  Wed       2        5     1

Now "Tuesday's sales" is just one row — sum it across, done. Same nine numbers, two different access patterns. Whichever the next operation needs, transposing into that shape is one cheap move away.

Textbook notation: if M ∈ ℝ^R×C, then M^T ∈ ℝ^C×R, with entries (M^T)_ij = M_ji — row and column indices swap. A useful identity: (AB)^T = B^TA^T (transposing a product reverses the order).

M (2 × 3)

123456

1 / 2

M (2 × 3) ⇄ M^T (3 × 2). Each row of M lands as a column of M^T.

Trace one entry to convince yourself: the 2 sits at row 0, column 1 of M. After transpose, that same 2 is at row 1, column 0 of M^T. Row and column indices swap — every cell makes the same move at once. Diagonal entries (where row index equals column index) sit still; everything else flips across the diagonal.

Why do we transpose so often? Two reasons.

First, shape compatibility. Matrix multiplication only works when the inner dimensions match: a (R × K) times (K × C) is fine, but (R × K) times (R × C) is not. If two matrices almost line up but the dimensions are flipped, transposing one of them fixes it.

Real example: attention scores. In a Transformer, both Q (queries) and K (keys) come out shaped (N tokens × D features). The model wants, for every query row, its dot product with every key row — an N × N grid of "how aligned." But matmul gives row · column products, not row · row. Transpose K so its rows become columns, and now Q · K^T in a single matmul produces all N × N dot products at once. Without the transpose, the shapes simply don't line up.

Second, it's essentially free. The numbers don't actually move in memory; the GPU just relabels which axis is "row" and which is "column." A free operation that fixes an entire class of "shape mismatch" errors is a real gift.

Reshape is more general. It takes a chunk of numbers and reorganizes them into any shape with the same total count. A 2×3 matrix (6 numbers) can become a 1×6 row, a 6×1 column, a 3×2 matrix, or any other arrangement. The data is the same — only the addressing scheme changes.

Picture a deck of 52 cards. Arranged as 4 suits × 13 ranks, you reach a card as "hearts, 7." Arranged as 13 ranks × 4 suits, you reach it as "7, hearts." The cards never moved — you only changed which dimension you look up first. Reshape does the same thing to a block of numbers.

M (2 × 3)

123456

⇄

M^T (3 × 2)

142536

reshape (1 × 6)

123456

reshape (6 × 1)

123456

Same six numbers [1, 2, 3, 4, 5, 6] in two reshape forms: a 1×6 row and a 6×1 column.

Reshape vs. transpose — a subtle but important difference. Reshape preserves the linear order of numbers and just regroups them. The matrix M = [[1, 2, 3], [4, 5, 6]] reshaped into 6 elements is [1, 2, 3, 4, 5, 6] (reading row by row), and reshaped to 3×2 becomes [[1, 2], [3, 4], [5, 6]] — not M^T, which would be [[1, 4], [2, 5], [3, 6]]. Reshape regroups; transpose actually rearranges. Mixing these up is one of the most common bugs in ML code.

Reshape feels like it shouldn't be a big deal — it's just relabeling, after all. But it's the trick behind one of the most important ideas in modern AI: multi-head attention.

Here's the trick — in one line of code: x.reshape(N, 12, 64). A token in GPT-2 has a vector of D = 768 numbers. That line reinterprets those 768 as 12 groups of 64 — called "heads" — and the model runs 12 independent attentions, one per head. Each head can specialize: one might track syntax, another long-range dependencies, another plural/singular agreement. After all 12 finish in parallel, x.reshape(N, 768) puts them back together as a single vector and the model moves on.

The reshape never changes any data. It just changes how the next operation carves it up. Yet that one trick — running attention on 12 sliced-up views of the same vector instead of on the whole thing — is responsible for a huge chunk of what makes Transformers as capable as they are.

Where else reshape shows up:

Image processing. A 28×28 grayscale image is 784 numbers. To feed it into a fully-connected layer, you reshape it from 28 × 28 into 784 × 1. Same pixels, different layout.
Batched training. Sixteen 28×28 images get stacked into a 16 × 28 × 28 tensor (a 3D matrix) so the GPU can train on them in parallel.
Convolutions. A "patch" of the image is reshaped into a vector before being multiplied by the filter. Reshape, then matmul, then reshape back. Repeat.

In a Transformer: beyond multi-head attention (the headline reshape use above), transpose shows up every time a shape needs to line up for a matmul — including the Q · K^T step at the core of attention, where K's rows need to become columns so the dot products work out.