Hardware & Tensors Primer

The other primers in this track described what a network does. This one describes the substrate it runs on. Four short topics: GPU vs CPU and why every modern model lives on a GPU; VRAM, the onboard memory that is now the hardest constraint on what model you can run; tensors, the multi-dimensional arrays that are the basic data unit of deep learning; and batch / padding / mask, the practical packaging that turns ragged real-world text into GPU-friendly rectangles. You can't skip this one in practice — every "CUDA out of memory" error you ever see is a story written in this vocabulary.

GPU vs CPU

CPUs are smart but few. GPUs are dumb but many. Deep learning is the second kind of workload.

A modern CPU is the platonic ideal of "a computer." A handful of fast, flexible cores — anywhere from 8 on a laptop to 64 on a workstation, each one able to do branching logic, complex instructions, and bookkeeping at high speed. Optimized for one thread doing a varied sequence of tasks. Run an operating system, a database, a web server — that's CPU territory.

A modern GPU is the opposite. Thousands of small, dumb cores — an NVIDIA H100 has 16,896 CUDA cores. Each one is bad at branching, bad at general logic, and only fast when it's doing the same simple operation as all its neighbors. The execution model is "SIMD" — single instruction, multiple data — or its finer-grained cousin SIMT. The whole architecture is optimized for one operation applied to thousands of pieces of data in parallel.

That second pattern is exactly what neural networks are made of:

Matrix multiplication. The bottom-of-the-stack op in a Transformer. A matmul of two 4096×4096 matrices is 67 billion multiplies — and every single one can run in parallel.
Element-wise activations. ReLU, GELU, softmax — apply the same function to millions of values independently.
Normalization layers. LayerNorm and RMSNorm reduce along one axis and then rescale every element — also embarrassingly parallel.

1 / 4

CPUs are smart but few. GPUs are dumb but many. Deep learning is overwhelmingly the second kind of workload.

The numbers, if you want a sanity check. A 4096 × 4096 matmul takes a fast CPU about 50 ms; an H100 GPU does it in under 1 ms. That gap grows superlinearly with size, because the GPU is barely warmed up at 4k × 4k while the CPU is already maxed. For the actual training of a frontier LLM, you typically need thousands of GPUs running in parallel for months. On CPUs, the same training would take centuries. That's why every modern model lives on a GPU.

What GPUs are bad at, briefly:

Branchy code. If different "cores" need to take different paths, they take turns — fast cores wait for slow cores. The deep-learning solution is to write code with no branches: pad, mask, and apply the same op everywhere.
Small workloads. Spinning up the GPU has overhead. A tiny matmul (say, 32 × 32) is faster on a CPU because the GPU's setup cost dominates. Real deep learning batches things up to amortize this.
CPU↔GPU memory transfers. Moving data between system RAM and the GPU's VRAM (§2) goes over a PCIe bus that's much slower than the GPU's internal bandwidth. Keep data on the GPU.

Briefly: TPUs are Google's purpose-built tensor accelerators — even more specialized than GPUs, optimized specifically for the matmul + activation pattern. They power most of Google's internal model training (Gemini, et al.). Functionally they're a more aggressive version of the same idea: lots of dumb, parallel arithmetic with controlled memory access patterns.

In a Transformer: every layer is matmuls (Q·K, attention·V, FFN weights) plus a handful of element-wise ops (softmax, GELU, residual add, layer norm). The entire forward pass is "feed the GPU rectangular tensors of work." Whoever designs the architecture is implicitly designing for this hardware reality; whoever trains it is implicitly competing for GPU time.

VRAM — The Hard Bottleneck

Every "CUDA out of memory" error you ever hit is a story written here.

A GPU has its own private memory, separate from the system RAM the CPU uses. NVIDIA calls it VRAM, or HBM on the newer chips. An H100 has 80 GB; an A100 has 40 or 80; a 4090 has 24; a 4070 has 12. That number is the binding constraint on what model you can train or run on that GPU. If your model and its scratch space don't fit, you don't train. Period.

For a 7-billion-parameter model in fp16 (2 bytes per number), the math is:

Inference (just the model):
  weights              7 × 10⁹ × 2 bytes  =  14 GB
                       ─────────────────
                       fits on a 24 GB consumer GPU

Training (model + all the training scaffolding):
  weights              14 GB
  gradients            14 GB     ← one per parameter
  Adam momentum        14 GB     ← one per parameter
  Adam variance        14 GB     ← one per parameter
  activations          ~30 GB    ← depends on batch & seq_len
                       ─────────
                       ≈ 86 GB  →  overflows 80 GB H100

1 / 4

Inference fits a 7B model on one GPU easily. Training the same model needs gradients + optimizer states + activations — and overshoots the 80 GB cap.

Two surprises here. First, training takes roughly 5–10× more VRAM than inference for the same model — because the optimizer keeps multiple copies of every parameter, and the forward-pass activations are saved for the backward pass. Second, the same 7B model that runs comfortably on a single 24 GB consumer GPU at inference can't even fit on a single 80 GB H100 at training time. A 70B model is 140 GB just for the weights, period — guaranteed multi-GPU territory.

The mitigations, in roughly the order people reach for them:

Quantization. Store weights at lower precision — int8 (half the bytes), int4 (a quarter), or fancier 3-bit / 2-bit schemes. The 7B model that weighs 14 GB in fp16 weighs 3.5 GB in int4. Common for inference; tricky during training because gradients want more precision.
Mixed precision training. Keep activations and gradients in fp16 or bf16; keep the optimizer's master copy in fp32 for stability. Roughly halves the activation memory.
Gradient / activation checkpointing. Don't save all activations from the forward pass — save every few layers, and recompute the in-between ones during backward. Trades compute for memory.
Model sharding / tensor parallelism. Cut the model itself across multiple GPUs. ZeRO, FSDP, DeepSpeed, Megatron — every distributed-training framework is some variation of this.
Flash attention. Re-implement attention to never materialize the full n × n attention matrix. Saves 20–80 GB on long-context training; now baked into every modern Transformer implementation.

Bandwidth is the other half of "memory" that nobody mentions until they hit it. An H100's VRAM bandwidth is 3 TB/s — enormous, but still finite. Many neural-net operations are memory-bound: the GPU spends more time waiting on data to arrive from VRAM than actually computing. The job of efficient kernels (FlashAttention, fused MLP, etc.) is partly to minimize how often you have to round-trip through VRAM. There's also a memory hierarchy inside the GPU — registers, shared memory / L1 cache (256 KB per SM), L2 cache (50 MB total) — that's much faster than VRAM but much smaller. Performance work is largely "keep data in faster memory longer."

In a Transformer: every architectural choice is partly a VRAM choice. The KV cache — keys and values from past tokens kept around for autoregressive decoding — scales as batch × seq_len × n_layers × d_kv, and for long-context models is often larger than the weights themselves. Reducing KV-cache size (Grouped-Query Attention, Multi-Query Attention, sliding window attention) is half of why modern LLMs can do 128k or 1M tokens of context.

Tensors — Multi-Dimensional Arrays

The basic data unit of deep learning. Every value flowing through a network is one.

Strip away the framework, the model, the abstractions — and what's left is tensors. A tensor is a multi-dimensional array of numbers. The whole modern ML stack (PyTorch, JAX, TensorFlow, NumPy) is just a collection of efficient ways to create, reshape, and combine tensors. Understand tensors and you understand 90% of what an LLM's code does line by line.

The dimension ladder:

0-D (scalar) — a single number. shape = (). The loss at the end of a forward pass; a learning rate; a probability.
1-D (vector) — a row of numbers. shape = (n,). One word's embedding (e.g., a 768-dim vector); a layer's biases.
2-D (matrix) — rows × columns. shape = (m, n). A weight matrix; one sentence's embeddings (sequence × d_model); a covariance matrix.
3-D (tensor) — a stack of matrices. shape = (a, b, c). The bread-and-butter shape for Transformers: (batch, seq, d_model).
Higher (n-D) — add more axes. Multi-head attention works in 4-D (batch, heads, seq, head_dim); video data is 5-D (batch, time, channel, height, width).

1 / 4

A tensor is a multi-dimensional array. Scalar → vector → matrix → 3-D tensor and beyond. Every value flowing through a neural net is one of these.

Every tensor has three things you need to know about it before you can do anything with it:

Shape. A tuple of integers — (32, 512, 768) means 32 examples, 512 tokens each, 768-dim hidden state per token. Almost every neural-net bug is a shape mismatch.
Dtype. The precision: float32 (4 bytes), float16 / bfloat16 (2 bytes), int8 (1 byte),bool. Choice of dtype is half VRAM (§2) decision, half numerical-stability decision.
Device. Where the bytes live — cpu, cuda:0,cuda:1, etc. Operations between tensors on different devices error out; you have to .to(device) first.

The operations you'll use 95% of the time:

Element-wise: +, *, tanh,relu. Apply per cell; output shape = input shape.
Reductions: sum, mean, maxalong an axis. (32, 512, 768).mean(dim=-1) → (32, 512) — collapse the hidden dim.
Reshapes: view, reshape, permute, transpose. Rearrange axes without changing data. The Transformer uses these constantly to swap batch / seq / head dimensions around.
Matrix multiplications: @ in PyTorch / NumPy. (32, 512, 768) @ (768, 768) → (32, 512, 768). The big op; the GPU was built for this.
Indexing / slicing: x[:, 0, :] grabs the first token of every example. x[mask] picks out positions where mask is true.

Broadcasting is the one operation that mystifies beginners and then becomes invisible. When you write (32, 512, 768) + (768,), the smaller tensor is automatically "stretched" along the missing dimensions, so the same (768,) bias vector is added to every position of every example. No memory is copied; the broadcast is virtual. Almost every line of neural-network code uses broadcasting to write tensor ops compactly.

In a Transformer: the input tokens get embedded into a tensor of shape (batch, seq, d_model). That shape stays throughout the network — each layer's output is another tensor of the same shape, just with different numbers. Inside attention, the tensor is briefly reshaped to (batch, heads, seq, d_head) to compute multi-head dot products in parallel, then reshaped back. The final layer projects it to (batch, seq, vocab_size) — one logit vector per position over the vocabulary. Every step is a tensor op.

Batch, Padding & Mask

How variable-length text actually gets packed into GPU-friendly rectangles.

Real text is a ragged collection of sequences. Real GPUs want rectangular tensors. Three pieces of practical machinery bridge the gap — and they show up in every line of every Transformer training loop. The text primer (§1) introduced the idea; this section nails down the mechanics.

Batch.

You don't feed one sentence at a time. You feed N sentences at once, stacked into a tensor of shape (N, max_len), and let the GPU process all N in parallel. N — the batch size — is one of the most important hyperparameters in deep learning. Bigger batches use more VRAM but keep the GPU busier, average gradients more, and (within a regime) speed up training. Typical values: 32 for fine-tuning a 70B model on a few GPUs, 4 million tokens of batch for pre-training GPT-4-class models. The exact number depends on what you can fit in VRAM.

Padding.

Sentences in a batch have different lengths. To stack them into a rectangular tensor of shape (N, max_len), you have to make them all the same length. You pad the shorter ones with a special [PAD] token, conventionally token id 0 in the vocabulary. The longest sentence in the batch sets max_len; everyone else gets PAD tokens appended.

1 / 3

Stacking sentences into a tensor needs uniform length. Padding fixes the shape; the attention mask tells the model which positions are real.

Two practical knobs around padding. First, the choice of max_len per batch: padding to the global maximum (across the entire dataset) wastes compute; padding to the max within each minibatch (called "dynamic padding") wastes much less. Second, the choice of which side to pad: most modern Transformers pad on the right ("right-padding") for training and on the left ("left-padding") for autoregressive generation. The reasons are subtle but consequential — get it wrong and the model silently learns nonsense.

Mask.

Padding fixes the shape but creates a new problem: the attention mechanism, by default, will happily compute attention weights between real tokens and PAD tokens. The model would learn to route information through padding, which is meaningless. The fix is a mask — a parallel tensor of 1s and 0s (often as bool) telling the model which positions are real:

sentence A:  ["hi", ".", PAD, PAD, PAD, PAD, PAD]
mask A:      [  1,   1,   0,   0,   0,   0,   0]

sentence B:  ["the", "dog", "runs", "fast", ".", PAD, PAD]
mask B:      [   1,    1,     1,      1,    1,   0,   0]

Inside the attention block, the mask is applied before softmax: the attention scores at PAD positions are set to −∞, so softmax sends their weight to exactly 0. PAD tokens contribute zero to the output; the model behaves identically to running on the unpadded sentences alone (modulo the wasted compute).

Two flavors of mask, both common in Transformers:

Padding mask — the one just described. Marks which positions are real vs PAD. Present in basically every Transformer.
Causal (autoregressive) mask — used in decoder-only models like GPT. Position t can only attend to positions ≤ t. Implemented as a triangular matrix of −∞s applied to the attention scores. Combined with the padding mask via a logical AND.

In a Transformer: every single attention call takes a mask. During training, a typical batch has both: causal mask for the autoregressive structure plus a padding mask for the variable lengths. The output is correct, the GPU sees nice rectangular tensors, the loss only counts real tokens (you also mask the loss function — don't penalize the model for what it predicts at PAD positions). This is the unglamorous plumbing that makes batched, parallel training possible at all.