Memory Hierarchy Primer

Our curl process holds the URL string "https://api.example.com/user/42" somewhere in memory. The question this primer answers: how long does the CPU take to read those 35 bytes — and why does the answer span six orders of magnitude depending on where they actually live? Five sections build the picture: the latency numbers with an interactive scale comparator; why the hierarchy exists at all (the physics that forced the choice); how caches actually work (lines, sets, ways, the prefetcher); locality — the rule that decides whether your code feels fast or slow; and a quick reference for the questions worth being able to answer cold.

The latency numbers, to scale

One L1 cache hit (~1 ns) to one disk read (~10 ms) is the same ratio as one second to four months. Engineers who can recite this table from memory make systematically better performance decisions than those who can't.

The numbers below are the canonical "Latency Numbers Every Programmer Should Know" (Jeff Dean, ~2009, updated for modern hardware). They span ten orders of magnitude. The rescaled column on the right is the most useful pedagogical trick: treat an L1 hit as if it took 1 second, and every other operation scales proportionally. A DRAM access becomes 100 seconds; a cross-DC TCP packet becomes 19 days; a cross-continent round-trip becomes 16 years.

1 / 9

Memorise the order of magnitude per row. Then memorise the right-hand column once — the "DRAM is 100 seconds vs L1's 1 second" intuition is the most useful single mental tool for performance work, because it lets you reason about whether a particular operation is worth caching, batching, or moving across a network boundary.

What this table is for

Memorising these orders of magnitude is the foundation. Two specific decision shapes fall out of it:

The 100× rule between adjacent tiers. Roughly every step down the hierarchy adds a factor of 100. L1 → DRAM is 100×, DRAM → SSD is another 1000×, SSD → cross-continent is another 1000×. This means caching anything at the right tier is a 100–1000× win when it works.
One DRAM miss costs ~100 useful ALU instructions. A 3 GHz core does 3 instructions per cycle in steady state. A 100 ns miss is 300 cycles is ~900 instructions of unrelated independent work the core could have done while waiting. Out-of-order execution recovers some of that, but not all — and the waiting itself is what pushes IPC below 1 in memory-bound workloads.

Why the right-hand column matters more than the left

The raw nanosecond numbers don't map naturally to human reasoning. We don't have intuition for 1 ns vs 100 ns — both feel "instant." Rescaling to seconds-vs-months snaps the gaps into a shape the human brain handles every day. Asking "would I drive across town for this?" for a cross-DC RTT becomes a usable cost frame; asking "would I take a flight?" for a cross-continent RTT does too. Most performance design errors come from underestimating these distances by orders of magnitude.

Two numbers the table doesn't show

Throughput, not latency. The table is per-operation latency. Theaggregate bandwidth at each tier is also worth knowing: ~1 TB/s for L1, ~80 GB/s for DRAM (per channel), ~7 GB/s for NVMe, ~1 GB/s for a 10 GbE NIC. A memcpy across DRAM is bandwidth-bound, not latency-bound; an array iteration that prefetches well runs at near peak DRAM bandwidth despite the latency.

Variance. Latency is a distribution, not a number. A DRAM access averages 100 ns but tail values stretch much higher under contention. A "NVMe SSD" 100 μs latency is the median; the 99th percentile under load may be 10× higher. The right-hand column lies about variance. For tail-latency-sensitive systems, the p99 numbers are what matter, not the medians.

The takeaway. "Performance work starts with the latency table. The rescaled column — L1 hit = 1 second — turns abstract nanoseconds into human-scale time, and once it's in your head you reason about caching, batching, and network boundaries quantitatively instead of by feel. The per-operation numbers are the median; the p99 matters more for any tail-sensitive workload."

Why the hierarchy exists at all

Three physical constraints — speed of light, transistor density, dollar cost — make a single "fast and big" memory impossible. The hierarchy is the forced compromise.

Imagine you could design any memory you wanted. You'd want itfast (latency under one nanosecond), big (terabytes per chip), andcheap (a dollar per gigabyte). Each constraint individually is achievable; any two are; all three are not. The hierarchy is the engineering compromise that gives you a small amount of fast-but-expensive memory next to lots of slow-but-cheap memory, and software gets the illusion of a single contiguous address space.

Constraint 1: speed of light

Light travels about 30 cm per nanosecond in vacuum and roughly half that through silicon and copper. A signal from one corner of a 1 cm² chip to the opposite corner has to travel ~1 cm round-trip, which already costs about 0.1 ns at light speed. Real signal-propagation delays through transistors and wires are several times that. Anything more than a few millimetres from the ALU has a latency floor measured in nanoseconds. SRAM cells next to the ALU stay fast precisely because they're small enough to fit close by; DRAM banks on the other side of the package are far enough away that the round-trip is dominant.

Constraint 2: transistor density and refresh

SRAM (the technology used for caches) holds one bit in 6 transistors. DRAM (the technology used for main memory) holds one bit in 1 transistor + 1 capacitor. So DRAM packs ~10× more bits per square millimetre at the cost of needing to refresh every cell every few milliseconds (the capacitor leaks) and a 100× slower access because each read involves charging up a bit-line through the small capacitor.

Cost-wise: SRAM is ~$100/GB. DRAM is ~$5/GB. NAND flash (SSD) is ~$0.10/GB. Disk is ~$0.02/GB. Each tier is roughly 100× cheaper than the one above. If you priced a machine with 1 TB of L1 cache at SRAM rates, it would cost $100 million. With 1 TB of disk: $20. The hierarchy is what makes computers affordable.

Constraint 3: the heat budget

Faster memory means more power per byte stored, both static (idle) and dynamic (access). A modern desktop CPU can dissipate ~150 W; a server CPU 300–400 W. If you built a chip with 10 GB of cache-grade SRAM, it would draw kilowatts of static power — the silicon would melt long before any program ran on it. The hierarchy keeps the fast tier small enough that it fits in the package's thermal envelope.

How the hierarchy resolves it

Stack small-and-fast on top of big-and-slow. Each level holds a hot subset of the level below it. The CPU's memory subsystem hides the hierarchy from software: a load instruction asks for an address, the L1 cache checks first; if missing, L2; then L3; then DRAM; then (via the OS's page fault handler) disk. From the program's point of view there's just one memory, and the hardware transparently moves bytes through the levels.

The inclusion / exclusion policy decides which level holds which data: an inclusive L3 contains a superset of L1 and L2 (simplifies coherence but wastes capacity); an exclusive L3 holds only what isn't in the inner caches (more total capacity at the cost of complex coherence). Modern Intel is usually inclusive; AMD Zen and Apple are usually exclusive or non-inclusive (a weaker version).

The hierarchy doesn't solve the problem so much as make the average case fast. Worst-case access (cache cold, page table cold, disk seek) is still the catastrophe the numbers from §1 say it is. The whole game of performance work is keeping the hot path on the warm tier.

The takeaway. "Fast memory is small, dense memory is slow, all memory needs power. The hierarchy stacks small-fast SRAM on top of cheap-slow DRAM on top of vast-slow flash, and the hardware moves bytes through transparently. Every level is roughly 100× slower and 100× cheaper than the one above. The whole performance game is keeping hot data on the fast tier."

How caches actually work — lines, sets, ways, the prefetcher

The hierarchy from §2 only works because the caches are clever about whatto keep and where to put it. A handful of mechanisms — cache lines, set associativity, eviction policy, hardware prefetcher — combine to make hot data stay hot.

Cache lines — the unit of transfer

A cache does not load individual bytes from DRAM. It loads cache lines — fixed-size blocks, 64 bytes on every common modern CPU (x86, ARM64, Apple Silicon, POWER). When you read one byte that's not in cache, the hardware fetches the entire 64-byte line containing that byte. Read another byte 10 addresses later — same line, already cached, free. Read another byte 100 addresses later — different line, another full transfer.

This is why spatial locality matters: a cache miss already paid for 64 bytes; using them all is essentially free. Iterating sequentially over an array amortises one DRAM access across many element reads. Iterating in a stride larger than 64 bytes uses one byte per line and wastes the other 63.

Set associativity — where a line is allowed to live

A cache has a finite number of slots and a much larger memory to draw from. When a new line arrives, which slot does it go into? Three strategies:

Direct-mapped: each memory address maps to exactly one slot (determined by some bits of the address). Cheap, fast, but suffers from conflict misses when two hot addresses happen to map to the same slot.
Fully associative: any address can live in any slot. Best hit rate, but requires searching all slots in parallel — too expensive past tiny caches (16–32 entries, like the TLB).
N-way set-associative: the practical compromise. The cache is divided into sets; each address maps to exactly one set; within that set, it can live in any of N ways. Modern L1 is typically 8-way, L3 is 16-way or higher. Lookups compare against N tags in parallel — manageable hardware cost.

Eviction policy — what to drop when the set is full

When all N ways in a set are occupied and a new line arrives, one of them must go. True LRU (Least Recently Used) is theoretically optimal among policies that don't see the future, but tracking exact LRU order requires log₂(N!) bits per set, which is too many for hardware. Real caches approximate it:

Pseudo-LRU (PLRU): a binary tree of bits, O(log N) bits per set, approximates LRU well enough.
Random: picks a way at random. Cheap, surprisingly competitive.
NRU (Not Recently Used): each line carries a "used" bit; clear it periodically; evict among the lines with the bit clear.

For a working set that fits in cache, eviction policy barely matters. For one that doesn't, the difference between LRU and Random is single-digit percent. The win-or-lose factor is whether the working set fits at all — see the locality section.

The hardware prefetcher

Once the CPU notices a sequential access pattern (consecutive lines, or a fixed stride), it starts prefetching — issuing speculative loads for lines the program hasn't asked for yet, hoping it will. A modern prefetcher tracks multiple streams independently, can detect stride patterns up to a few cache lines apart, and can even prefetch across page boundaries with cooperation from the TLB.

The prefetcher is invisible until it isn't. A linear sweep over an array runs near peak DRAM bandwidth because the prefetcher is keeping the cache hot. A pseudo-random access pattern (linked-list traversal, hash table probing) defeats it: no stride to detect, so the prefetcher can't help. This is the technical reason "cache-friendly data structures" (arrays, struct-of-arrays) beat "cache-hostile" ones (linked lists, trees of pointers) at the same algorithmic complexity.

A brief note on the TLB

Every memory access goes through virtual → physical address translation. TheTLB (Translation Lookaside Buffer) is a tiny cache (~64 entries L1, ~1500 entries L2 on a modern Intel core) of recently-translated page-table entries. When a translation misses the TLB, the CPU has to walk the page tables in DRAM — 4 memory reads on x86-64, each potentially a cache miss. The TLB is its own performance cliff: programs touching pages outside the TLB's coverage (working sets > ~256 MB on default 4 KB pages, or any random-access pattern over a large heap) can spend most of their time in page walks rather than the data access itself. The virtual-memory primer covers this in depth.

The takeaway. "Caches transfer 64-byte lines, not bytes. They're set-associative (typically 8-way), with an LRU-approximating eviction policy. The hardware prefetcher watches for stride patterns and loads ahead speculatively — which is why arrays beat linked lists at the same complexity. Every memory access also passes through the TLB, a separate small cache for page-table entries that can be its own bottleneck."

Locality — the rule that decides everything

Hot data stays hot (temporal). Nearby data comes along for the ride (spatial). Programs that respect these two principles run 10–50× faster than ones that don't — same algorithm, same complexity, same hardware.

Caches are bets that the program's access pattern is non-uniform — that recently-used and nearby-used data is more likely to be accessed again. Two flavours of bet:

Temporal locality

If you used address X recently, you'll probably use it again soon. The loop counter, the current request struct, the file descriptor table, the head pointer of a list — all hammered repeatedly within a small time window. Cache replacement policies (LRU and approximations) bet on this directly: keep what was used most recently; evict what hasn't been touched.

Programs that violate temporal locality run anyway — they just keep missing in cache. A loop that touches one new page per iteration and never revisits is guaranteed to flush the cache continuously, no matter the size. This is the access pattern of memset, memcpy, hashing through a large Bloom filter, and any "scan everything once" workload. Bandwidth, not cache, is what limits them.

Spatial locality

If you used address X, you'll probably use the address next to X soon. Arrays, struct fields, protocol headers — adjacent items get touched together. Caches exploit this by transferring 64 bytes per miss, not one byte. So reading one byte of the URL string drags in the next 63 bytes for free; reading the first field of a struct typically loads the next several fields too.

Matrix iteration — the canonical demo

A 2-D C array M[N][N] is stored row-major: M[i][j] sits next to M[i][j+1] in memory; M[i+1][j] is N elements away. Iterating in the order for i: for j: touches sequential addresses — every miss covers the next 64 bytes (~8 doubles or 16 floats), and the prefetcher kicks in. Iterating for j: for i: jumps by N elements each step — every access lands on a different cache line.

// Row-major (cache-friendly)            // Column-major (cache-hostile)
for (int i = 0; i < N; i++)              for (int j = 0; j < N; j++)
  for (int j = 0; j < N; j++)              for (int i = 0; i < N; i++)
    sum += M[i][j];                          sum += M[i][j];

On a 4096×4096 matrix of doubles (128 MB), the wrong order is typically 5–10× slower despite executing the exact same number of instructions. Fortran defaults to column-major storage, so the same loops in Fortran have the opposite cache behaviour. This is why ports of numerical code between C and Fortran sometimes change the loop ordering for no reason a domain expert can see.

Array-of-Structs vs Struct-of-Arrays

Suppose you have 1 million particles, each with position (3 floats), velocity (3 floats), and color (3 floats). Two ways to store them:

// Array of Structs (AoS) — natural OOP layout
struct Particle { vec3 pos, vel, color; };
Particle particles[1'000'000];

// Struct of Arrays (SoA) — cache-friendly for SIMD
struct Particles {
    float pos_x[1'000'000], pos_y[1'000'000], pos_z[1'000'000];
    float vel_x[1'000'000], vel_y[1'000'000], vel_z[1'000'000];
    float color_r[1'000'000], color_g[1'000'000], color_b[1'000'000];
};

For an update step that only reads positions and velocities, SoA brings in only 6 floats per particle on each cache line, fully utilising bandwidth. AoS brings in all 9 floats per particle even though color isn't needed — 30% wasted bandwidth. SoA also lets the compiler vectorise much more easily (each component array can be processed with SIMD). Game engines, physics simulators, and the inner loops of every machine-learning framework migrate from AoS to SoA for exactly this reason.

Cache-friendly vs cache-hostile data structures

std::vector<T> beats std::list<T> for almost any access pattern, despite having identical algorithmic complexity for many operations. The vector's contiguous storage makes spatial locality automatic; the list's pointer-per-element makes every traversal step a potentially-uncached load.
B-trees beat binary search trees in databases because each B-tree node fits in one or a few cache lines (or pages), so each comparison brings in 64+ keys at once instead of 1. B-trees are basically "binary trees, but cache-aware."
Hash maps have great amortised complexity but variable cache behaviour. Open-addressing hash maps (Robin Hood, swiss tables) beat chaining hash maps in practice because all entries are in one contiguous array; chaining has a pointer-chase per collision.
Cache-oblivious algorithms (matrix multiplication via recursive subdivision, van Emde Boas trees, funnel sort) get good cache performance without knowing the cache parameters, by recursively breaking the work into pieces of all sizes. The base case eventually fits in some cache, no matter what size it is.

The working set cliff

Performance changes smoothly as long as the working set fits in some cache level, and then steps down dramatically when it spills over. A workload with a 30 MB working set on a chip with 32 MB L3 cache runs near peak; the same workload with 34 MB working set drops to DRAM speed for every access. The cliff is sharp because the cache miss rate goes from ~0% to nearly 100% over a small range of working-set sizes.

Practical implication: perf stat -e cache-misses,LLC-load-misses on Linux tells you immediately whether your hot loop fits in cache. If LLC-miss rate is high, either shrink the working set, change the access pattern, or accept DRAM-bound performance.

The takeaway. "Locality is the rule: programs that touch recently-used and adjacent data run at cache speed; programs that don't run at DRAM speed, ~100× slower. Concretely: prefer contiguous storage (vectors, B-trees) over pointer-heavy structures (lists, trees of pointers); iterate row-major over row-major arrays; switch to SoA when your hot loop only needs some fields. The working-set cliff is real and sharp — measure with perf rather than guessing."

Quick reference

Six questions worth being able to reason about cold, and five red flags to spot in a code review. Internalize the question prompts; the answer scaffolds will follow.

What's the difference between temporal and spatial locality?

Temporal: recently-used data will likely be used again soon — this is what LRU eviction bets on. Spatial: data near recently-used data will likely be used soon — this is why caches transfer 64-byte lines instead of single bytes. Most real workloads have both; programs that have neither cannot benefit from any cache.

Why is a cache line 64 bytes?

A compromise. Smaller lines waste cache by needing more tag overhead per byte stored, and underutilise the burst transfer rate of DRAM. Larger lines amortise better but waste bandwidth when only a few bytes are actually needed (false sharing also gets worse — see the cache-coherence primer). Sixty-four bytes hit the sweet spot for most workloads and has been the industry standard since ~2003.

Explain how a set-associative cache lookup works.

An address is split into three fields: offset (bottom 6 bits — selects which byte within the line), index (middle bits — selects the set),tag (top bits — identifies which line is actually stored). On a lookup, the index selects the set; the N tags in that set are compared in parallel with the request's tag; if any match, that's a hit and the data is returned from the corresponding way; if none match, it's a miss and the line is fetched from the next level.

What is a TLB miss and what's its cost?

The TLB caches recent virtual→physical address translations. When a translation isn't in the TLB, the CPU walks the page table — 4 memory loads on x86-64, each potentially a cache miss, so a TLB miss can cost 100+ cycles even when the data itself is in cache. Workloads touching pages outside the TLB's coverage (~256 MB on default 4 KB pages) spend significant time in page walks. Huge pages (2 MB or 1 GB on x86) extend the TLB's reach 512× or 262 144×.

Why does row-major vs column-major matter for matrix iteration?

A C array M[N][N] stores M[i][j] next to M[i][j+1] in memory (row-major). Iterating for i: for j:touches consecutive addresses, hits in cache, and the prefetcher kicks in. Iterating for j: for i: jumps by N elements per step, putting every access on a different cache line. On a 4K×4K double matrix the wrong order runs 5–10× slower despite executing identical instructions. Fortran defaults to column-major storage, so the same loop order has the opposite cache behaviour there.

What's the practical difference between cache-friendly and cache-hostile code?

Cache-friendly code touches data in patterns the caches expect: sequential or stride-based access, contiguous storage, locality between accesses. Cache-hostile code uses pointer chasing, random access, or strides larger than 64 bytes — every access likely misses, and the prefetcher can't help. At the same algorithmic complexity, the gap is typically 5–50× in wall-clock time. The two canonical examples: std::vector beats std::list, B-trees beat binary trees.

Red flags in code review

Linked-list traversal in a hot inner loop. Every step is a potentially-uncached pointer load. Convert to a contiguous container or accept the cost.
Array of pointers (AoP) where Array of Structs would do. std::vector<std::unique_ptr<T>> when std::vector<T> would work means every access is two cache lines instead of one.
Hash map with poor locality between probe steps. Chaining hash maps in tight loops; consider open-addressing (Robin Hood, swiss tables).
Strided access where the stride is a multiple of a cache-set size. Accessing M[i][0] for many i in a row-major matrix whose row-bytes are a multiple of the L1 set size hits the same set repeatedly and exhausts its N ways — a conflict miss. Padding the row by one cache line breaks the pattern.
Working set just over LLC size. A struct that's 5% larger than necessary can push the working set off the cliff. Pack struct fields, remove cold fields to a separate struct, or use smaller types where possible.