CPU Architecture Primer

Our curl command compiles down to a stream of machine instructions — push a register, jump to a function, read a byte from memory. Every one of those instructions visits the same small set of hardware structures inside a CPU core. Five sections build the picture: the fetch / decode / execute loop and the registers it touches; pipelining with an interactive 5-stage walker showing data hazards and stalls; branch prediction — how a CPU keeps a 20-stage pipeline busy when an if would otherwise empty it; out-of-order execution — the bookkeeping that lets a single core sustain more than one instruction per cycle; and a quick reference for the questions worth being able to answer cold.

One core, one loop

Strip away everything else and a CPU is a small finite state machine running one loop forever: fetch the next instruction, decode it, execute it. Every other piece of hardware in this primer exists to make that loop go faster.

Inside one core, three structures matter: the register file (16 named 64-bit registers on x86-64, 31 on ARM64, plus FP and vector banks), the ALU (the math circuit — addition, comparison, bitwise ops), and the control unit (the small state machine that orchestrates everything). Plus a program counter register, conventionally called PC or RIP/EIP, which holds the address of the next instruction to execute.

+------------------- one core -------------------+
|                                                |
|  PC →  Fetch → Decode → Execute → Writeback    |
|         ↑       ↑         ↑          ↑         |
|       L1-I   Decoder    ALU/FPU   Registers    |
|                                                |
|         ↓ syscall, interrupt, exception        |
|                                                |
+------------------- one core -------------------+

The loop is conceptually simple: read the byte(s) at PC from L1 instruction cache, parse them to figure out which operation they encode (decode), run the operation on operands from the register file (execute), write the result back to a register (writeback), advance PC to the next instruction, repeat. That's every operation your computer has ever done.

Clock speed and what it actually means

A 3 GHz CPU ticks 3 billion times per second, so one tick (one clock cycle) is ~0.33 ns. The marketing implication is that you should get 3 billion instructions per second per core, but the reality is messier. A single instruction takes anywhere from one cycle (an integer add on registers) to hundreds (a divide, or a load that misses all cache levels). And modern cores routinely sustain more than one instruction per cycle (IPC > 1) thanks to the tricks in the next three sections. So "3 GHz × N cores = 3N billion instructions/sec" is a bad estimate in both directions.

Registers vs memory

Registers are where the action happens. Every ALU operation reads its operands from registers and writes results to registers. To work with memory you need explicit load and store instructions that copy bytes between registers and memory addresses. Registers themselves don't have an "address" — they're named (rax, rbx, … on x86) and the instruction opcodes encode the names in their bits. Accessing a register is part of executing the instruction and takes no extra time. Accessing memory takes 1 ns (L1) to 100 ns (DRAM) to 10 ms (disk) — the memory-hierarchy primer covers the numbers in depth.

Why so few registers? Encoding space: each operand slot in an x86 instruction is 4 bits (16 choices, after some tricks 32 with the REX prefix). More registers would mean longer instructions, larger code size, and worse instruction-cache density. ARM64 chose 31 because the fixed 32-bit instruction format had room for 5-bit operand fields and the designers traded some bits to get there. The compiler then does register allocation — choosing which variables live in registers at each point and which get "spilled" back to memory — as one of its most consequential optimisations.

Compilers also do inlining aggressively because keeping a function's intermediate values in registers across a call is hard; once you cross a function boundary the calling convention forces some registers to be saved. Modern code that looks like 50 small function calls per request often executes as one big inlined block across all of them, with everything live in registers. That's where the machine-vs-source-line mismatch many engineers hit in a profiler comes from.

Where this leaves us

The naive picture — one instruction per cycle, sequential — is essentially never how a modern core actually executes. The next three sections explain three layers of machinery the CPU adds to that loop: pipelining (do 5 things at once from one sequential stream), branch prediction (guess what comes next so the pipeline stays full at every if), and out-of-order execution (reorder freely as long as the visible result looks sequential). Together they buy roughly 10–20× the throughput a naive implementation would achieve at the same clock.

The takeaway. "A core is a fetch-decode-execute loop driven by the program counter, with operands shuttled through a small register file. Clock speed sets the upper bound, but real throughput depends on how cleverly the hardware overlaps and reorders work behind that loop — the next three sections are what that cleverness looks like."

Pipelining — five things at once

A naive CPU runs one instruction to completion before starting the next, leaving four out of five hardware units idle at any moment. Pipelining is the rule that says: don't wait — start the next instruction as soon as the current one moves to the next stage.

Split the instruction loop into stages. The classic 5-stage MIPS pipeline is Fetch → Decode → Execute → Memory → Writeback. Each stage has its own hardware (instruction cache, decoder, ALU, load/store unit, register-file write port) and each stage takes one cycle. With one stage holding one instruction, only one unit is busy. The pipeline trick is to keep all stages busy simultaneously by feeding a new instruction into Fetch every cycle. After a 4-cycle warm-up, the pipeline retires one instruction per cycle — five instructions complete in nine cycles instead of twenty-five.

Modern x86 and ARM cores have 10 to 20 pipeline stages, not five. Why so many? Because each stage has to fit in one clock cycle and modern clocks are very fast. Splitting work into more, smaller stages lets each stage do less per cycle, which allows a higher clock frequency. The cost is that any hiccup that flushes the pipeline now wastes 15+ cycles instead of 4. The next section is what gets done about that.

1 / 4

IPC = 1 (one instruction per cycle) is the theoretical maximum for a single-issue in-order pipeline. Real workloads hit it less often than you'd hope. Data hazards cost a cycle each; branch mispredicts cost a flush of every fetched-but-not-retired instruction (15+ slots in a modern 20-stage pipeline); cache misses to DRAM cost ~100 cycles. The next two sections — branch prediction and out-of-order execution — are what real CPUs build to defend IPC against these costs.

The three hazards

Data hazard. Instruction B reads a register that instruction A (still in flight) hasn't written yet. The pipeline must stall B until A's result is available. Modern CPUs forward the result directly from Execute's output to the next cycle's Execute input, eliminating most stalls — but a load-use hazard (B needs the value A is loading from memory) is unavoidable: the value isn't available until A passes through Memory, which is one cycle later than Execute.
Control hazard. The next instruction's address depends on the outcome of a branch that's still in the pipeline. Without prediction, the fetcher must stall until the branch resolves — a 15-cycle bubble in a 20-stage pipeline. With prediction (next section), it guesses; if wrong, the pipeline flushes everything fetched along the wrong path. Mispredictions are the dominant cost of branchy code.
Structural hazard. Two instructions want the same hardware unit at the same time — for example, both want the single load/store port in the same cycle. Modern designs throw hardware at this (more ports, more functional units) so structural hazards rarely show up in performance analysis, but they constrain how aggressive an in-order pipeline can be.

Why pipelining doesn't give you 1 IPC for free

Two upper bounds limit the effective IPC of an in-order pipeline. First, hazards: any stall or flush inserts a bubble, dragging IPC below 1. Second, the longest stage sets the cycle time: if one stage takes 1.5 ns and the others take 1.0 ns, the clock has to run at 1.5 ns regardless. Designers spend enormous effort balancing stages, which is why a modern Apple M-series or Intel core actually has many sub-stages within each of the "5" classical ones.

The interesting thing for software engineers is that this is mostly invisible — until it isn't. A linear loop over an array runs near 1 IPC. A loop with an unpredictable branch inside drops to 0.5 IPC or worse. A loop where every iteration depends on the previous (the loop-carried dependency) is bounded by the load-use latency regardless of the pipeline. Vectorisation and OoO execution exist to break those ceilings.

The takeaway. "A pipeline overlaps 5 (or 20) instructions across hardware stages so the steady-state throughput is one instruction per cycle. Hazards — data, control, structural — insert bubbles. The cost of a flush scales linearly with pipeline depth, which is why modern 20-stage pipelines work hard to avoid flushes (branch prediction, OoO execution)."

Branch prediction — guess, then verify

Every if, every loop back-edge, every virtual-method dispatch is a branch. Waiting for each branch to resolve before fetching the next instruction would empty a 20-stage pipeline 15 cycles deep. The CPU doesn't wait — it guesses, and most of the time it's right.

Modern branch predictors hit 95–99% accuracy on typical code. The hardware that achieves this is more sophisticated than most engineers realise — modern Intel and AMD predictors use neural-style perceptron logic or TAGE (TAgged GEometric history length) predictors that track patterns across thousands of past branch outcomes. The details vary by generation; what matters for software is the shape of what they handle well and what they don't.

What predictors are good at

Loop back-edges. A loop that runs N times is taken N-1 times then not taken once. The predictor learns this after one execution and gets all subsequent ones right. This is why straight-line iteration over an array runs near peak IPC.
Highly biased branches. An error-check that fails 0.1% of the time gets predicted "not taken" permanently after one observation. The slow path is just slow; the fast path stays fast.
Pattern branches. A loop that alternates true / false in lockstep gets predicted correctly once the predictor latches onto the period. Predictors track histories long enough to catch repeating patterns up to about 16–32 outcomes deep.

What predictors are bad at

Data-dependent branches with no pattern. The famous Stack Overflow question: a loop that branches on whether each array element is > 128 runs ~5× faster on a sorted array than an unsorted one. Sorted: branch always taken for a while, then always not-taken — perfectly predictable. Unsorted: 50/50, so ~50% mispredict rate × 15-cycle flush = bad day. std::sort before a hot filter can pay for itself many times over.
Indirect branches with many targets. Virtual-method dispatch on astd::vector of polymorphic objects, switch on an enum with 50 cases, a megamorphic JIT call site. The branch target is data-dependent, not just direction. Indirect-branch predictors (BTB — Branch Target Buffer) have smaller capacity and harder problems.
Spectre-style adversarial patterns. An attacker who can repeatedly train the predictor to expect a certain outcome, then trigger the branch with different data, can leak victim memory through the cache-timing side channel of the resulting speculative execution. Mitigations (LFENCE, retpoline, etc.) cost ~5–10% in throughput on affected workloads.

What to do about it as a programmer

Most code shouldn't worry about branch prediction. For the rare hot loop where it matters:

Sort or partition the data before processing if a branch is data-dependent and you have a tight inner loop over it.
Replace branches with branchless arithmetic when possible — max as (a - b) & ((a - b) >> 31) tricks, or__builtin_expect, or conditional moves (cmov). The compiler does this automatically when it can.
Hoist invariant branches out of inner loops. The compiler usually does this, but template-heavy code with many indirections can hide branches the optimiser can't see through.
Use perf stat -e branch-misses on Linux to measure. If misses-per-instruction is below 1%, branches are not your problem.

The takeaway. "Branch predictors hit 95–99% on typical code and the pipeline stays full. The 1–5% they miss costs ~15 cycles each in a modern pipeline. The pathological cases — data-dependent branches with no pattern, indirect branches with many targets — are where std::sort before a hot loop, branchless arithmetic, or pattern-tagged dispatch tables become real wins. Otherwise the predictor is invisible."

Out-of-order execution — sustain more than one instruction per cycle

A pipelined in-order CPU caps at 1 IPC. Modern cores routinely hit 3–4. The trick: dispatch instructions out of program order as soon as their inputs are ready, then commit them back in program order so the observable behaviour is unchanged.

Consider a tight loop where every iteration reads the previous one's result. Even with perfect pipelining, you can't start iteration N+1 until N's output is computed — the loop-carried dependency serialises everything. Now consider a loop where each iteration is independent: load a[i], add b[i], store to c[i]. Pipelining alone gives 1 IPC. But the CPU has multiple ALUs, multiple load units, multiple store units — it can run several independent instructions simultaneously if they happen to be available. The challenge is finding them in the instruction stream.

Out-of-order (OoO) execution is the mechanism. Three structures cooperate:

The reorder buffer (ROB). A FIFO of in-flight instructions in program order. An instruction enters the ROB when it's decoded and leaves (commits) when its result can be made architecturally visible. Sizes range from ~150 entries on older cores to ~600 on Apple M-series.
The reservation station (RS, also called scheduler). A pool of decoded instructions waiting for their inputs. As each input becomes available, the RS broadcasts it; instructions whose inputs are now all ready get fired into the execution units, possibly out of program order. The RS is what lets the CPU find independent work — it's the "parallel discovery" engine.
Register renaming. The 16 (x86) or 31 (ARM) architectural registers are an alias for hundreds of physical registers. When an instruction writes rax, the hardware actually writes a fresh physical register and remaps rax to it. This eliminates false dependencies: two independent operations that both happen to write rax can run in parallel because each gets its own physical register.

With these three working together, a single core can dispatch 4–6 instructions per cycle, with the ROB making sure the commit order looks sequential. The famous IPC metric (instructions per cycle) measures how well this is working. On well-structured code, Apple M-series and Intel Sapphire Rapids both sustain IPCs near 3.5; on dependency-bound code (chained pointer chasing, tight reductions), they struggle to break 1.

What software can do (and not do)

The compiler does most of the work — instruction scheduling, dependency-breaking, loop unrolling — to give the CPU plenty of independent instructions to find. As a programmer, you mostly help by:

Breaking dependency chains. A loop that sums 1 million numbers into a single accumulator has a serial dependency chain of length 1 million. Sum into 4 accumulators and combine at the end — same answer, 4× the parallelism, the compiler often does it automatically if you write it cleanly.
Letting the CPU see ahead. Large ROBs can hold ~600 instructions in flight, but they need to see ahead — which means not branching to an unpredictable target every few instructions. Inlined functions, hoisted loop invariants, and predictable branches all give the OoO window more work to chew on.
Pointer chasing is the OoO killer. A linked-list traversal where each step depends on the load of the previous node bottlenecks on memory latency, not ALU width. Arrays and SoA (struct-of-arrays) layouts let the prefetcher and OoO machine actually do their jobs. This is why "cache-friendly" means "layout the data so OoO can run parallel loads."

Speculative execution and Spectre / Meltdown

The cleverness has a dark side. Speculative execution — performing operations whose results may turn out to be discarded (because of a mispredict or an exception) — was historically invisible to software: the side effects of speculation were rolled back when speculation was abandoned. Spectre and Meltdown (2018) showed that this isn't actually true: the cache state changed during speculation persists after rollback, and a carefully crafted attack can read out a victim process's memory by triggering speculative loads of attacker-controlled addresses and measuring cache-timing afterwards.

The mitigations — speculation barriers, retpolines for indirect branches, page-table isolation between user and kernel — cost real performance, typically 5–30% depending on workload. They're also incomplete: roughly every six months, a new microarchitectural side channel surfaces. The general lesson is that speculative-execution side channels are now part of the threat model for any multi-tenant system (cloud VMs, browsers, anything that runs untrusted code on shared hardware).

The takeaway. "OoO execution lets a single core sustain IPC > 1 by dispatching independent instructions early and committing them in program order. The bottlenecks are dependency chains (loop-carried, pointer chasing) and the size of the lookahead window. Speculative execution is the same mechanism that enabled Spectre / Meltdown — its observable side effects on cache state turn out not to be reversible."

Quick reference

Six questions worth being able to reason about cold, and five red flags to spot in a code review. Internalize the question prompts; the answer scaffolds will follow.

Why does a 20-stage pipeline run faster than a 5-stage one, if a branch mispredict costs more cycles?

More stages means each stage does less work, which lets the clock run at a higher frequency. The deeper pipeline retires one instruction per cycle in steady state just like the shorter one — but each cycle is shorter. The flush cost grows linearly with depth, but branch predictors compensate by being >95% accurate on typical code. The net is positive: higher clock × similar steady-state throughput > (occasional bigger flush) penalty.

Explain branch prediction in 90 seconds.

Every if, loop back-edge, and indirect call is a branch. Without prediction, the pipeline must stall until the branch resolves — a 15-cycle bubble in a 20-stage pipeline. Modern predictors use neural/TAGE algorithms to track up to 32+ outcomes of past branches; they reach 95–99% on typical code. Mispredicts flush every fetched-but-not-retired instruction. The pathological cases are data-dependent branches with no pattern (sort the data) and indirect branches with many targets (megamorphic dispatch).

What is the difference between in-order and out-of-order execution?

In-order: the CPU executes instructions strictly in the order the program lists them. The IPC ceiling is 1, less when hazards stall. Out-of-order: the CPU dispatches instructions as soon as their inputs are ready, regardless of program order, then commits results back in program order via the reorder buffer. The IPC ceiling rises to 4–6 because multiple independent instructions can run in parallel across multiple ALUs.

Why does sorted-array iteration sometimes outperform unsorted on a tight branch?

The famous Stack Overflow case: a loop that branches on arr[i] > 128. On a sorted array the branch is always-taken for the first half, always-not-taken for the second — the predictor learns it after one run. On unsorted data the outcome is ~50/50 random and prediction can't help. Each mispredict costs 15+ cycles in a deep pipeline. The 5–6× speedup from pre-sorting is real and shows up immediately in perf stat.

What does IPC measure and what limits it?

IPC = instructions per cycle, averaged over a sample. The theoretical ceiling on modern cores is 4–6 (the issue width). Real workloads are limited by, in roughly decreasing order: dependency chains (loop-carried, pointer chasing), branch mispredicts (in branchy code), cache misses (when memory-bound), and structural hazards. Memory-bound code with cache misses often runs at IPC well under 1; tight arithmetic loops with no dependencies can hit 3–4.

What is speculative execution and what made it dangerous after 2018?

Speculative execution = the CPU performing operations whose results may be thrown away (because a branch was mispredicted, or an exception will be raised). The assumption was that discarded results had no side effects. Spectre and Meltdown (2018) showed that this is false: cache state changed during speculation persists after rollback, and a carefully crafted attack can read victim memory by triggering speculative loads of attacker-controlled addresses and measuring cache-timing. Mitigations cost 5–30% performance depending on workload.

Red flags in code review

Megamorphic dispatch in a hot inner loop. A virtual method called on a heterogeneous collection where the concrete type varies. The indirect-branch predictor (BTB) has small capacity and can't keep up.
Data-dependent branches without sorting upstream. Any loop branching on input values that aren't sorted or partitioned will mispredict heavily. If you know data has structure, sort or bucket it once and reap the benefit on every subsequent pass.
Loop-carried dependency chains in a numerical reduction. A single accumulator running over a million values serialises the entire sum. Use multiple accumulators or rely on the compiler's -ffast-math mode (with all the caveats that implies for FP).
Function pointers / virtual calls in numerical kernels. Each call is an indirect branch. If the body is small, the CPU spends more time predicting the target than executing.
Single-threaded loop where each iteration loads what the previous one stored. The store-load forwarding can't cover dependencies that escape the store buffer. This shows up as IPC stuck near 0.5 even on simple arithmetic.