Cache Coherence Primer

Our server has 32 cores serving incoming HTTP requests. Each core has a private L1 (~32 KB) and L2 (~1 MB) cache; only L3 and DRAM are shared. When two cores both touch the connection-pool reference counter, what stops them from seeing each other's stale values — and what makes that "stopping" the single most expensive thing two cores can do to each other? Five sections build the answer: the contract (coherence vs consistency, the distinction most engineers conflate); MESI with an interactive state walker; the price — bounces, atomic ops, the memory model; false sharing with a throughput-cliff demo; and finally a quick reference with six cold questions and the answer scaffold for each.

The contract — coherence vs consistency

One core is easy. The instant you have two, software gets a problem split into two distinct guarantees that most engineers conflate. Getting them straight is the piece of this primer most worth nailing down before any other.

With one core, "memory" is simple: write to address X, read from X, get back what you wrote. Multi-core makes that promise harder to keep — and answers it with two layered guarantees most engineers, at every level, conflate. Pulling them apart is the single piece of this primer most worth understanding deeply — and the first thing you reach for the moment a concurrent bug hits production.

The setup. A modern server CPU may have 32 cores. Each core has a private L1 (~32 KB) and a private L2 (~1 MB). All cores share an L3 (32–64 MB) and DRAM. When core A reads address X, the cache line containing X is loaded into A's private L1. If core B reads X moments later, B's L1 fills too — now two physical copies of the same logical memory location exist. If both write X, what should either core's next read return?

The hardware answers with two distinct guarantees:

Coherence is a per-location property. Informally: "any single memory location, viewed across all cores, looks like a single FIFO of writes." Formally: (1) every write to X eventually becomes visible to every core; (2) all cores observe writes to X in the same order. Coherence is what MESI (next section) enforces. Without it, x = 1; print(x) from the same core could already misbehave.
Consistency is a cross-location property. It governs the order in which writes to different locations become visible. The canonical case is producer / consumer:

// Producer (core A):       // Consumer (core B):
data = 42;                   while (!flag) {}
flag = true;                 print(data);

Coherence guarantees core B will eventually see flag = true. Coherence says nothing about whether B observes data = 42 before, after, or interleaved with flag = true. That's the consistency question — and on hardware with a weak memory model (ARM, POWER, RISC-V) without explicit fences, core B can legally print 0.

Why people conflate them. Both are properties of the cache hierarchy. Both are enforced by hardware. The difference is what they guarantee: coherence is roughly free and total — you never reason about it explicitly in code. Consistency is what volatile (Java), std::atomic memory orderings (C++), Ordering::Acquire/Release (Rust), and the mfence / dmb ish instructions actually configure.

The takeaway. "Cache coherence means you'll never read a stale value of one address. Memory consistency means you can't necessarily count on the relative order of values across different addresses without telling the hardware you care." The protocol you'll see next (MESI) is the coherence machinery. The memory orderings in your language standard are the consistency knobs.

One thing the textbooks fudge. Even with perfect coherence, there is no global, instantaneous "real" value of an address — only what each core has cached, what's in transit on the bus, and what's eventually settled in DRAM. The protocol's job is to make sure the observable sequence of values is consistent with some single global FIFO order. The physical state at any given picosecond is much messier than that abstraction suggests, and software is intentionally not allowed to look at it.

MESI — four states, one protocol

Every cache line in every cache wears a 2-bit tag. The whole protocol is the rules for transitioning between those four tags as cores read, write, and snoop each other.

Coherence is what MESI implements. The protocol's name is the four states each cache line can be in: Modified, Exclusive, Shared, Invalid. The complete state table is small enough to memorise — and worth memorising, because every coherence-related question that comes up in production debugging lives somewhere in it.

state  copies elsewhere?  matches DRAM?  this core can…
─────────────────────────────────────────────────────────
  M    no                  no, DRAM stale  read + write freely
  E    no                  yes             read + write freely
                                            (write → M, silent)
  S    yes (other cores)   yes             read freely;
                                            write must broadcast first
  I    line is empty/stale —               next access misses

Transitions are driven by two kinds of event. Local processor events — PrRd (this core reads), PrWr (this core writes) — and bus / snoop events from other cores: BusRd (someone else is reading), BusRdX (someone else wants exclusive), BusUpgr (a sharer wants to upgrade S→M), BusWb (a writeback to DRAM). Every cache snoops the bus continuously, even while doing its own work; the protocol's correctness depends on it.

The exhaustive state-transition table is too long to fit inline, but the rules that matter are intuitive:

First access (PrRd, line in I): issue BusRd. If no one else has it, the line arrives in E. If at least one other cache has it, both end up in S.
Write to an Exclusive line (PrWr, line in E): transition silently to M. No bus traffic — no one else cares yet.
Write to a Shared line (PrWr, line in S): issue BusUpgr (or BusRdX), forcing all other sharers from S → I. The writer ends in M. This is the most expensive common case.
Foreign read of a Modified line (snoop BusRd, line in M): writeback the dirty data (so DRAM gets updated), supply the line to the requester directly, both end in S.
Foreign write of any cached line (snoop BusRdX / BusUpgr): drop to I.

1 / 8

Step through the protocol that keeps two private L1 caches consistent. Notice the last two transitions: a silent write (already M) costs nothing on the bus, but the very next foreign read forces a writeback. Repeating that pattern is the cost of true sharing — and the topic of the next section.

Snoop bus vs directory protocols

The scheme above is a snoopy protocol: every cache hears every memory transaction and decides whether to react. It works because in a low-core-count system, the bus carries every coherence message anyway. Past about 16 cores, broadcasting to every cache becomes the dominant cost.

Many-core systems (Intel Xeon, AMD EPYC, Apple M-series, ARM Neoverse) use a directory-based protocol instead: a central or distributed directory tracks, for each cache line, which cores are currently sharing it. Writes only need to be sent to the sharers, not broadcast. Logically the same four states, very different network behaviour.

MESIF (Intel) and MOESI (AMD) — what real CPUs run

Pure MESI has a weak spot: when a core asks to read a line that's in S in several caches, which sharer responds? Without a tie-breaker, all of them might — or the request might fall through to DRAM. Two real-world extensions solve this differently:

MESIF (Intel Nehalem onward, ~2008). Adds an Forward state — at most one sharer at a time is designated as the responder for incoming BusRd. S→S transitions still happen, but only one cache replies. Strictly an optimization on top of MESI semantics.
MOESI (AMD K8 onward, ~2003; ARMv8 in many configurations). Adds an Owned state: a core can hold a dirty line and share it with other cores without writing back to DRAM. Cache-to-cache transfer of modified data — faster, but with a stricter set of invariants. Useful when one core writes heavily and many cores read.

The takeaway. "MESI is the four-state coherence protocol everyone learns. Real hardware uses extensions — Intel adds an F (Forward) state so only one cache responds to a shared-line read; AMD adds an O (Owned) state so one core can hold a dirty line while others read it directly. Same four-state intuition, different fast paths."

The price — bounces, atomics, the memory model

Coherence is correct by hardware. It is also the most expensive thing two cores can do to each other. Knowing the numbers — and the patterns that avoid them — is the difference between code that scales and code that flatlines at 4 cores.

What a bounce costs

Every transition that moves the "writable" (M) right for a line from one core to another involves a snoop, an invalidate broadcast (or directory lookup), and a cache-to-cache transfer of the actual data. On modern x86 it is empirically ~30–100 ns per bounce on a single socket, and 150–300 ns across sockets. That sets a hard ceiling on how often a contended cache line can be written:

Bounce time     Effective writes/sec on one contended line
─────────────────────────────────────────────────────────
  ~30 ns         ~33 M ops/sec   (intra-socket, best case)
  ~100 ns        ~10 M ops/sec   (intra-socket, typical)
  ~300 ns        ~ 3 M ops/sec   (cross-socket NUMA)

For comparison: an uncontended write to L1 takes ~1 ns. A contended atomic counter is two orders of magnitude slower than an uncontended one. This is why "the atomic ops/sec scaling curve" — throughput vs thread count for one shared counter — is famously flat or even decreasing past a handful of writers.

CAS, LOCK CMPXCHG, LDREX/STREX

Atomic operations are coherence's most important user. Consider compare_exchange_strong in C++ (or CAS in Java, or AtomicI64::compare_exchange in Rust). On x86 it compiles to LOCK CMPXCHG: the LOCK prefix tells the cache to acquire the line in M state via the coherence protocol, execute the compare-and-swap atomically (no other core can observe an intermediate state), then release. Uncontended cost is ~10–25 cycles (single-digit ns). Contended cost is dominated by the bounce — typically 100 ns or more.

ARM does it differently. LDREX (load-exclusive) tags the cache line as monitored; STREX (store-exclusive) attempts the write and either succeeds atomically or fails if the line has been touched in the meantime. The CAS loop retries until STREX succeeds. Functionally equivalent, but the cost profile under contention is even more variable because failed retries are wasted work.

Memory ordering — separate from coherence, related in cost

Coherence gives you the per-address guarantee (Section 01). Memory ordering — what you configure via std::memory_order_*, Java volatile, Rust Ordering::* — is about cross-address visibility. The cost of consistency is closely tied to the underlying coherence protocol, but the two are different knobs.

x86 is TSO (Total Store Order). Loads are not reordered with other loads; stores are not reordered with other stores; loads can sometimes be reordered before older stores to different addresses (the store-buffer trick). In practice, most C++ atomic operations on x86 compile to ordinary MOV instructions — no explicit fence needed. Only memory_order_seq_cst stores get an MFENCE appended (~20–30 cycles).

ARM is weakly ordered. Without explicit barriers, loads and stores to different addresses can be observed in any order by other cores. An acquire load needs LDAR; a release store needs STLR; a full seq_cst fence needs DMB ISH (typically 20–80 cycles). This is why "works on Intel, breaks on ARM" is the most common cross-architecture concurrency bug.

The pattern that wins under contention

The fix for "one atomic counter, N threads" is to stop sharing the counter. Give each thread its own counter (in its own cache line, see Section 04), increment without atomics, and periodically reduce the per-thread counters into the global total. Linux per-CPU counters work this way; Go's sync/atomic stat plumbing works this way; Java's LongAdder works this way (and the JDK explicitly recommends it over AtomicLong for hot contended counters).

Throughput goes from "~10 M ops/sec total no matter how many threads" to "~300 M ops/sec per thread, linear with cores." The cost is a slightly stale total between reduces — almost always acceptable, since the reads of the total are themselves infrequent.

The takeaway. "Atomic operations are not slow because the instruction itself is slow. They are slow because, under contention, they serialise across cores through the coherence protocol. A contended atomic on a hot line costs ~100 ns per op; an uncontended one costs ~1–5 ns. So the question is never 'is this atomic' — it is always 'is this contended.'"

False sharing — the bug you can't see

Two threads writing different variables. Same cache line. ~30× slowdown that's invisible in the source code, undetectable in single-threaded profiling, and the silent killer behind most "my multi-threaded code stops scaling at 4 cores" stories in production.

Coherence operates at cache-line granularity, not byte granularity. On every common CPU (x86, ARM64, Apple Silicon, POWER, RISC-V) a cache line is 64 bytes. So if two threads each maintain their own counter — fully independent in the source, no shared variable, no lock, no atomic — but the two counter variables happen to land in the same 64-byte line, every write by either thread invalidates the other's cached copy. The line ping-pongs continuously.

This is called false sharing because no logical sharing exists. The threads believe they are isolated. Coherence, which only cares about physical lines, treats them as contending. All the cost of contention, none of the actual sharing.

1 / 6

Same workload, same code, same hardware — only the byte distance between two thread-local counters changes. The cliff at padding = 56 (the value that pushes T1 across the 64-byte boundary) is where coherence stops thrashing. The exact ops/sec numbers are illustrative; the ~30× ratio is real and reproducible.

The 64-byte boundary in code

Show a coworker the following struct and ask what's wrong:

// Common silent-killer layout
struct Counters {
    std::atomic<long> a;   // bytes  0..7  — written by thread 0
    std::atomic<long> b;   // bytes  8..15 — written by thread 1
};
Counters c;                // single struct, both fields adjacent

Functionally correct. Performance disaster. The fix is alignment:

// Aligned — each counter alone on its line
struct Counters {
    alignas(64) std::atomic<long> a;
    alignas(64) std::atomic<long> b;
};
// or, more portable since C++17:
struct Counters {
    alignas(std::hardware_destructive_interference_size) std::atomic<long> a;
    alignas(std::hardware_destructive_interference_size) std::atomic<long> b;
};

The same idea in other languages:

Java: annotate with @Contended (since JDK 8). Off by default for non-JDK classes — needs the -XX:-RestrictContended flag. Used internally by LongAdder, ConcurrentHashMap, etc.
Go: no built-in attribute; pad manually with _pad [56]byte after each hot field, or use a struct wrapper. The Go runtime itself does this for per-P (per-processor) state.
Rust: #[repr(align(64))] on a wrapper type, or wrap fields in crossbeam_utils::CachePadded.
C / kernel code: __attribute__((aligned(64))) on GCC/Clang; Linux exposes ____cacheline_aligned for this exact pattern.

Where false sharing hides in practice

The compiler doesn't warn for any of these. None of them are visible in a single-thread microbenchmark. All of them have bitten real production systems:

Per-thread queues stored in a vector<Queue>. IfQueue is small enough that several fit in a 64-byte slot, adjacent queues fight on coherence. Pad Queue or store vector<PaddedQueue>.
Producer pointer and consumer pointer in the same ring buffer struct. They are the two hottest fields, written by different threads, and inevitably adjacent in memory. Disruptor's entire performance reputation rests on putting each on its own cache line — and it's the single pattern reviewers flag most often in concurrent data structures.
Sequence number and a flag field, used by a writer thread and a reader thread.The reader is constantly checking the flag; the writer constantly bumps the sequence number. Same line → both stall.
Stats / counters embedded in a hot data structure. A read-heavy hashmap with an integral hit/miss counter inside the node will see the counter update invalidate the node itself for any thread that was about to read it.

Detection

On Linux, perf c2c is the dedicated tool: it reports cache-to-cache transfers broken down by the variables involved. Output names the addresses, the offending source lines, and the lines that bounce most. macOS has equivalents via Instruments' counters view. Without a profiler, the smell is "multi-threaded throughput stops scaling past 2–4 threads even though CPU is at 100% and there are no obvious locks."

The takeaway. "False sharing is when two threads write to two different variables that happen to share a 64-byte cache line. Coherence can't tell that the threads aren't actually sharing data, so the line ping-pongs. Fix: align hot per-thread fields to a 64-byte boundary so each gets its own line. C++17 has hardware_destructive_interference_size; Java has @Contended; Linux has ____cacheline_aligned; everywhere else, pad manually."

Quick reference

Six questions worth being able to reason about cold, and five red flags to spot in a code review. Internalize the question prompts; the answer scaffolds will follow.

What is the difference between cache coherence and memory consistency?

Coherence is a per-address guarantee: every cache eventually sees writes to a given address in the same order, so you never read a stale value of one location. Consistency is a cross-address guarantee: it specifies in what order writes to different addresses become visible to other cores. Coherence is implemented by hardware (MESI); consistency is configured by the programmer through volatile / std::atomic memory orderings / fences.

Explain MESI in 90 seconds.

Every cache line in every core's private cache carries one of four states: Modified (this core has the only copy and it's been written; DRAM is stale), Exclusive (this core has the only copy and it matches DRAM), Shared (multiple cores have read-only copies matching DRAM), Invalid. Local reads/writes drive transitions on this core; snooped bus messages (BusRd, BusRdX, BusUpgr, BusWb) drive transitions caused by other cores. Real CPUs add F (Intel) or O (AMD) for an extra optimisation, but the semantic shape is MESI.

How does compare-and-swap work at the hardware level?

On x86: LOCK CMPXCHG. The LOCK prefix asks the cache to acquire the target cache line in M state via the coherence protocol, perform the compare-and-swap atomically (no other core can observe an intermediate state), then release. On ARM: a load-linked / store-conditional pair — LDREX tags the line as monitored, STREX writes and succeeds only if the line hasn't been touched. Uncontended cost is a few ns; contended cost is dominated by the bounce.

What is false sharing and how do you fix it?

Two threads writing two different variables that happen to share a 64-byte cache line. Coherence can't tell that no real sharing exists, so the line ping-pongs. Fix: align hot per-thread fields to 64 bytes — alignas(64) in C++, @Contended in Java, ____cacheline_aligned in Linux, CachePadded in Rust via crossbeam, manual [56]byte padding in Go.

Why is std::atomic<int>::fetch_add often slower than int++?

A plain int++ on a cache-hot local variable is a single L1 access — ~1 ns. An atomic::fetch_add compiles to an instruction with the LOCK prefix (x86) or an LL/SC loop (ARM); even uncontended, it forces full ordering against memory and may flush the store buffer. Under contention, it acquires the line in M state via the coherence protocol, which means a 30–100 ns bounce per call. The instruction itself isn't slow — the coherence work it triggers is.

When would you prefer a per-thread counter + reduce over a shared atomic?

Whenever writes vastly outnumber reads of the total and contention is high enough that coherence becomes the bottleneck. Per-thread counters incur zero coherence traffic on increment — each thread writes its own cache line — and only the periodic reduce pays. Throughput scales linearly with cores instead of flatlining. Java's LongAdder is built on exactly this insight; Linux per-CPU counters too. The cost is a slightly stale aggregate between reduces, almost always acceptable for metrics and stats.

Red flags in code review

Two or more std::atomic<T> fields adjacent in a struct. They almost certainly fall on the same line. If different threads write them, you have false sharing.
A hot counter or version number embedded inside a node of a read-heavy data structure. Every counter update invalidates the whole node for every reader.
vector<Worker> or vector<Queue> whereWorker / Queue is < 64 B. Adjacent elements end up on the same cache line; contiguous storage helps cache prefetch but kills coherence.
Producer pointer and consumer pointer in the same ring-buffer struct without padding. The single most-flagged pattern in concurrent code reviews. They must each get their own cache line.
A std::atomic<bool> flag next to the data it protects. The acquire load of the flag and the load of the data hit the same line — fine in a single-writer setup, but if you ever have multiple writers, both contend the same line for no good reason.