Assembly & ISA Primer

When curl finally needs to put bytes on the network, it makes a send() call. That call compiles down to a handful of assembly instructions ending in a syscall — the single special instruction that hands control to the kernel and bridges this primer to Layer 2. Five sections build the picture: what the compiler emits with side-by-side C → assembly comparisons; x86 vs ARM — the CISC vs RISC trade-off that suddenly matters in 2025; the calling convention that lets functions hand arguments to each other through registers; the syscall instruction with a transition walk-through; and a quick reference for the questions worth being able to answer cold.

01

What the compiler emits

Source code is for humans. Assembly is what actually runs. Once you can read the middle layer, every performance question becomes concrete instead of superstitious.

A C, C++, Rust, or Go compiler turns source into assembly (text mnemonics like mov, add, call), and an assembler then converts assembly into machine code (the raw bytes the CPU decodes). Assembly is a one-to-one human-readable view of machine code, with labels, comments, and a handful of pseudo-ops like .data and .text that have no machine-code counterpart.

The instruction set architecture (ISA) is the published contract: the list of legal opcodes, the registers they use, the addressing modes they accept, the side effects each one has. The ISA is what a CPU vendor implements; software targets it. The same C source compiled for x86-64 and ARM64 produces completely different machine code that does the same thing, because the two ISAs have different opcodes.

A minimal example

Take the canonical introductory function:

int add(int a, int b) {
    return a + b;
}

Compiled with gcc -O2 for x86-64 (Linux/Mac uses the SysV ABI), the emitted assembly is:

add:
    lea  eax, [rdi + rsi]   ; sum the two argument registers
    ret                     ; pop the return address, jump to it

Two instructions. Arguments arrived in registers rdi and rsi by the calling convention (the next section explains how); the return value is left in eax (the lower 32 bits of rax) by the same convention. There is no "memory" involved — small leaf functions live entirely in registers, which is one of the major reasons modern code runs so fast. The lea (load effective address) instruction is being abused for addition: lea eax, [rdi + rsi] computes rdi + rsi and stores it, in one instruction, without touching the carry flag — a common compiler trick.

The same function on ARM64

add:
    add  w0, w0, w1         ; arg 0 = arg 0 + arg 1
    ret                     ; jump to the link register

Two instructions, structurally the same job, different mnemonics and register names. ARM's SysV-style ABI passes the first 8 integer arguments in x0x7 (or w0w7 for 32-bit ints), and the return value goes in x0/w0. reton ARM jumps to the link register (lr = x30), which holds the return address set up by the caller's bl (branch with link) instruction.

Why read assembly

Three concrete moments when assembly literacy pays off:

  • Profiler attribution. perf report shows hot instructions in assembly. Knowing which mnemonic costs what — a lea is 1 cycle, a div is 20+, a mov to memory is 1 cycle on hit and ~100 on miss — turns a percentage on a line of source into an actionable hypothesis.
  • Compiler explanations. Did the compiler vectorise that loop? Did it eliminate that array bound check? Did std::move avoid a copy? The assembly is the ground truth; the source is wishful thinking. Compiler Explorer (godbolt.org) puts source and assembly side by side and is the best teaching tool for this.
  • Microarchitecture debugging. Branch mispredicts, cache misses, memory-ordering bugs in lock-free code, atomic operations not behaving as expected — all of these are explained at the assembly level, not the source level. Languages with strong memory models (Java, Rust) make this rarer, but in C/C++ it's essential at the senior level.

You don't need to write assembly to benefit. You need to be able to read it well enough that a 20-line function in perf report doesn't feel opaque.

The takeaway. "Source compiles to assembly compiles to machine code; assembly is just a human-readable view of what the CPU actually executes. The ISA is the contract between hardware and software. Reading assembly is what turns 'the compiler did something' from belief into knowledge — Compiler Explorer is the simplest way to start."

02

x86 vs ARM — CISC vs RISC, in practice

Two families dominate the world. They made different choices in the 1980s about instruction length, register count, and addressing complexity. Those choices, frozen into hardware for forty years, determine why your laptop is now ARM.

x86-64 (Intel and AMD) is a CISC design. Its instructions are variable-length — anywhere from 1 to 15 bytes each. It has 16 general-purpose 64-bit registers (originally 8 in 32-bit x86, extended with the REX prefix). Hundreds of distinct opcodes cover complex addressing modes: mov rax, [rbx + rcx*8 + 16] is a single instruction that reads memory at an address computed from one base register, one index register scaled by 8, and an immediate offset. The CPU front-end must do real work to figure out where one instruction ends and the next begins.

ARM64 (AArch64) is a RISC design. Instructions are fixed-length — exactly 4 bytes each. There are 31 general-purpose 64-bit registers. The opcode set is smaller and more regular. There are no complex addressing modes: a load from [rbx + rcx*8 + 16] takes 2–3 simpler instructions on ARM. The decode hardware is correspondingly simpler and smaller, which translates directly into lower power per instruction.

Why the trade-offs mattered then

  • Code density. CISC packs more meaning per byte, so binaries are smaller. This mattered hugely in the 1980s when memory was expensive. ARM's fixed 4-byte instructions can be 30% larger on disk than equivalent x86 code, partly offset by ARM's Thumb mode (16-bit instructions for hot code).
  • Decoder complexity. An x86 decoder must scan bytes serially to find instruction boundaries (the longest instruction prefix doesn't help — you still don't know the length until you partially decode). ARM's 4-byte alignment lets the decoder fetch and decode N instructions in parallel trivially. This is part of why ARM cores hit higher decode width per watt.
  • Register file size. More registers means more live values without spilling to memory, which is the most expensive thing a compiler can do. ARM's 31 GPRs are a real ergonomic win for the compiler over x86's 16, especially for register-heavy code like JIT-compiled JavaScript.

Why the trade-offs matter now

The 2020s flipped the historical landscape. Three things drove it:

  • Apple Silicon. M1 (2020) showed that an ARM core with enough R&D could outperform Intel's best at significantly lower power — 5–10W per core sustained vs Intel's 25W+. Within four years every Mac shipped is ARM.
  • AWS Graviton. ARM-based server chips with 20–30% better performance per dollar drove a major cloud migration. By 2024, half of new EC2 instances at AWS are Graviton. Every major web service has to support multi-arch deployment now.
  • Phones. ARM was always the phone ISA, but ARMv8 in 2011 made it competitive with desktop cores. Today the phone in your pocket has more single-thread IPC than a 2015 server CPU.

What this means for software

For an application engineer, the differences mostly show up at three points:

  • The assembly in profilers looks different. Same algorithm, but mnemonics, register names, and instruction count vary. Learning to read both requires some practice but is straightforward.
  • Atomic memory ordering surprises. x86 is TSO (Total Store Order) — most operations are already sequentially consistent for free. ARM is weakly ordered — naive lock-free code that works on x86 silently breaks on ARM without explicit barriers. This is the most common cross-arch porting bug.
  • Container image and binary distribution. Docker images must declare and build for both amd64 and arm64; docker buildx and manifest lists handle this. Pre-built binaries on GitHub releases need both architectures. The cost of multi-arch builds is real but small.

The takeaway. "x86 is CISC (variable-length instructions, 16 registers, complex addressing); ARM64 is RISC (4-byte fixed, 31 registers, simple addressing). ARM's simpler decoder and larger register file translate to better power efficiency, which is why Apple Silicon, AWS Graviton, and all phones run ARM. The biggest software surprise is memory ordering: x86 is TSO (forgiving), ARM is weak (strict barriers required for lock-free code)."

03

The calling convention — how functions talk to each other

A function can't call another function without agreement on where arguments go, where the return value comes back, and which registers must be preserved. That agreement is the calling convention — a contract between every binary on the system.

The calling convention is the rule that lets code compiled by GCC call into code compiled by Clang, into the C library, into the kernel, all transparently. Each OS+architecture has one canonical ABI (Application Binary Interface) that everything targets. On Linux/x86-64 it's the System V AMD64 ABI. On Windows/x86-64 it's the Microsoft x64 calling convention. On Apple/ARM64 it's a customised variant of AAPCS64. They differ in the small details, agree on the structure.

SysV AMD64 in one screenful

Argument registers (integer):  rdi, rsi, rdx, rcx, r8, r9  (6 args)
Argument registers (float):    xmm0..xmm7                  (8 args)
Return value (integer):        rax  (rdx for 128-bit ints)
Return value (float):          xmm0

Caller-saved registers:        rax, rcx, rdx, rsi, rdi, r8..r11
                               (the caller must save if needed across calls)

Callee-saved registers:        rbx, rbp, r12..r15
                               (the callee must restore before returning)

Stack:                         16-byte aligned at call site;
                               grows downward; red zone of 128 bytes below rsp.

For most function calls the first 6 integer arguments and first 8 float arguments all fit in registers, so the call itself touches no memory at all. Arguments beyond that limit get pushed on the stack. The return value comes back in rax or xmm0. Stack alignment is 16 bytes — required so SSE instructions can operate on stack-allocated vectors without trapping.

Caller-saved vs callee-saved

The split exists because callers and callees have different costs to save things. Caller-saved registers (also called "scratch" or "volatile"): the calling function knows which values it cares about across the call and saves only those. Cheap if you don't care about most of them. Callee-saved registers: the called function knows which ones it actually uses and saves only those. Cheap if the function doesn't use most callee-saved regs.

In practice, the compiler decides for each variable which type of register to store it in. Loop counters and inner-loop temporaries usually go in caller-saved registers (cheap inside the loop, hot during work). Longer-lived values that survive across function calls go in callee-saved registers (saved once on function entry, restored once on return — much cheaper than re-saving across every call).

The stack frame

high addresses
   ┌──────────────────────┐
   │  caller's arg #7..#n │  (if more than 6 int args)
   ├──────────────────────┤
   │   return address     │  ← pushed by 'call'
   ├──────────────────────┤  ← caller's rsp at entry
   │  saved rbp (optional)│
   ├──────────────────────┤  ← callee sets rbp here
   │   local variables    │
   │   spilled registers  │
   ├──────────────────────┤  ← callee's current rsp
   │   red zone (128 B)   │  ← can be used without rsp adjustment
   ├──────────────────────┤
low addresses

Calling a function pushes the return address onto the stack and jumps. The callee prologue typically sets up a frame pointer (push rbp; mov rbp, rsp) and allocates space for locals (sub rsp, N). The callee epilogue undoes this (add rsp, N; pop rbp; ret). Leaf functions often skip the frame pointer to save instructions.

The 128-byte red zone below rsp is reserved for leaf-function scratch space without explicitly adjusting rsp. Signal handlers must not touch it. This is one of the small details where signal-safe code differs from ordinary code.

Why this matters for debugging and FFI

  • Stack-trace decoding. Every stack frame follows the convention, so a backtrace can walk frames by following rbp (with frame pointers) or by reading DWARF unwind info (without). When the convention is violated (e.g. inline assembly clobbering callee-saved registers without declaring them), backtraces break and crashes become opaque.
  • FFI (Foreign Function Interface). Calling C from Python, Rust, Go, Java, etc. always goes through the C ABI as the lingua franca. The FFI layer in each language marshals arguments into the convention's expected registers and stack slots. Mismatches (wrong type widths, missing alignment) are a common source of memory-corruption bugs.
  • Custom calling conventions. JITs (V8, LuaJIT, Java's JIT) often invent their own internal conventions that pass more args in registers or skip prologue/epilogue work. They have to switch to the ABI convention whenever crossing into C code.

The takeaway. "The calling convention is the contract that lets all the binaries on a system call each other. On Linux/x86-64 (SysV AMD64): first 6 integer args in rdi/rsi/rdx/rcx/r8/r9, return value in rax, callee-saved rbx/rbp/r12-r15, stack 16-byte aligned at call sites. FFI, stack traces, and JIT design all depend on knowing this."

04

Syscalls — the special instruction that opens the door

Of the hundreds of instructions in any ISA, exactly one switches the CPU from user mode to kernel mode. Every read, write, open, fork, mmap your program ever does flows through it. This section is what happens during those few nanoseconds.

Out of the hundreds of instructions in either x86-64 or ARM64, exactly one is special: the one that switches the CPU from user mode (privilege ring 3 on x86, EL0 on ARM) to kernel mode (ring 0 / EL1). On x86-64 it's the syscall instruction; on ARM64 it's svc #0. Every other instruction your program ever executes — every load, store, add, branch — runs in user mode, where the CPU enforces that the program can only touch memory and devices the kernel has mapped for it.

What changes atomically

On the syscall instruction, the hardware atomically changes several pieces of state. The atomicity is what makes the boundary safe: there is no in-between moment when user code is running with kernel privileges, or kernel code is running with user page tables.

  • Privilege ring 3 → 0. A flag in the CPU's status register changes. From this moment, the program can execute privileged instructions and access mapped kernel memory.
  • Page-table base switches. On x86, the CR3 register loads the kernel's page-table base. Since Meltdown (2018) made KPTI (Kernel Page Table Isolation) mandatory, the user and kernel have separate page tables — the swap is a real cost (TLB flushes).
  • Stack pointer switches. The user-mode rsp is saved, and the per-thread kernel stack (held in a CPU MSR) is loaded.
  • Instruction pointer jumps to the kernel's syscall entry, recorded in the LSTAR model-specific register on x86-64.

The kernel side

The kernel reads a number out of a designated register (rax on Linux/x86-64, x8 on Linux/ARM64) to identify which syscall was requested. It indexes that number into a table (sys_call_table) of function pointers — one per syscall — and calls the corresponding handler. Argument registers (rdi/rsi/rdx/r10/r8/r9 for syscalls on Linux/x86-64; note r10 instead of rcx because the syscall instruction itself clobbers rcx) carry the arguments.

syscall transition — eight steps from user code through kernel and backUser: load syscall arguments into argument registersUSER MODE; user-mode write(fd, buf, n): mov rdi, rax_fd ; arg 1 mov rsi, rax_buf ; arg 2 mov rdx, rcx_n ; arg 3 mov rax, 1 ; syscall # = write syscall ; ← ; ... continues after kernel returns mov rax_bytes, rax ; return valueKERNEL MODE; entry_SYSCALL_64: swapgs ; load kernel gs mov rsp, kernel_stack call sys_call_table[rax] ; → sys_write(fd, buf, n) ; sockfs_write → tcp_sendmsg → ... ; bytes_written into rax swapgs ; restore user gs sysretq ; ←ring = 3user page tables
SysV ABI: first 6 integer args go in rdi, rsi, rdx, rcx, r8, r9. The compiler emitted these moves while preparing the call.
1 / 8
Three things change atomically on syscall: the privilege ring (3 → 0), the page-table base (user → kernel, since Spectre / Meltdown made this mandatory), and the stack pointer (user stack → per-thread kernel stack). The kernel reads rax for the syscall number, dispatches through a fixed table, runs the handler, writes the return value back to rax, and sysret reverses the transition. Every I/O, every memory allocation, every fork your program does goes through this same sequence.

The cost

A syscall itself takes 100–300 ns on modern hardware (post-Meltdown KPTI included). That's before the actual work — a read from a socket buffer is the syscall cost plus a memory copy; a read from disk is the syscall cost plus I/O latency. The syscall overhead is the floor: nothing involving a kernel transition can be faster than that.

This is why high-performance servers go to such lengths to avoidsyscalls. epoll coalesces N socket-readiness queries into one syscall instead of N. io_uring goes further — userspace and the kernel share a ring buffer of in-flight operations, so submitting and reaping I/O happens without any syscall at all. sendfile and splice let the kernel copy bytes between file descriptors without round-tripping through userspace.

vDSO — the syscall that isn't

Some frequently-used "syscalls" — gettimeofday, clock_gettime, getcpu — are implemented in a shared userspace library called the vDSO (virtual Dynamic Shared Object) that the kernel maps into every process. The implementation reads kernel-maintained data from a read-only page and returns immediately, with no privilege transition. This brings the latency from ~150 ns to ~15 ns — important for time-sensitive code (logging, profiling, trace tools) that calls these constantly.

The takeaway. "Of every instruction in an ISA, exactly one switches privilege levels and bridges user code to the kernel — syscallon x86-64, svc #0 on ARM64. The transition is atomic: ring, page tables, and stack all flip together. The cost (~100–300 ns) is the reason high-throughput servers use epoll, io_uring, sendfile, and the vDSO to avoid making syscalls at all."

05

Quick reference

Six questions worth being able to reason about cold, and five red flags to spot in a code review. Internalize the question prompts; the answer scaffolds will follow.

What's the difference between assembly, machine code, and the ISA?

Machine code is the raw bytes the CPU decodes. Assembly is a one-to-one human-readable view of those bytes — same instructions, just mnemonics and labels. The ISA (Instruction Set Architecture) is the published contract that specifies what opcodes exist and what each does — what the CPU vendor implements.

Why is ARM more power-efficient than x86 at comparable performance?

Two structural reasons: fixed-length 4-byte instructions let the decoder run N-way in parallel trivially (vs x86's serial byte scan to find boundaries); and 31 registers vs 16 reduce memory spills, which are the most power-hungry operations. Plus Apple/Qualcomm/ARM have spent more recent R&D budget on ARM designs than Intel has on x86. The net is 2–3× perf/watt on comparable workloads.

Which registers carry function arguments in SysV AMD64?

First 6 integer args go in rdi, rsi, rdx, rcx, r8, r9; first 8 float args go in xmm0..xmm7; further args go on the stack. Return value: rax (or rax+rdx for 128-bit integers, xmm0for floats). For syscalls, the 4th argument register is r10 instead of rcx because the syscall instruction itself clobbers rcx.

Walk through what happens on the syscall instruction.

Atomically: privilege ring 3 → 0; page-table base swaps from user to kernel (KPTI, mandatory since Meltdown 2018); stack pointer swaps from user stack to per-thread kernel stack; instruction pointer jumps to the kernel's syscall entry (recorded in LSTAR on x86-64). Then the kernel reads rax for the syscall number, indexes into sys_call_table, calls the handler, writes the return value back to rax, and sysret reverses everything atomically.

Why are syscalls considered expensive, and what techniques reduce that cost?

A syscall is ~100–300 ns on modern hardware just for the transition (KPTI included), before any actual work. Reduction techniques: epoll coalesces N readiness queries into 1 syscall; io_uring shares a ring buffer with the kernel so submission/completion happens with no syscall at all; sendfilelets the kernel move bytes between fds without crossing to userspace; the vDSO implements common reads (clock_gettime) in userspace memory mapped from the kernel, avoiding the transition entirely.

What's the difference between caller-saved and callee-saved registers?

Both are conventions. Caller-saved ("volatile") registers may be overwritten by any function call; the caller must save them itself if it needs the values preserved. Callee-saved ("non-volatile") registers are preserved across calls — the callee must save them on entry and restore on return if it uses them. The split exists because callers and callees have different costs; the compiler picks which register to use based on lifetime.

Red flags in code review

  • Inline assembly without proper clobber declarations. A function using asm volatile must tell the compiler which registers and memory it modifies. Missing clobbers cause silent corruption that only manifests when the compiler decides to keep a value in the clobbered register across the asm block.
  • Lock-free code without explicit memory barriers, intended to run on ARM. Code that works on x86 (TSO) silently breaks on ARM (weak ordering). The fix is explicit atomic_thread_fence or appropriatememory_order on every atomic op.
  • FFI declarations with mismatched argument widths. Calling a C function from Python or Rust with the wrong type widths corrupts the registers/ stack the convention expects to find at the next call site. cffi, bindgen, and the Rust libc crate exist to generate correct bindings.
  • Tight loops making many syscalls per iteration. Eachread from a socket is a syscall; reading a megabyte 1 byte at a time costs 100 000 syscalls. Buffer in userspace and call once per kilobyte (or use readv) to amortise.
  • Single-arch Docker images for code that runs in 2025.FROM alpine without specifying or supporting arm64fails on Graviton, M-series Macs, and any modern dev environment. Use docker buildx with --platform linux/amd64,linux/arm64.