Calculus Primer

The minimum calculus you need before any deep-learning paper. Four short topics covering what a derivative is, how the partial derivative handles many variables at once, why the chain rule is the heart of backpropagation, and how the gradient bundles every partial into a single arrow that points uphill. No integrals — you can train a billion-parameter model without ever computing one.

Derivative

A single number that says how fast a function is changing right now.

Drive a car. The speedometer says 60 km/h. That number isn't the distance you've covered or the time you've spent driving — it's the rate at which distance is changing, captured at this exact instant. A derivative is just that same idea, formalized: for any function f(x), the derivative f'(x) tells you how fast f is changing as x moves.

Geometrically, the derivative is the slope of the tangent line — the straight line that just barely kisses the curve at one point and goes in the same direction the curve is heading. If the curve is climbing steeply at x = 3, the tangent there is steep and f'(3) is a big positive number. If the curve has just crested and is about to come down, the tangent is horizontal and f' is zero. If it's plunging, the slope is negative.

Three notations all mean the same thing, and ML papers use them interchangeably:

f'(x) — Lagrange's prime notation, compact.
df/dx — Leibniz's ratio notation, reads as "the change in f per change in x."
d/dx [f(x)] — when you want to emphasize the act of differentiating.

Concrete example. Take f(x) = x². The graph is a parabola climbing on the right and falling on the left. The derivative is f'(x) = 2x. At x = 3 the slope is 6 — pretty steep. At x = 0 the slope is 0 — the parabola has bottomed out. At x = −1 the slope is −2 — falling. One formula gives you the climbing-rate at every point on the curve at once.

1 / 5

f(x) = x², so f'(x) = 2x. The tangent rotates as the point moves; its slope is the derivative.

A small kit of rules covers almost everything you'll meet in ML papers:

  d/dx [c]         = 0              (a constant doesn't move)
  d/dx [x]         = 1
  d/dx [x^n]       = n · x^(n−1)     (power rule)
  d/dx [eˣ]        = eˣ              (exp is its own derivative)
  d/dx [ln x]      = 1/x
  d/dx [f + g]     = f' + g'         (linearity)
  d/dx [c · f]     = c · f'

Why does any of this matter for ML? Because training a model is a search for the bottom of a valley. The "valley" is the loss function — one number per choice of parameters, smaller is better. The derivative tells you which way is downhill at the spot you're standing on. Gradient descent takes a small step in the negative derivative direction, then repeats. After millions of repetitions, you're sitting somewhere near the bottom — a model that fits the data.

In a Transformer: every learnable parameter — and there are billions — has a derivative of the loss with respect to itself. The optimizer reads each derivative as "if I nudge this weight up a tiny bit, the loss goes up (or down) by this much." Then it nudges every weight in the direction that lowers the loss. That nudge, scaled by the learning rate, is one training step. The next three sections build up the machinery the Transformer actually uses to compute those derivatives at scale.

Partial Derivative

A derivative for a multi-variable function — one variable at a time, the rest held still.

Section 1's derivative assumed a function of one variable: input x, output f(x). Real models have millions of inputs at once — weights, biases, inputs from the previous layer. We need a derivative concept that handles a function of many variables. Enter the partial derivative.

Start with a function of two variables: f(x, y). Picture a landscape — every latitude x and longitude y has an altitude f(x, y). Stand at the point (x₀, y₀). Two questions are now possible: "if I take a tiny step east (just x changes), how fast does the altitude change?" and "if I step north (just y changes), same question?" Those two answers are the two partial derivatives at that point.

The notation uses a curly ∂ (read "del" or "partial") instead of the straight d from Section 1, to flag that other variables are being held still:

∂f/∂x — rate of change as x moves; y held fixed.
∂f/∂y — rate of change as y moves; x held fixed.

Computing one is mechanical: treat all the other variables as if they were constants, then apply the single-variable rules from Section 1. Example with f(x, y) = x² + 3xy + y³:

  ∂f/∂x   (treat y as constant)
        = d/dx [x²] + d/dx [3xy] + d/dx [y³]
        = 2x        + 3y         + 0
        = 2x + 3y

  ∂f/∂y   (treat x as constant)
        = d/dy [x²] + d/dy [3xy] + d/dy [y³]
        = 0         + 3x         + 3y²
        = 3x + 3y²

Read ∂f/∂x = 2x + 3y as: "wherever you stand on the surface, this is how fast altitude rises if you push purely in the x-direction." Plug in a specific point, say (x, y) = (1, 2), and you get the number 8.

1 / 3

Holding one variable fixed turns the surface into a single curve. The partial is that curve's slope.

The "slice" picture is the right intuition. Holding y fixed at y₀ carves a single curve out of the surface — the cross-section where y = y₀. On that 2-D cross-section, x is the only variable, and the slope of the tangent there is precisely ∂f/∂x. The partial derivative is an ordinary derivative; the partial sign just records which slice you took.

Extending to D variables changes nothing about the recipe. A function f(x₁, x₂, …, x_D) has D partial derivatives, one per variable; each one is computed by holding the other D − 1 variables fixed. The mechanical difficulty doesn't grow — only the bookkeeping.

A small but central subtlety: ∂f/∂x is itself a function of every variable — change y and the slope-in-x usually changes too. The partial derivative at a point is one number, but the partial derivative as a function is another multi-variable function of the same dimensionality as f.

In a Transformer: the loss is a single scalar — one number per training example, averaged across the batch. The model has billions of parameters w₁, w₂, …, w_n. What training needs is the partial of the loss with respect to each parameter: ∂L/∂wᵢ for every i. Each one answers "if I nudge this one weight up a hair, does the loss go up or down?" The next section, the chain rule, is what makes computing these billions of partials in one pass actually possible.

Chain Rule

When functions feed into functions, multiply the rates along the chain.

Imagine three meshed gears. Turn gear A by 1°; gear B turns by 2°; gear C turns by 5°. How fast does gear C turn when you turn gear A? Easy — multiply the ratios. One degree of A produces 2 · 5 = 10 degrees of C. That's the chain rule. When changes propagate through a chain of functions, the overall rate of change is the product of the local rates along the chain.

Formally: if y = f(g(x)) — meaning you feed x into g, then feed that into f — the derivative is:

dy/dx = f'(g(x)) · g'(x)

Or in the more transparent Leibniz form, using a helper variable u = g(x):

dy/dx = dy/du · du/dx

The Leibniz form almost looks like a trivial fraction cancellation — and it is. The derivative does compose by multiplication. Three functions chained? dy/dx = dy/du · du/dv · dv/dx. N functions chained? N factors multiplied together. The recipe never changes.

Concrete example. Let y = sin(x²). Set u = x², then y = sin(u):

  dy/du  =  cos(u)         (derivative of sin)
  du/dx  =  2x             (derivative of x²)
  dy/dx  =  dy/du · du/dx
         =  cos(u) · 2x
         =  cos(x²) · 2x   (substitute u back in)

Plug in x = 1: cos(1) · 2 ≈ 0.54 · 2 ≈ 1.08. That's the slope of sin(x²) at x = 1. Two single-variable derivatives, multiplied — no calculus harder than Section 1's rules ever entered the picture.

1 / 4

y = sin(x²). Two local rates multiply into the overall slope at x = 1.

The multivariable version is the chain rule that powers all of deep learning. If L depends on y which depends on w, and y is a vector or even a whole layer's worth of activations, the same multiplication applies — just with partials in place of ordinary derivatives, summed across the intermediate variables. The skeleton stays the same: local derivatives, multiplied along the path from input to output.

This is precisely what backpropagation is. A modern model is a deep composition: input → layer 1 → layer 2 → … → layer N → loss. To get ∂L/∂w for a weight w living inside layer 5, you multiply the chain of derivatives running from L back through layers N, N−1, …, down to 5, finally hitting w. There's no calculus deeper than Section 1 anywhere in the algorithm — just the chain rule, applied billions of times.

Backprop earns its name from the direction it traverses the chain. Going forward (input → loss) is how we compute predictions and the loss value. Going backward (loss → input) is how we compute the gradients. The chain rule works in either direction; backward turns out to be the dramatically cheaper one when there's one scalar loss and millions of parameters.

In a Transformer: every operation a Transformer performs — matrix multiplication, softmax, layer norm, GELU, residual add — registers itself in a computation graph as the forward pass runs. When the backward pass starts, the framework walks that graph from loss to inputs, multiplying local gradients at every node. Each weight inherits its ∂L/∂w as the chain rule cashes out the multiplications. One forward pass + one backward pass = one training step. The chain rule isn't a technicality — it is, mechanically, the entire reason deep learning is possible.

Gradient

All the partial derivatives, packed into a single vector that points uphill.

A multi-variable function has many partial derivatives — one per input variable. Listing them all individually is fine, but a more useful move is to pack them into a single vector. That vector is the gradient, written ∇f (read "del f" or "grad f"):

∇f = (∂f/∂x₁, ∂f/∂x₂, …, ∂f/∂x_D)

The gradient lives in the same space as the input. A function of 2 variables has a 2-D gradient; a function of D variables has a D-dimensional gradient. The gradient at a specific point — say, ∇f(1, 2) — is the same vector with numbers plugged in.

Example with f(x, y) = x² + 3xy + y³ from Section 2:

  ∂f/∂x   =  2x + 3y
  ∂f/∂y   =  3x + 3y²

  ∇f       =  ( 2x + 3y,  3x + 3y² )

  ∇f(1, 2) =  ( 2·1 + 3·2,  3·1 + 3·4 )
           =  ( 8, 15 )

At the point (1, 2), the gradient is the 2-D vector (8, 15). Two things make this vector special, and both matter for ML:

Direction. ∇f points in the direction along which f is increasing the fastest. Drop a marble on a smooth landscape and watch which way it would roll uphill if you reversed gravity — that's the gradient's direction.
Magnitude. The norm ‖∇f‖ tells you how fast f climbs in that best direction. A short gradient means the surface is nearly flat; a long one means it's steep.

The "direction of steepest ascent" property is the whole reason gradients run ML. The loss function is a surface with billions of input dimensions; we want to find a low point. We compute the gradient — the direction of steepest ascent — and then walk in the opposite direction. That's gradient descent:

w ← w − η · ∇f(w)

Read that one line carefully — it's the algorithm at the heart of every neural network ever trained. The current parameter vector w moves opposite the gradient. The scalar η (eta, the learning rate) controls how big each step is. Too small and training crawls; too big and it overshoots and diverges. Pick a reasonable η, repeat the update a few million times, and the parameters settle near a minimum of the loss.

1 / 4

Gradient points away from the minimum; descent walks the marker against it.

Three useful facts:

At a minimum (or any flat spot), the gradient is the zero vector (0, 0, …, 0) — no direction is uphill, because you're at the bottom. Optimizers detect convergence by watching the gradient's magnitude shrink toward 0.
The gradient is always perpendicular to the level curves of f — the lines where f stays constant. (A hiker following a level curve goes neither uphill nor downhill; the gradient, which is the uphill direction, must therefore cross those lines at 90°.)
For a function from D inputs to D outputs (instead of one scalar output), the natural generalization is a matrix called the Jacobian — one row per output, one column per input. The gradient is the special case when there is one scalar output, so the Jacobian collapses to a single row.

In a Transformer: training reduces to repeating two steps. Forward: compute the loss L from a batch of examples. Backward: compute ∇L — a vector with one entry per parameter, billions of entries long. Then apply w ← w − η · ∇L (often dressed up as Adam or Lion, but the skeleton is identical). Every interesting deep-learning topic — momentum, weight decay, learning-rate schedules, gradient clipping, mixed-precision training — is a refinement of this one update rule. The gradient is the central object: linear algebra (Primer 1) supplies the space it lives in, probability (Primer 2) supplies the loss that defines it, and the chain rule (last section) supplies the algorithm that computes it.