Neural Net Primer
The minimum neural-network anatomy you need before any deep-learning paper. Four short topics covering the single neuron at the bottom of every model, the activation functions that make non-linearity possible (and why non-linearity is required at all), the way neurons compose into layers, and the forward pass — input arrow → layer → layer → output — that ties them together.
A Single Neuron
The atom of every modern model.
Pick up any deep-learning paper and trace its computations all the way down. Eventually you bottom out at the same thing every time: a neuron. Take a few numbers in, mix them by a recipe of weights, add a bias, squeeze the result through a non-linear function, send one number out. Billions of those, stacked the right way, is what a Transformer is. Start with one.
The math of a single neuron is one line:
output = activation(W · x + b)
Four objects, each with a job:
- Inputs
x— a vector of numbers coming in. Could be raw features (pixels, sqft, a token embedding) or the outputs from a previous layer. - Weights
W— one number per input. Tells the neuron how much each input matters and in what direction. These are the things training learns. - Bias
b— a scalar added to the weighted sum. Lets the neuron shift its activation up or down independently of the inputs. - Activation function — a non-linear function that takes the weighted sum and produces the neuron's output. The subject of §2.
Concrete example. Two inputs, one neuron, ReLU activation. Let x = (1.5, −0.7), W = (0.8, −0.3), b = 0.2:
W · x = 0.8 · 1.5 + (−0.3) · (−0.7)
= 1.20 + 0.21
= 1.41
W · x + b = 1.41 + 0.20 = 1.61
ReLU(1.61) = max(0, 1.61) = 1.61 ← neuron outputThe W · x piece is exactly the dot product from the linear-algebra primer (§5). The weights vector decides what kind of "input pattern" excites this neuron; any dot product is a similarity score, so W · x measures how much the current input "looks like" what this neuron is looking for. A high dot product produces a high activation; a low (or negative) dot product produces a low one.
The bias is just a shift — it sets the neuron's "default" output when the weighted sum is zero. Without it, every neuron would have to fire activation(0) when all inputs cancel; with it, neurons can be biased toward firing or biased toward silence. Bias is what lets a network represent more than just lines through the origin.
Where does the neuron metaphor come from? Vaguely, biology. A biological neuron sums up signals on its dendrites and fires an electrical pulse down its axon if the sum crosses a threshold. The artificial neuron in this primer is a very loose approximation: weighted sum + threshold-like non-linearity. The biology vs the math diverged long ago, and modern deep-learning networks aren't models of brains in any meaningful sense. But the name stuck.
A single neuron is, on its own, a glorified linear classifier — like the linear regression from the supervised primer with one extra squashing function. The leap to "neural network" comes from putting many of them next to each other (§3) and stacking the result into layers (§4). Each individual neuron stays this simple all the way up to GPT-scale models.
In a Transformer: a single attention head's output, the feed-forward layer, the value projection — every learnable operation in a Transformer is composed of neurons of this exact form, just stacked into much larger matrices and wrapped in attention and residual connections. Open a 70-billion parameter LLM and at the bottom there's nothing more exotic than 70 billion copies of this W · x + b, glued into the right network shape.
Activation Functions
The non-linearity that turns a stack of matrix multiplies into something powerful.
Section 1's neuron has a weighted sum W · x + b and an activation function wrapping it. The weighted sum is the easy part — it's just linear algebra. The activation function is where the magic happens. Without it, a deep neural network has no expressive power beyond a single linear layer.
Here's the argument: two linear layers stacked, with no activation in between, compute W₂ · (W₁ · x + b₁) + b₂. Distribute the multiplication and that becomes (W₂ · W₁) · x + (W₂ · b₁ + b₂). The two layers algebraically collapse into a single equivalent linear layer with weights W₂ · W₁ and bias W₂ · b₁ + b₂. Stack a hundred linear layers and the result is still a single linear function. No matter how deep, you've fancy-wrapped a line.
The activation function is what stops that collapse. By inserting a non-linear function between layers — anything that isn't a straight line — the layers can no longer be combined into one. Stacking does add expressive power. Universal approximation kicks in: a deep enough network with non-linear activations can approximate basically any continuous function.
Four activations cover almost everything you'll meet:
- ReLU —
max(0, x). Pass positives through unchanged; turn negatives into 0. Brutally simple: one comparison, no transcendental function call. Fast, gradient-friendly (the derivative is just 1 or 0), and the default for most feed-forward layers from ~2010 onward. - Sigmoid —
σ(x) = 1 / (1 + e^(−x)). Smooth S-shape from 0 to 1. The original neural-network activation, useful when you need an output bounded in[0, 1](e.g., predicting a probability). Largely replaced inside hidden layers because gradients vanish at the saturated ends. - Tanh —
tanh(x). Smooth S-shape from −1 to 1 — basically a rescaled sigmoid that's zero-centered. Used in many older recurrent networks (LSTMs, GRUs); still common when zero-centered output matters. - GELU —
x · Φ(x)whereΦis the Gaussian CDF. Looks like a smooth ReLU that lets a small negative tail through. The dominant activation in modern Transformer feed-forward layers (used in BERT, GPT, almost every recent LLM). Better gradient flow than ReLU at the cost of one transcendental function evaluation.
Three things to notice in the demo:
- All four are non-linear. None of them is a straight line — that's the whole point. The "kink" in ReLU at
x = 0is enough non-linearity to unlock universal approximation; the smooth curves of sigmoid / tanh / GELU give the same property with continuous derivatives. - Sigmoid and tanh saturate at the ends. Far away from zero, their derivatives shrink to ~0. During backpropagation (calculus primer §3) those tiny gradients multiply together through many layers, and the signal "vanishes" before it reaches the early layers. This is why ReLU and GELU dominate now — their gradients don't vanish.
- ReLU has a "dead zone." For negative inputs, both the output and the gradient are exactly zero — a neuron that's spent the whole training session in the negative zone never learns anything. Variants like Leaky ReLU, PReLU, ELU, and GELU exist partly to address this.
Picking an activation is rarely the highest-impact decision; modern conventional wisdom is "ReLU is fine for almost everything, GELU is the default in Transformers, use sigmoid only when you specifically want a probability." Worry about the data, the architecture, the optimizer first; come back to activations when the rest is squared away.
In a Transformer: the feed-forward sublayer inside each Transformer block applies GELU(x · W₁ + b₁) · W₂ + b₂ — two linear layers with a GELU between them. The "FFN" or "MLP" you read about in architecture diagrams is exactly this. Newer variants (SwiGLU, used in LLaMA / PaLM) replace GELU with a slightly more elaborate gated structure, but the principle is identical: insert a non-linearity or your stack collapses to a single linear map.
Layers
A row of neurons looking at the same input, stacked into a matrix.
Section 1's neuron takes a vector in and produces one number out. That's a bottleneck — most useful models output many numbers per step. A layer is the obvious fix: line up several neurons, give them all the same input, let each one apply its own weights and bias, and collect their outputs into a new vector. That's it. A layer is a row of neurons all looking at the same input.
Three roles every network has:
- Input layer. Not really a "layer" in the computational sense — just the input vector
xthe model receives. Its size is fixed by the problem (number of features, number of pixels, dimension of a token embedding). - Hidden layers. The middle of the network. Each hidden layer takes the previous layer's output, applies its own
Wandb, and produces a new vector for the next layer. "Deep" learning means many of these. - Output layer. The last layer. Its size is fixed by the task: 1 for regression, 10 for ImageNet's 10-class subset, 50,000 for an LLM's vocabulary.
When you stack neurons into a layer, the linear-algebra primer turns a row of dot products into a single matrix-times-vector. If the layer has m input units and n output units:
h = activation(W · x + b)
Same formula as §1 — except now W is an n × m matrix (one row per neuron), b is a length-n vector, and the output h is a length-n vector. Linear-algebra primer §6 (matrix multiplication) handles the bookkeeping. A neural-network layer is the single most important application of "matrix times vector" in computing.
Concrete example. 2 inputs, 3 neurons in the layer, ReLU activation. Take x = (1.5, −0.7):
x = ( 1.5, −0.7 )
W (3 × 2) = ┌ 0.8 −0.3 ┐ ← neuron 1's weights
│ −0.4 0.6 │ ← neuron 2's weights
└ 0.5 0.9 ┘ ← neuron 3's weights
b = ( 0.2, 0.1, −0.5 )
W · x = ( 0.8·1.5 + (−0.3)·(−0.7),
−0.4·1.5 + 0.6 ·(−0.7),
0.5·1.5 + 0.9 ·(−0.7) )
= ( 1.41, −1.02, 0.12 )
+ b = ( 1.61, −0.92, −0.38 )
ReLU(·) = ( 1.61, 0.00, 0.00 ) ← layer outputTwo dimensions describe any layer:
- Width. Number of neurons in the layer — the output dimension. Bigger width = more parameters per layer = potentially more expressive power per layer.
- Depth. Number of layers stacked. More depth = more rounds of "extract features, recombine, extract more features" the model can perform.
Modern model design lives in the trade-off between width and depth. Wider layers add parameters quadratically (a layer of width w on input of width w has w² weights). Adding depth adds parameters linearly (one more layer of width w adds w² more weights, regardless of how many came before). Common shapes:
- Tiny MLP for tabular data. 2-3 hidden layers, each ~64-256 wide. A few tens of thousands of parameters total.
- ResNet-50 (image classification). ~50 layers, ~25 million parameters. Roughly 2018-era state-of-the-art for ImageNet.
- LLaMA 3 (70B). 80 Transformer blocks, each with multiple sublayers of hidden width 8192. 70 billion parameters.
Whatever the scale, the operation inside one layer is the same activation(W · x + b) as §1 — just with bigger matrices.
In a Transformer: a single Transformer block contains several layers, each of this exact form. The attention sublayer projects each input vector through three learnable matrices (the Q, K, V projections — each one is a layer in this sense). The feed-forward sublayer is two layers with a GELU between them (§2's footer). LayerNorm and residual connections wrap around them. A 70-billion parameter LLM is 80 of these blocks in series. Same W · x + b building block, ten thousand times over, in a very specific topology.
Forward Pass
Walk the input through every layer in order — that's what the model "does."
Section 3 gave you one layer; "deep" learning gives you many. The forward pass is the simple recipe for using them all: feed the input into the first layer, hand that layer's output to the second layer, hand the second layer's output to the third, and so on, until the final layer produces the model's prediction. The function the model represents is the composition of every layer's function.
One line of math captures it:
ŷ = fL( … f2( f1(x) ) … )
where each fi is a layer of the form activation(Wi · hi−1 + bi) from §3. The forward pass is just function composition — a concept from the calculus primer (§3 chain rule), now seen from the model's side rather than the optimizer's.
Concrete example. A tiny MLP: 2 inputs → 3-neuron hidden layer (ReLU) → 1-neuron output (no activation, for regression). Take x = (1.5, −0.7). Layer 1 was already computed in §3 — its output is h₁ = (1.61, 0, 0). Layer 2 takes that and produces the final prediction:
Layer 1 (already computed in §3) ──────────────────────────────── x = (1.5, −0.7) W₁ · x + b₁ = (1.61, −0.92, −0.38) h₁ = ReLU(·) = (1.61, 0, 0) Layer 2 (output) ──────────────────────────────── W₂ = ( 0.6, −0.2, 0.4 ) (one row, since output is scalar) b₂ = 0.1 W₂ · h₁ = 0.6·1.61 + (−0.2)·0 + 0.4·0 = 0.966 W₂ · h₁ + b₂ = 0.966 + 0.1 = 1.066 ŷ = 1.066 ← the model's prediction for x = (1.5, −0.7)
That's the entire model: a function that maps the 2-D input (1.5, −0.7) to the scalar output 1.066. Training is the search over all the weights and biases — every Wi and bi — for the values that make ŷ match the true label as often as possible across the dataset. The supervised primer (§3) optimization and the gradient-descent primer's training loop do exactly that, repeated billions of times.
Three things to internalize about the forward pass:
- It's entirely deterministic. Once the weights are set, the forward pass is a pure function: same input always produces the same output. The randomness in modern LLMs (different completions for the same prompt) comes from sampling the output distribution (probability primer §2), not from anything random inside the network.
- It's parallelizable. Each layer's computation is a single matrix multiplication — exactly what GPUs are designed to do fast. A modern accelerator can run a forward pass through a 70B-parameter LLM for one input in about 50ms.
- It's the same shape in inference and training. When you use a trained model (inference), all you do is run the forward pass. Training adds a backward pass on top to compute gradients (calculus primer §3); but the forward pass is identical in both cases.
How does training relate to the forward pass? Three steps, repeated millions of times:
- Forward pass — walk input through every layer, compute the loss on the prediction.
- Backward pass — walk back through every layer, multiplying local gradients (chain rule) to get
∂L / ∂Wifor every weight. - Update — apply the gradient-descent step (gradient-descent primer §1) to every
Wiandbi.
The forward pass is one third of that loop, but it's the only third that's ever run in production. Once a model is trained, every API call to ChatGPT, every image a diffusion model generates, every protein AlphaFold predicts — it's a forward pass.
In a Transformer: the forward pass walks input tokens through token embedding → 80 Transformer blocks (each with attention, residuals, LayerNorm, and the feed-forward sublayer from §3) → a final linear projection to vocabulary size → softmax to get the next-token distribution. From your perspective, ChatGPT typing a response is just thousands of forward passes through the same network, one per output token. The Transformer is the architecture; the forward pass is what it does.