Probability & Statistics Primer

The minimum probability and statistics you need before reading any LLM training paper. Five short topics covering what a probability actually is, distributions and their sum-to-1 constraint, conditional probability, expectation / mean / variance, and the log probabilities that quietly run every language model under the hood. No prior math beyond high-school algebra assumed.

Probability

A single number between 0 and 1 that says how strongly we expect something.

Flip a fair coin. Before it lands, you can't say whether it'll show heads or tails — but you can say something useful: the two outcomes feel equally likely. A probability is just a number that pins down that feeling. Heads gets the number 0.5; tails also gets 0.5. Written as P(heads) = 0.5 — read as "the probability of heads is one-half."

The number always lives in the closed range [0, 1]. The endpoints are the two boundary cases everyone knows by heart, just dressed in formal clothes:

P(event) = 0 — the event cannot happen. The coin landing on its edge and growing wings.
P(event) = 1 — the event is certain. The sun rising tomorrow (close enough).
P(event) = 0.5 — totally neutral, a coin flip.
P(event) = 0.99 — almost certain, but not quite.
P(event) = 0.001 — unlikely, but not impossible.

1 / 5

Walking from 0 to 1 — each step lands on a canonical probability.

Two equivalent ways to read the number, and both come in handy:

As a long-run frequency. If you flipped the coin a million times, roughly half a million would land heads. Probability is the fraction the count converges to.
As a degree of belief. Before any flip — even just one — you'd bet 50/50. Probability is how strongly you commit to the outcome.

Why a single number? Because once you have it, lots of follow-up questions answer themselves with simple arithmetic. P(not heads) = 1 − P(heads) = 0.5. The chance of two independent fair flips both showing heads is 0.5 × 0.5 = 0.25. The chance of at least one heads in two flips is 1 − 0.25 = 0.75. The entire calculus of "what might happen" reduces to adding, subtracting, and multiplying numbers in [0, 1].

Probabilities aren't reserved for coins and dice. Anything uncertain has one, even if nobody's written it down:

Weather. "70% chance of rain tomorrow" = P(rain) = 0.7.
Spam filter. Every email gets a number; above a threshold, the message moves to the spam folder.
Medical tests. A positive result doesn't mean disease — it shifts a probability.
Sports. "The favorite is at 1.4 odds" = a bookmaker's implied probability of about 0.71.
LLMs. The next token cat after "The quick brown" has some probability the model computes.

Notation worth knowing up front. P(A) is the probability of event A. P(A and B) — sometimes written P(A ∩ B) or P(A, B) — is the probability that both happen. P(A or B) — sometimes written P(A ∪ B) — is the probability that at least one happens. And P(not A), written P(¬A) or P(Aᶜ), equals 1 − P(A) by construction.

In a Transformer: the final layer of every LLM outputs exactly one number in [0, 1] for each possible next token — tens of thousands of these, one per word fragment in the vocabulary. To generate the next token, the model picks one according to those numbers. Every word your favorite chatbot has ever produced started life as a single probability in that range. The next four sections are about how those numbers get combined, summed, conditioned, and ultimately learned from data.

Probability Distribution

A list of probabilities — one per outcome — that has to sum to 1.

A single coin gives two probabilities: P(heads) = 0.5 and P(tails) = 0.5. A six-sided die gives six: each face at 1/6 ≈ 0.167. A 100,000-token vocabulary gives 100,000. Whenever you bundle the probabilities of every possible outcome together, you get a probability distribution. It's the same idea — one number per outcome — but viewed all at once.

Every distribution obeys two rules, and they're both worth memorizing:

Each number sits in [0, 1] — every entry is a valid probability.
The numbers sum to exactly 1 — something has to happen.

The sum-to-1 rule is the whole game. It looks pedantic until you remember why it's true: if you list every outcome that could occur, then with probability 1 (i.e., for certain) one of them does. Half the manipulations you'll ever do with probabilities — normalizing scores into probabilities, splitting a probability across sub-cases, computing marginals — all come back to forcing the numbers to add to 1.

A toy distribution. Tomorrow's weather has four mutually exclusive outcomes:

  P(sunny)   = 0.55
  P(cloudy)  = 0.25
  P(rainy)   = 0.15
  P(snowy)   = 0.05
  ─────────────────
  total      = 1.00 ✓

1 / 5

One bar per outcome, height = probability. Fill them all in and the total lands at 1.

Four numbers, each in [0, 1], adding to 1. That's a distribution. You can read off any single-outcome probability, and you can also bundle outcomes: P(precipitation) = P(rainy) + P(snowy) = 0.20. Bundling is just adding — because the four outcomes are mutually exclusive (it can't be both rainy and snowy on the same day in this model), so the events partition the space of possibilities.

Two flavors of distribution show up everywhere; both follow the same two rules, just with different bookkeeping.

Discrete. Finitely (or countably) many outcomes. A coin, a die, a token from a vocabulary, a category label. The distribution is literally a list of numbers, and the sum is a regular sum.
Continuous. Outcomes form a continuum — a height in centimeters, a temperature, a time. Instead of a list, you get a density curve p(x); probabilities live in areas under that curve, and "sums to 1" becomes "total area equals 1". Same intuition, calculus-flavored notation.

The most famous continuous distribution is the normal (or Gaussian) — the bell curve you've seen on every test-score chart. It's described by just two numbers: where the peak sits (μ, the mean) and how wide the bell is (σ, the standard deviation). Adult heights, measurement errors, and the noise random initialization adds to neural network weights are all roughly normal.

On the discrete side, the LLM workhorse is the categorical distribution — exactly the four-weather-outcomes layout above, just much wider. Every time a language model picks the next token, it's sampling from a categorical distribution over the whole vocabulary.

How do you get a distribution? Two main ways, both familiar:

From counts. Roll the die 6,000 times; if 1,003 of those land on a 4, your empirical estimate is P(4) ≈ 1003/6000 ≈ 0.167. Divide every count by the total and you get probabilities that automatically sum to 1.
From scores via softmax. A single formula — softmax(zᵢ) = exp(zᵢ) / Σⱼ exp(zⱼ) — that takes any list of real numbers and spits out a valid distribution. It's worth unpacking, because every LLM uses it.

What softmax does. Take a list of real-numbered scores (z₁, z₂, …, z_n) — positive, negative, doesn't matter. Out comes a list the same length, every entry in [0, 1], summing to exactly 1. A valid probability distribution, every time.

Why we need it. A neural network's last layer is just a weighted sum — it can emit −5.2, 17.8, anything. To treat those scores as probabilities you need to (a) make every entry non-negative and (b) make them sum to 1. You might try a simpler recipe: shift the scores so the minimum is zero, then divide by the total. But shifting destroys information — a score of −2 and a score of −10 both become "small positive" after the shift, even though one was vastly more confident than the other. And the moment all the scores happen to cancel, you divide by zero. Softmax sidesteps both problems by exponentiating first.

The recipe, in three lines. Given logits (z₁, z₂, …, z_n):

Exponentiate each: exp(zᵢ). The exponential of any real number is positive.
Sum them: S = Σⱼ exp(zⱼ).
Divide each by the sum: softmax(zᵢ) = exp(zᵢ) / S.

Worked example. After the prompt "The quick brown", an LLM scores three candidate continuations:

                fox       dog       hen
  logit       4.50      3.20      0.80
  exp        90.02     24.53      2.23
  ──────────────────────────────────────
                     sum = 116.78
  softmax     0.77      0.21      0.02   ✓ sums to 1.00

Look at what the exponentials did. The gap between fox and dog in logit-space is only 1.3, but after softmax fox ends up 3.7× more likely. The gap between dog and hen is 2.4 in logits and 10× in probabilities. Exponentials amplify differences: a small numerical edge in logits becomes a large gap in probability. That's the design intent — turning a fuzzy relative ordering into a sharp probability landscape the model can sample from.

1 / 4

Three logits → exp → sum → probabilities. The exponentials turn small logit gaps into large probability gaps.

In a Transformer: the final layer of an LLM produces one real-numbered score per vocabulary token — called a logit. Softmax turns the whole vector of logits — tens of thousands of numbers — into a single categorical distribution over the vocabulary in one line of code. That distribution is what the model samples from to produce the next token. Every chat response you've ever read is thousands of draws from thousands of these distributions, one per token.

Conditional Probability

The probability of A, given that B already happened.

Pick a random adult on Earth. The probability they own a smartphone is something like P(smartphone) ≈ 0.7. Now zoom in: pick a random adult who lives in Tokyo. That probability jumps to maybe 0.95. Same question, different starting pool — and the answer changes. The second number is a conditional probability, written P(smartphone | lives in Tokyo) — "given they live in Tokyo, the probability of smartphone."

The vertical bar is the entire notational trick. Read it as the word "given": everything after the bar is something you already know to be true; everything before the bar is what you're asking about under that assumption. P(A | B) is "given B, what's the probability of A?"

The mechanical definition has one line:

P(A | B) = P(A and B) / P(B)

It looks abstract; the picture is concrete. Out of everyone in the world, only the slice where B is true matters once you condition on B. Inside that slice, you ask: what fraction also has A? That fraction is P(A | B). The denominator P(B) rescales the slice back to a valid distribution that sums to 1; the numerator P(A and B) is the overlap.

Concrete example. Suppose, out of every 1,000 random adults:

  · own smartphone, live in Tokyo        =  19  ←  A and B
  · own smartphone, don't live in Tokyo  = 681
  · no smartphone,  live in Tokyo        =   1
  · no smartphone,  don't live in Tokyo  = 299
  ────────────────────────────────────────────
  total                                  = 1000

  P(smartphone)             = 700 / 1000 = 0.70
  P(lives in Tokyo)         =  20 / 1000 = 0.02   ← P(B)
  P(smartphone AND Tokyo)   =  19 / 1000 = 0.019  ← P(A and B)

  P(smartphone | Tokyo)     = 0.019 / 0.02 = 0.95

1 / 4

Box the Tokyo column, drop everything else, take the ratio inside that slice — that's conditional probability.

Notice what happened. P(smartphone) = 0.7 across everyone; but once we restrict to the 20 Tokyo residents (the "given Tokyo" slice), 19 of them own a smartphone, so the conditional probability is 0.95. Conditioning is a zoom-in operation: it throws away the rows where B is false, then renormalizes what's left.

Two events are independent if conditioning doesn't change the answer — i.e., P(A | B) = P(A). Knowing B happened told you nothing about A. Two coin flips are independent: the second is 50/50 whether the first was heads or tails. Otherwise the events are dependent: knowing B shifts your belief about A. Almost every interesting real-world pair is dependent — which is exactly why conditional probability is the lever models pull on.

Rearranging the definition gives the chain rule, which you'll see constantly in ML papers:

P(A and B) = P(A | B) · P(B) = P(B | A) · P(A)

Two ways to compute the joint probability of A-and-B, depending on which conditional you find easier. Setting the two right-hand sides equal — and dividing both sides by P(B) — produces Bayes' theorem, P(A | B) = P(B | A) · P(A) / P(B), the formula that lets you flip a conditional around. Diagnostic-test problems, spam filtering, and the entire field of Bayesian inference all hinge on this one rearrangement.

A classic gotcha. P(A | B) and P(B | A) are not the same number. P(rain | wet sidewalk) is high — most wet sidewalks come from rain. But P(wet sidewalk | rain) is also high (and may equal 1). Now compare P(disease | positive test) to P(positive test | disease): a test with 99% sensitivity (P(positive | disease) = 0.99) can still produce a result where most positives are false alarms if the disease is rare. Same numbers, conditioned the other way — wildly different conclusions. Bayes' theorem is what reconciles them.

In a Transformer: every probability an LLM produces is conditional. When you ask it to continue "The capital of France is", the model is computing P(next token | the previous tokens) — read that bar carefully. The whole prompt sits to the right of the bar; the prediction is what's to the left. Training an LLM means estimating this enormous conditional distribution from a billion-text corpus: for every span of text in the training data, the model is asked "given everything you saw so far, what comes next?" Generation is the same trick run forward, one conditional at a time. An entire essay from a chatbot is a chain of conditionals, glued together by the chain rule above.

Expectation, Mean & Variance

Two numbers that summarize a whole distribution — where it sits and how far it spreads.

A distribution can have dozens, thousands, or infinitely many entries — way too many to eyeball. So we squash it down to a handful of summary numbers. The two you cannot escape are the mean (where the distribution sits, on average) and the variance (how much it spreads around that average).

Suppose you spin a wheel that pays out one of four amounts, with probabilities listed:

  payout  $0     $10    $50    $1000
  P       0.70   0.20   0.09   0.01

What payout do you "expect" per spin? You won't actually get this value on any single spin — you'll get one of the four. But if you spun this wheel a million times and averaged the payouts, the long-run average would converge to the expectation (also called the expected value or just the mean), written E[X]. The formula is "outcome times probability, summed":

E[X] = Σ xᵢ · P(xᵢ)

For our wheel:

  E[X] = 0·0.70 + 10·0.20 + 50·0.09 + 1000·0.01
       =  0     +  2      +  4.5    +  10
       = 16.5

1 / 5

Four contributions added in turn — the expectation is the weighted sum.

On average, you net $16.50 per spin — even though no single spin pays exactly that. The mean is a weighted average: each outcome is weighted by how often it happens. Outcomes you almost never see (like the $1000 jackpot) contribute only their small share; common outcomes (like $0) anchor the result.

One word, three uses. People say mean, average, or expectation almost interchangeably, but it's worth noticing the shade of difference:

Mean of a list of numbers — the arithmetic average. Add them all up, divide by count. mean([2, 4, 9]) = 5.
Expectation of a distribution — what that arithmetic average converges to if you sampled from the distribution forever. Same idea, viewed through the lens of probability.
Mean of a sample — your empirical estimate of the underlying expectation, based on the actual draws you saw. It approaches the true expectation as the sample grows.

The mean tells you where the distribution sits. But it doesn't tell you how spread out it is. Compare two distributions of returns, both with mean 0: a savings account that pays 0 every day, and a roulette wheel that pays −100 or +100 with equal odds. Same mean. Wildly different lived experience. We need a second number to capture spread.

That number is the variance, written Var(X). The idea: measure how far each outcome deviates from the mean, square those deviations (to make them positive and to weight big deviations harder than small ones), then take the expectation of those squared deviations:

Var(X) = E[(X − μ)²]

Here μ (Greek letter "mu") is just shorthand for the mean E[X] — statistics writes the mean as μ for brevity. For our spinning wheel (μ = 16.5):

  Var(X) = (0 − 16.5)²·0.70
         + (10 − 16.5)²·0.20
         + (50 − 16.5)²·0.09
         + (1000 − 16.5)²·0.01
         = 190.575 + 8.45 + 101.0025 + 9672.7225
         ≈ 9972.75

A large variance, dominated by the rare $1000 outcome. Variance has the awkward property of being in squared units (dollars², in this case), so we usually report its square root — the standard deviation σ = √Var(X). For the wheel, σ ≈ $99.86. That number lives in the same units as the mean, so you can say things like "the typical payout sits at $16.50 ± $99.86" and the comparison is meaningful.

Three useful facts that show up everywhere:

Linearity of expectation. E[X + Y] = E[X] + E[Y], always. Even if X and Y are tangled together — no independence needed. The most surprisingly powerful identity in introductory probability.
Variance and constants. Var(aX) = a² · Var(X) — variance scales by the square. Doubling every payout quadruples the variance.
Variance of a sum (independent case). Var(X + Y) = Var(X) + Var(Y) when X and Y are independent — variances add. This is why averaging many independent measurements shrinks the noise: Var(avg of n) = σ² / n, so σ shrinks like 1/√n.

In a Transformer: the loss function used to train every LLM — cross-entropy — is, mechanically, an expectation. We average a per-token loss across the whole training corpus, and gradient descent minimizes that average. Variance shows up twice: LayerNorm centers each token vector at zero mean and unit variance to keep activations stable; and the noise that gradient descent sees from each mini-batch has its own variance — which is why bigger batches give smoother training curves (averaging more samples shrinks the noise by √n). Three of the central tricks in modern ML — minimizing an average loss, normalizing activations, batching gradients — are all applications of mean and variance.

Log Probability

Probabilities are tiny and they multiply. Logs are bigger numbers and they add.

Probabilities, in practice, have a problem: they get small fast. The chance of any one particular sentence appearing in the wild is something like 10⁻³⁰, give or take. The chance of a 100-word paragraph? 10⁻³⁰⁰. These numbers are too small to store in a 64-bit float — the smallest positive number a `double` can represent is around 10⁻³⁰⁸, and below that everything rounds to zero. A whole field full of probabilities multiplied together runs you straight off that cliff.

The fix is to take the logarithm. The log of a number tells you "what exponent of e (about 2.718) gives this number?" — log(0.5) ≈ −0.693, log(0.01) ≈ −4.605, log(10⁻³⁰⁰) ≈ −690.78. Tiny positive probabilities become moderate negative numbers, easy to store and compute with.

1 / 3

The same three probabilities clump on a linear axis; log spreads them out so the gaps are visible.

Three properties of logs make them perfect for probabilities. All three follow from the single rule log(a · b) = log(a) + log(b):

Multiplication becomes addition. The joint probability of N independent events is p₁ · p₂ · … · p_N — a product that underflows. The log of that joint probability is log p₁ + log p₂ + … + log p_N — a sum, which stays well-behaved no matter how many terms you pile on.
Order is preserved. Log is monotonically increasing — p > q if and only if log p > log q. So if you only need to compare or rank probabilities (which token is most likely? which class scores highest?), comparing logs gives identical answers. Whatever you were going to maximize, you can maximize its log instead.
Numerical range collapses. Probabilities span [0, 1], but most of the interesting action sits near 0 — 0.001 and 0.0001 are visually similar yet differ by a factor of 10 in likelihood. After log, those become −6.91 and −9.21, and the difference is plainly visible. Gradient-based optimizers see a much cleaner landscape.

Convention: natural log (base e), written log or ln, is the default in ML. Some textbooks use base 2 (so the units come out as bits) or base 10, but the qualitative behavior is the same — choosing a base is just scaling all logs by a constant.

A worked example. Compute the probability of the sentence "the cat sat" under a tiny language model, by chaining conditional probabilities (from Section 03):

  P("the cat sat")
    = P(the) · P(cat | the) · P(sat | the cat)
    = 0.04   · 0.001        · 0.0002
    = 0.000000008   ← already painful; for 100 tokens, hopeless.

  log P("the cat sat")
    = log(0.04) + log(0.001) + log(0.0002)
    = −3.22     + −6.91      + −8.52
    = −18.65        ← three numbers added; never underflows.

Same information, two scales. The product runs to zero in the float register; the sum is a comfortable −18.65. For a 100-token sequence the product underflows long before you finish; the sum just keeps adding into the hundreds. Every LLM scoring routine, every perplexity computation, every beam-search ranker works in log space for exactly this reason.

Two related quantities you'll meet immediately as you read ML papers:

Negative log-likelihood (NLL). Just −log P. We flip the sign so that "more likely" corresponds to a smaller number — which optimizers prefer (they all minimize). Maximizing likelihood and minimizing NLL are the same procedure.
Cross-entropy loss. The training objective of every LLM: L = −Σ log P(correct token | preceding tokens), averaged over the corpus. Sum of negative log-probs of the right answer, with the model's parameters as the knobs. It's NLL by another name, but the term you'll see in every paper.

One more pattern: log-sum-exp. Summing a list of probabilities in log space — say, to combine the log-probs of several reasoning paths into one overall probability — needs care, because log(p + q) ≠ log p + log q. The trick is log(Σ exp(log pᵢ)), implemented in numerically stable form by every framework as logsumexp. Softmax (Section 02) is just one application of it.

In a Transformer: the model outputs logits, and what we actually feed into the loss is log_softmax(logits) — that is, log-probabilities. The training loop computes cross_entropy = −mean(log P(correct token)) and backpropagates that. At inference time, beam search and sampling rank candidate continuations by their cumulative log-probability — adding log-probs as it extends each candidate one token at a time. Every probability you'd normally write P(x) in this primer is, in the actual training and inference code, almost certainly stored and manipulated as log P(x). The math is identical; the floating-point register just stays alive.