Transformer Forward Pass

01 / The one-sentence essence

The whole transformer forward pass in one timeline — a chef hearing a customer say "light and tangy" and replying "shrimp ceviche with lime". The 6 phases up top mark where in the chef's brain we currently are: trained knowledge, tokenizing the order, embedding + position, attention, decoding the first reply word, and KV-cached generation of the rest.

Walkthroughcustomer "light and tangy" → chef "shrimp ceviche with lime"Trained→ Tokenize

01 · vocabularythe words the chef knows

·→·

02 · token embeddingshis feel for each word

light········

and········

tangy········

shrimp········

ceviche········

with········

lime········

03 · personal styleafter "light and tangy" — top 5 first replies

first token after "light and tangy" :

shrimp·

fish·

ceviche·

try·

our·

Picture the model as a chef who's spent years training. Before any customer walks in, three things already sit in his head — a vocabulary of food concepts, a feel for each one, and his personal style. These won't change during service; everything in Phase 02–06 is just the chef using them to answer one specific customer.

step

0 / 72

phase

01 Trained

phase progress

1 / 10

phase

1 / 6

0 / 72

02 / Further Reading

videoLet's build GPT: from scratch, in code, spelled out — Andrej KarpathyTwo-hour walkthrough that codes everything you just watched, in PyTorch.→postThe Illustrated Transformer — Jay AlammarDiagrams covering the same five-stage path. Read after watching.→codenanoGPT — the canonical reference implementation — Karpathy~300 lines of PyTorch implementing what you just animated.→