Transformer 前向传播

01 / 一句话本质

整个 transformer forward pass 在同一条 timeline里跑完 —— 一位主厨听到客人说 "light and tangy",现场推荐 "shrimp ceviche with lime"。顶部 6 个 phase 标出此刻我们在主厨大脑的哪一环:已学到的知识、切分客人话语、加位置编码、做 attention、解码第一个回应词、用 KV cache 加速后续生成。

全流程客人说 "light and tangy" → 主厨回 "shrimp ceviche with lime"已学到→ 切分

01 · 词表主厨认识的所有词

·→·

02 · 词嵌入每个词在他脑里的印象

light········

and········

tangy········

shrimp········

ceviche········

with········

lime········

03 · 个人风格"light and tangy" 之后的首词,前 5

"light and tangy" 后的首个 token:

shrimp·

fish·

ceviche·

try·

our·

把模型想象成一个练了几十年的主厨。在客人进门之前,他脑子里已经装着三样东西:认识的词(食材、口味、菜名)、每个词的"印象"、自己的个人风格。这三样在接客过程中不会变;Phase 02–06 都是他在用这三样东西来回应某个具体的客人。

步

0 / 72

阶段

01 已学到

阶段进度

1 / 10

阶段

1 / 6

0 / 72

02 / Further Reading

videoLet's build GPT: from scratch, in code, spelled out — Andrej KarpathyTwo-hour walkthrough that codes everything you just watched, in PyTorch.→postThe Illustrated Transformer — Jay AlammarDiagrams covering the same five-stage path. Read after watching.→codenanoGPT — the canonical reference implementation — Karpathy~300 lines of PyTorch implementing what you just animated.→