Word Embeddings Primer
A neural network only consumes numbers, but text is symbols. Bridging that gap took the field years. Four short topics tell the story: one-hot encoding, the obvious first attempt that doesn't scale; word embeddings (Word2Vec, GloVe), the dense-vector idea that made every modern NLP system possible; the famous king − man + woman ≈ queen analogy that made everyone fall in love with the idea; and the limit of static embeddings — every word gets one vector, regardless of context — which is exactly what the next primer (the Transformer) finally fixes.
One-Hot Encoding
The obvious first attempt — and the one nobody uses any more.
A neural network only consumes numbers. Text is symbols. The cheapest possible way to bridge that is one-hot encoding: pick a vocabulary of size V, then represent each word as a length-V vector with a single 1 at the word's index and 0s everywhere else. Bag-of-words and TF-IDF (the text primer §1 and §2) are basically aggregates of these vectors.
Pick a tiny vocab — ["the", "cat", "sat", "on", "mat"] — and the encoding looks like:
the = [1, 0, 0, 0, 0] cat = [0, 1, 0, 0, 0] sat = [0, 0, 1, 0, 0] on = [0, 0, 0, 1, 0] mat = [0, 0, 0, 0, 1]
What's wrong with this? Three things, in order of severity:
- Sparse and huge. A GPT-4-class vocabulary has roughly
100,000tokens. Each word is a 100,000-dimensional vector with one nonzero entry. Almost all of every vector is wasted storage. Frameworks support sparse representations, but the embedding layer is still a giantV × d_embedtable; doing one-hot times this table is just an index lookup, which means we never actually construct the one-hot vector in modern code. Conceptually, though, that's what each token is. - No similarity structure. The cosine similarity between any two different one-hot vectors is exactly 0. "Cat" and "kitten" sit at the same mathematical distance from each other as "cat" and "calculator." The model has no way to know that two words are related until it has seen each one many times in similar contexts.
- No generalization. If you train a model on a million sentences containing "dog" and zero containing "puppy," the model knows everything about "dog" and nothing about "puppy" — even though the two words mean nearly the same thing. Each word has to be learned from scratch.
Bag-of-words and TF-IDF inherit all three problems and add their own: complete loss of word order (text primer §2). The fix that broke the field open isn't about aggregating one-hots better — it's about replacing the one-hot itself with a small dense vector. That's §2.
In a Transformer: the first thing a token does on entering a Transformer is hit an embedding lookup table — a (vocab_size, d_model) matrix where each row is the (learned, dense) embedding for one token. The "one-hot vector times embedding matrix" interpretation is still mathematically true, but in practice the framework just gathers the right rows. Either way: one-hot has been promoted to a thin layer of indexing, and what flows into the rest of the model is the dense vector from the next section.
Word Embeddings — Word2Vec & GloVe
Dense vectors learned so that similar words sit near each other.
What if every word were a small dense vector — say 300 dimensions of real numbers — instead of a giant sparse one-hot? And what if those vectors were learned from how words appear in real text, so that similar words ended up in similar positions? That's the word embedding idea, and it was a turning point.
Word2Vec (Mikolov et al., 2013) was the recipe that made it famous. Train a small neural network to predict context from a center word (skip-gram) or center from context (CBOW). The network has one hidden layer of size d_embed. Throw away the prediction head after training; the hidden layer's weights are the embeddings — one vector per word in the vocabulary.
Skip-gram, for word at position t:
input = embedding[w_t]
for each context position w_{t±j} in a small window:
predict log-probability of w_{t±j} given w_t
The hidden layer is shared. The prediction signal pushes
words used in similar contexts toward similar embeddings.GloVe (Pennington et al., 2014) reached the same destination by a different route: factor the global word-word co-occurrence matrix to find low-rank vectors that approximate it. Different math, similar embeddings in practice.
Both methods exploit the same linguistic fact, often attributed to John Firth: "You shall know a word by the company it keeps." Words appearing in similar contexts tend to mean similar things; if your loss function rewards predicting one from the other, the resulting vectors capture that similarity.
The properties that made word embeddings the default representation for ~5 years:
- Dense. Typical sizes: 100 to 300 dimensions. Every component is meaningful and nonzero. Compare to 100,000-D one-hot — orders of magnitude less memory.
- Similar words → similar vectors.
cosine(v("cat"), v("kitten"))is large;cosine(v("cat"), v("calculator"))is small. A downstream classifier that sees "cat" learns something about "kitten" for free. - Generalization. A model trained on "the cat sat" can sometimes handle "the kitten sat" reasonably, even without ever seeing "kitten" during training — because the kitten vector is near the cat vector.
- Transferable. Train embeddings once on a giant text corpus (Wikipedia, the web), download them, plug them into your task. They became one of the first pre-trained artifacts in NLP, the philosophical ancestor of all modern "pre-train then fine-tune" workflows.
Things to know about the practical side. Word2Vec/GloVe embeddings tend to encode biases straight from training data — gendered occupation associations, racial stereotypes, etc. — because the training data itself has those biases. The early 2010s NLP literature has many papers about debiasing techniques. The problem didn't go away with contextual embeddings; if anything it got more subtle.
In a Transformer: the (vocab_size, d_model) embedding table at the start of a modern Transformer is functionally a learned Word2Vec-like thing — except trained jointly with the rest of the model on the language-modeling loss, not as a separate pre-step. The vectors that come out of that table become the input to attention. Word2Vec's ideas live on; only the training recipe has moved.
king − man + woman ≈ queen
The intuition that made everyone fall in love with word embeddings.
Word2Vec / GloVe embeddings encode something stranger than "similar words land near each other." They encode directions. The most famous example, almost a meme by now:
v(king) − v(man) + v(woman) ≈ v(queen)
Take the vector for "king", subtract the vector for "man", add the vector for "woman" — you land almost exactly on the vector for "queen". The vector from "man" to "woman" is a consistent direction in the embedding space, and the same direction applied to "king" lands you at "queen." Gender is a direction, not a coordinate.
Other analogies that work surprisingly often:
- Capitals.
v(Paris) − v(France) + v(Italy) ≈ v(Rome). The "is the capital of" direction is consistent across countries. - Verb tenses.
v(walking) − v(walk) + v(swim) ≈ v(swimming). The "-ing" inflection becomes a learnable direction. - Comparatives.
v(big) → v(bigger)parallelsv(fast) → v(faster). - Singular / plural.
v(dogs) − v(dog) ≈ v(cats) − v(cat).
Why does this work? Word2Vec and GloVe are trained to push words used in similar contexts toward similar vectors. "King" and "queen" appear in similar contexts apart from the gender of who they refer to. "Man" and "woman" also have contexts that share most properties (subject, sentience, person-noun-ness…) and differ on gender. So in vector space, the gender axis is the dimension along which man/woman vs king/queen vary in the same way. Vector subtraction isolates that axis; vector addition transports it elsewhere.
A few honest caveats. The analogies don't always work — they tend to work best for cherry-picked pairs and on the cleanest, most frequent words. Recent critical analyses showed that some of the famous demos were partly engineered (the nearest-neighbor rules used in the original demo excluded the input words, which biases the answer). And contextual embeddings (the topic of §4 and the next primer) partially break this clean structure — because each occurrence has its own vector, there's no single "v(king)" to point at any more. But the core observation held: directions in embedding space carry meaning.
For a moment in 2013, this was world-stopping. It looked like the field had accidentally taught a computer to do analogical reasoning. We later learned the truth is more boring and more interesting: it had taught the computer to approximate co-occurrence statistics, and analogies are a side effect of that statistical regularity. Either way, the demo was the right kind of viral. It set up a decade of progress.
In a Transformer: the input embedding matrix is still trainable, still encodes word similarities, and to some extent still encodes the analogical structure you can probe with vector arithmetic. The bulk of the model's knowledge, though, has moved beyond static embeddings into the attention and FFN layers — which is §4's point and the bridge to the Transformer.
Why Static Embeddings Aren't Enough
One vector per word, no matter how many meanings it has, no matter the context.
Word2Vec and GloVe were a giant leap forward, and they sat at the top of NLP for years. But by ~2018 it was clear they had a fundamental limitation: every occurrence of a word gets the same vector, regardless of what sentence it's in. The text-in-LLMs primer (§3) called this out; here's the cleanest demonstration.
Consider the word "bank":
- "She sat by the river bank." — a sloping edge of land.
- "She walked into the bank." — a financial institution.
- "He works at the blood bank." — a medical storage facility.
Three meanings. Word2Vec gives them one vector. The training procedure has to average over all three senses (and many more), so the resulting "bank" embedding sits somewhere between the riverbank cluster and the financial cluster — useful for neither.
The full list of things static embeddings can't do:
- Polysemy. "bank," "bat," "Apple," "rose," "lead" — every dictionary's pages of senses-per-word are invisible to Word2Vec.
- Out-of-vocabulary words. The vocabulary is fixed at training time. A new word ("ChatGPT", a brand, a person's name) has no embedding at all. Sub-word tokenization (BPE, SentencePiece) helps by breaking unknown words into known pieces, but it's a workaround.
- Phrase composition. "New York" is a city, but
v("New") + v("York")is something else. Pre-Transformer NLP needed separate phrase-detection pipelines to recover the joint meaning. - Syntactic role. "Run" the noun and "run" the verb get the same vector. A model has to disambiguate them downstream from context, with the embedding offering no help.
- Long-range context. Even when "bank" should mean "river bank" because a paragraph back you mentioned a hike along the Mississippi, the static embedding has no way to know.
The fix, briefly: don't embed words once and call it a day. Embed them once per sentence, conditioned on the sentence. Every occurrence of "bank" produces its own vector, computed from the surrounding tokens. Contextual embeddings. ELMo (2018) did this with a bidirectional LSTM stacked on top of word embeddings. BERT and GPT (2018) did it with Transformers. The improvement on downstream tasks was so large that within two years, every NLP benchmark's top of the leaderboard was a contextual model — and static embeddings were demoted to "input layer of a contextual model."
That demotion is exactly the structure of a modern Transformer. The first layer is still a learned embedding table (basically Word2Vec, jointly trained). The dozens of layers stacked on top take that static vector and gradually contextualize it by attending to the other tokens in the sequence. By the last layer, the "bank" vector at position 47 in your input depends on every other token in the sequence — it has no fixed meaning, only a meaning-in-this-particular-sentence.
That's the bridge. Static word embeddings were the right answer for the 2013–2018 era — dense, semantically meaningful, beautifully analogical. They failed at one specific thing — context — and the fix for that is the entire next primer.
In a Transformer: the entire purpose of stacking attention layers on top of an embedding table is to convert each token's context-free input embedding into a context-sensitive output vector. By the time data reaches the last layer, every "bank" in the input has a vector that reflects whether it's a river bank, a financial bank, or a blood bank — and the differences are stark enough that downstream layers can act differently for each.