Text in LLMs Primer

Every primer so far has worked on tidy fixed-size numeric data — pixels in an image, a few columns in a CSV, the 2D and 3D vectors of the linear-algebra primer. Real-world ML, and especially LLMs, runs on text. Text turns out to be one of the hardest data types ML deals with, for three specific reasons: it's variable length, order matters, and every token's meaning depends on context. Together these three properties broke every pre-2017 approach. The Transformer is the architecture that finally handles all three. This primer is the "why we need a Transformer" setup; the next primer is the Transformer itself.

Text Is Variable Length

"OK." is two tokens. A Wikipedia article is millions. Both must fit in the same model.

Almost every neural network primitive in the previous primers — matrix multiplication, convolution, normalization — operates on tensors of a fixed shape. The image classifier expects 224 × 224 pixels. The housing regressor expects a 4-dimensional feature vector. A neural net is a function with a fixed-shape input and a fixed-shape output, and matrix algebra makes that assumption everywhere.

Text refuses to cooperate. The same model that needs to process "OK." (2 tokens) also needs to process a Wikipedia article (millions of tokens). And every input length in between. There is no natural maximum and no natural minimum.

The pre-deep-learning era had three ugly fixes, all still occasionally in use:

Pad. Pick a maximum length L, then for every input shorter than L, append a special [PAD] token until you hit L. Every input is now the same shape — but if your batch contains one Wikipedia article and a hundred short tweets, you're wasting 99% of your compute on padding tokens.
Truncate. Pick a maximum length L, and chop off anything past that. Now your model never sees the tail of long documents. For translation that's a sentence half-translated; for summarization the actual punchline is missing; for code completion you can't see the function header.
Bag of features. Throw out the sequence structure entirely. Treat each document as a bag of word counts (or TF-IDF scores) — a fixed-length vector determined by the vocabulary, not by the input length. Fast, but loses everything §2 and §3 will be about.

1 / 3

Most ML models want a fixed input shape. Sentences disagree. Padding wastes compute; truncating loses information.

Modern Transformer-based systems use a combination of these tricks at the edges (you still pad mini-batches to the same length within a batch, because GPUs want rectangular tensors), but the core architecture handles each token of the input the same way regardless of how many tokens there are. There is no "fixed input size" baked into a Transformer; you can run it on length 10 or 10,000. The per-position computation looks the same; only the total work and memory scale with the length (notoriously, quadratically — but more on that later).

Variable length used to be exotic. Now it's the entire game. Every recent improvement to "context window" (1k → 4k → 32k → 128k → 1M tokens) is a story about getting this primitive to work at scales nobody dreamed of a decade ago.

Order Matters

"The dog chased the cat" vs "The cat chased the dog" — same words, opposite meaning.

Once you have a way to ingest variable-length text, the obvious next step is to keep it cheap: throw away the order, just count which words appear. That's a bag of words. It's the simplest representation that handles variable length and gives a fixed-size output. For some tasks (spam detection, topic classification) it's nearly sufficient. For most of what we want from language models, it's catastrophic.

Word order encodes who did what to whom. The classic example:

Sentence A:  The dog chased the cat.
Sentence B:  The cat chased the dog.

BoW(A)  =  { the:2, dog:1, chased:1, cat:1, .:1 }
BoW(B)  =  { the:2, dog:1, chased:1, cat:1, .:1 }   ← identical

The same set of words. The exact same bag-of-words representation. But the two sentences describe opposite events. Any model that operates only on the bag-of-words cannot tell them apart. Bigger versions of the same problem are everywhere:

"Man bites dog" — newsworthy precisely because the order of bite-er and bite-ee is unusual.
"I never said she stole the money" — has seven distinct meanings depending on which word you emphasize. Without word order, you can't even start.
Translation. The English subject–verb–object order vs Japanese subject–object–verb, or Arabic verb–subject–object. Word order is half the grammar.
Code. x = a / b and x = b / a use the same tokens; only order says which divides what.

1 / 3

Word order changes who did what to whom. Any model that ignores order makes these two sentences identical.

The pre-Transformer compromises:

n-grams. Instead of counting individual words, count n-tuples of adjacent words. Bigrams capture two-word order, trigrams three-word order, and so on. It works locally — a bigram model knows "chased the cat" is different from "the cat chased" — but the size of the feature space explodes (vocabularyⁿ) and it still can't handle long-range order. "If A then B" with a paragraph between A and B is invisible to any n-gram with reasonable n.
RNNs. Read tokens left-to-right, carry a hidden state that summarizes what you've seen so far. Order is respected by construction. Brilliant — but sequential, so slow to train, and the hidden state has a finite capacity so distant tokens get squeezed out.
1D CNNs over text. Slide a convolution kernel over the token sequence. Captures local order within the kernel's receptive field. Parallel-friendly. But the receptive field is bounded — stacking layers grows it linearly, not enough for paragraph-long dependencies without absurd depth.

The Transformer's answer: positional encodings. Each token's embedding gets stamped with information about where it appears in the sequence — token 1 looks different from "the same token at position 47," before any attention or FFN runs. Without this, the Transformer's attention is itself order-invariant; with it, the architecture knows where every token lives. Different recipes (sinusoidal, learned, RoPE, ALiBi) compete on how exactly to inject position, but every Transformer worth running does this in some form.

Every Token Depends on Context

"Bank" is a financial institution next to "deposit" and a riverside next to "river."

Even if you handle variable length and order perfectly, you still have a third problem: a single token doesn't have a single meaning. Its meaning is decided by the tokens around it.

Polysemy. "bank" = financial institution or riverside. "bat" = nocturnal mammal or sports equipment. "Apple" = company or fruit. English alone has thousands of words like this.
Pronouns & references. "Alice asked Bob a question. He didn't know the answer." Who is he? Anyone who reads English answers "Bob" — but the model has to figure that out from context, often many tokens away.
Ellipsis. "I went to the store. Bob did too." What did Bob do? Went to the store. The verb "went" is omitted in the second sentence; you have to pull it forward from the first.
Tone & sarcasm. "Oh, fantastic." Means "great" or "this is the worst day of my life," determined by everything around it.

The pre-Transformer answer to "what does a word mean" was an embedding: a fixed-size vector trained per word. Word2Vec (2013) and GloVe (2014) learned a single dense vector for each token in the vocabulary. "king" went near "queen"; "Paris" went near "France." A massive leap over one-hot, and most modern NLP pipelines still start from something like an embedding lookup table.

But Word2Vec gives every occurrence of "bank" the same vector. Whether the sentence is "she sat on the river bank" or "she walked into the bank," the bank token's representation is identical at the start. Any meaning has to be recovered downstream by whatever sits on top of the embedding — and an RNN can do this, slowly, by mixing in context as it reads. But the embedding itself is context-free.

1 / 3

Word's meaning depends on its neighbors. A static embedding gives one point per word; a contextual embedding gives a different point each time.

The fix is the obvious one in hindsight: instead of fixed embeddings, compute contextualized embeddings — vectors that depend on the surrounding tokens. ELMo (2018) did this with a bidirectional LSTM. BERT and GPT (2018) did it with Transformers, and that's what almost everyone uses now. In a modern model, the "bank" in "river bank" and the "bank" in "money bank" have completely different vectors by the time they leave the first Transformer layer. That's exactly what the Transformer is for.

Context-dependence isn't just polysemy — it's everywhere in language. Each token's meaning is a function of an arbitrary subset of the other tokens in the sequence. Sometimes the relevant context is two tokens away ("not good"); sometimes it's 200 ("he," referring to a character introduced a paragraph ago); sometimes it's the entire prior conversation. Variable-distance dependencies are the rule, not the exception. The architecture has to let any token attend to any other, cheaply enough to do it at every layer. That's what attention is.

The Pre-Transformer Landscape

Each prior approach handles some of the three properties, never all three.

With the three requirements clear — variable length, order, context — the obvious question is "how did people handle this before the Transformer?" The answer is "with a sequence of partial solutions, each addressing some properties at the cost of others." Here's the menu, roughly in chronological order:

Bag of words / TF-IDF — handles variable length (the output is fixed-size by vocabulary), but throws out both order and context. Still surprisingly competitive on shallow classification (spam, sentiment, topic).
n-grams — captures local order by counting n-tuples of consecutive words. The feature space explodes (vocabularyⁿ) and long-range order is invisible.
RNN / LSTM / GRU — read tokens left-to-right (or in both directions, with biRNNs), maintain a hidden state. Variable length and order are handled natively. Context is in principle modeled, but in practice the hidden state has a limited capacity, the gradient flow through long sequences is fragile (the backprop primer's §3 — vanishing gradients), and training is sequential rather than parallel.
1D CNNs over text — apply convolutions along the sequence axis. Parallel-friendly. Captures local order and context inside the receptive field but not beyond. Stacking layers extends the receptive field linearly with depth, not nearly fast enough for document-length reasoning.
Attention on top of RNNs — the bridge. Bahdanau et al. (2014) and Luong et al. (2015) added an attention mechanism to encoder-decoder RNNs so the decoder could look at any encoder hidden state, not just the last one. This dramatically improved translation. It also planted the seed: maybe the RNN itself isn't needed?

Vaswani et al.'s 2017 paper, titled "Attention Is All You Need," was the punchline. Drop the RNN entirely. Do nothing but attention, applied in parallel at every position, stacked into layers, on top of positional encodings.

1 / 5

Every pre-Transformer recipe trades off at least one property. Transformer is the first to tick all three — at the cost of a quadratic-in-length attention block.

The Transformer scores all three:

Variable length — yes. The per-position computation looks the same regardless of sequence length; only memory and compute scale with length. Modern context windows of 128k or 1M tokens are extensions of this property, not new architectures.
Order — yes, via explicit positional encodings (sinusoidal in the original paper, learned / RoPE / ALiBi in newer models). Attention itself is order-invariant; the positional encoding stamps each token with its position before attention runs.
Context — yes, fully. Every token attends to every other token at every layer. After L layers, each token's representation is a function of all the other tokens, with the model learning what to weight high and what to ignore.

The cost: attention is quadratic in sequence length — to let every token attend to every other token, you compute n × n pairwise similarities. For n = 1024 that's fine; for n = 1,000,000 it's a million-million operations per layer. The entire industry of "efficient attention" research (sparse, linear, flash, ring, etc.) exists to chip away at this. But for moderate lengths, the quadratic cost is the price of finally getting all three properties.

That's the setup. Variable length, order, context — what the Transformer was born to solve. The next primer pulls it apart layer by layer.