RNN & LSTM Primer

This primer is optional. If you just want to read about the Transformer, skip straight to the next one. But understanding what came before the Transformer — and the specific pain points it solved — is the cleanest way to appreciate why the Transformer was such a sharp break with everything that came before. Three short topics: RNN, the basic recurrent net that processes one token at a time; LSTM / GRU, the gated variants that handle long-range dependencies; and why the Transformer killed them all — parallelism, true long-range memory, and the end of the encoder-decoder bottleneck.

RNN — One Token at a Time

A fixed-size "memory" carried forward through the sequence.

The text primer (§4) listed RNN-family models in the pre-Transformer landscape. Here we open one up. The basic RNN (Recurrent Neural Network) is the simplest network that handles variable-length input: walk the sequence left to right, and at every step combine the new token with a "memory" of everything seen so far.

The recipe is one update rule, applied at every timestep:

h_t = tanh(W_h · h_{t−1} + W_x · x_t + b)

h_t is the hidden state at step t — a fixed-size vector that's the network's entire summary of the sequence so far. x_t is the new token's embedding. W_h recycles the previous hidden state, W_x projects the new input, tanh squashes them together. Crucially, W_h, W_x, and b are the same across every timestep — that's parameter sharing through time, the property that lets one set of weights process sequences of any length.

1 / 5

An RNN walks the sequence one token at a time. Each step's hidden state summarizes everything seen so far. Same parameters reused at every step.

For tasks that need a single output at the end (sentiment classification, next-token prediction), you read off the final hidden state and project it:

Walk forward once:
  h_0   =  tanh(  0           +  W_x · x_0  +  b )      ← "The"
  h_1   =  tanh(  W_h · h_0   +  W_x · x_1  +  b )      ← "dog"
  h_2   =  tanh(  W_h · h_1   +  W_x · x_2  +  b )      ← "runs"
  h_3   =  tanh(  W_h · h_2   +  W_x · x_3  +  b )      ← "fast"

Read the answer:
  y     =  W_o · h_3 + b_o                              ← prediction

What an RNN gets right:

Variable length. Same weights, run the loop as many times as you need. A 5-token sentence and a 5000-token document use the exact same parameters.
Order is respected by construction. Token x_t can only influence the hidden state from time t onward; the recurrence enforces a strict left-to-right reading.
Parameter count is small. One matrix per "role" (W_h, W_x, W_o), reused everywhere. An RNN trained on language can be a few million parameters total.

What an RNN gets wrong:

Sequential. You cannot compute h_5 until h_4 exists. There's no way to parallelize across time within one example, which is murder on GPUs.
Vanishing / exploding gradients through time. Computing ∂L/∂h_0 means walking backward through the entire chain, multiplying by W_h at every step. Either ‖W_h‖ < 1 (gradient shrinks to zero) or ‖W_h‖ > 1 (gradient explodes to NaN). Same failure mode the backprop primer's §3 covered.
Finite-capacity memory. The hidden state is a fixed-size vector. Stuffing the meaning of a 500-token paragraph into 1024 floats is a lossy compression; older information gets overwritten by newer.

The first weakness is the GPU killer. The second and third are why even small RNNs struggle with anything longer than a few dozen tokens. LSTM and GRU (§2) attack the gradient problem head-on; the Transformer (next primer) attacks all three at once by dropping recurrence entirely.

LSTM & GRU — Gated Memory

A separate "cell state" that flows along a highway, controlled by learnable gates.

The plain RNN's biggest weakness is what §1 called the gradient-through-time problem: by the time you've walked back twenty steps, you've multiplied by W_h twenty times. The signal's either dust or NaN. Hochreiter & Schmidhuber (1997) proposed a clever architectural fix: keep a separate "cell state"c_t that flows along the timeline largely untouched, and let learnable gates decide what to drop, what to add, and what to expose at each step. That's the LSTM (Long Short-Term Memory).

The full update for one timestep — looks scarier than it is:

Forget gate:    f_t  =  σ(W_f · [h_{t-1}, x_t] + b_f)
Input gate:     i_t  =  σ(W_i · [h_{t-1}, x_t] + b_i)
Candidate:      ĉ_t  =  tanh(W_c · [h_{t-1}, x_t] + b_c)
Update cell:    c_t  =  f_t · c_{t-1}  +  i_t · ĉ_t
Output gate:    o_t  =  σ(W_o · [h_{t-1}, x_t] + b_o)
Hidden state:   h_t  =  o_t · tanh(c_t)

Every gate is a sigmoid (values in [0, 1]), elementwise multiplied with whatever it's gating. Read each line as a sentence:

Forget — "of the old cell state, keep this fraction." f = 1 means keep everything; f = 0 means wipe it.
Input + candidate — "write this much (i) of this content (ĉ) into the cell." i = 0 means write nothing; the candidate is itself a tanh-bounded blob of "what new information to add."
Output — "of the (now-updated) cell state, expose this fraction as the new hidden state h_t." The hidden state is what flows to the next timestep's gates and to whatever sits on top of the LSTM.

1 / 4

An LSTM cell carries a separate "cell state" c along a highway. Three learnable gates decide what to forget, what to add, and what to expose as the hidden output.

The reason this fixes the vanishing-gradient problem is subtle and beautiful. Look at the cell-state update: c_t = f_t · c_{t-1} + i_t · ĉ_t. The derivative of c_t with respect to c_{t-1} is just f_t. If the network learns to set f_t ≈ 1 for a particular cell dimension over many steps, the gradient flows backward through those steps essentially undamped — multiplied by 1, 20 times, is still 1. Compare that to the vanilla RNN where each backward step multiplied by W_h regardless of context. The LSTM gives the network a way to choose when to forget; the vanilla RNN doesn't have that knob.

The GRU (Gated Recurrent Unit, 2014) is a popular simplification: only two gates (reset and update), no separate cell state. About 25% fewer parameters than an LSTM, slightly faster, and in practice often indistinguishable on accuracy. Use whichever your framework has bindings for.

What LSTM/GRU buy over the vanilla RNN:

Long-range dependencies become feasible. 100, 500, even 1000-token contexts can be trained. The cell state can hold information for very long stretches if the forget gate stays near 1.
Training is stable. No more NaN explosions on long sequences. Vanilla RNNs needed all kinds of tricks (gradient clipping, careful init); LSTMs mostly just work.

What they don't fix:

Still sequential. The gate values at step t depend onh_{t-1}, which depends on h_{t-2}, and so on. The GPU still hates this.
Still finite memory. The cell state is a fixed-size vector — bigger context windows just mean more information competing for the same number of floats.
Still O(n) sequential ops to process n tokens. An LSTM with a 1024-dim hidden state is fast per step but every step blocks the next.

From 2014 to 2017, LSTMs were the dominant architecture for sequence modeling — neural machine translation, language modeling, speech recognition. The Google translate rewrite of 2016 was a deep LSTM. ELMo (2018), the first widely-used contextual embedding, was a bidirectional LSTM. Then attention took over.

Why the Transformer Killed Them All

Three structural wins that the Transformer delivers and LSTMs couldn't.

From around 2017 onward, LSTMs gradually disappeared from leaderboards. By 2020 you would struggle to find a recent NLP paper using one as the main architecture. The Transformer (Vaswani et al., 2017) replaced them so completely that the entire field re-tooled within three years. Why? Three properties — each one a direct response to a specific RNN-family weakness from §1 and §2.

1. Parallelism.

An RNN's timestep t cannot start until timestep t − 1 finishes. The hidden state has to be computed in order, period. A modern GPU has thousands of parallel cores; running an LSTM on one keeps almost all of those cores idle. Even with batch parallelism (different sequences in parallel), the within-sequence work is sequential.

The Transformer's self-attention layer, by contrast, computes the output at every position in parallel. Position 1, position 5, and position 5000 all process at the same instant. A single Transformer forward pass on a batch of 64 sequences × 2048 tokens fills the GPU completely; an LSTM on the same batch is 2048× slower per sequence even at full GPU utilization across the batch dim. This is the single biggest reason Transformers eat LSTMs at scale.

2. True long-range memory.

Even with the LSTM's clever forget gate, information from token 1 has to pass through 999 timesteps of gating arithmetic to influence token 1000. The gates can keep it through (forget ≈ 1 at every step is theoretically possible), but in practice the network has to learn to do this for every dimension that needs to survive, and it's a hard optimization problem. Real LSTMs degrade smoothly with distance.

In a Transformer, position 1 and position 1000 are directly connected by attention. The model multiplies their query and key, gets a similarity score, applies softmax, and weighs in. Position 1's contribution to position 1000 doesn't go through 999 intermediate gates — it goes through one attention edge. The "effective path length" between any two tokens is O(1), not O(n).

3. No information bottleneck.

For encoder-decoder tasks (translation, summarization), an RNN encoder reads the source sequence and produces a final hidden state. That single fixed-size vector is then handed to the decoder, which must generate the entire output sequence from it. Translate a 200-word paragraph: the entire meaning of the paragraph has to fit in a few-thousand-float vector, then the decoder has to unpack it word by word. Bahdanau et al. (2014) recognized this as a bottleneck and proposed an attention mechanism so the decoder could look at every encoder hidden state, not just the last. That was already a massive translation-quality win — and it planted the seed: maybe the recurrence is the wrong primitive entirely?

Vaswani 2017 answered yes. Drop the RNN. Use attention everywhere, in both encoder and decoder, in every layer. Now every output position can attend to every input position directly — no bottleneck, no fading memory, no sequential blocker.

1 / 3

Parallelism, true long-range memory, and no encoder-decoder bottleneck. RNN and LSTM each missed at least one; Transformer ticks all three.

What the Transformer pays for these wins:

Quadratic memory / compute in sequence length. Pairwise attention across n tokens is n² entries. For n = 1024, fine; for n = 1,000,000, catastrophic. Half of post-2020 ML systems research is "efficient attention" (FlashAttention, sparse attention, linear attention, sliding windows, ring attention) trying to break the quadratic ceiling.
Bigger parameter count. A Transformer needs separate Q, K, V projections at every layer, plus the FFN sublayer. Modest Transformers blow past the parameter counts that LSTMs operated at. Some of that's good (more capacity); some of it's waste.
Order has to be injected explicitly. Attention is order-invariant, so the Transformer needs positional encodings — covered briefly in the text primer §2, in detail in the next primer.

For the kinds of sequences we care about (text, code, audio frames, tokenized anything in the kilotoken range), the wins comfortably outweigh the costs. Hence the last seven years.

Where RNNs are coming back, briefly: extremely long sequences (think DNA, audio over hours) where attention's n² cost becomes prohibitive. Mamba (2023) and other state-space models, and RWKV (2023), revive RNN-like recurrence with modern training tricks — and at extreme lengths they outperform Transformers. For the typical 4k–128k context window of an LLM in 2026, though, the Transformer's three wins are still the dominant story.

That's the historical context. The next primer pulls the Transformer itself apart — block by block, sublayer by sublayer — and you'll see exactly how it delivers all three properties on the same architectural slab.