Tokenization Primer

Before an LLM can do anything with text, it has to chop the text into pieces and look up an integer ID for each piece. Those pieces are called tokens, and the chopping is tokenization. Four short topics: the character / word / subword trade-off that frames the whole problem; BPE (the algorithm GPT uses); WordPiece, SentencePiece, Unigram — the major variants; and why modern LLM vocabularies sit between 50,000 and 200,000 tokens.

Character, Word, or Subword?

The choice of "unit" sets the entire shape of every cost and capability that follows.

The first decision in any text-to-numbers pipeline is also the most consequential: what counts as a "token"? Three traditional answers, each with its own broken compromise, until subword tokenization arrived and quietly replaced them all.

Character-level. Every character is a token. Vocab is tiny (~100 for ASCII, low thousands for Unicode), and there are no OOV problems — every conceivable text can be tokenized. The catch: sequences are very long. A 100-word sentence becomes ~500 tokens. Sequence length costs everything in a Transformer: attention is O(n²), KV cache grows linearly. Doable, just slow.
Word-level. Each word is a token. A sentence of 100 words is 100 tokens. Natural unit, short sequences. The catches are big though: vocab explodes (English alone has millions of distinct word-forms once you count plurals, inflections, compound words, names), and any unseen word is "out-of-vocabulary" — a token your model has no idea what to do with.
Subword. A hybrid: frequent words stay whole, rare words split into reusable pieces. The word "unhappiness" might become "un" + "happi" + "ness." Vocab stays small (50K–200K), sequences stay short, and any unseen word can be assembled from familiar pieces. No OOV. This is the answer every modern LLM uses.

1 / 3

Three levels: characters (tiny vocab, very long sequences), words (huge vocab, OOV problems), subwords (the modern compromise).

Why subword wins is best seen with the failure modes of the other two. Pure character-level can represent anything, but the n² attention cost makes long-context inference impractical. Pure word-level can't represent anything outside its vocab, and the moment you train on web text you have an OOV explosion. Subword is "characters when you need them, words when you have them." Modern tokenizers are all variants of this idea.

One subtlety. The "right size" of a subword token depends on the language. For English a typical token is roughly half a word (3–4 characters on average). For Chinese or Japanese it's typically a single character or even a Unicode byte. For code it's often a sub-keyword: "def" + " " + "function". Tokenizers trained on multilingual or multi-format corpora carefully balance these — and when they don't balance well, you see surprising token bills (Chinese text in early GPT-3 could be 3× the token count of equivalent English).

In a Transformer: the tokenizer is the first and last thing the model touches. Input text → token IDs → embeddings → attention layers → ... → output logits → next token ID → decode back to text. Every cost the Transformer pays — VRAM, compute, latency, API price — scales with how many tokens the tokenizer produced. The tokenizer is, in this sense, an architectural decision dressed up as a preprocessing step.

BPE — Byte Pair Encoding

The greedy merge algorithm GPT uses, originally invented for data compression in 1994.

Byte Pair Encoding is the dominant subword algorithm in 2026, used by every model in the GPT family from GPT-2 through GPT-4o, plus most LLaMA-family models. The algorithm is surprisingly small.

Training:
  1. Start with vocab = unique characters in your corpus.
  2. Split every word into characters.
  3. Repeat target_vocab_size − len(vocab) times:
     a. Count every adjacent pair of tokens across the corpus.
     b. Find the most frequent pair (a, b).
     c. Merge into a new token "ab". Add it to vocab.
     d. Apply this merge throughout the corpus.

Tokenizing (using the trained merges):
  1. Split the input into characters.
  2. Apply the merges in order learned, greedily.
  3. Return the resulting token sequence.

That's the entire algorithm. Run it long enough on a big corpus and you get a usable LLM tokenizer.

1 / 4

Byte Pair Encoding starts from characters and greedily merges the most frequent adjacent pair, again and again, until the vocab is the size you want.

A few subtleties hidden inside the "byte" in "byte pair encoding":

Byte-level start. GPT-2 / GPT-3 / GPT-4 use byte-level BPE — they start from the 256 raw bytes, not Unicode characters. This means the tokenizer can handle any text in any language without a separate "unknown character" code, because every byte sequence is valid input.
Pre-tokenization. Before running BPE, modern tokenizers split text on whitespace and punctuation, so BPE never merges across word boundaries. A regex in the OpenAI tokenizer handles this: it groups consecutive letters, then digits, then whitespace, etc.
Special tokens. Real vocabularies include "non-text" tokens — <|endoftext|>, chat-format markers like <|im_start|>, padding tokens. These never come from BPE merges; they're added by hand and reserve a few hundred IDs at the top of the vocab.
Determinism. Once trained, the tokenizer is a deterministic function from text to token IDs. The training is randomized (subsampling the corpus), but inference is not.

Why BPE won, in the historical sense, is a story about pragmatism. Earlier subword approaches existed (Morfessor, etc.) but BPE was simple enough to implement in 100 lines of code, fast enough to train on a corpus the size of the web, and produced tokenizers that worked. Sennrich et al. (2016) repurposed it from data compression to neural machine translation, and within two years it was the default. GPT-2 in 2019 cemented byte-level BPE as the standard for autoregressive LLMs.

In a Transformer: the BPE merges file shipped with every modern LLM is just a list of pairs in the order they were learned. The tokenizer reads them once at startup, builds a trie or hash table for fast lookup, and applies them greedily to every prompt. tiktoken (OpenAI's tokenizer) does this at hundreds of MB/s. The model never sees text — only the integer IDs that come out the other side.

WordPiece, SentencePiece, Unigram

Three variants of subword tokenization, each picking a slightly different merge criterion.

BPE isn't the only subword recipe. Three other approaches sit alongside it, sometimes ahead of it for specific use cases. They all produce qualitatively similar tokenizers — fixed vocab, no OOV, frequent words whole — but disagree on the details of how the vocab is learned and how text is segmented.

WordPiece (Schuster & Nakajima, 2012). Originally from Google's Japanese speech recognition system, adopted by BERT. Like BPE but uses a likelihood criterion instead of raw frequency: at each step it merges the pair that maximizes the probability of the training corpus under a unigram language model. In practice the resulting vocab looks similar, but WordPiece tends to keep slightly longer subwords. Continuation pieces get a ## prefix — that's why BERT's output is full of things like ##ing and ##tion.
Unigram (Kudo, 2018). A probabilistic alternative. Start with a large initial vocab (often via BPE), then iteratively prune the least-useful tokens via EM optimization. The "best" segmentation of a word becomes the one that maximizes total probability under the unigram model. Unique feature: multiple valid tokenizations per word. Useful for data augmentation ("subword regularization") during training.
SentencePiece (Kudo & Richardson, 2018). Not a new merging algorithm but a different preprocessing: treat the input as a raw stream of bytes including spaces. The space becomes part of a token (rendered as ▁, U+2581). Result: completely language-agnostic, no separate word-splitting step, works on languages without spaces (Chinese, Japanese) and on code identically. SentencePiece is a framework that can run BPE or Unigram inside; LLaMA-family models typically use SentencePiece + BPE.

1 / 3

Different recipes pick different merges, and they mark boundaries differently. "##" means "continues previous word"; "▁" means "starts a new word."

The "##" vs "▁" convention is one of those things that feels arbitrary until you trip on it:

BERT's WordPiece assumes input is already split into words before tokenization. ## on a piece means "this attaches to the previous word; don't put a space in front." Useful for languages that space-separate words; awkward for ones that don't.
SentencePiece assumes nothing about spaces. Every word starts with ▁ if it followed a space in the original, otherwise it doesn't. Detokenization just removes the ▁ markers. Symmetric, language-neutral, no special "is this a continuation?" logic anywhere.
GPT's BPE is intermediate: spaces are part of the tokens (the very first space is often part of the next token, like " the"as one token), but there's no continuation marker.

For most practical purposes you don't pick a tokenizer; you inherit whichever one the base model uses. The choices are mostly historical: GPT decided on byte-level BPE, BERT decided on WordPiece, T5 and most multilingual models decided on SentencePiece. The differences matter when you mix tokenizers (fine-tuning across model families) or when you train a new tokenizer for a non-standard domain.

In a Transformer: the tokenizer is a separate artifact shipped alongside the model weights. A typical model release is (weights, tokenizer, config), and the tokenizer is the smallest of the three (a few hundred KB). Using the wrong tokenizer with a set of weights is one of the more silent failures in deep learning — the model will produce confident garbage.

Modern Vocab Sizes — 50K to 200K

Why "the right vocab size" sits in this band and what it costs to move within it.

For most of the LLM era (2019–2024) "vocabulary size" was a knob nobody touched — 50K-ish was the default, set by GPT-2's influential release, and that's what most subsequent models inherited. The frontier has moved up in the last two years, and the band is now roughly 50K to 200K. Here's why.

Concrete numbers for popular models:

GPT-2 (2019): 50,257 tokens. Byte-level BPE on English-heavy web text. Became the default for ~5 years.
LLaMA 1 / 2 (2023): 32,000 tokens. Smaller than GPT for efficiency, SentencePiece BPE. Optimized for English with limited multilingual support.
GPT-4 / cl100k (2023): 100,256 tokens. Roughly doubled GPT-2's vocab, with much better coverage of code, non-English, and modern technical terms.
LLaMA 3 (2024): 128,256 tokens. Big jump from LLaMA 2, specifically to improve multilingual and code coverage.
GPT-4o (2024): ~200,000 tokens. Pushed the frontier further; notable for compressing non-English text dramatically (sometimes 4× fewer tokens than GPT-4 for the same Chinese or Japanese paragraph).
Claude 3 / Gemini (2024): not publicly disclosed but rumored ~100K–200K, in the same ballpark.

1 / 3

Bigger vocab → fewer tokens per English sentence (cheaper inference), but bigger embedding tables and a longer tail of rarely-seen tokens.

The trade-off behind these numbers is the central design question:

Bigger vocab → fewer tokens per text. Each token covers more characters on average. A vocab of 100K can fit "unhappiness" as a single token; a vocab of 30K probably splits it. Fewer tokens means cheaper inference (less attention compute, smaller KV cache, lower API price for users).
Bigger vocab → bigger embedding tables. An embedding table isvocab_size × d_model. For d_model = 4096, going from 50K to 200K vocab grows the table from 200M to 800M parameters — a real cost for both training and storage. Modern LLMs often share weights between the input embedding and the output projection ("tied weights") to halve this.
Bigger vocab → longer tail of rare tokens. A 200K vocab will have thousands of tokens that occur very rarely in training. Their embeddings are poorly learned; misuse can hurt quality on uncommon inputs (the "glitch token" phenomenon — search for SolidGoldMagikarp).
Bigger vocab → better non-English / code coverage. The biggest single argument for the move to 200K. A Chinese paragraph that takes 1,000 tokens in GPT-3 might take 250 tokens in GPT-4o, because the bigger vocab contains many Chinese-specific multi-character tokens. Same for code — a 100K+ vocab can keep common identifiers like printf, def,console.log as single tokens.

Practical advice for picking a vocab size, if you ever have to: target the smallest vocab that produces an acceptable token-per-character ratio for the languages and formats you care about. For English-only research, 30K–50K still works. For multilingual or code-heavy use, 100K+ is the floor in 2026. Don't go below without a strong reason; the embedding table is rarely the bottleneck and the sequence-length savings compound across every token of every prompt.

In a Transformer: the vocab size is the parameter that bookends the entire model. It sets the input embedding dimension and the output classification head's dimension; everything in between is independent. A 7B LLaMA-2 with 32K vocab and a 7B LLaMA-3 with 128K vocab have the same architecture in the middle and ~96M more parameters in the embedding + output head of the larger one. The tokenizer's choice ripples across every layer.

That's the entire prerequisite stack — math, data, optimization, hardware, software, embeddings, tokenization. Every piece has been laid down. The next primer finally assembles them: the Transformer.