Character-Level vs BPE Tokenization
Two different approaches to turning text into tokens. Both are in the Karpathy corpus, and the contrast between them is the cleanest way to see why tokenization matters.
Character-level: the toy
In ng-video-lecture, makemore, and the Tiny Shakespeare config of nanoGPT, tokenization is character-level. Every unique character in the corpus becomes a token. For Tiny Shakespeare, that's 65 characters: lowercase letters, uppercase letters, punctuation, newlines, whitespace.
chars = sorted(list(set(text)))
vocab_size = len(chars)
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s]
For makemore on baby names, vocab is 27 (26 letters + 1 special end-of-name token). The names dataset has 32,000 entries; with character tokens, that's about 200,000 training tokens. Trivial to train.
Properties
- Small vocab. 27–65 tokens. The token embedding table is tiny.
- Long sequences. Average word is ~5 characters, so a 256-character context = ~50 words.
- No priors over word structure. The model has to learn that "the" and "The" share meaning, that consonant-vowel patterns matter, etc., from scratch.
- Predictable failure modes. Generates plausible-looking gibberish that often forms valid-looking but nonsense words.
The character-level model in lecture 7 trained on Tiny Shakespeare for a few minutes produces output that has the rhythm of Shakespeare but with mostly-invented words:
ANGELO:
And cowards it be strawn to my bed,
And thrust the gates of my threats,
Because he that ale away, and hang'd
An one with him.
The model has learned: lines end with newlines, lines are short, character names are uppercase, the vocabulary is medieval-flavored, punctuation patterns. It hasn't really learned what words mean because at the character level each "decision" is too local.
BPE: the serious version
GPT-2's tokenizer is byte-level BPE. Briefly (more detail in the tokenization page):
- Start with 256 byte tokens (UTF-8 covers every Unicode codepoint).
- Greedily merge the most frequent adjacent pair, repeat 50,000 times.
- End up with 50,257 tokens: bytes + merges +
<|endoftext|>.
Properties
- Bigger vocab. 50k tokens. Embedding table for GPT-2 (124M) is
(50257, 768)= ~38M params, about 30% of the model. - Shorter sequences. Average English token is 3–4 characters, so a 1024-token context = ~4KB of text.
- Some word structure baked in. Common words and morphemes get single tokens. "preprocessing" might be
[" pre", "processing"]instead of 13 character tokens. - Foot-guns documented in tokenization. Spelling, arithmetic, non-English, trailing whitespace,
SolidGoldMagikarp.
Side by side
SolidGoldMagikarpWhen character-level is the right choice
For TinyStories the Llama 2 reproduction uses a tokenizer trained on the narrow story corpus with vocab 4096 — somewhere between pure character-level and full BPE. The point: when your domain is narrow, you don't need 50k tokens. A small, domain-specific tokenizer lets your context window go further and reduces embedding table size.
For toy/educational work on small corpora, character-level is fine and removes the tokenizer as a source of confusion. You see this in:
- micrograd — no tokenization at all, just numbers.
- makemore — 27-character vocab on baby names.
- ng-video-lecture — 65-character vocab on Tiny Shakespeare.
For serious LLMs, BPE (or its sibling SentencePiece) is the obvious default. The pre-training corpus is so large and diverse that compression matters, and the cost in foot-guns is the price of admission.
A pragmatic middle ground
For a narrow but non-trivial domain (medical text, source code, a specific language), train your own BPE/SentencePiece tokenizer on the domain corpus rather than using GPT-2's general tokenizer. llama2.c/doc/train_llama_tokenizer.md walks through doing this for TinyStories. The result: a smaller vocab tuned to your domain, with much better compression than a general tokenizer, and fewer surprising splits.
Related
- tokenization — the full story
- repos/ng-video-lecture — character-level on Tiny Shakespeare
- repos/makemore — character-level on baby names
- repos/build-nanogpt — GPT-2 BPE on FineWeb