Tokenization · Comparison

Character-Level vs BPE Tokenization

Two different approaches to turning text into tokens. Both are in the Karpathy corpus, and the contrast between them is the cleanest way to see why tokenization matters.

Character-level: the toy

In ng-video-lecture, makemore, and the Tiny Shakespeare config of nanoGPT, tokenization is character-level. Every unique character in the corpus becomes a token. For Tiny Shakespeare, that's 65 characters: lowercase letters, uppercase letters, punctuation, newlines, whitespace.

chars = sorted(list(set(text)))
vocab_size = len(chars)
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s]

For makemore on baby names, vocab is 27 (26 letters + 1 special end-of-name token). The names dataset has 32,000 entries; with character tokens, that's about 200,000 training tokens. Trivial to train.

Properties

Small vocab. 27–65 tokens. The token embedding table is tiny.
Long sequences. Average word is ~5 characters, so a 256-character context = ~50 words.
No priors over word structure. The model has to learn that "the" and "The" share meaning, that consonant-vowel patterns matter, etc., from scratch.
Predictable failure modes. Generates plausible-looking gibberish that often forms valid-looking but nonsense words.

The character-level model in lecture 7 trained on Tiny Shakespeare for a few minutes produces output that has the rhythm of Shakespeare but with mostly-invented words:

ANGELO:
And cowards it be strawn to my bed,
And thrust the gates of my threats,
Because he that ale away, and hang'd
An one with him.

Character-level output on Tiny Shakespeare — rhythm without meaning.

The model has learned: lines end with newlines, lines are short, character names are uppercase, the vocabulary is medieval-flavored, punctuation patterns. It hasn't really learned what words mean because at the character level each "decision" is too local.

BPE: the serious version

GPT-2's tokenizer is byte-level BPE. Briefly (more detail in the tokenization page):

Start with 256 byte tokens (UTF-8 covers every Unicode codepoint).
Greedily merge the most frequent adjacent pair, repeat 50,000 times.
End up with 50,257 tokens: bytes + merges + <|endoftext|>.

Properties

Bigger vocab. 50k tokens. Embedding table for GPT-2 (124M) is (50257, 768) = ~38M params, about 30% of the model.
Shorter sequences. Average English token is 3–4 characters, so a 1024-token context = ~4KB of text.
Some word structure baked in. Common words and morphemes get single tokens. "preprocessing" might be [" pre", "processing"] instead of 13 character tokens.
Foot-guns documented in tokenization. Spelling, arithmetic, non-English, trailing whitespace, SolidGoldMagikarp.

Side by side

Character-level

Byte-level BPE (GPT-2)

Vocab size

27–65

50,257

Sequence length

~5 chars / word

~3–4 chars / token

Embedding table

tiny

~38M params (~30% of 124M)

Word-structure prior

none — learned from scratch

common words / morphemes as single tokens

Failure modes

plausible-looking gibberish

spelling, arithmetic, non-English, trailing whitespace, SolidGoldMagikarp

When character-level is the right choice

For TinyStories the Llama 2 reproduction uses a tokenizer trained on the narrow story corpus with vocab 4096 — somewhere between pure character-level and full BPE. The point: when your domain is narrow, you don't need 50k tokens. A small, domain-specific tokenizer lets your context window go further and reduces embedding table size.

Tiny / educational

27–65

character-level

Narrow domain

4,096

TinyStories tokenizer

Serious LLM

50,257

GPT-2 byte-level BPE

For toy/educational work on small corpora, character-level is fine and removes the tokenizer as a source of confusion. You see this in:

micrograd — no tokenization at all, just numbers.
makemore — 27-character vocab on baby names.
ng-video-lecture — 65-character vocab on Tiny Shakespeare.

For serious LLMs, BPE (or its sibling SentencePiece) is the obvious default. The pre-training corpus is so large and diverse that compression matters, and the cost in foot-guns is the price of admission.

A pragmatic middle ground

For a narrow but non-trivial domain (medical text, source code, a specific language), train your own BPE/SentencePiece tokenizer on the domain corpus rather than using GPT-2's general tokenizer. llama2.c/doc/train_llama_tokenizer.md walks through doing this for TinyStories. The result: a smaller vocab tuned to your domain, with much better compression than a general tokenizer, and fewer surprising splits.

tokenization — the full story
repos/ng-video-lecture — character-level on Tiny Shakespeare
repos/makemore — character-level on baby names
repos/build-nanogpt — GPT-2 BPE on FineWeb

Character-Level vs BPE Tokenization

Character-level: the toy

Properties

BPE: the serious version

Properties

Side by side

When character-level is the right choice

A pragmatic middle ground

Related