Tokenization · Comparison

Character-Level vs BPE Tokenization

Two different approaches to turning text into tokens. Both are in the Karpathy corpus, and the contrast between them is the cleanest way to see why tokenization matters.

Character-level: the toy

In ng-video-lecture, makemore, and the Tiny Shakespeare config of nanoGPT, tokenization is character-level. Every unique character in the corpus becomes a token. For Tiny Shakespeare, that's 65 characters: lowercase letters, uppercase letters, punctuation, newlines, whitespace.

chars = sorted(list(set(text)))
vocab_size = len(chars)
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s]

For makemore on baby names, vocab is 27 (26 letters + 1 special end-of-name token). The names dataset has 32,000 entries; with character tokens, that's about 200,000 training tokens. Trivial to train.

Properties

The character-level model in lecture 7 trained on Tiny Shakespeare for a few minutes produces output that has the rhythm of Shakespeare but with mostly-invented words:

ANGELO:
And cowards it be strawn to my bed,
And thrust the gates of my threats,
Because he that ale away, and hang'd
An one with him.
Character-level output on Tiny Shakespeare — rhythm without meaning.

The model has learned: lines end with newlines, lines are short, character names are uppercase, the vocabulary is medieval-flavored, punctuation patterns. It hasn't really learned what words mean because at the character level each "decision" is too local.

BPE: the serious version

GPT-2's tokenizer is byte-level BPE. Briefly (more detail in the tokenization page):

  1. Start with 256 byte tokens (UTF-8 covers every Unicode codepoint).
  2. Greedily merge the most frequent adjacent pair, repeat 50,000 times.
  3. End up with 50,257 tokens: bytes + merges + <|endoftext|>.
Properties
Side by side
 
Character-level
Byte-level BPE (GPT-2)
Vocab size
27–65
50,257
Sequence length
~5 chars / word
~3–4 chars / token
Embedding table
tiny
~38M params (~30% of 124M)
Word-structure prior
none — learned from scratch
common words / morphemes as single tokens
Failure modes
plausible-looking gibberish
spelling, arithmetic, non-English, trailing whitespace, SolidGoldMagikarp

When character-level is the right choice

For TinyStories the Llama 2 reproduction uses a tokenizer trained on the narrow story corpus with vocab 4096 — somewhere between pure character-level and full BPE. The point: when your domain is narrow, you don't need 50k tokens. A small, domain-specific tokenizer lets your context window go further and reduces embedding table size.

Tiny / educational
27–65
character-level
Narrow domain
4,096
TinyStories tokenizer
Serious LLM
50,257
GPT-2 byte-level BPE

For toy/educational work on small corpora, character-level is fine and removes the tokenizer as a source of confusion. You see this in:

For serious LLMs, BPE (or its sibling SentencePiece) is the obvious default. The pre-training corpus is so large and diverse that compression matters, and the cost in foot-guns is the price of admission.

A pragmatic middle ground

For a narrow but non-trivial domain (medical text, source code, a specific language), train your own BPE/SentencePiece tokenizer on the domain corpus rather than using GPT-2's general tokenizer. llama2.c/doc/train_llama_tokenizer.md walks through doing this for TinyStories. The result: a smaller vocab tuned to your domain, with much better compression than a general tokenizer, and fewer surprising splits.

Related