Tokenization
Tokenization is the layer that converts strings to integers (and back) before they hit the neural network. It is, in Karpathy's words from lecture 9, his "least favorite part of working with LLMs" — the source of most of the weird, hairy edge cases in production language models. But you can't skip it: every "the model can't spell, can't reverse strings, can't do arithmetic" complaint traces back to how text was broken into tokens.
Character-level: the simplest possible thing
In ng-video-lecture, tokenization is trivial:
chars = sorted(list(set(text)))
vocab_size = len(chars)
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])
That's the whole tokenizer. Vocab is 65 (the unique characters in Tiny Shakespeare), and the model has to learn everything about word structure from scratch. This is fine pedagogically and fine for small toy corpora, but it has two problems at scale:
- Sequence length explodes. Every character is a token, so context windows fill up fast.
- No prior over word structure. The model has to relearn that "the" and "The" and " the " are all variants of the same word.
Byte-level BPE (what GPT-2 uses)
GPT-2's tokenizer is Byte Pair Encoding over raw UTF-8 bytes. The algorithm:
<|endoftext|> token.
The result: common substrings like the, ing, of become single tokens, while rare strings stay broken into bytes. Average token length in English is ~4 characters, so a 1024-token GPT-2 context window holds ~4KB of text.
build-nanogpt uses the GPT-2 tokenizer directly via tiktoken:
enc = tiktoken.get_encoding("gpt2")
tokens = enc.encode("Hello, I'm a language model,")
llm.c ships the same tokenizer in C, loading the merge table from a .bin file.
Why tokenization is the source of every weird bug
From the lecture 9 transcript, Karpathy lists the things tokenization breaks:
- Spelling
- "How many letters in 'lollipop'?" The model sees one token, not seven characters.
- String reversal
- Tokens don't reverse character-by-character.
- Non-English
- The merge table was learned on a corpus that's mostly English, so other languages tokenize less efficiently — more tokens per word, less effective context.
- Arithmetic
- "127 + 677" tokenizes as
[' 12', '7', ' +', ' 67', '7']or similar — different splits for different numbers. The model has to learn arithmetic per-token-split. - Code
- Python's whitespace is meaningful; if your tokenizer eats indentation greedily, you get weird behaviors.
- Trailing whitespace
helloandhelloandhelloare completely different tokens.SolidGoldMagikarp- Tokens that appear in the tokenizer training corpus but rarely in the LM training corpus end up with un-trained embeddings, and the model behaves bizarrely when they appear.
The Llama tokenizer
Llama 2 uses SentencePiece BPE with a 32,000-token vocab, trained on a more multilingual corpus. The format is different (a SentencePiece .model file), but conceptually it's the same thing: learned subword merges. llama2.c/tokenizer.py loads the SentencePiece model and exports it to a compact binary that the C inference engine consumes.
GPT-2 BPE
Byte Pair Encoding over raw UTF-8 bytes.
Llama 2 SentencePiece
SentencePiece BPE, more multilingual corpus.
Training your own
llama2.c/doc/train_llama_tokenizer.md walks through training a custom SentencePiece tokenizer for a narrow domain (TinyStories), which gets you a much smaller vocab (4096 tokens) and tighter compression on that domain. This is the right move for narrow-domain models — and the TinyStories paper shows you can train surprisingly capable language models with very small vocabs if the domain is narrow enough.
Related
- zero-to-hero-arc — full lecture 9 walkthrough
- character-vs-bpe — comparison in more depth
- repos/llama2-c — different tokenizer, same idea