Concept

Tokenization

Tokenization is the layer that converts strings to integers (and back) before they hit the neural network. It is, in Karpathy's words from lecture 9, his "least favorite part of working with LLMs" — the source of most of the weird, hairy edge cases in production language models. But you can't skip it: every "the model can't spell, can't reverse strings, can't do arithmetic" complaint traces back to how text was broken into tokens.

Character-level: the simplest possible thing

In ng-video-lecture, tokenization is trivial:

chars = sorted(list(set(text)))
vocab_size = len(chars)
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

That's the whole tokenizer. Vocab is 65 (the unique characters in Tiny Shakespeare), and the model has to learn everything about word structure from scratch. This is fine pedagogically and fine for small toy corpora, but it has two problems at scale:

  1. Sequence length explodes. Every character is a token, so context windows fill up fast.
  2. No prior over word structure. The model has to relearn that "the" and "The" and " the " are all variants of the same word.

Byte-level BPE (what GPT-2 uses)

GPT-2's tokenizer is Byte Pair Encoding over raw UTF-8 bytes. The algorithm:

Step 1 Seed Start with 256 tokens (one per byte). UTF-8 makes this universal — any Unicode text encodes losslessly.
Step 2 Count Count the most frequent adjacent byte pair in your training corpus.
Step 3 Merge Merge that pair into a new token. Repeat 50,000 times.
Step 4 Vocab End up with a vocabulary of 50,257 tokens: 256 bytes + 50,000 merges + 1 special <|endoftext|> token.

The result: common substrings like the, ing, of become single tokens, while rare strings stay broken into bytes. Average token length in English is ~4 characters, so a 1024-token GPT-2 context window holds ~4KB of text.

build-nanogpt uses the GPT-2 tokenizer directly via tiktoken:

enc = tiktoken.get_encoding("gpt2")
tokens = enc.encode("Hello, I'm a language model,")

llm.c ships the same tokenizer in C, loading the merge table from a .bin file.

Why tokenization is the source of every weird bug

From the lecture 9 transcript, Karpathy lists the things tokenization breaks:

Spelling
"How many letters in 'lollipop'?" The model sees one token, not seven characters.
String reversal
Tokens don't reverse character-by-character.
Non-English
The merge table was learned on a corpus that's mostly English, so other languages tokenize less efficiently — more tokens per word, less effective context.
Arithmetic
"127 + 677" tokenizes as [' 12', '7', ' +', ' 67', '7'] or similar — different splits for different numbers. The model has to learn arithmetic per-token-split.
Code
Python's whitespace is meaningful; if your tokenizer eats indentation greedily, you get weird behaviors.
Trailing whitespace
hello and hello and hello are completely different tokens.
SolidGoldMagikarp
Tokens that appear in the tokenizer training corpus but rarely in the LM training corpus end up with un-trained embeddings, and the model behaves bizarrely when they appear.

The Llama tokenizer

Llama 2 uses SentencePiece BPE with a 32,000-token vocab, trained on a more multilingual corpus. The format is different (a SentencePiece .model file), but conceptually it's the same thing: learned subword merges. llama2.c/tokenizer.py loads the SentencePiece model and exports it to a compact binary that the C inference engine consumes.

GPT-2 BPE

Byte Pair Encoding over raw UTF-8 bytes.

vocab50,257
base unitUTF-8 byte
formattiktoken / .bin

Llama 2 SentencePiece

SentencePiece BPE, more multilingual corpus.

vocab32,000
base unitsubword merges
formatSentencePiece .model

Training your own

llama2.c/doc/train_llama_tokenizer.md walks through training a custom SentencePiece tokenizer for a narrow domain (TinyStories), which gets you a much smaller vocab (4096 tokens) and tighter compression on that domain. This is the right move for narrow-domain models — and the TinyStories paper shows you can train surprisingly capable language models with very small vocabs if the domain is narrow enough.

Related