makemore
One Python file that takes a text file of "things" and generates more things like them, character by character. The substrate for five lectures of Zero-to-Hero.
After micrograd shows you autograd in scalars, makemore teaches you to use PyTorch's tensor-based autograd to build progressively bigger language models — bigram, MLP, RNN, LSTM, GRU, transformer — all in a single hackable Python file.
What it is
makemore is one Python file that takes a text file of
"things" (e.g. one name per line) and generates more things like them,
character by character. The current default is a 200K-parameter
transformer, but the file contains implementations of multiple
architectures all accessible via command-line flags.
$ python makemore.py -i names.txt -o names
The included dataset is 32,000 baby names from the SSA. Sample output:
dontell, khylum, camatena, aeriline, najlah,
sherrith, ryel, irmi, taislee, mortaz...
Plausibly name-shaped, not actually real names. The transformer learns the texture of names.
What's in makemore.py
A single ~700-line file with these architectures, all sharing the same input/output API:
(prev_char) → next_char. One lookup table of size
(vocab_size, vocab_size). Roughly equivalent to a
1-gram counts model.
block_size
character embeddings, pass through a hidden layer, predict next char.
NewGELU,
LayerNorm, causal
multi-head attention, etc.
You choose architecture via --type. The code is structured
so the architectures are interchangeable behind a common
ModelConfig and a common training loop.
The pedagogical arc
makemore is the substrate for lectures 2-6 of Zero-to-Hero, which trace a progression:
-
Lec 2Bigram
Lookup table of next-character counts. Then the same thing implemented as a neural net with a single linear layer trained with cross-entropy loss. Identical predictions, but now it's neural.
-
Lec 3MLP
Concatenate embeddings of last 3 chars, pass through an MLP. Discusses train/val split, learning rate finding, mini-batches.
-
Lec 4Activations and gradients
The deep MLP, BatchNorm, what makes deep networks trainable. The lecture that everyone should watch.
-
Lec 5Manually deriving every backward pass
"Becoming a Backprop Ninja." Hard, optional, character-building.
-
Lec 6WaveNet-style hierarchical model
Convolutions, dilated structure, treating sequences as a tree.
What's interesting in the code
The whole file is one giant educational artifact, but a few highlights:
NewGELU
The exact GELU approximation used by GPT-2, defined as a 4-line
nn.Module:
class NewGELU(nn.Module):
def forward(self, x):
return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))
CausalBoW
Bag-of-words as a matmul. This is the cleanest possible introduction to "communication between tokens as a learned mixing matrix" — see attention.md.
The ModelConfig pattern
All architectures share one config dataclass:
@dataclass
class ModelConfig:
block_size: int = None
vocab_size: int = None
n_layer: int = 4
n_embd: int = 64
n_embd2: int = 64
n_head: int = 4
Same fields drive a Transformer, an RNN, or an MLP. This is the abstraction that makes swap-the-architecture pedagogy possible.
Pedagogical notes
this is one hackable file, and is mostly intended for educational purposes. PyTorch is the only requirement.
— the README's tone
This is the right size for a teaching artifact — small enough that you can hold the whole thing in your head, big enough to actually learn from.
Related
- zero-to-hero-arc
- lectures 2-6
- character-vs-bpe
- makemore is character-level
- gelu-and-swiglu
- the
NewGELUclass - attention
CausalBoWforeshadows it- repos/ng-video-lecture
- the next step after makemore