REPOS · KARPATHY

makemore

One Python file that takes a text file of "things" and generates more things like them, character by character. The substrate for five lectures of Zero-to-Hero.

Step 1 micrograd autograd in scalars
Step 2 · here makemore tensor autograd, many architectures
Step 3 ng-video-lecture the next step after makemore

After micrograd shows you autograd in scalars, makemore teaches you to use PyTorch's tensor-based autograd to build progressively bigger language models — bigram, MLP, RNN, LSTM, GRU, transformer — all in a single hackable Python file.

What it is

makemore is one Python file that takes a text file of "things" (e.g. one name per line) and generates more things like them, character by character. The current default is a 200K-parameter transformer, but the file contains implementations of multiple architectures all accessible via command-line flags.

$ python makemore.py -i names.txt -o names

The included dataset is 32,000 baby names from the SSA. Sample output:

dontell, khylum, camatena, aeriline, najlah,
sherrith, ryel, irmi, taislee, mortaz...

Plausibly name-shaped, not actually real names. The transformer learns the texture of names.

What's in makemore.py

A single ~700-line file with these architectures, all sharing the same input/output API:

Bigram
(prev_char) → next_char. One lookup table of size (vocab_size, vocab_size). Roughly equivalent to a 1-gram counts model.
MLP
Bengio et al. 2003. Concatenate the last block_size character embeddings, pass through a hidden layer, predict next char.
CausalBoW
Average the previous tokens. Karpathy includes this with a wink ("...looks suspiciously like a CausalAttention module you'd find in a transformer, for no apparent reason at all ;)") — it's the structural precursor to attention.
RNN, LSTM, GRU
Recurrent variants. Slower to train than transformers, included for completeness and historical interest.
Transformer
The same architecture as GPT-2, just at a much smaller scale. NewGELU, LayerNorm, causal multi-head attention, etc.

You choose architecture via --type. The code is structured so the architectures are interchangeable behind a common ModelConfig and a common training loop.

The pedagogical arc

makemore is the substrate for lectures 2-6 of Zero-to-Hero, which trace a progression:

  1. Lec 2
    Bigram

    Lookup table of next-character counts. Then the same thing implemented as a neural net with a single linear layer trained with cross-entropy loss. Identical predictions, but now it's neural.

  2. Lec 3
    MLP

    Concatenate embeddings of last 3 chars, pass through an MLP. Discusses train/val split, learning rate finding, mini-batches.

  3. Lec 4
    Activations and gradients

    The deep MLP, BatchNorm, what makes deep networks trainable. The lecture that everyone should watch.

  4. Lec 5
    Manually deriving every backward pass

    "Becoming a Backprop Ninja." Hard, optional, character-building.

  5. Lec 6
    WaveNet-style hierarchical model

    Convolutions, dilated structure, treating sequences as a tree.

By the end of lecture 6 you have, conceptually, every ingredient of a transformer except attention itself. Lecture 7 ("Let's build GPT") then bolts attention onto the existing infrastructure.

What's interesting in the code

The whole file is one giant educational artifact, but a few highlights:

NewGELU

The exact GELU approximation used by GPT-2, defined as a 4-line nn.Module:

class NewGELU(nn.Module):
    def forward(self, x):
        return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))

CausalBoW

Bag-of-words as a matmul. This is the cleanest possible introduction to "communication between tokens as a learned mixing matrix" — see attention.md.

The ModelConfig pattern

All architectures share one config dataclass:

@dataclass
class ModelConfig:
    block_size: int = None
    vocab_size: int = None
    n_layer: int = 4
    n_embd: int = 64
    n_embd2: int = 64
    n_head: int = 4

Same fields drive a Transformer, an RNN, or an MLP. This is the abstraction that makes swap-the-architecture pedagogy possible.

Pedagogical notes

this is one hackable file, and is mostly intended for educational purposes. PyTorch is the only requirement.

— the README's tone

This is the right size for a teaching artifact — small enough that you can hold the whole thing in your head, big enough to actually learn from.

Related

zero-to-hero-arc
lectures 2-6
character-vs-bpe
makemore is character-level
gelu-and-swiglu
the NewGELU class
attention
CausalBoW foreshadows it
repos/ng-video-lecture
the next step after makemore