ng-video-lecture
The bridge from makemore to real GPT. The
companion repo for
lecture 7, "Let's build GPT: from
scratch, in code, spelled out". Two files: bigram.py
(the starting point) and gpt.py (the destination — a small
but legitimate GPT, ~10M parameters, trained on Tiny Shakespeare).
What it is
Two self-contained training scripts:
bigram.py ~120 lines
Character-level bigram model. One embedding table maps each character to a logit distribution over the next character. Trains in seconds. Output is gibberish that has Shakespeare's character distribution.
gpt.py ~225 lines
Character-level GPT. Token embedding + position embedding + 6 transformer blocks with 6-head attention + final LayerNorm + lm_head. ~10M parameters. Trains in a few minutes on a GPU. Output is Shakespeare-flavored coherent-looking text with mostly-invented words.
Both train on input.txt, the Tiny Shakespeare corpus (~1MB).
Why this repo matters
In a typical "let's build GPT" tutorial, you copy the code, run it, and
feel slightly smarter. Karpathy's version is different: the lecture
takes 2 hours to develop gpt.py line by line,
starting from bigram.py and adding one piece at a time,
with the loss decreasing at each step to confirm the addition helped.
The progression in the lecture
- Bigram baseline loss ~2.5 (vs ~4.2 random)
- Add a single attention head. The "weighted bag of words" trick: build a lower-triangular matrix of 1/T entries, matmul. This is just averaging the past; it improves loss because some context is better than none.
- Replace the uniform weights with learned, content-dependent weights via Q and K. This is now real self-attention.
- Add multi-head: parallel attention heads, concatenated.
- Add the MLP (feedforward block).
- Stack into 6 layers.
- Add residual connections — loss drops sharply
- Add LayerNorm, pre-norm placement.
- Add dropout.
-
Scale up
n_embd,n_head,n_layer, train longer.
By the end, the loss is around 1.0-1.5 and the output is what people quote as "look, GPT for kids":
ANGELO:
And cowards it be strawn to my bed,
And thrust the gates of my threats,
Because he that ale away, and hang'd
An one with him.
What's in gpt.py
The full transformer in 225 lines. Classes:
- Head
- One attention head: Q, K, V linears, causal mask, softmax, dropout, weighted V aggregation.
- MultiHeadAttention
n_headparallelHeads, concat outputs, output projection, dropout.- FeedFoward
- Two linears + ReLU + dropout. (The typo is in the original.)
- Block
- Pre-norm Block:
x + attn(ln1(x))thenx + ffwd(ln2(x)). - GPTLanguageModel
- Token + position embeddings, stack of
Blocks, final LayerNorm, lm_head. Plusgenerate()with greedytorch.multinomialsampling.
Hyperparameters
The file is configured for a GPU:
block_size=256 characters, ~10M params, 5000 iterations of
AdamW. Karpathy says this takes about 15 minutes on an A100.
Differences from nanoGPT
gpt.py is a self-contained pedagogical version of
nanoGPT. Differences:
| Dimension | ng-video-lecture | nanoGPT |
|---|---|---|
| Tokenization | character-level (vocab 65) | character-level and BPE |
| Attention impl | manual (explicit q @ k.T matmul, softmax, etc.) |
F.scaled_dot_product_attention (flash attention) when available |
| Multi-head layout | nn.ModuleList([Head(head_size) for _ in range(num_heads)]) with one Linear per head |
All heads' Q/K/V fused into one Linear of size 3 * n_embd |
| Head combination | Separate attention heads concatenated | Batched as an extra dimension |
| Data | Tiny Shakespeare in-memory | OpenWebText with sharded prep |
gpt.py, reading nanoGPT/model.py
is a fun exercise in spotting the optimizations.
What's in bigram.py
The minimum-viable LM. One nn.Embedding(vocab_size,
vocab_size) directly outputs next-token logits when queried with
the current token. Training loop is 30 lines. Loss converges to about
2.5 (vs uniform 4.2). Output is character soup with Shakespeare's
character frequency.
The point isn't that bigram is good; the point is that bigram is a working LM that you can extend incrementally to GPT. Each addition has to help the loss to justify itself.
Related
- zero-to-hero-arc — the lecture
- attention — the central concept being built
- transformer-block — the result
- repos/nanoGPT — the production sibling