The Zero-to-Hero Lecture Arc
Andrej Karpathy's "Neural Networks: Zero to Hero" YouTube series is a
9-lecture, ~30-hour course that walks from "what is a derivative" all the
way to reproducing GPT-2 (124M) from scratch. The transcripts for all 9
lectures are in sources/transcripts/. This page is a reading
map for them.
Lecture 1: Building micrograd
FoundationalThe foundational lecture. Karpathy:
- Defines what a derivative is (with finite differences and a Jupyter notebook).
- Builds the
Valueclass operator by operator. See value-class. - Derives
_backwardclosures by hand for+,*,tanh, etc. - Shows the topological sort and reverse-pass algorithm. See backpropagation.
- Compares micrograd's gradients to PyTorch's on the same expression.
- Builds an MLP, trains it on a tiny dataset.
If you don't already have intuition for backprop, this lecture is the place to develop it.
Lecture 2: makemore intro (bigram)
Switches from numerical functions to language modeling. Trains a bigram model two ways:
- As a counts table — count how often each character follows each other character, normalize to a probability matrix, sample from it.
- As a neural net — a single
nn.Linearlayer trained with cross-entropy loss.
Shows that they produce identical predictions, then argues the neural-net version generalizes (you can add more layers, attention, etc.). The framing: "language modeling is just next-token prediction with cross-entropy loss" — set up here, reused for the next 8 lectures.
Lecture 3: makemore MLP
Implements Bengio et al. 2003: concatenate embeddings of the last 3 chars, pass through a hidden layer, predict next char. Topics introduced:
- Train/val/test splits.
- Why you need them, how to pick them.
- Mini-batch training.
- Why batches help, how to choose batch size.
- Learning rate finding.
- Plot loss vs LR, eyeball the sweet spot.
Same names dataset, much better loss than bigram. Output is more name-like.
Lecture 4: makemore activations & gradients, BatchNorm
Most importantThe most important lecture for understanding why training fails or succeeds. Builds a deep MLP and shows:
- With default init, deep layers saturate the tanh, gradients vanish, training stalls.
- Hand-tuned init fixes activation magnitudes layer-by-layer.
- BatchNorm automates the fix.
- Activation/gradient histograms per layer let you diagnose training health visually.
Even though transformers use LayerNorm instead of BatchNorm, the lecture is essential — it teaches the diagnostic skill of looking at activation statistics, which is what you do when debugging any training failure. See training-stability and weight-init.
Lecture 5: Becoming a Backprop Ninja
Most demanding
The most demanding lecture. Karpathy disables loss.backward()
and reimplements every backward pass by hand: the cross-entropy backward,
the softmax backward, the LayerNorm backward (notoriously tricky), the MLP
backward, the embedding backward. At each step he checks gradient
correctness against PyTorch's auto-computed gradient.
Optional but transformative — by the end you can write a CUDA backward kernel and know if it's right. See backpropagation.
Lecture 6: makemore WaveNet
Builds a dilated 1D convolutional model inspired by DeepMind's WaveNet.
Treats sequence modeling as a hierarchical tree: positions 1-2 mix into a
layer-1 feature, 3-4 mix into another, then layer-1 features mix in
layer-2, etc. Tree depth log(context_length) instead of
linear.
Important for understanding that "transformer" is not the only sequence model, and that the idea of hierarchical mixing pre-dates attention. The next lecture introduces attention as a generalization of these mixing patterns.
Lecture 7: Let's build GPT
Crown jewelThe crown-jewel lecture. Karpathy builds a full transformer character-level LM on Tiny Shakespeare, starting from a bigram baseline and adding one feature at a time:
Attention. Single attention head as "weighted bag of words" matmul → real Q/K/V attention.
Multi-head attention.
MLP.
Stack into 6 blocks.
Add residual connections — big loss drop.
Add LayerNorm, pre-norm placement.
Scale up.
The final model is gpt.py in ng-video-lecture, ~10M params,
character-level, produces the famous Shakespeare-flavored output. This is
the lecture that turns "GPT is magic" into "GPT is six PyTorch classes."
Lecture 9: Let's build the GPT Tokenizer
Karpathy's least-favorite topic, taught well. Walks through:
- Why character-level tokenization is bad for serious LLMs.
- The BPE algorithm from scratch.
- The byte-level twist (work on raw UTF-8 bytes).
- Foot-guns: spelling, arithmetic, non-English, trailing whitespace,
SolidGoldMagikarp. - The GPT-2 tokenizer's regex-based pre-tokenization rules.
- SentencePiece (the Llama tokenizer).
See tokenization and character-vs-bpe. The lecture is the deepest tokenizer treatment you'll find on YouTube.
Lecture 10: Let's reproduce GPT-2 (124M)
CapstoneThe capstone. Four hours of incremental optimization that turns a "nanoGPT-ish baseline" into a faithful GPT-2 reproduction trained on FineWeb-Edu in 6 hours on 8×A100 for ~$10 of compute. Reaches GPT-3 paper's HellaSwag accuracy.
Each optimization is introduced, justified, measured. See repos/build-nanogpt for the full step-by-step list.
The shape of the arc
- 1micrograd: backprop on scalars.
- 2makemore bigram: language modeling at all.
- 3makemore MLP: real neural net architecture.
- 4makemore deep: how training stays stable.
- 5makemore backprop ninja: backprop from scratch.
- 6makemore wavenet: hierarchical mixing.
- 7ng-video-lecture: transformer + attention.
- 8(skipped)
- 9tokenizer: BPE.
- 10build-nanogpt: real GPT-2 reproduction at scale.
The progression is: small clear primitives → composed into bigger systems → optimized for real-world hardware. Each lecture is independently watchable, but the cumulative arc is what makes the series special.
Related
- All the repo pages: micrograd, makemore, ng-video-lecture, nanoGPT, build-nanogpt, llama2-c, llm-c
- All the concept pages link back here when they cite a specific lecture.