Reading map

The Zero-to-Hero Lecture Arc

Andrej Karpathy's "Neural Networks: Zero to Hero" YouTube series is a 9-lecture, ~30-hour course that walks from "what is a derivative" all the way to reproducing GPT-2 (124M) from scratch. The transcripts for all 9 lectures are in sources/transcripts/. This page is a reading map for them.

The remarkable thing about the series is how cumulative it is — each lecture builds on the previous, and skipping lectures will burn you. There is no Karpathy lecture that's "just an overview." Every one is hands-on coding.

Lecture 1: Building micrograd

Foundational
Transcript 01-The spelled-out intro to neural networks and backpropagation: building micrograd.en.txt
Companion micrograd

The foundational lecture. Karpathy:

  1. Defines what a derivative is (with finite differences and a Jupyter notebook).
  2. Builds the Value class operator by operator. See value-class.
  3. Derives _backward closures by hand for +, *, tanh, etc.
  4. Shows the topological sort and reverse-pass algorithm. See backpropagation.
  5. Compares micrograd's gradients to PyTorch's on the same expression.
  6. Builds an MLP, trains it on a tiny dataset.

If you don't already have intuition for backprop, this lecture is the place to develop it.

Lecture 2: makemore intro (bigram)

Transcript 02-The spelled-out intro to language modeling: building makemore.en.txt
Companion makemore

Switches from numerical functions to language modeling. Trains a bigram model two ways:

  1. As a counts table — count how often each character follows each other character, normalize to a probability matrix, sample from it.
  2. As a neural net — a single nn.Linear layer trained with cross-entropy loss.

Shows that they produce identical predictions, then argues the neural-net version generalizes (you can add more layers, attention, etc.). The framing: "language modeling is just next-token prediction with cross-entropy loss" — set up here, reused for the next 8 lectures.

Lecture 3: makemore MLP

Transcript 03-Building makemore Part 2: MLP.en.txt
Companion makemore, MLP architecture

Implements Bengio et al. 2003: concatenate embeddings of the last 3 chars, pass through a hidden layer, predict next char. Topics introduced:

Train/val/test splits.
Why you need them, how to pick them.
Mini-batch training.
Why batches help, how to choose batch size.
Learning rate finding.
Plot loss vs LR, eyeball the sweet spot.

Same names dataset, much better loss than bigram. Output is more name-like.

Lecture 4: makemore activations & gradients, BatchNorm

Most important
Transcript 04-Building makemore Part 3: Activations & Gradients, BatchNorm.en.txt
Companion makemore, deeper MLP

The most important lecture for understanding why training fails or succeeds. Builds a deep MLP and shows:

  1. With default init, deep layers saturate the tanh, gradients vanish, training stalls.
  2. Hand-tuned init fixes activation magnitudes layer-by-layer.
  3. BatchNorm automates the fix.
  4. Activation/gradient histograms per layer let you diagnose training health visually.

Even though transformers use LayerNorm instead of BatchNorm, the lecture is essential — it teaches the diagnostic skill of looking at activation statistics, which is what you do when debugging any training failure. See training-stability and weight-init.

Lecture 5: Becoming a Backprop Ninja

Most demanding
Transcript 05-Building makemore Part 4: Becoming a Backprop Ninja.en.txt
Companion makemore

The most demanding lecture. Karpathy disables loss.backward() and reimplements every backward pass by hand: the cross-entropy backward, the softmax backward, the LayerNorm backward (notoriously tricky), the MLP backward, the embedding backward. At each step he checks gradient correctness against PyTorch's auto-computed gradient.

Optional but transformative — by the end you can write a CUDA backward kernel and know if it's right. See backpropagation.

Lecture 6: makemore WaveNet

Transcript 06-Building makemore Part 5: Building a WaveNet.en.txt
Companion makemore, CNN architecture

Builds a dilated 1D convolutional model inspired by DeepMind's WaveNet. Treats sequence modeling as a hierarchical tree: positions 1-2 mix into a layer-1 feature, 3-4 mix into another, then layer-1 features mix in layer-2, etc. Tree depth log(context_length) instead of linear.

Important for understanding that "transformer" is not the only sequence model, and that the idea of hierarchical mixing pre-dates attention. The next lecture introduces attention as a generalization of these mixing patterns.

Lecture 7: Let's build GPT

Crown jewel
Transcript 07-Let's build GPT: from scratch, in code, spelled out..en.txt
Companion ng-video-lecture

The crown-jewel lecture. Karpathy builds a full transformer character-level LM on Tiny Shakespeare, starting from a bigram baseline and adding one feature at a time:

1

Attention. Single attention head as "weighted bag of words" matmul → real Q/K/V attention.

2

Multi-head attention.

3

MLP.

4

Stack into 6 blocks.

5

Add residual connections — big loss drop.

6

Add LayerNorm, pre-norm placement.

7

Scale up.

The final model is gpt.py in ng-video-lecture, ~10M params, character-level, produces the famous Shakespeare-flavored output. This is the lecture that turns "GPT is magic" into "GPT is six PyTorch classes."

Lecture 9: Let's build the GPT Tokenizer

(No #8 — the series skips a number.)
Transcript 09-Let's build the GPT Tokenizer.en.txt
Companion the minbpe repo (not in this corpus), plus the GPT-2 tokenizer in tiktoken

Karpathy's least-favorite topic, taught well. Walks through:

See tokenization and character-vs-bpe. The lecture is the deepest tokenizer treatment you'll find on YouTube.

Lecture 10: Let's reproduce GPT-2 (124M)

Capstone
Transcript 10-Let's reproduce GPT-2 (124M).en.txt
Companion build-nanogpt

The capstone. Four hours of incremental optimization that turns a "nanoGPT-ish baseline" into a faithful GPT-2 reproduction trained on FineWeb-Edu in 6 hours on 8×A100 for ~$10 of compute. Reaches GPT-3 paper's HellaSwag accuracy.

Each optimization is introduced, justified, measured. See repos/build-nanogpt for the full step-by-step list.

By the end, you have actually reproduced GPT-2 — not "made a model that's GPT-shaped." That's the difference between Karpathy's pedagogy and most online tutorials.

The shape of the arc

  1. 1micrograd: backprop on scalars.
  2. 2makemore bigram: language modeling at all.
  3. 3makemore MLP: real neural net architecture.
  4. 4makemore deep: how training stays stable.
  5. 5makemore backprop ninja: backprop from scratch.
  6. 6makemore wavenet: hierarchical mixing.
  7. 7ng-video-lecture: transformer + attention.
  8. 8(skipped)
  9. 9tokenizer: BPE.
  10. 10build-nanogpt: real GPT-2 reproduction at scale.

The progression is: small clear primitives → composed into bigger systems → optimized for real-world hardware. Each lecture is independently watchable, but the cumulative arc is what makes the series special.

Related