Reading map

The Zero-to-Hero Lecture Arc

Andrej Karpathy's "Neural Networks: Zero to Hero" YouTube series is a 9-lecture, ~30-hour course that walks from "what is a derivative" all the way to reproducing GPT-2 (124M) from scratch. The transcripts for all 9 lectures are in sources/transcripts/. This page is a reading map for them.

The remarkable thing about the series is how cumulative it is — each lecture builds on the previous, and skipping lectures will burn you. There is no Karpathy lecture that's "just an overview." Every one is hands-on coding.

Lecture 1: Building micrograd

Foundational

Transcript 01-The spelled-out intro to neural networks and backpropagation: building micrograd.en.txt

Companion micrograd

The foundational lecture. Karpathy:

Defines what a derivative is (with finite differences and a Jupyter notebook).
Builds the Value class operator by operator. See value-class.
Derives _backward closures by hand for +, *, tanh, etc.
Shows the topological sort and reverse-pass algorithm. See backpropagation.
Compares micrograd's gradients to PyTorch's on the same expression.
Builds an MLP, trains it on a tiny dataset.

If you don't already have intuition for backprop, this lecture is the place to develop it.

Lecture 2: makemore intro (bigram)

Transcript 02-The spelled-out intro to language modeling: building makemore.en.txt

Companion makemore

Switches from numerical functions to language modeling. Trains a bigram model two ways:

As a counts table — count how often each character follows each other character, normalize to a probability matrix, sample from it.
As a neural net — a single nn.Linear layer trained with cross-entropy loss.

Shows that they produce identical predictions, then argues the neural-net version generalizes (you can add more layers, attention, etc.). The framing: "language modeling is just next-token prediction with cross-entropy loss" — set up here, reused for the next 8 lectures.

Lecture 3: makemore MLP

Transcript 03-Building makemore Part 2: MLP.en.txt

Companion makemore, MLP architecture

Implements Bengio et al. 2003: concatenate embeddings of the last 3 chars, pass through a hidden layer, predict next char. Topics introduced:

Train/val/test splits.: Why you need them, how to pick them.
Mini-batch training.: Why batches help, how to choose batch size.
Learning rate finding.: Plot loss vs LR, eyeball the sweet spot.

Same names dataset, much better loss than bigram. Output is more name-like.

Lecture 4: makemore activations & gradients, BatchNorm

Most important

Transcript 04-Building makemore Part 3: Activations & Gradients, BatchNorm.en.txt

Companion makemore, deeper MLP

The most important lecture for understanding why training fails or succeeds. Builds a deep MLP and shows:

With default init, deep layers saturate the tanh, gradients vanish, training stalls.
Hand-tuned init fixes activation magnitudes layer-by-layer.
BatchNorm automates the fix.
Activation/gradient histograms per layer let you diagnose training health visually.

Even though transformers use LayerNorm instead of BatchNorm, the lecture is essential — it teaches the diagnostic skill of looking at activation statistics, which is what you do when debugging any training failure. See training-stability and weight-init.

Lecture 5: Becoming a Backprop Ninja

Most demanding

Transcript 05-Building makemore Part 4: Becoming a Backprop Ninja.en.txt

Companion makemore

The most demanding lecture. Karpathy disables loss.backward() and reimplements every backward pass by hand: the cross-entropy backward, the softmax backward, the LayerNorm backward (notoriously tricky), the MLP backward, the embedding backward. At each step he checks gradient correctness against PyTorch's auto-computed gradient.

Optional but transformative — by the end you can write a CUDA backward kernel and know if it's right. See backpropagation.

Lecture 6: makemore WaveNet

Transcript 06-Building makemore Part 5: Building a WaveNet.en.txt

Companion makemore, CNN architecture

Builds a dilated 1D convolutional model inspired by DeepMind's WaveNet. Treats sequence modeling as a hierarchical tree: positions 1-2 mix into a layer-1 feature, 3-4 mix into another, then layer-1 features mix in layer-2, etc. Tree depth log(context_length) instead of linear.

Important for understanding that "transformer" is not the only sequence model, and that the idea of hierarchical mixing pre-dates attention. The next lecture introduces attention as a generalization of these mixing patterns.

Lecture 7: Let's build GPT

Crown jewel

Transcript 07-Let's build GPT: from scratch, in code, spelled out..en.txt

Companion ng-video-lecture

The crown-jewel lecture. Karpathy builds a full transformer character-level LM on Tiny Shakespeare, starting from a bigram baseline and adding one feature at a time:

Attention. Single attention head as "weighted bag of words" matmul → real Q/K/V attention.

Multi-head attention.

MLP.

Stack into 6 blocks.

Add residual connections — big loss drop.

Add LayerNorm, pre-norm placement.

Scale up.

The final model is gpt.py in ng-video-lecture, ~10M params, character-level, produces the famous Shakespeare-flavored output. This is the lecture that turns "GPT is magic" into "GPT is six PyTorch classes."

Lecture 9: Let's build the GPT Tokenizer

(No #8 — the series skips a number.)

Transcript 09-Let's build the GPT Tokenizer.en.txt

Companion the minbpe repo (not in this corpus), plus the GPT-2 tokenizer in tiktoken

Karpathy's least-favorite topic, taught well. Walks through:

Why character-level tokenization is bad for serious LLMs.
The BPE algorithm from scratch.
The byte-level twist (work on raw UTF-8 bytes).
Foot-guns: spelling, arithmetic, non-English, trailing whitespace, SolidGoldMagikarp.
The GPT-2 tokenizer's regex-based pre-tokenization rules.
SentencePiece (the Llama tokenizer).

See tokenization and character-vs-bpe. The lecture is the deepest tokenizer treatment you'll find on YouTube.

Lecture 10: Let's reproduce GPT-2 (124M)

Capstone

Transcript 10-Let's reproduce GPT-2 (124M).en.txt

Companion build-nanogpt

The capstone. Four hours of incremental optimization that turns a "nanoGPT-ish baseline" into a faithful GPT-2 reproduction trained on FineWeb-Edu in 6 hours on 8×A100 for ~$10 of compute. Reaches GPT-3 paper's HellaSwag accuracy.

Each optimization is introduced, justified, measured. See repos/build-nanogpt for the full step-by-step list.

By the end, you have actually reproduced GPT-2 — not "made a model that's GPT-shaped." That's the difference between Karpathy's pedagogy and most online tutorials.

The shape of the arc

1micrograd: backprop on scalars.
2makemore bigram: language modeling at all.
3makemore MLP: real neural net architecture.
4makemore deep: how training stays stable.
5makemore backprop ninja: backprop from scratch.
6makemore wavenet: hierarchical mixing.
7ng-video-lecture: transformer + attention.
8(skipped)
9tokenizer: BPE.
10build-nanogpt: real GPT-2 reproduction at scale.

The progression is: small clear primitives → composed into bigger systems → optimized for real-world hardware. Each lecture is independently watchable, but the cumulative arc is what makes the series special.

All the repo pages: micrograd, makemore, ng-video-lecture, nanoGPT, build-nanogpt, llama2-c, llm-c
All the concept pages link back here when they cite a specific lecture.

The Zero-to-Hero Lecture Arc

Lecture 1: Building micrograd

Lecture 2: makemore intro (bigram)

Lecture 3: makemore MLP

Lecture 4: makemore activations & gradients, BatchNorm

Lecture 5: Becoming a Backprop Ninja

Lecture 6: makemore WaveNet

Lecture 7: Let's build GPT

Lecture 9: Let's build the GPT Tokenizer

Lecture 10: Let's reproduce GPT-2 (124M)

The shape of the arc

Related