Karpathy's LLM Pedagogy
This wiki covers Andrej Karpathy's published teaching corpus on language models — seven open-source repositories and a nine-lecture YouTube series ("Neural Networks: Zero to Hero"). Together they trace the technical lineage from "what is backpropagation" through to "here is a working reproduction of GPT-2 (124M)."
The corpus is unusually coherent. The same patterns and abstractions recur across repos — Block, MultiHeadAttention, configure_optimizers, estimate_mfu, from_pretrained — at progressively bigger scales. Reading any one repo in isolation works, but reading them in order shows you the underlying ideas being refined.
Reading guide
If you're starting from zero and want the full arc, the order is:
-
zero-to-hero-arcThe lecture map. Read this first. -
repos/microgradScalar autograd. The conceptual root. -
backpropagationandvalue-classThe algorithm and its data structure. -
repos/makemoreFirst real LMs. Bigram → MLP → ... → Transformer. -
repos/ng-video-lectureCharacter-level GPT on Tiny Shakespeare. -
repos/nanoGPTProduction-grade GPT-2 implementation. -
repos/build-nanogptFaithful GPT-2 reproduction with every optimization. -
repos/llama2-cLlama 2 in PyTorch + pure C inference. The "modern" architecture. -
repos/llm-cSame training task as build-nanogpt, in pure C/CUDA.
If you want to learn a specific concept, jump to the concept page; each one cross-references the repos that demonstrate it.
The architecture, in pieces
The transformer architecture as Karpathy teaches it, broken into independent pieces:
| Topic | Page |
|---|---|
| The repeating unit | transformer-block |
| Information mixing across positions | attention |
| Stability mechanism for deep stacks | residual-connections |
| Per-layer normalization | layernorm-vs-rmsnorm |
| Per-position nonlinearity | gelu-and-swiglu |
| Positional information (GPT-2 vs Llama) | rope |
| Vocabulary and embedding | tokenization, character-vs-bpe |
| Embedding-unembedding sharing | weight-tying |
Training, in pieces
| Topic | Page |
|---|---|
| Gradient computation | backpropagation, value-class |
| Parameter update | adamw |
| Initialization | weight-init |
| Learning rate over time | learning-rate-schedules |
| Batches and effective batch size | gradient-accumulation, dataloader |
| Numerical precision | mixed-precision-and-mfu |
| Keeping training alive | training-stability |
| Downstream evaluation | hellaswag-eval |
Inference
| Topic | Page |
|---|---|
| Token selection | sampling |
| Generation acceleration | kv-cache |
| Pure-C runtime | repos/llama2-c |
Three "model families" to compare
The corpus contains three subtly different transformer architectures, useful to compare against each other:
| Component | GPT-2 ng-video-lecture, nanoGPT, build-nanogpt, llm.c | Llama 2 llama2.c | makemore Transformer |
|---|---|---|---|
| Normalization | LayerNorm | RMSNorm | LayerNorm |
| Positional | Learned embedding | RoPE | Learned embedding |
| Activation | GELU | SwiGLU | GELU |
| Tokenizer | BPE (50257) | SentencePiece BPE (32000) | character-level |
| Attention | Multi-head | Grouped-query | Multi-head |
Same skeleton, different organs. Once you know the skeleton (the transformer block wrapped in residuals and a stack), swapping organs is straightforward.
What's not in this wiki
Things outside the scope of the corpus:
Cross-reference conventions
Every page in this wiki uses markdown reference links: [name](name.md) for concepts, [name](repos/name.md) for repos. The link text is usually the unqualified name; the path tells you whether it's a concept or a repo page.
For agents post-processing this wiki: every page is a self-contained topic that can be rendered as a single HTML page. Internal links between pages are the primary structural signal of the wiki graph. The concepts/ flat layout was rejected in favor of having concepts at the wiki root and repos in a subdirectory — concepts are first-class citizens, repos are case studies that ground them.