build-nanogpt
The companion repo for lecture 10, "Let's reproduce GPT-2 (124M)."
Takes nanoGPT and walks through every optimization that goes into
actually reproducing GPT-2 faithfully on modern hardware. The play.ipynb notebook in
the repo is the lecture's working artifact; train_gpt2.py is the cleaned-up training script.
What's in the repo
| README.md | ||
| train_gpt2.py | 521 lines | the full training script |
| play.ipynb | the working notebook from the lecture | |
| fineweb.py | dataset preprocessor: HuggingFace fineweb-edu → token shards | |
| hellaswag.py | HellaSwag eval rendering | |
| input.txt | Tiny Shakespeare (a fallback) |
The full reproduction targets a 6-hour wall-clock on 8×A100 to reach GPT-3 paper's HellaSwag accuracy. The cost on Lambda Cloud was about $10 of compute at the time of recording.
The lecture this comes from
Lecture 10, "Let's reproduce GPT-2 (124M)", is the longest in the series (~4 hours). It's structured as a series of optimizations to a baseline:
-
Start from a minimal model. Same as nanoGPT.
-
Add the GPT-2 specific details. Learned init
std=0.02, scaled init for residual projections,tanhGELU approximation, AdamW withbetas=(0.9, 0.95), selective weight decay. -
Optimize the data pipeline.
DataLoaderLiteover.npyshards instead of in-memory tensors. See dataloader. -
Switch to FineWeb-Edu. The high-quality 10B-token subset.
fineweb.pydownloads and tokenizes. -
Use bfloat16 autocast and TF32 matmuls. Mixed precision gets ~5-10× speedup over fp32.
-
torch.compile. Graph-mode execution. Another ~30% speedup.
-
Flash attention via
scaled_dot_product_attention. Memory and speed. -
Vocab padding to multiple of 64.
vocab_size=50304(50257 padded). Aligns matmuls to nice multiples for tensor cores. -
DDP. Distribute across the 8 GPUs.
-
Gradient accumulation. Hit the target 524288-token batch size.
-
Cosine LR schedule with warmup, gradient clipping.
-
HellaSwag eval. Verify the trained model on a real downstream eval.
Each step is shown to improve either loss or throughput. By the end, the script reproduces GPT-3 paper's HellaSwag accuracy in one ~6-hour run.
What's in train_gpt2.py
The model code (the first ~200 lines) is almost identical to
nanoGPT/model.py, with a few faithfulness tweaks:
nn.GELU(approximate='tanh')instead of plainnn.GELU.NANOGPT_SCALE_INITmarker on residual projections for scaled init.vocab_size=50257in the config (and50304at model construction for padding).bias=Trueeverywhere by default (GPT-2 had biases).
The training loop (the last ~300 lines) is where the lecture-driven optimizations live:
DataLoaderLitereading sharded.npytoken files.torch.set_float32_matmul_precision('high')for TF32 matmuls.with torch.autocast(device_type=device_type, dtype=torch.bfloat16):around the forward.grad_accum_steps = total_batch_size // (B * T * ddp_world_size)— see gradient-accumulation.get_lr(it)— cosine schedule with linear warmup.clip_grad_norm_(model.parameters(), 1.0)— training stability.- Periodic validation loss, HellaSwag eval, and sample generation every 250 steps.
- DDP setup with
init_process_group(backend='nccl').
The full code is dense but readable. Every line is purposeful.
fineweb.py — the data prep
The training corpus is a 10B-token subset of FineWeb-Edu, an HuggingFace-curated educational subset of the FineWeb web crawl.
fineweb.py:
- Streams the dataset via
datasets.load_dataset("HuggingFaceFW/fineweb-edu", "sample-10BT"). - Tokenizes each document with
tiktoken.get_encoding("gpt2"), prepends an<|endoftext|>token. - Writes tokens to shard files of 100M tokens each, as
np.uint16arrays (token IDs fit in 16 bits since vocab_size < 65536). - First shard is
val, remaining shards aretrain.
Shards
100 files, 100M tokens each
On disk
~20GB total
The training script just np.loads them in order.
HellaSwag eval
hellaswag.py provides the eval logic. HellaSwag is a multiple-choice commonsense
reasoning benchmark: each example has a context and four possible completions, exactly one of
which is human-written. The eval scores the model by picking the completion with the lowest
average per-token loss (the model's most likely choice).
# from train_gpt2.py
def get_most_likely_row(tokens, mask, logits):
# evaluate the autoregressive loss at all positions
# ... compute per-completion-region loss ...
pred_norm = avg_loss.argmin().item()
return pred_norm
This is the standard way LLMs are evaluated on multiple-choice benchmarks: score every option by perplexity, pick the lowest. The lecture reaches the GPT-3 paper's reported 0.29 accuracy on HellaSwag.
Why this lecture matters
There are dozens of "build a GPT" tutorials online. Most produce a small model that "works" on Tiny Shakespeare and stop there. Karpathy's reproduction is one of the only ones that goes the whole way: real data, real hyperparameters, real eval, and lands inside the GPT-3 paper's reported numbers on a real benchmark.
The reason it ends here in the lecture series (lecture 10) is that there's nothing more to add at the 124M scale — you've reproduced GPT-2. Going further means scaling up, which is mostly engineering, not new ideas.
Related
- repos/nanoGPT
- The predecessor this builds on.
- zero-to-hero-arc
- Lecture 10.
- Concept pages
- adamw · learning-rate-schedules · mixed-precision-and-mfu · gradient-accumulation · training-stability · weight-init · dataloader · hellaswag-eval