Repo · zero-to-hero · lecture 10

build-nanogpt

The companion repo for lecture 10, "Let's reproduce GPT-2 (124M)." Takes nanoGPT and walks through every optimization that goes into actually reproducing GPT-2 faithfully on modern hardware. The play.ipynb notebook in the repo is the lecture's working artifact; train_gpt2.py is the cleaned-up training script.

repo gpt-2 reproduction karpathy

What's in the repo

README.md
train_gpt2.py 521 lines the full training script
play.ipynb the working notebook from the lecture
fineweb.py dataset preprocessor: HuggingFace fineweb-edu → token shards
hellaswag.py HellaSwag eval rendering
input.txt Tiny Shakespeare (a fallback)

The full reproduction targets a 6-hour wall-clock on 8×A100 to reach GPT-3 paper's HellaSwag accuracy. The cost on Lambda Cloud was about $10 of compute at the time of recording.

Wall-clock
~6 hours
Hardware
8×A100
Cloud cost
~$10

The lecture this comes from

Lecture 10, "Let's reproduce GPT-2 (124M)", is the longest in the series (~4 hours). It's structured as a series of optimizations to a baseline:

  1. Start from a minimal model. Same as nanoGPT.
  2. Add the GPT-2 specific details. Learned init std=0.02, scaled init for residual projections, tanh GELU approximation, AdamW with betas=(0.9, 0.95), selective weight decay.
  3. Optimize the data pipeline. DataLoaderLite over .npy shards instead of in-memory tensors. See dataloader.
  4. Switch to FineWeb-Edu. The high-quality 10B-token subset. fineweb.py downloads and tokenizes.
  5. Use bfloat16 autocast and TF32 matmuls. Mixed precision gets ~5-10× speedup over fp32.
  6. torch.compile. Graph-mode execution. Another ~30% speedup.
  7. Flash attention via scaled_dot_product_attention. Memory and speed.
  8. Vocab padding to multiple of 64. vocab_size=50304 (50257 padded). Aligns matmuls to nice multiples for tensor cores.
  9. DDP. Distribute across the 8 GPUs.
  10. Gradient accumulation. Hit the target 524288-token batch size.
  11. Cosine LR schedule with warmup, gradient clipping.
  12. HellaSwag eval. Verify the trained model on a real downstream eval.

Each step is shown to improve either loss or throughput. By the end, the script reproduces GPT-3 paper's HellaSwag accuracy in one ~6-hour run.

What's in train_gpt2.py

The model code (the first ~200 lines) is almost identical to nanoGPT/model.py, with a few faithfulness tweaks:

The training loop (the last ~300 lines) is where the lecture-driven optimizations live:

The full code is dense but readable. Every line is purposeful.

fineweb.py — the data prep

The training corpus is a 10B-token subset of FineWeb-Edu, an HuggingFace-curated educational subset of the FineWeb web crawl.

fineweb-edu sample-10BT tiktoken gpt2 np.uint16 shards np.load

fineweb.py:

  1. Streams the dataset via datasets.load_dataset("HuggingFaceFW/fineweb-edu", "sample-10BT").
  2. Tokenizes each document with tiktoken.get_encoding("gpt2"), prepends an <|endoftext|> token.
  3. Writes tokens to shard files of 100M tokens each, as np.uint16 arrays (token IDs fit in 16 bits since vocab_size < 65536).
  4. First shard is val, remaining shards are train.

Shards

100 files, 100M tokens each

On disk

~20GB total

The training script just np.loads them in order.

HellaSwag eval

hellaswag.py provides the eval logic. HellaSwag is a multiple-choice commonsense reasoning benchmark: each example has a context and four possible completions, exactly one of which is human-written. The eval scores the model by picking the completion with the lowest average per-token loss (the model's most likely choice).

# from train_gpt2.py
def get_most_likely_row(tokens, mask, logits):
    # evaluate the autoregressive loss at all positions
    # ... compute per-completion-region loss ...
    pred_norm = avg_loss.argmin().item()
    return pred_norm
The scoring core: argmin over per-completion average loss.

This is the standard way LLMs are evaluated on multiple-choice benchmarks: score every option by perplexity, pick the lowest. The lecture reaches the GPT-3 paper's reported 0.29 accuracy on HellaSwag.

Why this lecture matters

There are dozens of "build a GPT" tutorials online. Most produce a small model that "works" on Tiny Shakespeare and stop there. Karpathy's reproduction is one of the only ones that goes the whole way: real data, real hyperparameters, real eval, and lands inside the GPT-3 paper's reported numbers on a real benchmark.

The reason it ends here in the lecture series (lecture 10) is that there's nothing more to add at the 124M scale — you've reproduced GPT-2. Going further means scaling up, which is mostly engineering, not new ideas.

Related

repos/nanoGPT
The predecessor this builds on.
zero-to-hero-arc
Lecture 10.
Concept pages
adamw · learning-rate-schedules · mixed-precision-and-mfu · gradient-accumulation · training-stability · weight-init · dataloader · hellaswag-eval