Repo · zero-to-hero · lecture 10

build-nanogpt

The companion repo for lecture 10, "Let's reproduce GPT-2 (124M)." Takes nanoGPT and walks through every optimization that goes into actually reproducing GPT-2 faithfully on modern hardware. The play.ipynb notebook in the repo is the lecture's working artifact; train_gpt2.py is the cleaned-up training script.

repo gpt-2 reproduction karpathy

What's in the repo

README.md
train_gpt2.py	521 lines	the full training script
play.ipynb		the working notebook from the lecture
fineweb.py		dataset preprocessor: HuggingFace fineweb-edu → token shards
hellaswag.py		HellaSwag eval rendering
input.txt		Tiny Shakespeare (a fallback)

The full reproduction targets a 6-hour wall-clock on 8×A100 to reach GPT-3 paper's HellaSwag accuracy. The cost on Lambda Cloud was about $10 of compute at the time of recording.

Wall-clock

~6 hours

Hardware

8×A100

Cloud cost

~$10

The lecture this comes from

Lecture 10, "Let's reproduce GPT-2 (124M)", is the longest in the series (~4 hours). It's structured as a series of optimizations to a baseline:

Start from a minimal model. Same as nanoGPT.
Add the GPT-2 specific details. Learned init std=0.02, scaled init for residual projections, tanh GELU approximation, AdamW with betas=(0.9, 0.95), selective weight decay.
Optimize the data pipeline. DataLoaderLite over .npy shards instead of in-memory tensors. See dataloader.
Switch to FineWeb-Edu. The high-quality 10B-token subset. fineweb.py downloads and tokenizes.
Use bfloat16 autocast and TF32 matmuls. Mixed precision gets ~5-10× speedup over fp32.
torch.compile. Graph-mode execution. Another ~30% speedup.
Flash attention via scaled_dot_product_attention. Memory and speed.
Vocab padding to multiple of 64. vocab_size=50304 (50257 padded). Aligns matmuls to nice multiples for tensor cores.
DDP. Distribute across the 8 GPUs.
Gradient accumulation. Hit the target 524288-token batch size.
Cosine LR schedule with warmup, gradient clipping.
HellaSwag eval. Verify the trained model on a real downstream eval.

Each step is shown to improve either loss or throughput. By the end, the script reproduces GPT-3 paper's HellaSwag accuracy in one ~6-hour run.

What's in `train_gpt2.py`

The model code (the first ~200 lines) is almost identical to nanoGPT/model.py, with a few faithfulness tweaks:

nn.GELU(approximate='tanh') instead of plain nn.GELU.
NANOGPT_SCALE_INIT marker on residual projections for scaled init.
vocab_size=50257 in the config (and 50304 at model construction for padding).
bias=True everywhere by default (GPT-2 had biases).

The training loop (the last ~300 lines) is where the lecture-driven optimizations live:

DataLoaderLite reading sharded .npy token files.
torch.set_float32_matmul_precision('high') for TF32 matmuls.
with torch.autocast(device_type=device_type, dtype=torch.bfloat16): around the forward.
grad_accum_steps = total_batch_size // (B * T * ddp_world_size) — see gradient-accumulation.
get_lr(it) — cosine schedule with linear warmup.
clip_grad_norm_(model.parameters(), 1.0) — training stability.
Periodic validation loss, HellaSwag eval, and sample generation every 250 steps.
DDP setup with init_process_group(backend='nccl').

The full code is dense but readable. Every line is purposeful.

`fineweb.py` — the data prep

The training corpus is a 10B-token subset of FineWeb-Edu, an HuggingFace-curated educational subset of the FineWeb web crawl.

fineweb-edu sample-10BT → tiktoken gpt2 → np.uint16 shards → np.load

fineweb.py:

Streams the dataset via datasets.load_dataset("HuggingFaceFW/fineweb-edu", "sample-10BT").
Tokenizes each document with tiktoken.get_encoding("gpt2"), prepends an <|endoftext|> token.
Writes tokens to shard files of 100M tokens each, as np.uint16 arrays (token IDs fit in 16 bits since vocab_size < 65536).
First shard is val, remaining shards are train.

Shards

100 files, 100M tokens each

On disk

~20GB total

The training script just np.loads them in order.

HellaSwag eval

hellaswag.py provides the eval logic. HellaSwag is a multiple-choice commonsense reasoning benchmark: each example has a context and four possible completions, exactly one of which is human-written. The eval scores the model by picking the completion with the lowest average per-token loss (the model's most likely choice).

# from train_gpt2.py
def get_most_likely_row(tokens, mask, logits):
    # evaluate the autoregressive loss at all positions
    # ... compute per-completion-region loss ...
    pred_norm = avg_loss.argmin().item()
    return pred_norm

The scoring core: argmin over per-completion average loss.

This is the standard way LLMs are evaluated on multiple-choice benchmarks: score every option by perplexity, pick the lowest. The lecture reaches the GPT-3 paper's reported 0.29 accuracy on HellaSwag.

Why this lecture matters

There are dozens of "build a GPT" tutorials online. Most produce a small model that "works" on Tiny Shakespeare and stop there. Karpathy's reproduction is one of the only ones that goes the whole way: real data, real hyperparameters, real eval, and lands inside the GPT-3 paper's reported numbers on a real benchmark.

The reason it ends here in the lecture series (lecture 10) is that there's nothing more to add at the 124M scale — you've reproduced GPT-2. Going further means scaling up, which is mostly engineering, not new ideas.

repos/nanoGPT: The predecessor this builds on.
zero-to-hero-arc: Lecture 10.
Concept pages: adamw · learning-rate-schedules · mixed-precision-and-mfu · gradient-accumulation · training-stability · weight-init · dataloader · hellaswag-eval

build-nanogpt

What's in the repo

The lecture this comes from

What's in train_gpt2.py

fineweb.py — the data prep

Shards

On disk

HellaSwag eval

Why this lecture matters

Related

What's in `train_gpt2.py`

`fineweb.py` — the data prep