nanoGPT
The production-grade version of ng-video-lecture. Roughly 300 lines of model code and 300 lines of training loop that can reproduce GPT-2 (124M) on OpenWebText in ~4 days on an 8×A100 node.
The simplest, fastest repository for training/finetuning medium-sized GPTs. It is a rewrite of minGPT that prioritizes teeth over education.
— from the README
"Teeth over education" is the key framing. ng-video-lecture and makemore optimize for clarity. nanoGPT optimizes for actually reproducing GPT-2 quickly. Differences from the pedagogical versions are all "the fast way."
What's in the repo
from_pretrained, configure_optimizers, generate
train.py
~336 lines — single-file training loop with DDP, mixed precision, checkpointing
sample.py
~80 lines — load checkpoint, run generate()
data/
preprocessing scripts: openwebtext, shakespeare, shakespeare_char
config/
config files for different training runs
scaling_laws.ipynb
notebook reproducing Chinchilla-style scaling experiments
transformer_sizing.ipynb
notebook showing param count breakdowns
The "training" entry point is train.py config/train_gpt2.py. The config files are just Python files that override defaults — Karpathy's idiosyncratic "configurator" pattern.
What's in model.py
A complete from-scratch GPT-2 implementation. Reading the file end to end:
LayerNorm
A custom wrapper around F.layer_norm that supports bias=False. PyTorch's built-in nn.LayerNorm requires bias=True, and GPT-2 modern variants want the option. See layernorm-vs-rmsnorm.
CausalSelfAttention
Multi-head causal self-attention with three optimizations over ng-video-lecture:
Fused QKV
self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd). Splits into Q, K, V via .split(self.n_embd, dim=2).
One matmul instead of three.
Batched heads
q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) reshapes so heads are a batch dimension, not a Python list.
One attention matmul handles all heads at once.
Flash attention
if self.flash: y = F.scaled_dot_product_attention(...) — uses PyTorch's flash-attention backend when available.
Falls back to the manual implementation otherwise.
MLP
c_fc → GELU → c_proj → dropout. Standard.
Block
Pre-norm: x = x + attn(ln_1(x)); x = x + mlp(ln_2(x)). See transformer-block.
GPT
Token + position embeddings, stack of blocks, final LayerNorm, lm_head, weight tying, scaled init for residual projections. Plus several methods:
| Method | Purpose |
|---|---|
from_pretrained(model_type) |
Loads HuggingFace GPT-2 weights, transposes the Conv1D-shaped weights into Linear shape. Critical for finetuning runs. |
configure_optimizers(weight_decay, learning_rate, betas, device_type) |
Sets up AdamW with selective weight decay: 2D tensors decay, 1D tensors don't. |
estimate_mfu(fwdbwd_per_iter, dt) |
Computes Model Flops Utilization using the PaLM paper's formula. |
crop_block_size(block_size) |
Model surgery to shrink the context window of a loaded checkpoint. |
generate(idx, max_new_tokens, temperature, top_k) |
Autoregressive sampling with optional top-k. |
What's in train.py
A single-file training loop with serious production features:
torch.distributed. Detects RANK env vars from torchrun; otherwise runs single-process.
Mixed precision
via torch.autocast(device_type, dtype=ptdtype). Defaults to bf16 on supported GPUs.
Grad accumulation
via the gradient_accumulation_steps config knob.
LR schedule
Cosine LR schedule with warmup.
Grad clipping
at grad_clip = 1.0.
Checkpointing
to out_dir. Resumable runs.
wandb
integration (optional).
torch.compile
support — wraps the model for graph-mode execution.
The configurator pattern. Instead of argparse, train.py defines all defaults as module-level globals, then execs the config file you pass on the command line, which overrides specific globals. It's hacky but lets config files be pure Python (you can compute hyperparams, do conditional logic, etc.).
The shakespeare_char baby example
The first thing the README walks through: train a tiny character-level GPT on the works of Shakespeare. A 6-layer, 6-head, 384-channel model trains for ~3 minutes on one A100, reaches validation loss 1.4697, and produces:
ANGELO:
And cowards it be strawn to my bed,
And thrust the gates of my threats,
Because he that ale away, and hang'd
An one with him.
ng-video-lecture/gpt.py produces — it's essentially the same model.OpenWebText reproduction
Config: config/train_gpt2.py
- Model
- GPT-2 (124M) — 12 layers, 12 heads, 768 channels
- Data
- OpenWebText
- Hardware
- ~4 days on 8×A100
- Result
- Reaches GPT-2's reported perplexity on OpenAI's eval set. The plot in the repo (
assets/gpt2_124M_loss.png) shows the loss curve matching the original.
Update note from the README
Update Nov 2025. nanoGPT has a new and improved cousin called nanochat. It is very likely you meant to use/find nanochat instead. nanoGPT (this repo) is now very old and deprecated but I will leave it up for posterity.
nanoGPT predates ChatGPT and focuses on pretraining. nanochat (not in this corpus) is the follow-up that also includes RLHF and chat-style fine-tuning. nanoGPT is still the right place to learn pretraining specifically.
Related
- repos/build-nanogpt — the next step: lecture-driven faithful GPT-2 reproduction
- transformer-block, attention — the model internals
- adamw, learning-rate-schedules, gradient-accumulation, mixed-precision-and-mfu — the training internals