REPO · NANOGPT

nanoGPT

The production-grade version of ng-video-lecture. Roughly 300 lines of model code and 300 lines of training loop that can reproduce GPT-2 (124M) on OpenWebText in ~4 days on an 8×A100 node.

Predecessor minGPT the original pedagogical rewrite
This repo nanoGPT teeth over education
Successor nanochat pretraining + RLHF + chat

The simplest, fastest repository for training/finetuning medium-sized GPTs. It is a rewrite of minGPT that prioritizes teeth over education.

— from the README

"Teeth over education" is the key framing. ng-video-lecture and makemore optimize for clarity. nanoGPT optimizes for actually reproducing GPT-2 quickly. Differences from the pedagogical versions are all "the fast way."

What's in the repo

model.py ~330 lines — GPT class, with from_pretrained, configure_optimizers, generate train.py ~336 lines — single-file training loop with DDP, mixed precision, checkpointing sample.py ~80 lines — load checkpoint, run generate() data/ preprocessing scripts: openwebtext, shakespeare, shakespeare_char config/ config files for different training runs scaling_laws.ipynb notebook reproducing Chinchilla-style scaling experiments transformer_sizing.ipynb notebook showing param count breakdowns

The "training" entry point is train.py config/train_gpt2.py. The config files are just Python files that override defaults — Karpathy's idiosyncratic "configurator" pattern.

What's in model.py

A complete from-scratch GPT-2 implementation. Reading the file end to end:

LayerNorm

A custom wrapper around F.layer_norm that supports bias=False. PyTorch's built-in nn.LayerNorm requires bias=True, and GPT-2 modern variants want the option. See layernorm-vs-rmsnorm.

CausalSelfAttention

Multi-head causal self-attention with three optimizations over ng-video-lecture:

Opt 1

Fused QKV

self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd). Splits into Q, K, V via .split(self.n_embd, dim=2).

One matmul instead of three.

Opt 2

Batched heads

q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) reshapes so heads are a batch dimension, not a Python list.

One attention matmul handles all heads at once.

Opt 3

Flash attention

if self.flash: y = F.scaled_dot_product_attention(...) — uses PyTorch's flash-attention backend when available.

Falls back to the manual implementation otherwise.

MLP

c_fc → GELU → c_proj → dropout. Standard.

Block

Pre-norm: x = x + attn(ln_1(x)); x = x + mlp(ln_2(x)). See transformer-block.

GPT

Token + position embeddings, stack of blocks, final LayerNorm, lm_head, weight tying, scaled init for residual projections. Plus several methods:

MethodPurpose
from_pretrained(model_type) Loads HuggingFace GPT-2 weights, transposes the Conv1D-shaped weights into Linear shape. Critical for finetuning runs.
configure_optimizers(weight_decay, learning_rate, betas, device_type) Sets up AdamW with selective weight decay: 2D tensors decay, 1D tensors don't.
estimate_mfu(fwdbwd_per_iter, dt) Computes Model Flops Utilization using the PaLM paper's formula.
crop_block_size(block_size) Model surgery to shrink the context window of a loaded checkpoint.
generate(idx, max_new_tokens, temperature, top_k) Autoregressive sampling with optional top-k.

What's in train.py

A single-file training loop with serious production features:

DDP via torch.distributed. Detects RANK env vars from torchrun; otherwise runs single-process. Mixed precision via torch.autocast(device_type, dtype=ptdtype). Defaults to bf16 on supported GPUs. Grad accumulation via the gradient_accumulation_steps config knob. LR schedule Cosine LR schedule with warmup. Grad clipping at grad_clip = 1.0. Checkpointing to out_dir. Resumable runs. wandb integration (optional). torch.compile support — wraps the model for graph-mode execution.

The configurator pattern. Instead of argparse, train.py defines all defaults as module-level globals, then execs the config file you pass on the command line, which overrides specific globals. It's hacky but lets config files be pure Python (you can compute hyperparams, do conditional logic, etc.).

The shakespeare_char baby example

The first thing the README walks through: train a tiny character-level GPT on the works of Shakespeare. A 6-layer, 6-head, 384-channel model trains for ~3 minutes on one A100, reaches validation loss 1.4697, and produces:

ANGELO:
And cowards it be strawn to my bed,
And thrust the gates of my threats,
Because he that ale away, and hang'd
An one with him.
Sample output after ~3 minutes of training on one A100. The same output that ng-video-lecture/gpt.py produces — it's essentially the same model.

OpenWebText reproduction

Config: config/train_gpt2.py

Model
GPT-2 (124M) — 12 layers, 12 heads, 768 channels
Data
OpenWebText
Hardware
~4 days on 8×A100
Result
Reaches GPT-2's reported perplexity on OpenAI's eval set. The plot in the repo (assets/gpt2_124M_loss.png) shows the loss curve matching the original.

Update note from the README

Update Nov 2025. nanoGPT has a new and improved cousin called nanochat. It is very likely you meant to use/find nanochat instead. nanoGPT (this repo) is now very old and deprecated but I will leave it up for posterity.

nanoGPT predates ChatGPT and focuses on pretraining. nanochat (not in this corpus) is the follow-up that also includes RLHF and chat-style fine-tuning. nanoGPT is still the right place to learn pretraining specifically.

Related