Learning Rate Schedules
The learning rate is the single most consequential hyperparameter in deep learning. Constant learning rates are almost never optimal — every modern training run uses a schedule that warms up, then decays. Karpathy's GPT-2 reproduction in build-nanogpt is a canonical example.
Cosine with linear warmup
The schedule used by GPT-2, GPT-3, Llama, and every reproduction since:
max_lr, cosine decay to min_lr, then a floor.Constants
- max_lr
- 6e-4
- min_lr
- max_lr * 0.1
- warmup_steps
- 715
- max_steps
- 19073 # ~1 epoch on 10B tokens at batch 0.5M tokens
# from build-nanogpt/train_gpt2.py
def get_lr(it):
# 1) linear warmup for warmup_iters steps
if it < warmup_steps:
return max_lr * (it+1) / warmup_steps
# 2) if it > lr_decay_iters, return min learning rate
if it > max_steps:
return min_lr
# 3) in between, use cosine decay down to min learning rate
decay_ratio = (it - warmup_steps) / (max_steps - warmup_steps)
coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio))
return min_lr + coeff * (max_lr - min_lr)
Three phases:
Warmup (linear)
LR climbs linearly from 0 to max_lr = 6e-4. The model is fragile at init — a full learning rate would blow up the first few steps. Warmup gives the optimizer state (momentum and variance estimates in AdamW) time to populate before they get used aggressively.
Cosine decay
LR follows min_lr + 0.5 * (max_lr - min_lr) * (1 + cos(π * progress)). Smooth, monotonic, no abrupt cliff. Lands at min_lr = 0.1 * max_lr = 6e-5.
Floor
Anything past max_steps stays at min_lr. In practice you stop training at max_steps.
Why cosine, not exponential
Used in older deep learning literature. Has no natural endpoint and depends on how fast you choose to decay.
Has a natural shape — slow at first (model can still make big jumps in the loss landscape), slow at the end (model is fine-tuning), and steepest in the middle.
It's also reproducible: the only hyperparameters are max_lr, min_lr, warmup_steps, and max_steps.
build-nanogpt is), cosine is fine and well-understood.
Warmup is essential, not optional
Without warmup, AdamW's variance estimate v starts at zero, so
1/sqrt(v + eps) blows up, so the first updates are enormous. With warmup, the LR
multiplier is tiny while v is still warming up, so the early-step instability is
hidden. You can skip warmup if you use a much smaller initial learning rate (the smaller LR
≈ effective warmup), but explicit warmup is cleaner.
Warmup tokens across reproductions
| Run | Warmup tokens |
|---|---|
| GPT-2 paper | 250k |
| GPT-3 paper | several hundred million |
build-nanogpt (124M on 10B-token corpus, batch 524288 tokens, 715 steps) | ≈ 375M |
Gradient clipping
Closely paired with the LR schedule:
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0). Clips the global gradient
norm to 1.0 before each optimizer step. This is the safety net for the rare bad batch that
produces a huge gradient (numerical instability in attention, a weird outlier in the data).
Without clipping, one bad gradient can destabilize the optimizer state for hundreds of steps.
In build-nanogpt/train_gpt2.py:
norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
lr = get_lr(step)
for param_group in optimizer.param_groups:
param_group['lr'] = lr
optimizer.step()
norm is logged each step — watching it decay over training is a good
health check. Big spikes during training usually indicate a problem.
Related
- adamw — what consumes the learning rate
- gradient-accumulation — coexists with LR scheduling
- training-stability — clipping, warmup, init, all together