Training · Optimizer

Learning Rate Schedules

The learning rate is the single most consequential hyperparameter in deep learning. Constant learning rates are almost never optimal — every modern training run uses a schedule that warms up, then decays. Karpathy's GPT-2 reproduction in build-nanogpt is a canonical example.

Cosine with linear warmup

The schedule used by GPT-2, GPT-3, Llama, and every reproduction since:

Linear warmup to max_lr, cosine decay to min_lr, then a floor.

Constants

max_lr: 6e-4
min_lr: max_lr * 0.1
warmup_steps: 715
max_steps: 19073 # ~1 epoch on 10B tokens at batch 0.5M tokens

# from build-nanogpt/train_gpt2.py
def get_lr(it):
    # 1) linear warmup for warmup_iters steps
    if it < warmup_steps:
        return max_lr * (it+1) / warmup_steps
    # 2) if it > lr_decay_iters, return min learning rate
    if it > max_steps:
        return min_lr
    # 3) in between, use cosine decay down to min learning rate
    decay_ratio = (it - warmup_steps) / (max_steps - warmup_steps)
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio))
    return min_lr + coeff * (max_lr - min_lr)

Three phases:

Phase 1

Warmup (linear)

step 0 → 715

LR climbs linearly from 0 to max_lr = 6e-4. The model is fragile at init — a full learning rate would blow up the first few steps. Warmup gives the optimizer state (momentum and variance estimates in AdamW) time to populate before they get used aggressively.

Phase 2

Cosine decay

step 715 → 19073

LR follows min_lr + 0.5 * (max_lr - min_lr) * (1 + cos(π * progress)). Smooth, monotonic, no abrupt cliff. Lands at min_lr = 0.1 * max_lr = 6e-5.

Phase 3

Floor

step > 19073

Anything past max_steps stays at min_lr. In practice you stop training at max_steps.

Why cosine, not exponential

Exponential decay

Used in older deep learning literature. Has no natural endpoint and depends on how fast you choose to decay.

Cosine

Has a natural shape — slow at first (model can still make big jumps in the loss landscape), slow at the end (model is fine-tuning), and steepest in the middle.

It's also reproducible: the only hyperparameters are max_lr, min_lr, warmup_steps, and max_steps.

WSD (warmup-stable-decay). Some recent work argues for keeping LR flat for most of training and decaying sharply at the end. The intuition is that you can extend training easily by appending more flat steps. For one-shot fixed-budget runs (which is what build-nanogpt is), cosine is fine and well-understood.

Warmup is essential, not optional

Without warmup, AdamW's variance estimate v starts at zero, so 1/sqrt(v + eps) blows up, so the first updates are enormous. With warmup, the LR multiplier is tiny while v is still warming up, so the early-step instability is hidden. You can skip warmup if you use a much smaller initial learning rate (the smaller LR ≈ effective warmup), but explicit warmup is cleaner.

Warmup tokens across reproductions

Run	Warmup tokens
GPT-2 paper	250k
GPT-3 paper	several hundred million
`build-nanogpt` (124M on 10B-token corpus, batch 524288 tokens, 715 steps)	≈ 375M

Gradient clipping

Closely paired with the LR schedule: torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0). Clips the global gradient norm to 1.0 before each optimizer step. This is the safety net for the rare bad batch that produces a huge gradient (numerical instability in attention, a weird outlier in the data). Without clipping, one bad gradient can destabilize the optimizer state for hundreds of steps.

In build-nanogpt/train_gpt2.py:

norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
lr = get_lr(step)
for param_group in optimizer.param_groups:
    param_group['lr'] = lr
optimizer.step()

The clipped norm is logged each step — watching it decay over training is a good health check. Big spikes during training usually indicate a problem.

adamw — what consumes the learning rate
gradient-accumulation — coexists with LR scheduling
training-stability — clipping, warmup, init, all together

Learning Rate Schedules

Cosine with linear warmup

Constants

Warmup (linear)

Cosine decay

Floor

Why cosine, not exponential

Warmup is essential, not optional

Warmup tokens across reproductions

Gradient clipping

Related