Optimizer

AdamW

AdamW is the optimizer everyone uses to train LLMs. "Adam" is the per-parameter adaptive-learning-rate optimizer from Kingma & Ba 2014; the "W" is for "decoupled weight decay" (Loshchilov & Hutter 2019), which separates weight decay from the gradient update. Every Karpathy training script — bigram, MLP, makemore, nanoGPT, build-nanogpt, llama2.c, llm.c — uses AdamW.

What Adam does

For each parameter p, Adam keeps two running statistics:

m

Exponential moving average of the gradient — the first moment, "momentum".

v

Exponential moving average of the squared gradient — the second moment, "RMSprop".

Each step:

m = beta1 * m + (1 - beta1) * grad
v = beta2 * v + (1 - beta2) * grad^2
m_hat = m / (1 - beta1^t)    # bias correction
v_hat = v / (1 - beta2^t)
p = p - lr * m_hat / (sqrt(v_hat) + eps)

The intuition: m_hat is the smoothed direction of the gradient. sqrt(v_hat) is the smoothed magnitude. Dividing one by the other gives an update that's roughly scale-free in the gradient — each parameter gets an update of roughly fixed magnitude regardless of whether its gradient is huge or tiny. This is what "adaptive" means.

The bias correction (/ (1 - beta1^t)) accounts for the fact that m and v start at zero and need a few steps to reach their true running averages. Without it, the early steps would be unfairly damped.

AdamW: decoupled weight decay

Classic Adam adds an L2 penalty to the loss: loss + 0.5 * wd * ||p||^2. The gradient of that penalty is wd * p, which gets added into grad and then runs through the adaptive scaling. This is wrong — the weight decay should be a pull toward zero of constant magnitude, not modulated by 1/sqrt(v).

AdamW fixes this by applying weight decay directly to the parameter:

Classic Adam coupled

Weight decay flows through the adaptive scaling — parameters with large gradients effectively get less decay.

loss = loss + 0.5 * wd * ||p||^2
grad = grad + wd * p
p = p - lr * m_hat / (sqrt(v_hat) + eps)

AdamW decoupled

Weight decay is applied directly to the parameter, outside the adaptive scaling.

p = p - lr * (m_hat / (sqrt(v_hat) + eps)
              + wd * p)

In Karpathy's CUDA implementation:

// from llm.c/llmc/adamw.cuh
float param = old_param - (learning_rate * (m / (sqrtf(v) + eps) + weight_decay * old_param));

This decoupling matters: with classic Adam, larger gradient parameters effectively got less weight decay. AdamW makes the decay genuinely uniform.

Selective weight decay

A standard trick in nanoGPT/model.py, llama2.c/model.py, and build-nanogpt/train_gpt2.py:

# any parameters that is 2D will be weight decayed, otherwise no.
# i.e. all weight tensors in matmuls + embeddings decay, all biases and layernorms don't.
decay_params = [p for n, p in param_dict.items() if p.dim() >= 2]
nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2]
optim_groups = [
    {'params': decay_params, 'weight_decay': weight_decay},
    {'params': nodecay_params, 'weight_decay': 0.0}
]
Group Examples Weight decay
p.dim() >= 2 matmul weight tensors, embeddings weight_decay
p.dim() < 2 biases, LayerNorm gains/biases 0.0

Biases and LayerNorm gains/biases (1D tensors) don't get weight decay. Matrix weights and embeddings (2D tensors) do. The intuition: you want to regularize the bulk of the model, but pulling LayerNorm weights toward zero would suppress the entire layer's contribution to the residual stream.

The standard value is weight_decay = 0.1 for LLM pretraining. Karpathy uses this throughout.

Hyperparameters

The defaults from build-nanogpt that reproduce GPT-2 (124M):

optimizer = torch.optim.AdamW(
    optim_groups,
    lr=6e-4,
    betas=(0.9, 0.95),
    eps=1e-8,
    fused=use_fused,
)
betas
(0.9, 0.95)
PyTorch default is (0.9, 0.999). The lower beta2 is from the GPT-2 / GPT-3 papers and adapts faster to changing gradient magnitudes. Worth knowing because the PyTorch default isn't great for LLM training.
eps
1e-8
Standard.
fused
True (CUDA)
PyTorch has a fused AdamW kernel; it's a free speedup.
lr
6e-4

Memory cost

AdamW stores m and v per parameter, both in fp32. For a model with N parameters at fp32, that's 8N bytes of optimizer state on top of 4N bytes of weights and 4N of gradients — 3× the model size in memory for the optimizer alone.

weights 4N
grads 4N
m 4N
v 4N
weights (4N) gradients (4N) m — first moment (4N) v — second moment (4N)
Per-parameter memory in fp32. Optimizer state (m + v) is 8N — twice the size of the weights themselves.
This is why optimizer state is the dominant memory cost in LLM training (more than gradients, often more than weights) and why "ZeRO" / FSDP optimizer-state sharding is a big win.

llm.c keeps a master copy of params in fp32 plus an fp16/bfloat16 working copy, so the cost is even higher — but this is what lets mixed precision training maintain accuracy.

Related

learning-rate-schedules
the LR is what gets passed in
backpropagation
produces the gradients AdamW consumes
mixed-precision-and-mfu
why master params exist