AdamW
AdamW is the optimizer everyone uses to train LLMs. "Adam" is the per-parameter adaptive-learning-rate optimizer from Kingma & Ba 2014; the "W" is for "decoupled weight decay" (Loshchilov & Hutter 2019), which separates weight decay from the gradient update. Every Karpathy training script — bigram, MLP, makemore, nanoGPT, build-nanogpt, llama2.c, llm.c — uses AdamW.
What Adam does
For each parameter p, Adam keeps two running statistics:
m
Exponential moving average of the gradient — the first moment, "momentum".
v
Exponential moving average of the squared gradient — the second moment, "RMSprop".
Each step:
m = beta1 * m + (1 - beta1) * grad
v = beta2 * v + (1 - beta2) * grad^2
m_hat = m / (1 - beta1^t) # bias correction
v_hat = v / (1 - beta2^t)
p = p - lr * m_hat / (sqrt(v_hat) + eps)
The intuition: m_hat is the smoothed direction of the
gradient. sqrt(v_hat) is the smoothed magnitude. Dividing
one by the other gives an update that's roughly scale-free
in the gradient — each parameter gets an update of roughly fixed magnitude
regardless of whether its gradient is huge or tiny. This is what
"adaptive" means.
The bias correction (/ (1 - beta1^t)) accounts for the fact
that m and v start at zero and need a few steps
to reach their true running averages. Without it, the early steps would be
unfairly damped.
AdamW: decoupled weight decay
Classic Adam adds an L2 penalty to the loss:
loss + 0.5 * wd * ||p||^2. The gradient of that penalty is
wd * p, which gets added into grad and then runs
through the adaptive scaling. This is wrong — the weight decay should be a
pull toward zero of constant magnitude, not modulated by
1/sqrt(v).
AdamW fixes this by applying weight decay directly to the parameter:
Classic Adam coupled
Weight decay flows through the adaptive scaling — parameters with large gradients effectively get less decay.
loss = loss + 0.5 * wd * ||p||^2
grad = grad + wd * p
p = p - lr * m_hat / (sqrt(v_hat) + eps)
AdamW decoupled
Weight decay is applied directly to the parameter, outside the adaptive scaling.
p = p - lr * (m_hat / (sqrt(v_hat) + eps)
+ wd * p)
In Karpathy's CUDA implementation:
// from llm.c/llmc/adamw.cuh
float param = old_param - (learning_rate * (m / (sqrtf(v) + eps) + weight_decay * old_param));
This decoupling matters: with classic Adam, larger gradient parameters effectively got less weight decay. AdamW makes the decay genuinely uniform.
Selective weight decay
A standard trick in nanoGPT/model.py,
llama2.c/model.py, and
build-nanogpt/train_gpt2.py:
# any parameters that is 2D will be weight decayed, otherwise no.
# i.e. all weight tensors in matmuls + embeddings decay, all biases and layernorms don't.
decay_params = [p for n, p in param_dict.items() if p.dim() >= 2]
nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2]
optim_groups = [
{'params': decay_params, 'weight_decay': weight_decay},
{'params': nodecay_params, 'weight_decay': 0.0}
]
| Group | Examples | Weight decay |
|---|---|---|
p.dim() >= 2 |
matmul weight tensors, embeddings | weight_decay |
p.dim() < 2 |
biases, LayerNorm gains/biases | 0.0 |
Biases and LayerNorm gains/biases (1D tensors) don't get weight decay. Matrix weights and embeddings (2D tensors) do. The intuition: you want to regularize the bulk of the model, but pulling LayerNorm weights toward zero would suppress the entire layer's contribution to the residual stream.
The standard value is weight_decay = 0.1 for LLM pretraining.
Karpathy uses this throughout.
Hyperparameters
The defaults from build-nanogpt that reproduce GPT-2 (124M):
optimizer = torch.optim.AdamW(
optim_groups,
lr=6e-4,
betas=(0.9, 0.95),
eps=1e-8,
fused=use_fused,
)
Memory cost
AdamW stores m and v per parameter, both in
fp32. For a model with N parameters at fp32, that's 8N bytes
of optimizer state on top of 4N bytes of weights and
4N of gradients — 3× the model size in memory for
the optimizer alone.
llm.c keeps a master copy of params in fp32 plus an
fp16/bfloat16 working copy, so the cost is even higher — but this is what
lets mixed precision training
maintain accuracy.
Related
- learning-rate-schedules
- the LR is what gets passed in
- backpropagation
- produces the gradients AdamW consumes
- mixed-precision-and-mfu
- why master params exist