Weight Initialization
The weights you start with determine whether training works at all. Bad init can make gradients vanish, activations saturate, or the loss diverge in the first few steps. Karpathy's lecture 4 is mostly a tour of what bad init does and how to fix it — by the end of that lecture you can read the histograms of activations and gradients across layers and diagnose a model from a glance.
The GPT-2 init recipe
From build-nanogpt/train_gpt2.py, almost verbatim from the GPT-2 paper:
def _init_weights(self, module):
if isinstance(module, nn.Linear):
std = 0.02
if hasattr(module, 'NANOGPT_SCALE_INIT'):
std *= (2 * self.config.n_layer) ** -0.5
torch.nn.init.normal_(module.weight, mean=0.0, std=std)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
Three pieces:
std = 0.02 for all linear layers and
embeddings. This is the baseline that the GPT-2 paper landed on. Karpathy
notes in the lecture 10 transcript
that this is somewhat hand-tuned and roughly matches
1/sqrt(n_embd) for typical GPT-2 sizes
(n_embd=768, so 1/sqrt(768) ≈ 0.036 — same
order of magnitude).
Scaled init for residual projections. Any linear
layer that writes back into the residual stream — the output projection
of attention, the second linear of the MLP — gets its std multiplied by
1/sqrt(2 * n_layer). The 2 * is because there
are two residual adds per layer. This keeps
residual stream variance
approximately constant with depth.
Biases initialized to zero.
The NANOGPT_SCALE_INIT attribute marker is just a clean way
to opt specific linears into the scaled init without complicated
name-matching logic.
Why 0.02 specifically
The intuition (from the lecture): a Linear(n_in, n_out)
followed by a typical activation should have output variance similar to
input variance. If weights are sampled with std σ and inputs
have unit variance, the output of W @ x has variance
n_in * σ^2. For unit output variance,
σ = 1/sqrt(n_in). This is the Xavier/He family of inits.
| Model | n_embd |
1 / √n_embd | GPT-2 chose |
|---|---|---|---|
| GPT-2 (base) | 768 | ≈ 0.036 | 0.02 |
| GPT-2 XL | 1600 | 0.025 | 0.02 |
0.02 is a reasonable middle that works across all GPT-2 sizes
without per-size tuning. It's slightly conservative (smaller than ideal
for the larger models), which is fine — small init + warmup is the safe
combination.
Kaiming for ReLU/GELU
Strictly, for ReLU-family activations the right init is Kaiming/He:
σ = sqrt(2/n_in), which compensates for the fact that ReLU
zeros out half the activations and halves output variance. PyTorch's
defaults for nn.Linear use Kaiming uniform. The GPT-2 paper
chose a different (simpler) scheme; both work in practice with LayerNorm
and warmup smoothing over any rough edges.
Activation statistics (lecture 4 in one paragraph)
In a deep tanh MLP without normalization, the wrong init causes one of two failures:
Std too small
Activations shrink toward zero, gradients shrink, no learning.
Std too large
Activations saturate at ±1 (tanh) or one half saturates (ReLU), gradients zero out at the saturation, no learning.
Karpathy shows this with activation histograms per layer in the BatchNorm lecture. Either failure mode is recoverable with LayerNorm, which forces each layer's activations back to a unit-variance distribution. But starting close to the right scale lets training begin productively from step 0 instead of step 1000.
Llama init
In llama2.c/model.py, the init is the
same std=0.02 Gaussian, with the same
1/sqrt(2 * n_layers) scaling applied to
w3.weight and wo.weight — the SwiGLU output
projection and the attention output projection. Different model family,
same trick.
Related
- residual-connections — why the scaled init exists
- layernorm-vs-rmsnorm — what cleans up init mistakes
- zero-to-hero-arc — the lecture on activation statistics