Training · Initialization

Weight Initialization

The weights you start with determine whether training works at all. Bad init can make gradients vanish, activations saturate, or the loss diverge in the first few steps. Karpathy's lecture 4 is mostly a tour of what bad init does and how to fix it — by the end of that lecture you can read the histograms of activations and gradients across layers and diagnose a model from a glance.

The GPT-2 init recipe

From build-nanogpt/train_gpt2.py, almost verbatim from the GPT-2 paper:

def _init_weights(self, module):
    if isinstance(module, nn.Linear):
        std = 0.02
        if hasattr(module, 'NANOGPT_SCALE_INIT'):
            std *= (2 * self.config.n_layer) ** -0.5
        torch.nn.init.normal_(module.weight, mean=0.0, std=std)
        if module.bias is not None:
            torch.nn.init.zeros_(module.bias)
    elif isinstance(module, nn.Embedding):
        torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

Three pieces:

1

std = 0.02 for all linear layers and embeddings. This is the baseline that the GPT-2 paper landed on. Karpathy notes in the lecture 10 transcript that this is somewhat hand-tuned and roughly matches 1/sqrt(n_embd) for typical GPT-2 sizes (n_embd=768, so 1/sqrt(768) ≈ 0.036 — same order of magnitude).

2

Scaled init for residual projections. Any linear layer that writes back into the residual stream — the output projection of attention, the second linear of the MLP — gets its std multiplied by 1/sqrt(2 * n_layer). The 2 * is because there are two residual adds per layer. This keeps residual stream variance approximately constant with depth.

3

Biases initialized to zero.

The NANOGPT_SCALE_INIT attribute marker is just a clean way to opt specific linears into the scaled init without complicated name-matching logic.

Why 0.02 specifically

The intuition (from the lecture): a Linear(n_in, n_out) followed by a typical activation should have output variance similar to input variance. If weights are sampled with std σ and inputs have unit variance, the output of W @ x has variance n_in * σ^2. For unit output variance, σ = 1/sqrt(n_in). This is the Xavier/He family of inits.

Model n_embd 1 / √n_embd GPT-2 chose
GPT-2 (base) 768 ≈ 0.036 0.02
GPT-2 XL 1600 0.025 0.02

0.02 is a reasonable middle that works across all GPT-2 sizes without per-size tuning. It's slightly conservative (smaller than ideal for the larger models), which is fine — small init + warmup is the safe combination.

Kaiming for ReLU/GELU

Strictly, for ReLU-family activations the right init is Kaiming/He: σ = sqrt(2/n_in), which compensates for the fact that ReLU zeros out half the activations and halves output variance. PyTorch's defaults for nn.Linear use Kaiming uniform. The GPT-2 paper chose a different (simpler) scheme; both work in practice with LayerNorm and warmup smoothing over any rough edges.

Activation statistics (lecture 4 in one paragraph)

In a deep tanh MLP without normalization, the wrong init causes one of two failures:

Std too small

Activations shrink toward zero, gradients shrink, no learning.

Std too large

Activations saturate at ±1 (tanh) or one half saturates (ReLU), gradients zero out at the saturation, no learning.

Karpathy shows this with activation histograms per layer in the BatchNorm lecture. Either failure mode is recoverable with LayerNorm, which forces each layer's activations back to a unit-variance distribution. But starting close to the right scale lets training begin productively from step 0 instead of step 1000.

Llama init

In llama2.c/model.py, the init is the same std=0.02 Gaussian, with the same 1/sqrt(2 * n_layers) scaling applied to w3.weight and wo.weight — the SwiGLU output projection and the attention output projection. Different model family, same trick.

Related