NORMALIZATION

LayerNorm vs RMSNorm

Normalization layers keep activations in a stable range as data flows through a deep network. They're the thing that makes training deep networks possible at all — without them, activations either explode or saturate within a few layers and gradients vanish or diverge. The two normalizations you see in Karpathy's repos are LayerNorm (GPT family) and RMSNorm (Llama family).

LayerNorm GPT family

Subtract mean, divide by std, then learned scale + bias.

mean = x.mean()
var  = x.var()
y = (x - mean) / sqrt(var + eps)
y = y * weight + bias    # learned per-channel

RMSNorm Llama family

No mean, no bias. Divide by root-mean-square, then learned scale.

def _norm(self, x):
    return x * torch.rsqrt(
        x.pow(2).mean(-1, keepdim=True) + self.eps
    )

# forward: output * self.weight

LayerNorm

For each token's activation vector x of dimension C:

mean = x.mean()
var  = x.var()
y = (x - mean) / sqrt(var + eps)
y = y * weight + bias    # learned per-channel

The normalization is over the last dimension only — each token is normalized independently. This is different from BatchNorm, which normalizes across the batch dimension and was tried and abandoned for transformers because batch dependencies are awful in sequence models (variable lengths, very small batches per device, autoregressive generation where batch dim doesn't even make sense).

The PyTorch one-liner in nanoGPT/model.py:

class LayerNorm(nn.Module):
    """ LayerNorm but with an optional bias. PyTorch doesn't support simply bias=False """
    def __init__(self, ndim, bias):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(ndim))
        self.bias = nn.Parameter(torch.zeros(ndim)) if bias else None
    def forward(self, input):
        return F.layer_norm(input, self.weight.shape, self.weight, self.bias, 1e-5)

Karpathy adds the optional bias because the original GPT-2 included bias terms in LayerNorm; modern practice often drops them ("a bit better and faster," per the GPTConfig comment).

RMSNorm

RMSNorm strips out the mean subtraction:

# from llama2.c/model.py
class RMSNorm(nn.Module):
    def __init__(self, dim, eps):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))

    def _norm(self, x):
        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)

    def forward(self, x):
        output = self._norm(x.float()).type_as(x)
        return output * self.weight

No mean, no bias. Just divide by the root-mean-square of the activations, then scale by a learned weight. The motivation is partly empirical (works as well as LayerNorm with fewer operations) and partly theoretical (the mean shift in LayerNorm may not be doing meaningful work).

In practice RMSNorm is ~30% faster than LayerNorm and uses fewer parameters. Llama, Mistral, Gemma, and most modern open-weight LLMs use it. GPT-2 is now the historical outlier.

Pre-norm vs post-norm

Independent of LN vs RMS: the placement of the normalization in the transformer block. The original "Attention is All You Need" used post-norm. GPT-2 switched to pre-norm.

Post-norm — original transformer
x = LN(x + sublayer(x))

Residual stream variance grows uncontrolled with depth; needs careful learning rate warmup just to stay alive.

Pre-norm — GPT-2 onward
x = x + sublayer(LN(x))

Essential for deep transformers. Keeps the residual stream stable and trains much more reliably at scale.

The backward pass

LayerNorm has one of the trickiest hand-derived backward passes you'll meet, because the mean and variance both depend on every element of x, so the gradient of every output flows back to every input. Karpathy makes you re-derive it from scratch in lecture 5, "Becoming a Backprop Ninja". The final formula has three terms and is famously easy to get wrong.

1 scale 2 mean 3 variance

In CUDA the LayerNorm forward and backward are explicit kernels — see llm.c/llmc/layernorm.cuh for a warp-parallel reduction that computes mean and rstd in one pass and then applies the affine transform in a second pass. RMSNorm is the same but drops the mean reduction, which is one fewer pass over the data.

Activation statistics — the BatchNorm lecture

Karpathy's lecture 4 ("Activations & Gradients, BatchNorm") motivates normalization layers by showing what happens without them: in a deep MLP with bad init, activations saturate the tanh nonlinearity within 3 layers and gradients vanish. The lecture builds up from "gain the init properly" to "scale per layer" to BatchNorm, with histograms of activations and gradients at each step. It's the best ground-up explanation of why normalization layers exist that I've seen.

Related