Concept

Residual Connections

A residual connection is y = x + f(x). The f(x) part is some learned function (attention, MLP, conv block); the x is a straight wire from input to output. In a transformer block, you see two of them per layer — one around attention, one around the MLP — and they're load-bearing in a way that's easy to under-appreciate.

What the residual stream actually is

x →

The unbroken x channel that runs from the embedding through every block to the final unembedding is called the residual stream. Every block reads from it (via a LayerNorm), computes something, and writes the result back via the +. Nothing in the standard architecture replaces the residual stream — only adds to it.

Mathematically, this means a deep transformer is really:

x_L = x_0 + ∑_i f_i(LN(x__i-1)))

The output is the input plus a sum of per-layer contributions. This has two huge consequences:

Gradients have a free path

During backprop, the gradient of the loss flows straight through every + unimpeded. The only place gradient can attenuate is through the f_i branches. This is why you can train 96-layer GPT-3 without exploding/vanishing problems that plagued pre-residual deep networks.

Information has a free path

Early-layer features can survive to late layers without being mangled. Late layers can read early-layer information directly. This decouples "depth" from "destruction of information."

The mechanistic-interpretability framing

The residual stream view is also the foundation of Anthropic's mechanistic interpretability work: each layer adds a delta into a shared vector space, and you can sometimes find low-rank circuits that span multiple layers by their additive contributions. The architecture invites this framing in a way that, say, an RNN does not.

Special initialization for residual projections

Because every block adds into the residual stream, the variance of the residual stream grows with depth. If each block adds a contribution with unit variance, after n_layer blocks the residual stream has variance n_layer + 1. This is bad.

GPT-2 and all of Karpathy's GPT implementations fix this by scaling the initialization of the residual output projections (the projection at the end of attention and the end of the MLP) by 1/sqrt(2 * n_layer):

# from build-nanogpt/train_gpt2.py
def _init_weights(self, module):
    if isinstance(module, nn.Linear):
        std = 0.02
        if hasattr(module, 'NANOGPT_SCALE_INIT'):
            std *= (2 * self.config.n_layer) ** -0.5
        torch.nn.init.normal_(module.weight, mean=0.0, std=std)

The NANOGPT_SCALE_INIT marker is attached to the residual-stream output projections (the second linear in MLP, the output projection in attention). The factor of 2 accounts for the two residual additions per block. This keeps the residual stream variance approximately constant with depth at init.

See weight-init for the full story. Karpathy notes in lecture 10, "Let's reproduce GPT-2", that this scaling matches the GPT-2 paper's initialization scheme.

Pre-norm makes residuals possible at depth

Residual connections are old (ResNet, 2015). What makes them work for transformers specifically is putting the LayerNorm inside the residual branch (pre-norm), not on the output (post-norm). Pre-norm keeps the input to each sublayer well-scaled while leaving the residual stream itself unnormalized, which is essential for the gradient and information arguments above.

Residual Connections

What the residual stream actually is

Gradients have a free path

Information has a free path

The mechanistic-interpretability framing

Special initialization for residual projections

Pre-norm makes residuals possible at depth

Related