Architecture / Building Blocks

The Transformer Block

A transformer is n_layer copies of the same building block stacked on top of each other. Each block is the unit of "communication followed by computation": the attention layer mixes information across positions, then the MLP applies a per-position nonlinearity. Both are wrapped in residual connections and preceded by LayerNorm.

The canonical pre-norm block

Every Karpathy GPT implementation uses the same block, modulo small details:

# from ng-video-lecture/gpt.py
class Block(nn.Module):
    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x
x ln1 self-attention + ln2 ffwd (MLP) + read mix across positions residual add read per-position compute residual add
Pre-norm block: the residual stream (left spine) is read via LayerNorm, processed, then added back.

Two things to notice.

Residuals are unconditional. The residual connections (x + ...) are not gated, not weighted, just adds.

Pre-norm, not post-norm. The LayerNorm is applied before attention and MLP, not after. This is "pre-norm," and it's what stabilizes deep training. The original "Attention is All You Need" paper used post-norm; GPT-2 switched to pre-norm; everyone has used pre-norm since.

The MLP / feedforward

The MLP is two linear layers with a nonlinearity in between, and a 4× expansion in the hidden dimension:

# from nanoGPT/model.py
class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
        self.gelu    = nn.GELU()
        self.c_proj  = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
        self.dropout = nn.Dropout(config.dropout)

So the MLP holds ~8 * n_embd^2 parameters per block, compared to ~4 * n_embd^2 for attention (Q, K, V, and output projections).

MLP — ~8 · n_embd²
Attn — ~4 · n_embd²
MLP (~2/3) Attention (~1/3)

For typical model sizes, the MLP is where most of the parameters live — about 2/3 of every block. The GELU activation is GPT's choice; Llama uses SwiGLU instead, which trades a third linear for a multiplicative gate.

What lives outside the block

Everything that isn't repeated n_layer times: token + position embeddings at the input, a final LayerNorm, and the unembedding (lm_head) at the output. In GPT-2 the embedding and unembedding share weights (weight tying).

# from nanoGPT/model.py
self.transformer = nn.ModuleDict(dict(
    wte = nn.Embedding(config.vocab_size, config.n_embd),
    wpe = nn.Embedding(config.block_size, config.n_embd),
    drop = nn.Dropout(config.dropout),
    h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
    ln_f = LayerNorm(config.n_embd, bias=config.bias),
))
self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
self.transformer.wte.weight = self.lm_head.weight  # weight tying

The residual stream as central data structure

The residual stream — the running x that gets added into through every block — is the central data structure. Think of each block as "reading from" the residual stream via the LayerNorm and "writing to" it via the residual add. Attention writes information that mixes across positions; the MLP writes information that's computed per-position. The whole transformer is n_layer rounds of read → process → add-back.

read ln1(x) from stream
process sa(...) mixes positions
add-back x = x + sa(ln1(x)) into stream
read ln2(x) from stream
process ffwd(...) per-position
add-back x = x + ffwd(ln2(x)) into stream

Llama 2 variant

In llama2.c/model.py, the block looks nearly identical, with three substitutions:

Learned positional embedding
RoPE (applied inside attention)
GELU MLP
SwiGLU MLP

These changes are independent of each other and orthogonal to the overall architecture. The "transformer block" abstraction is robust enough that you can swap parts in and out without changing anything else.

Related