Architecture / Building Blocks

The Transformer Block

A transformer is n_layer copies of the same building block stacked on top of each other. Each block is the unit of "communication followed by computation": the attention layer mixes information across positions, then the MLP applies a per-position nonlinearity. Both are wrapped in residual connections and preceded by LayerNorm.

The canonical pre-norm block

Every Karpathy GPT implementation uses the same block, modulo small details:

# from ng-video-lecture/gpt.py
class Block(nn.Module):
    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

Pre-norm block: the residual stream (left spine) is read via LayerNorm, processed, then added back.

Two things to notice.

Residuals are unconditional. The residual connections (x + ...) are not gated, not weighted, just adds.

Pre-norm, not post-norm. The LayerNorm is applied before attention and MLP, not after. This is "pre-norm," and it's what stabilizes deep training. The original "Attention is All You Need" paper used post-norm; GPT-2 switched to pre-norm; everyone has used pre-norm since.

The MLP / feedforward

The MLP is two linear layers with a nonlinearity in between, and a 4× expansion in the hidden dimension:

# from nanoGPT/model.py
class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
        self.gelu    = nn.GELU()
        self.c_proj  = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
        self.dropout = nn.Dropout(config.dropout)

So the MLP holds ~8 * n_embd^2 parameters per block, compared to ~4 * n_embd^2 for attention (Q, K, V, and output projections).

MLP (~2/3) Attention (~1/3)

For typical model sizes, the MLP is where most of the parameters live — about 2/3 of every block. The GELU activation is GPT's choice; Llama uses SwiGLU instead, which trades a third linear for a multiplicative gate.

What lives outside the block

Everything that isn't repeated n_layer times: token + position embeddings at the input, a final LayerNorm, and the unembedding (lm_head) at the output. In GPT-2 the embedding and unembedding share weights (weight tying).

# from nanoGPT/model.py
self.transformer = nn.ModuleDict(dict(
    wte = nn.Embedding(config.vocab_size, config.n_embd),
    wpe = nn.Embedding(config.block_size, config.n_embd),
    drop = nn.Dropout(config.dropout),
    h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
    ln_f = LayerNorm(config.n_embd, bias=config.bias),
))
self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
self.transformer.wte.weight = self.lm_head.weight  # weight tying

The residual stream as central data structure

The residual stream — the running x that gets added into through every block — is the central data structure. Think of each block as "reading from" the residual stream via the LayerNorm and "writing to" it via the residual add. Attention writes information that mixes across positions; the MLP writes information that's computed per-position. The whole transformer is n_layer rounds of read → process → add-back.

read ln1(x) from stream

process sa(...) mixes positions

add-back x = x + sa(ln1(x)) into stream

read ln2(x) from stream

process ffwd(...) per-position

add-back x = x + ffwd(ln2(x)) into stream

Llama 2 variant

In llama2.c/model.py, the block looks nearly identical, with three substitutions:

LayerNorm

→

RMSNorm

Learned positional embedding

→

RoPE (applied inside attention)

GELU MLP

→

SwiGLU MLP

These changes are independent of each other and orthogonal to the overall architecture. The "transformer block" abstraction is robust enough that you can swap parts in and out without changing anything else.

attention — the "communication" half
residual-connections — why the + matters
layernorm-vs-rmsnorm — what does normalization actually do here
weight-init — special scaling for residual projections

The Transformer Block

The canonical pre-norm block

The MLP / feedforward

What lives outside the block

The residual stream as central data structure

Llama 2 variant

Related