The Transformer Block
A transformer is n_layer copies of the same building block stacked on top of
each other. Each block is the unit of "communication followed by computation": the
attention layer mixes information across positions, then the
MLP applies a per-position nonlinearity. Both are wrapped in residual connections and
preceded by LayerNorm.
The canonical pre-norm block
Every Karpathy GPT implementation uses the same block, modulo small details:
# from ng-video-lecture/gpt.py
class Block(nn.Module):
def __init__(self, n_embd, n_head):
super().__init__()
head_size = n_embd // n_head
self.sa = MultiHeadAttention(n_head, head_size)
self.ffwd = FeedFoward(n_embd)
self.ln1 = nn.LayerNorm(n_embd)
self.ln2 = nn.LayerNorm(n_embd)
def forward(self, x):
x = x + self.sa(self.ln1(x))
x = x + self.ffwd(self.ln2(x))
return x
Two things to notice.
Residuals are unconditional. The residual connections (x + ...) are not gated, not weighted, just adds.
Pre-norm, not post-norm. The LayerNorm is applied before attention and MLP, not after. This is "pre-norm," and it's what stabilizes deep training. The original "Attention is All You Need" paper used post-norm; GPT-2 switched to pre-norm; everyone has used pre-norm since.
The MLP / feedforward
The MLP is two linear layers with a nonlinearity in between, and a 4× expansion in the hidden dimension:
# from nanoGPT/model.py
class MLP(nn.Module):
def __init__(self, config):
super().__init__()
self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
self.gelu = nn.GELU()
self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
self.dropout = nn.Dropout(config.dropout)
So the MLP holds ~8 * n_embd^2 parameters per block, compared to
~4 * n_embd^2 for attention (Q, K, V, and output projections).
For typical model sizes, the MLP is where most of the parameters live — about 2/3 of every block. The GELU activation is GPT's choice; Llama uses SwiGLU instead, which trades a third linear for a multiplicative gate.
What lives outside the block
Everything that isn't repeated n_layer times: token + position embeddings at
the input, a final LayerNorm, and the unembedding (lm_head) at the output. In
GPT-2 the embedding and unembedding share weights
(weight tying).
# from nanoGPT/model.py
self.transformer = nn.ModuleDict(dict(
wte = nn.Embedding(config.vocab_size, config.n_embd),
wpe = nn.Embedding(config.block_size, config.n_embd),
drop = nn.Dropout(config.dropout),
h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
ln_f = LayerNorm(config.n_embd, bias=config.bias),
))
self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
self.transformer.wte.weight = self.lm_head.weight # weight tying
The residual stream as central data structure
The residual stream — the running x that gets added into through every block —
is the central data structure. Think of each block as "reading from" the residual stream
via the LayerNorm and "writing to" it via the residual add. Attention writes information
that mixes across positions; the MLP writes information that's computed per-position. The
whole transformer is n_layer rounds of read → process → add-back.
Llama 2 variant
In llama2.c/model.py, the block looks nearly identical, with three substitutions:
These changes are independent of each other and orthogonal to the overall architecture. The "transformer block" abstraction is robust enough that you can swap parts in and out without changing anything else.
Related
- attention — the "communication" half
- residual-connections — why the
+matters - layernorm-vs-rmsnorm — what does normalization actually do here
- weight-init — special scaling for residual projections