GELU and SwiGLU
The MLP in a transformer block needs a nonlinearity sandwiched between its two linear layers. Without one, the two linears would collapse to a single linear and you'd lose the model's only per-position nonlinear capacity. The choice of nonlinearity is a small architecture decision with a real quality impact.
GELU: GPT's choice
GELU (Gaussian Error Linear Unit) is the activation used by BERT, GPT-2, GPT-3, and basically every transformer trained between 2018 and 2022. The function:
GELU(x) = x * Phi(x)
where Phi is the standard Gaussian CDF. It looks like a smoothed
ReLU — negative for small negative inputs (mild gradient there) and linear
for large positive inputs.
build-nanogpt/train_gpt2.py uses GPT-2's specific GELU variant:
self.gelu = nn.GELU(approximate='tanh')
The approximate='tanh' flag selects the cheap approximation:
GELU(x) ≈ 0.5 * x * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x^3)))
This is what GPT-2 actually used (the exact erf-based GELU is
slightly different at large |x|). makemore.py
defines this approximation explicitly as NewGELU and uses it
inside its Block:
class NewGELU(nn.Module):
def forward(self, x):
return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))
x < 0. GELU has a small negative dip and a smooth transition
through zero, giving nonzero gradients everywhere and slightly better
optimization dynamics.
nanoGPT/model.py uses the exact nn.GELU() (no
approximate flag) which uses the erf form. Trains fine either
way; build-nanogpt uses the tanh form to be more faithful to OpenAI's
original.
SwiGLU: Llama's choice
Llama replaces the GELU-MLP with SwiGLU — Swish (a.k.a. SiLU) combined with a Gated Linear Unit. The MLP becomes:
# from llama2.c/model.py
class FeedForward(nn.Module):
def __init__(self, dim, hidden_dim, multiple_of, dropout):
super().__init__()
if hidden_dim is None:
hidden_dim = 4 * dim
hidden_dim = int(2 * hidden_dim / 3)
hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)
self.w1 = nn.Linear(dim, hidden_dim, bias=False)
self.w2 = nn.Linear(hidden_dim, dim, bias=False)
self.w3 = nn.Linear(dim, hidden_dim, bias=False)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
return self.dropout(self.w2(F.silu(self.w1(x)) * self.w3(x)))
Three linears instead of two. The forward pass is:
gate = silu(w1(x)) # SiLU = x * sigmoid(x), similar to GELU
value = w3(x)
out = w2(gate * value)
So the "MLP" is w2(silu(w1(x)) * w3(x)). The
silu(w1(x)) * w3(x) is the gated linear unit —
silu(w1(x)) acts as a learnable gate that multiplicatively
modulates the projection w3(x). The intuition: instead of a
fixed nonlinearity, give the model a learned multiplicative gate.
MLP shapes, side by side
GELU MLP 2 linears
↓ w1
h = w1(x)
↓ GELU
h' = GELU(h)
↓ w2
out
Fixed nonlinearity between two linears. Hidden width 4 * dim.
SwiGLU MLP 3 linears
↓ w1 ↓ w3
gate = silu(w1(x)) value = w3(x)
↓ × (elementwise)
gate * value
↓ w2
out
Learned multiplicative gate. Hidden width (8/3) * dim, rounded.
The 2/3 hidden-dim trick
To keep the parameter count comparable between a 2-linear GELU MLP and a 3-linear SwiGLU MLP, Llama shrinks the hidden dimension by 2/3:
hidden_dim = 4 * dim
hidden_dim = int(2 * hidden_dim / 3) # shrink to keep param count similar
hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)
So instead of 4 * dim hidden width, it's
(8/3) * dim, then rounded up to a multiple of 256 for hardware
efficiency. The total MLP parameter count
3 * dim * hidden_dim then equals roughly
2 * dim * (4 * dim) = 8 * dim^2 — the same as the GELU MLP.
2 linears × (dim × 4·dim)
3 linears × (dim × (8/3)·dim)
Like-for-like comparison shows SwiGLU is a small but real quality improvement (Shazeer 2020, "GLU Variants Improve Transformer").
Which one matters?
In practice the activation choice is a 1-2% quality knob, well below the size of the noise floor in most LLM evaluations. The reason every new model picks SwiGLU now is that small wins compound and the cost is essentially zero. GPT-2 used GELU because SwiGLU wasn't published yet; if it were, GPT-2 would have used it.
Related
- transformer-block — where the MLP sits
- repos/nanoGPT — uses GELU
- repos/llama2-c — uses SwiGLU
- repos/makemore — has the explicit NewGELU class