Architecture note

GELU and SwiGLU

The MLP in a transformer block needs a nonlinearity sandwiched between its two linear layers. Without one, the two linears would collapse to a single linear and you'd lose the model's only per-position nonlinear capacity. The choice of nonlinearity is a small architecture decision with a real quality impact.

GELU: GPT's choice

GELU (Gaussian Error Linear Unit) is the activation used by BERT, GPT-2, GPT-3, and basically every transformer trained between 2018 and 2022. The function:

GELU(x) = x * Phi(x)

where Phi is the standard Gaussian CDF. It looks like a smoothed ReLU — negative for small negative inputs (mild gradient there) and linear for large positive inputs.

build-nanogpt/train_gpt2.py uses GPT-2's specific GELU variant:

self.gelu = nn.GELU(approximate='tanh')

The approximate='tanh' flag selects the cheap approximation:

GELU(x) ≈ 0.5 * x * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x^3)))

This is what GPT-2 actually used (the exact erf-based GELU is slightly different at large |x|). makemore.py defines this approximation explicitly as NewGELU and uses it inside its Block:

class NewGELU(nn.Module):
    def forward(self, x):
        return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))
Intuition. ReLU's hard zero kills gradients for x < 0. GELU has a small negative dip and a smooth transition through zero, giving nonzero gradients everywhere and slightly better optimization dynamics.

nanoGPT/model.py uses the exact nn.GELU() (no approximate flag) which uses the erf form. Trains fine either way; build-nanogpt uses the tanh form to be more faithful to OpenAI's original.

SwiGLU: Llama's choice

Llama replaces the GELU-MLP with SwiGLU — Swish (a.k.a. SiLU) combined with a Gated Linear Unit. The MLP becomes:

# from llama2.c/model.py
class FeedForward(nn.Module):
    def __init__(self, dim, hidden_dim, multiple_of, dropout):
        super().__init__()
        if hidden_dim is None:
            hidden_dim = 4 * dim
            hidden_dim = int(2 * hidden_dim / 3)
            hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)
        self.w1 = nn.Linear(dim, hidden_dim, bias=False)
        self.w2 = nn.Linear(hidden_dim, dim, bias=False)
        self.w3 = nn.Linear(dim, hidden_dim, bias=False)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.dropout(self.w2(F.silu(self.w1(x)) * self.w3(x)))

Three linears instead of two. The forward pass is:

gate = silu(w1(x))   # SiLU = x * sigmoid(x), similar to GELU
value = w3(x)
out = w2(gate * value)

So the "MLP" is w2(silu(w1(x)) * w3(x)). The silu(w1(x)) * w3(x) is the gated linear unit — silu(w1(x)) acts as a learnable gate that multiplicatively modulates the projection w3(x). The intuition: instead of a fixed nonlinearity, give the model a learned multiplicative gate.

MLP shapes, side by side

GELU MLP 2 linears

x
↓ w1
h = w1(x)
↓ GELU
h' = GELU(h)
↓ w2
out

Fixed nonlinearity between two linears. Hidden width 4 * dim.

SwiGLU MLP 3 linears

x
↓ w1   ↓ w3
gate = silu(w1(x))   value = w3(x)
↓ × (elementwise)
gate * value
↓ w2
out

Learned multiplicative gate. Hidden width (8/3) * dim, rounded.

The 2/3 hidden-dim trick

To keep the parameter count comparable between a 2-linear GELU MLP and a 3-linear SwiGLU MLP, Llama shrinks the hidden dimension by 2/3:

hidden_dim = 4 * dim
hidden_dim = int(2 * hidden_dim / 3)   # shrink to keep param count similar
hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)

So instead of 4 * dim hidden width, it's (8/3) * dim, then rounded up to a multiple of 256 for hardware efficiency. The total MLP parameter count 3 * dim * hidden_dim then equals roughly 2 * dim * (4 * dim) = 8 * dim^2 — the same as the GELU MLP.

GELU MLP

2 linears × (dim × 4·dim)

≈ 2 · dim · (4 · dim) = 8 · dim²
SwiGLU MLP

3 linears × (dim × (8/3)·dim)

≈ 3 · dim · ((8/3) · dim) = 8 · dim²

Like-for-like comparison shows SwiGLU is a small but real quality improvement (Shazeer 2020, "GLU Variants Improve Transformer").

Which one matters?

In practice the activation choice is a 1-2% quality knob, well below the size of the noise floor in most LLM evaluations. The reason every new model picks SwiGLU now is that small wins compound and the cost is essentially zero. GPT-2 used GELU because SwiGLU wasn't published yet; if it were, GPT-2 would have used it.

Related