Transformer internals

Self-Attention

Attention is the operator that lets a transformer mix information across positions in a sequence. It's the one piece of the architecture that is genuinely sequence-aware — everything else (the MLP, LayerNorm, embeddings) operates on each token independently.

Karpathy builds up to it carefully in lecture 7, "Let's build GPT", and the same code shows up in slightly different forms across the repos:

The "weighted communication" framing

Karpathy's pedagogical trick is to introduce attention as a generalization of the "previous tokens' average" trick. If you start with a tensor x of shape (B, T, C) — batch, time, channels — the simplest way to give each position information about its past is to average the previous tokens. You can express that average as a matrix multiplication: build a lower-triangular matrix of ones, normalize each row to sum to 1, and matmul against x.

That's a causal bag-of-words, and sources/karpathy-repos/makemore/makemore.py includes an explicit CausalBoW class that does exactly this, "for no apparent reason at all," as the comment winks.

The leap. Self-attention is the same shape, except the weights are learned and content-dependent.

Q, K, V

Every token emits three vectors via three linear projections:

Q
Query
q = W_q @ x
"what am I looking for?"
K
Key
k = W_k @ x
"what do I contain?"
V
Value
v = W_v @ x
"what do I broadcast if you attend to me?"

Attention scores are q @ k.T. High score = high alignment between what one token wants and what another offers. The scores are divided by sqrt(head_dim) to keep the softmax in a reasonable regime — at high dimension, raw dot products have variance that grows with head_dim, which would saturate the softmax and kill gradients. Karpathy explains this scaling carefully in the GPT video.

# from ng-video-lecture/gpt.py
wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5    # (B, T, T)
wei = wei.masked_fill(tril[:T, :T] == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ v
The four-line core of causal self-attention.
q @ k.T scale by 1/√d causal mask softmax @ v

Causal masking

For autoregressive language modeling, a token at position t must not see positions > t — otherwise the model would just learn to copy the answer. The mask is a lower-triangular matrix of zeros and ones; positions above the diagonal get filled with -inf before the softmax, which sends them to 0 after.

1 −∞ −∞ −∞ −∞ 1 1 −∞ −∞ −∞ 1 1 1 −∞ −∞ 1 1 1 1 −∞ 1 1 1 1 1
Lower-triangular mask for T = 5. Cells above the diagonal become -inf before the softmax and 0 after.

Multi-head attention

A single attention "head" projects into a low-dimensional space (e.g. n_embd / n_head), does the QKV dance, and outputs head_size channels. Multi-head attention runs n_head of these in parallel, with independent projections, then concatenates the outputs and applies a final projection.

The intuition: different heads can learn to attend to different kinds of relationships (syntactic, positional, lexical). In practice you don't see clean specialization; the heads ensemble noisily.

In production code (build-nanogpt/train_gpt2.py), the three QKV projections are fused into one nn.Linear(n_embd, 3*n_embd) for efficiency. The QKV is then split with .split(n_embd, dim=2) and reshaped to (B, n_head, T, head_size) so the head dimension acts like a batch dimension during the matmul.

Flash attention

F.scaled_dot_product_attention (PyTorch ≥ 2.0) is the same math but tiled to keep the (B, n_head, T, T) matrix out of HBM. The naive implementation allocates that matrix; Flash attention fuses the matmul, mask, softmax, and second matmul into one kernel that streams tiles through SRAM. For long contexts this is the difference between training and OOM.

naive

Materialize the T×T matrix

Allocates the (B, n_head, T, T) attention matrix in HBM.

flash

Stream tiles through SRAM

Fuses matmul, mask, softmax, and the second matmul into one kernel.

The CUDA implementation in llm.c uses cuDNN's flash attention by default and has a manual fallback in llmc/attention.cuh that does QKV permutation and three matmuls explicitly.

RoPE: a different positional story

Llama 2 does attention with no learned positional embeddings — positions get baked into Q and K via Rotary Positional Embedding (RoPE) just before the matmul. GPT-2 instead uses an additive learned positional embedding before the first block. Both work; RoPE generalizes better to longer contexts than seen at training and is now the default in most new LLMs.

GPT-2

Additive learned positional embedding

Added to token embeddings before the first block.

Llama 2

RoPE baked into Q and K

No learned positional embeddings; rotation applied just before the matmul.

Related