Self-Attention
Attention is the operator that lets a transformer mix information across positions in a sequence. It's the one piece of the architecture that is genuinely sequence-aware — everything else (the MLP, LayerNorm, embeddings) operates on each token independently.
Karpathy builds up to it carefully in lecture 7, "Let's build GPT", and the same code shows up in slightly different forms across the repos:
The "weighted communication" framing
Karpathy's pedagogical trick is to introduce attention as a generalization
of the "previous tokens' average" trick. If you start with a tensor
x of shape (B, T, C) — batch, time, channels —
the simplest way to give each position information about its past is to
average the previous tokens. You can express that average as a matrix
multiplication: build a lower-triangular matrix of ones, normalize each row
to sum to 1, and matmul against x.
That's a causal bag-of-words, and
sources/karpathy-repos/makemore/makemore.py includes an
explicit CausalBoW class that does exactly this,
"for no apparent reason at all," as the comment winks.
Q, K, V
Every token emits three vectors via three linear projections:
Attention scores are q @ k.T. High score = high alignment
between what one token wants and what another offers. The scores are
divided by sqrt(head_dim) to keep the softmax in a reasonable
regime — at high dimension, raw dot products have variance that grows with
head_dim, which would saturate the softmax and kill gradients.
Karpathy explains this scaling carefully in the GPT video.
# from ng-video-lecture/gpt.py
wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, T)
wei = wei.masked_fill(tril[:T, :T] == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ v
Causal masking
For autoregressive language modeling, a token at position t
must not see positions > t — otherwise the model would just
learn to copy the answer. The mask is a lower-triangular matrix of zeros
and ones; positions above the diagonal get filled with -inf
before the softmax, which sends them to 0 after.
T = 5. Cells above the diagonal
become -inf before the softmax and 0 after.
Multi-head attention
A single attention "head" projects into a low-dimensional space (e.g.
n_embd / n_head), does the QKV dance, and outputs
head_size channels. Multi-head attention runs
n_head of these in parallel, with independent projections,
then concatenates the outputs and applies a final projection.
The intuition: different heads can learn to attend to different kinds of relationships (syntactic, positional, lexical). In practice you don't see clean specialization; the heads ensemble noisily.
In production code (build-nanogpt/train_gpt2.py), the three
QKV projections are fused into one
nn.Linear(n_embd, 3*n_embd) for efficiency. The QKV is then
split with .split(n_embd, dim=2) and reshaped to
(B, n_head, T, head_size) so the head dimension acts like a
batch dimension during the matmul.
Flash attention
F.scaled_dot_product_attention (PyTorch ≥ 2.0) is the same
math but tiled to keep the (B, n_head, T, T) matrix out of
HBM. The naive implementation allocates that matrix; Flash attention fuses
the matmul, mask, softmax, and second matmul into one kernel that streams
tiles through SRAM. For long contexts this is the difference between
training and OOM.
Materialize the T×T matrix
Allocates the (B, n_head, T, T) attention matrix in HBM.
Stream tiles through SRAM
Fuses matmul, mask, softmax, and the second matmul into one kernel.
The CUDA implementation in llm.c uses
cuDNN's flash attention by default and has a manual fallback in
llmc/attention.cuh that does QKV permutation and three matmuls
explicitly.
RoPE: a different positional story
Llama 2 does attention with no learned positional embeddings — positions get baked into Q and K via Rotary Positional Embedding (RoPE) just before the matmul. GPT-2 instead uses an additive learned positional embedding before the first block. Both work; RoPE generalizes better to longer contexts than seen at training and is now the default in most new LLMs.
Additive learned positional embedding
Added to token embeddings before the first block.
RoPE baked into Q and K
No learned positional embeddings; rotation applied just before the matmul.
Related
- transformer-block — what wraps attention
- rope — positional info inside attention, not outside
- kv-cache — what attention looks like at inference time
- mixed-precision-and-mfu — why flash attention matters for hardware utilization