RoPE: Rotary Positional Embedding
Rotary Positional Embedding (Su et al. 2021) is how Llama, Mistral, Gemma,
and most modern open-weight LLMs encode position. It's a complete
replacement for the learned positional embedding that
GPT-2 uses.
Llama 2 implements it explicitly in
llama2.c/model.py.
The problem RoPE solves
A transformer's attention operation is permutation-invariant: if you shuffled the tokens in your input, the attention weights would be exactly the same. You have to inject positional information from somewhere, or the model can't tell "the cat sat on the mat" from "the mat sat on the cat."
GPT's solution: add a learned position embedding wpe[pos] to
the token embedding wte[tok] at the input. Simple and
effective, but two limitations:
Limitation 1 — Absolute, not relative
The model has to learn for itself that "two tokens apart" is similar in any position. It works, but it's not parameter-efficient.
Limitation 2 — Doesn't extrapolate
The position embedding table has block_size rows. If you
try to use the model on longer contexts than training, you have no
embedding for those positions. You can interpolate, but it's hacky.
RoPE: encode relative position by rotating the query and
key vectors in 2D subspaces by angles proportional to position. Then the
attention dot product q @ k.T naturally depends only on the
difference between positions.
The math, briefly
Treat the head dimension as pairs of features:
(x0, x1), (x2, x3), .... Each pair gets rotated by an angle
θ_i * pos, where
θ_i = 1 / (10000^(2i/d)) for the i-th pair — same base
frequency as sinusoidal positional embeddings, just used differently.
The rotation:
x'_0 = x_0 * cos(θ * pos) - x_1 * sin(θ * pos)
x'_1 = x_0 * sin(θ * pos) + x_1 * cos(θ * pos)
This is just a 2D rotation. Do it for every pair, get a rotated vector of the same shape. Apply to Q and K before the attention matmul.
p1 and
p2, the dot product of rotated Q at p1 with
rotated K at p2 depends on
cos((p1 - p2) * θ_i) and sin((p1 - p2) * θ_i)
— purely relative.
Llama 2's implementation
llama2.c/model.py precomputes the cos and sin tables for all
positions up to max_seq_len:
def precompute_freqs_cis(dim, end, theta=10000.0):
freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
t = torch.arange(end, device=freqs.device)
freqs = torch.outer(t, freqs).float()
freqs_cos = torch.cos(freqs)
freqs_sin = torch.sin(freqs)
return freqs_cos, freqs_sin
Then apply_rotary_emb applies them inside the attention block:
xq, xk = apply_rotary_emb(xq, xk, freqs_cos, freqs_sin)
Critical detail: RoPE is applied to Q and K only, not V.
The point is that the attention score q @ k.T is what should
depend on relative position. The value vectors v carry
content, and you want their interpretation to be position-independent so
they can be aggregated across positions cleanly.
RoPE in C
In llama2.c/run.c, RoPE is just a loop over head pairs after
the QK projections:
// RoPE relative positional encoding: complex-valued rotate q and k in each head
for (int i = 0; i < dim; i+=2) {
int head_dim = i % head_size;
float freq = 1.0f / powf(10000.0f, head_dim / (float)head_size);
float val = pos * freq;
float fcr = cosf(val);
float fci = sinf(val);
int rotn = i < kv_dim ? 2 : 1;
for (int v = 0; v < rotn; v++) {
float* vec = v == 0 ? s->q : s->k;
float v0 = vec[i];
float v1 = vec[i+1];
vec[i] = v0 * fcr - v1 * fci;
vec[i+1] = v0 * fci + v1 * fcr;
}
}
The rotn thing accounts for
grouped-query attention where there may
be fewer K vectors than Q vectors.
Length extrapolation
RoPE has the famous property that it extrapolates somewhat beyond
training context — the model has never seen pos = 5000 if
trained on 2048 tokens, but the rotation pattern is just continuous trig,
so it can be applied. In practice, naive RoPE extrapolation degrades
sharply past training context, but several tricks (RoPE base scaling,
YaRN, dynamic NTK) extend the effective context window much further. This
is how Llama 2 7B got extended to 32k context with relatively cheap
fine-tuning.
Related
- attention
- RoPE applies inside attention
- repos/llama2-c
- full PyTorch + C implementation
- transformer-block
- GPT-2's positional alternative