Concept

RoPE: Rotary Positional Embedding

Rotary Positional Embedding (Su et al. 2021) is how Llama, Mistral, Gemma, and most modern open-weight LLMs encode position. It's a complete replacement for the learned positional embedding that GPT-2 uses. Llama 2 implements it explicitly in llama2.c/model.py.

positional-encoding attention llama2

The problem RoPE solves

A transformer's attention operation is permutation-invariant: if you shuffled the tokens in your input, the attention weights would be exactly the same. You have to inject positional information from somewhere, or the model can't tell "the cat sat on the mat" from "the mat sat on the cat."

GPT's solution: add a learned position embedding wpe[pos] to the token embedding wte[tok] at the input. Simple and effective, but two limitations:

Limitation 1 — Absolute, not relative

The model has to learn for itself that "two tokens apart" is similar in any position. It works, but it's not parameter-efficient.

Limitation 2 — Doesn't extrapolate

The position embedding table has block_size rows. If you try to use the model on longer contexts than training, you have no embedding for those positions. You can interpolate, but it's hacky.

RoPE: encode relative position by rotating the query and key vectors in 2D subspaces by angles proportional to position. Then the attention dot product q @ k.T naturally depends only on the difference between positions.

The math, briefly

Treat the head dimension as pairs of features: (x0, x1), (x2, x3), .... Each pair gets rotated by an angle θ_i * pos, where θ_i = 1 / (10000^(2i/d)) for the i-th pair — same base frequency as sinusoidal positional embeddings, just used differently.

The 2D rotation applied to each feature pair. Apply this to every pair, get a rotated vector of the same shape.

The rotation:

x'_0 = x_0 * cos(θ * pos) - x_1 * sin(θ * pos)
x'_1 = x_0 * sin(θ * pos) + x_1 * cos(θ * pos)

This is just a 2D rotation. Do it for every pair, get a rotated vector of the same shape. Apply to Q and K before the attention matmul.

Key property. For two positions p1 and p2, the dot product of rotated Q at p1 with rotated K at p2 depends on cos((p1 - p2) * θ_i) and sin((p1 - p2) * θ_i) — purely relative.

Llama 2's implementation

llama2.c/model.py precomputes the cos and sin tables for all positions up to max_seq_len:

def precompute_freqs_cis(dim, end, theta=10000.0):
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
    t = torch.arange(end, device=freqs.device)
    freqs = torch.outer(t, freqs).float()
    freqs_cos = torch.cos(freqs)
    freqs_sin = torch.sin(freqs)
    return freqs_cos, freqs_sin

Then apply_rotary_emb applies them inside the attention block:

xq, xk = apply_rotary_emb(xq, xk, freqs_cos, freqs_sin)

Critical detail: RoPE is applied to Q and K only, not V. The point is that the attention score q @ k.T is what should depend on relative position. The value vectors v carry content, and you want their interpretation to be position-independent so they can be aggregated across positions cleanly.

Q · rotate → K · rotate → Q @ K.T → V (unrotated) → attention out

RoPE in C

In llama2.c/run.c, RoPE is just a loop over head pairs after the QK projections:

// RoPE relative positional encoding: complex-valued rotate q and k in each head
for (int i = 0; i < dim; i+=2) {
    int head_dim = i % head_size;
    float freq = 1.0f / powf(10000.0f, head_dim / (float)head_size);
    float val = pos * freq;
    float fcr = cosf(val);
    float fci = sinf(val);
    int rotn = i < kv_dim ? 2 : 1;
    for (int v = 0; v < rotn; v++) {
        float* vec = v == 0 ? s->q : s->k;
        float v0 = vec[i];
        float v1 = vec[i+1];
        vec[i]   = v0 * fcr - v1 * fci;
        vec[i+1] = v0 * fci + v1 * fcr;
    }
}

The rotn thing accounts for grouped-query attention where there may be fewer K vectors than Q vectors.

Length extrapolation

RoPE has the famous property that it extrapolates somewhat beyond training context — the model has never seen pos = 5000 if trained on 2048 tokens, but the rotation pattern is just continuous trig, so it can be applied. In practice, naive RoPE extrapolation degrades sharply past training context, but several tricks (RoPE base scaling, YaRN, dynamic NTK) extend the effective context window much further. This is how Llama 2 7B got extended to 32k context with relatively cheap fine-tuning.

attention: RoPE applies inside attention
repos/llama2-c: full PyTorch + C implementation
transformer-block: GPT-2's positional alternative

RoPE: Rotary Positional Embedding

The problem RoPE solves

Limitation 1 — Absolute, not relative

Limitation 2 — Doesn't extrapolate

The math, briefly

Llama 2's implementation

RoPE in C

Length extrapolation

Related