Repos · Inference Lineage

llama2.c

The inference-side counterpart. While the GPT-2 line of repos focuses on training, llama2.c focuses on inference. Train the Llama 2 architecture in PyTorch then inference it from a single ~700-line C file. No CUDA, no dependencies, just gcc.

Inference llama 2 pure c mmap gqa int8 quant

Have you ever wanted to inference a baby Llama 2 model in pure C? No? Well, now you can!

— From the README

What's in the repo

model.py ~343 lines — Llama 2 in PyTorch (RMSNorm, RoPE, SwiGLU) train.py PyTorch training loop, adapted from nanoGPT export.py converts PyTorch checkpoint → llama2.c binary format run.c ~973 lines — full C inference engine runq.c quantized (int8) inference tokenizer.py / .bin / .model SentencePiece BPE tinystories.py data prep for the TinyStories corpus doc/ stories260K.md, train_llama_tokenizer.md

The whole repo is "fullstack train + inference" for Llama 2 at small scale: train a model on TinyStories, export it to a .bin file, run it with ./run.

Why Llama 2's architecture differs from GPT-2

Same general shape as GPT-2 (a stack of pre-norm transformer blocks, autoregressive next-token prediction), but with three swaps:

GPT-2
Llama 2
Learned positional embedding
RoPE (applied inside attention)
GELU MLP (2 linears)
SwiGLU MLP (3 linears)

Plus one inference-time addition that GPT-2 doesn't strictly need but is essential at any production scale: grouped-query attention (GQA).

Each of these is a small change; together they define the "modern" LLM architecture (Llama 2, Llama 3, Mistral, Gemma, Qwen, etc. all use this shape).

model.py: Llama 2 in PyTorch

A clean PyTorch implementation. Key classes:

RMSNorm
see layernorm-vs-rmsnorm.
precompute_freqs_cis(dim, end, theta=10000)
precomputes the cos/sin tables for RoPE.
apply_rotary_emb(xq, xk, freqs_cos, freqs_sin)
applies the 2D rotation to Q and K vectors.
Attention
multi-head attention with grouped KV heads.
FeedForward
SwiGLU MLP with 3 linears and the 2/3 hidden-dim trick.
TransformerBlock
pre-norm with RMSNorm.
Transformer
full model, with weight tying (tok_embeddings.weight = output.weight).

The model also includes scaled init for w3.weight and wo.weight — the SwiGLU value projection and the attention output projection. Same scaling as nanoGPT, applied to the Llama-equivalent projections.

Grouped-query attention

In standard multi-head attention, you have n_head query, key, and value heads — same number of each. Llama 2 (and most modern LLMs) decouples this: n_heads query heads but n_kv_heads ≤ n_heads key/value heads. Each KV head is shared across multiple query heads via repeat_kv:

# from llama2.c/model.py
def repeat_kv(x: torch.Tensor, n_rep: int) -> torch.Tensor:
    """torch.repeat_interleave(x, dim=2, repeats=n_rep)"""
    bs, slen, n_kv_heads, head_dim = x.shape
    if n_rep == 1:
        return x
    return (
        x[:, :, :, None, :]
        .expand(bs, slen, n_kv_heads, n_rep, head_dim)
        .reshape(bs, slen, n_kv_heads * n_rep, head_dim)
    )

Why? The KV cache size scales with n_kv_heads * head_dim, not n_heads * head_dim. Cutting n_kv_heads by 4× cuts the KV cache by 4×, which means 4× more sequences fit on one GPU at inference. Llama 2 70B uses n_heads=64, n_kv_heads=8 — 8× compression on the cache.

In small Llama 2 models (7B), n_kv_heads = n_heads (no GQA, plain multi-head). In larger models, GQA kicks in.

run.c: pure C inference

The whole inference loop in C with no dependencies beyond libc and mmap. Structure:

Config
model hyperparameters (dim, n_layers, n_heads, etc.)
TransformerWeights
pointers into the mmap'd weight file
RunState
scratch buffers + KV cache
forward(transformer, token, pos)
one transformer forward pass for one position
sample(sampler, logits)
top-p / top-k / greedy sampling in C
generate(transformer, tokenizer, sampler, prompt, steps)
autoregressive loop

The forward pass is just n_layers rounds of the same thing:

RMSNorm QKV matmul RoPE KV cache update attention scores softmax V aggregation output proj residual add RMSNorm SwiGLU residual add
M1 MacBook Air, ./run stories15M.bin ~110 tok/s
With -O3 and OpenMP ~700 tok/s
The 110M model is interactive on a laptop with no GPU.

TinyStories and small-domain LLMs

The included models (stories15M.bin, stories42M.bin, stories110M.bin) are trained on TinyStories — a synthetic dataset of simple short stories generated by GPT-3.5 to teach narrow but coherent storytelling. Karpathy's small Llama 2 models can generate paragraphs that are essentially indistinguishable from human-written children's stories:

Once upon a time, there was a little girl named Lily. She loved playing with her toys on top of her bed. One day, she decided to have a tea party with her stuffed animals...

The deeper lesson: a 15M-parameter model trained on a narrow, well-curated corpus can be more useful in its domain than a 1B-parameter generalist model. This is the case for narrow LLMs: if you can scope a problem tightly, you don't need a frontier model.

Quantization

runq.c is a separate, ~int8-quantized version of the inference engine. Weights are stored in int8 with per-row scale factors; matmuls dequantize on the fly. Cuts the model size 4×, fits bigger models in less RAM, runs nearly as fast.

This is a "real" quantization implementation (group-wise int8) you can read end to end. The PyTorch-side quantization happens in export.py.

Related