llama2.c
The inference-side counterpart. While the GPT-2 line of repos focuses on training, llama2.c focuses on inference. Train the Llama 2 architecture in PyTorch then inference it from a single ~700-line C file. No CUDA, no dependencies, just gcc.
Have you ever wanted to inference a baby Llama 2 model in pure C? No? Well, now you can!
— From the README
What's in the repo
The whole repo is "fullstack train + inference" for Llama 2 at small scale: train a model on TinyStories, export it to a .bin file, run it with ./run.
Why Llama 2's architecture differs from GPT-2
Same general shape as GPT-2 (a stack of pre-norm transformer blocks, autoregressive next-token prediction), but with three swaps:
Plus one inference-time addition that GPT-2 doesn't strictly need but is essential at any production scale: grouped-query attention (GQA).
Each of these is a small change; together they define the "modern" LLM architecture (Llama 2, Llama 3, Mistral, Gemma, Qwen, etc. all use this shape).
model.py: Llama 2 in PyTorch
A clean PyTorch implementation. Key classes:
RMSNorm- see layernorm-vs-rmsnorm.
precompute_freqs_cis(dim, end, theta=10000)- precomputes the cos/sin tables for RoPE.
apply_rotary_emb(xq, xk, freqs_cos, freqs_sin)- applies the 2D rotation to Q and K vectors.
Attention- multi-head attention with grouped KV heads.
FeedForward- SwiGLU MLP with 3 linears and the 2/3 hidden-dim trick.
TransformerBlock- pre-norm with RMSNorm.
Transformer- full model, with weight tying (
tok_embeddings.weight = output.weight).
The model also includes scaled init for w3.weight and wo.weight — the SwiGLU value projection and the attention output projection. Same scaling as nanoGPT, applied to the Llama-equivalent projections.
Grouped-query attention
In standard multi-head attention, you have n_head query, key, and value heads — same number of each. Llama 2 (and most modern LLMs) decouples this: n_heads query heads but n_kv_heads ≤ n_heads key/value heads. Each KV head is shared across multiple query heads via repeat_kv:
# from llama2.c/model.py
def repeat_kv(x: torch.Tensor, n_rep: int) -> torch.Tensor:
"""torch.repeat_interleave(x, dim=2, repeats=n_rep)"""
bs, slen, n_kv_heads, head_dim = x.shape
if n_rep == 1:
return x
return (
x[:, :, :, None, :]
.expand(bs, slen, n_kv_heads, n_rep, head_dim)
.reshape(bs, slen, n_kv_heads * n_rep, head_dim)
)
Why? The KV cache size scales with n_kv_heads * head_dim, not n_heads * head_dim. Cutting n_kv_heads by 4× cuts the KV cache by 4×, which means 4× more sequences fit on one GPU at inference. Llama 2 70B uses n_heads=64, n_kv_heads=8 — 8× compression on the cache.
In small Llama 2 models (7B), n_kv_heads = n_heads (no GQA, plain multi-head). In larger models, GQA kicks in.
run.c: pure C inference
The whole inference loop in C with no dependencies beyond libc and mmap. Structure:
Config- model hyperparameters (dim, n_layers, n_heads, etc.)
TransformerWeights- pointers into the mmap'd weight file
RunState- scratch buffers + KV cache
forward(transformer, token, pos)- one transformer forward pass for one position
sample(sampler, logits)- top-p / top-k / greedy sampling in C
generate(transformer, tokenizer, sampler, prompt, steps)- autoregressive loop
The forward pass is just n_layers rounds of the same thing:
./run stories15M.bin
~110 tok/s
-O3 and OpenMP
~700 tok/s
TinyStories and small-domain LLMs
The included models (stories15M.bin, stories42M.bin, stories110M.bin) are trained on TinyStories — a synthetic dataset of simple short stories generated by GPT-3.5 to teach narrow but coherent storytelling. Karpathy's small Llama 2 models can generate paragraphs that are essentially indistinguishable from human-written children's stories:
Once upon a time, there was a little girl named Lily. She loved playing with her toys on top of her bed. One day, she decided to have a tea party with her stuffed animals...
The deeper lesson: a 15M-parameter model trained on a narrow, well-curated corpus can be more useful in its domain than a 1B-parameter generalist model. This is the case for narrow LLMs: if you can scope a problem tightly, you don't need a frontier model.
Quantization
runq.c is a separate, ~int8-quantized version of the inference engine. Weights are stored in int8 with per-row scale factors; matmuls dequantize on the fly. Cuts the model size 4×, fits bigger models in less RAM, runs nearly as fast.
This is a "real" quantization implementation (group-wise int8) you can read end to end. The PyTorch-side quantization happens in export.py.
Related
- layernorm-vs-rmsnorm, rope, gelu-and-swiglu — the architectural differences
- kv-cache — central to inference
- sampling — top-p in run.c
- weight-tying — Llama also ties
- repos/llm-c — pure-C/CUDA for training, sibling of llama2.c