llm.c
The most ambitious of the Karpathy repos. Reproduce GPT-2 (and GPT-3) pretraining in pure C/CUDA, with no PyTorch dependency.
LLMs in simple, pure C/CUDA with no need for 245MB of PyTorch or 107MB of cPython. Current focus is on pretraining, in particular reproducing the GPT-2 and GPT-3 miniseries.
— from the README
What's in the repo
| File | Purpose |
|---|---|
train_gpt2.c |
~1000-line CPU reference, fp32, single-threaded (with OpenMP for parallelism) |
train_gpt2.cu |
the production CUDA version, bf16/fp16, multi-GPU |
train_gpt2_fp32.cu |
legacy fp32 CUDA, simpler and easier to read |
train_gpt2.py |
parallel PyTorch reference for unit-test parity |
train_llama3.py |
Llama 3 architecture training |
llmc/ |
individual CUDA kernels: attention, layernorm, matmul, adamw, etc. |
dev/ |
scripts: data prep, downloading checkpoints, profiling |
test_gpt2.c |
unit test: runs one step and compares activations + gradients to a PyTorch dump |
llmc/ is the headline. Each transformer operation is a
separate .cuh file containing the forward and backward kernels, plus a small
launcher. Reading them in sequence is the best CUDA-for-deep-learning tutorial I've
encountered.
The kernels
The kernel files in llmc/:
encoder.cuh
token + position embedding lookup
layernorm.cuh
LayerNorm forward + backward (also handles residual)
attention.cuh
manual attention forward + backward (fallback when cuDNN flash isn't available)
cudnn_att.cpp
cuDNN flash attention wrapper (default path)
matmul.cuh
cublasLt-backed matmul, with explicit bias-add fusion
gelu.cuh
GELU forward + backward
fused_classifier.cuh
final logits + softmax + cross-entropy + backward, fused into one pass over the vocab
adamw.cuh
AdamW update with stochastic rounding for bf16 master params
global_norm.cuh
gradient norm computation (used by gradient clipping)
zero.cuh
ZeRO-1 optimizer state sharding across GPUs
schedulers.h
sampler.h
top-k/top-p sampling in C
tokenizer.h
GPT-2 BPE tokenizer in C
dataloader.h
sharded mmap-based data loader
outlier_detector.h
runs a moving stats on activations to detect anomalies (loss-spike safety net)
Each kernel comes with hand-written forward and backward. There is no autograd; the training loop is hardcoded to call them in the right order.
The training loop
train_gpt2.cu has the full loop in plain C with macros. Conceptual structure:
for (step = 0; step < max_steps; step++) {
// forward
for (layer = 0; layer < n_layer; layer++) {
layernorm_forward(...);
attention_forward(...);
residual_forward(...);
layernorm_forward(...);
matmul_forward(...); // c_fc
gelu_forward(...);
matmul_forward(...); // c_proj
residual_forward(...);
}
layernorm_forward(...);
matmul_forward(...); // lm_head
fused_classifier(...); // softmax + cross-entropy + backward
// backward (mirror, in reverse)
matmul_backward(...); // lm_head_backward
layernorm_backward(...);
for (layer = n_layer-1; layer >= 0; layer--) {
// ... reverse of forward ...
}
// optimizer
global_norm_squared(...); // for grad clipping
adamw_update(...);
// logging, eval, etc.
}
This is the C version of what PyTorch's autograd machinery does for you implicitly.
Once you've stared at this code for an hour, the magic of loss.backward()
is fully demystified — it's mechanical, just written explicitly.
CPU reference: train_gpt2.c
A 1000-line C file that does fp32 GPT-2 training on CPU only, single-file, no dependencies. Slow but pedagogically perfect: every matmul is a triple-nested for-loop, every backward is hand-written, every memory allocation is visible.
step 0: train loss 5.356189 (took 1452.121000 ms)
step 1: train loss 4.301069 (took 1288.673000 ms)
...
step 39: train loss 3.970751 (took 1323.779000 ms)
val loss 4.107781
Slow, but it trains. The point isn't speed; it's that you can read every line.
Mixed precision and stochastic rounding
The CUDA version uses bf16 working params, fp32 master params, and stochastic
rounding to convert between them. See mixed-precision-and-mfu.
The implementation in llmc/adamw.cuh:
float param = old_param - (learning_rate * (m / (sqrtf(v) + eps) + weight_decay * old_param));
stochastic_rounding(param, ¶ms_memory[idx], seed);
if (master_params_memory != NULL) { master_params_memory[idx] = param; }
Stochastic rounding: instead of round-to-nearest, round up with probability proportional to the residual. This gives unbiased estimates of the parameter in low precision and avoids the bias of consistent round-down or round-up.
Multi-GPU
llm.c supports DDP-style multi-GPU via NCCL, and ZeRO-1 optimizer state sharding via
the zero.cuh kernels. No model parallelism (yet); model fits in one GPU's
memory at the sizes targeted (GPT-2 series + small GPT-3 variants).
Performance
The README claims llm.c is 7% faster than PyTorch Nightly on GPT-2 (124M). The reasons:
- Custom fused kernels (
fused_classifierhandles softmax+ce+backward in one pass). - Direct cublasLt access (no PyTorch dispatcher overhead).
- No Python (no GIL, no autograd-graph build/destroy, no per-op overhead).
- Hand-tuned launch parameters.
PyTorch is a moving target — it's getting faster too — but the comparison shows that 90%+ of the speed is already in the underlying libraries (cublasLt, cuDNN), and the remaining 7-10% is pure dispatch/Python overhead that you can claw back by going to C.
Related
- repos/build-nanogpt — the PyTorch counterpart, same target model
- backpropagation — the algorithm llm.c spells out
- mixed-precision-and-mfu — master params and stochastic rounding
- adamw — the kernel in
llmc/adamw.cuh - layernorm-vs-rmsnorm, attention — kernels in
llmc/