REPOS · KARPATHY LINEAGE

llm.c

The most ambitious of the Karpathy repos. Reproduce GPT-2 (and GPT-3) pretraining in pure C/CUDA, with no PyTorch dependency.

~1000 lines of C (CPU reference)
~few k lines of CUDA (GPU version)
+7% vs PyTorch Nightly, GPT-2 124M (late 2024)

LLMs in simple, pure C/CUDA with no need for 245MB of PyTorch or 107MB of cPython. Current focus is on pretraining, in particular reproducing the GPT-2 and GPT-3 miniseries.

— from the README

What's in the repo

FilePurpose
train_gpt2.c ~1000-line CPU reference, fp32, single-threaded (with OpenMP for parallelism)
train_gpt2.cu the production CUDA version, bf16/fp16, multi-GPU
train_gpt2_fp32.cu legacy fp32 CUDA, simpler and easier to read
train_gpt2.py parallel PyTorch reference for unit-test parity
train_llama3.py Llama 3 architecture training
llmc/ individual CUDA kernels: attention, layernorm, matmul, adamw, etc.
dev/ scripts: data prep, downloading checkpoints, profiling
test_gpt2.c unit test: runs one step and compares activations + gradients to a PyTorch dump

llmc/ is the headline. Each transformer operation is a separate .cuh file containing the forward and backward kernels, plus a small launcher. Reading them in sequence is the best CUDA-for-deep-learning tutorial I've encountered.

The kernels

The kernel files in llmc/:

encoder.cuh

token + position embedding lookup

layernorm.cuh

LayerNorm forward + backward (also handles residual)

attention.cuh

manual attention forward + backward (fallback when cuDNN flash isn't available)

cudnn_att.cpp

cuDNN flash attention wrapper (default path)

matmul.cuh

cublasLt-backed matmul, with explicit bias-add fusion

gelu.cuh

GELU forward + backward

fused_classifier.cuh

final logits + softmax + cross-entropy + backward, fused into one pass over the vocab

adamw.cuh

AdamW update with stochastic rounding for bf16 master params

global_norm.cuh

gradient norm computation (used by gradient clipping)

zero.cuh

ZeRO-1 optimizer state sharding across GPUs

schedulers.h

LR schedules

sampler.h

top-k/top-p sampling in C

tokenizer.h

GPT-2 BPE tokenizer in C

dataloader.h

sharded mmap-based data loader

outlier_detector.h

runs a moving stats on activations to detect anomalies (loss-spike safety net)

Each kernel comes with hand-written forward and backward. There is no autograd; the training loop is hardcoded to call them in the right order.

The training loop

train_gpt2.cu has the full loop in plain C with macros. Conceptual structure:

for (step = 0; step < max_steps; step++) {
    // forward
    for (layer = 0; layer < n_layer; layer++) {
        layernorm_forward(...);
        attention_forward(...);
        residual_forward(...);
        layernorm_forward(...);
        matmul_forward(...);  // c_fc
        gelu_forward(...);
        matmul_forward(...);  // c_proj
        residual_forward(...);
    }
    layernorm_forward(...);
    matmul_forward(...);  // lm_head
    fused_classifier(...);  // softmax + cross-entropy + backward

    // backward (mirror, in reverse)
    matmul_backward(...);  // lm_head_backward
    layernorm_backward(...);
    for (layer = n_layer-1; layer >= 0; layer--) {
        // ... reverse of forward ...
    }

    // optimizer
    global_norm_squared(...);  // for grad clipping
    adamw_update(...);

    // logging, eval, etc.
}

This is the C version of what PyTorch's autograd machinery does for you implicitly. Once you've stared at this code for an hour, the magic of loss.backward() is fully demystified — it's mechanical, just written explicitly.

CPU reference: train_gpt2.c

A 1000-line C file that does fp32 GPT-2 training on CPU only, single-file, no dependencies. Slow but pedagogically perfect: every matmul is a triple-nested for-loop, every backward is hand-written, every memory allocation is visible.

step 0: train loss 5.356189 (took 1452.121000 ms)
step 1: train loss 4.301069 (took 1288.673000 ms)
...
step 39: train loss 3.970751 (took 1323.779000 ms)
val loss 4.107781
Sample timing from the README (M3 Max MacBook). About 1.3 seconds per step on a 124M model at batch 4 context 64.

Slow, but it trains. The point isn't speed; it's that you can read every line.

Mixed precision and stochastic rounding

The CUDA version uses bf16 working params, fp32 master params, and stochastic rounding to convert between them. See mixed-precision-and-mfu. The implementation in llmc/adamw.cuh:

float param = old_param - (learning_rate * (m / (sqrtf(v) + eps) + weight_decay * old_param));
stochastic_rounding(param, &params_memory[idx], seed);
if (master_params_memory != NULL) { master_params_memory[idx] = param; }

Stochastic rounding: instead of round-to-nearest, round up with probability proportional to the residual. This gives unbiased estimates of the parameter in low precision and avoids the bias of consistent round-down or round-up.

Multi-GPU

llm.c supports DDP-style multi-GPU via NCCL, and ZeRO-1 optimizer state sharding via the zero.cuh kernels. No model parallelism (yet); model fits in one GPU's memory at the sizes targeted (GPT-2 series + small GPT-3 variants).

Performance

The README claims llm.c is 7% faster than PyTorch Nightly on GPT-2 (124M). The reasons:

PyTorch is a moving target — it's getting faster too — but the comparison shows that 90%+ of the speed is already in the underlying libraries (cublasLt, cuDNN), and the remaining 7-10% is pure dispatch/Python overhead that you can claw back by going to C.

Related