Algorithm · Autodiff

Backpropagation

Backpropagation is the algorithm that lets you efficiently compute the gradient of a scalar loss with respect to every parameter in a neural network. Once you have those gradients, SGD or AdamW can take a step that reduces the loss. Every modern deep learning framework — PyTorch, JAX, TensorFlow — is at its core a system for doing backprop fast on big tensors. The clearest way to understand it is to see it built up from scalars, which is exactly what micrograd does.

The same thing in different costumes. Across implementations, the algorithm doesn't change — only how the graph is represented and how the backward pass is dispatched.
Scalar micrograd Value nodes with _backward closures.
Tensor PyTorch autograd Same idea, each node holds a tensor; backward is tensor ops.
Hand-rolled llm.c No graph, no closures — kernels called in reverse order.
Pedagogy Backprop Ninja Reimplement every backward in an MLP by hand.

The scalar view: micrograd's Value class

In sources/karpathy-repos/micrograd/micrograd/engine.py, every number in the computation graph is a Value object that stores three things: its data (the scalar), its grad (initially zero), and a closure _backward that knows how to push gradients from this node into its children.

data
The scalar value at this node.
grad
The accumulated gradient, initially zero.
_backward
A closure that pushes gradients from this node into its children.

When you do c = a + b, the __add__ method allocates a new Value with data = a.data + b.data and a _backward that does:

def _backward():
    self.grad += out.grad
    other.grad += out.grad

That's the chain rule for addition: the local derivative of a+b with respect to each input is 1, so the upstream gradient out.grad passes through unchanged. For multiplication, the local derivative of a*b with respect to a is b, so _backward does self.grad += other.data * out.grad. Karpathy walks through these one by one in lecture 1 of Zero-to-Hero.

The += matters. A node can be used in multiple places downstream, and gradients from different paths must accumulate. This is just the multivariate chain rule. Forgetting the += is one of the most common bugs when reimplementing backprop.

Topological sort, then reverse

The forward pass builds a DAG implicitly: each Value records _prev (its inputs) and _op (the operation). The backward() method does a topological sort of the DAG so that every node is processed only after all of its downstream consumers have already deposited their gradients:

self.grad = 1
for v in reversed(topo):
    v._backward()

You seed the loss with grad = 1 (the derivative of the loss with respect to itself), then visit nodes in reverse topological order, each calling its closure. After the loop, every leaf parameter has the correct gradient and you can call optimizer.step().

Why "reverse-mode"

You could compute gradients by pushing forward: for every parameter, propagate its directional derivative through the graph. That's forward-mode autodiff, and it costs one forward pass per parameter. With billions of parameters, that's hopeless.

Forward-mode

One forward pass per parameter.

With billions of parameters, that's hopeless.

Reverse-mode

One forward pass to build the graph and cache activations, one backward pass to compute every parameter's gradient at once.

Roughly 2× the forward pass — independent of parameter count.

This is the whole reason deep learning is tractable.

From scalars to tensors

PyTorch's autograd is the same idea, but each node holds a tensor instead of a scalar and each _backward is written in terms of tensor ops (matmul, broadcast, sum). The local derivative of a matmul Y = X @ W is:

w.r.t. Local derivative
X dL/dX = dL/dY @ W.T
W dL/dW = X.T @ dL/dY

Karpathy hand-derives these in lecture 5, "Becoming a Backprop Ninja", where he reimplements every backward in the MLP by hand without loss.backward().

The C view: llm.c

sources/karpathy-repos/llm.c/train_gpt2.c writes the whole forward and backward pass by hand. There's no autograd — each kernel (matmul_backward, attention_backward, layernorm_backward, etc.) is its own function, and the training loop calls them in reverse order of the forward pass. It's the same algorithm as micrograd, just with no graph object and no closures: the structure is hardcoded into the source.

This is the payoff of understanding backprop at the scalar level: once you do, every implementation — _backward closures, PyTorch's autograd engine, hand-rolled C kernels — looks like the same thing in different costumes.

Related