Backpropagation
Backpropagation is the algorithm that lets you efficiently compute the gradient of a scalar loss with respect to every parameter in a neural network. Once you have those gradients, SGD or AdamW can take a step that reduces the loss. Every modern deep learning framework — PyTorch, JAX, TensorFlow — is at its core a system for doing backprop fast on big tensors. The clearest way to understand it is to see it built up from scalars, which is exactly what micrograd does.
Value nodes with _backward closures.
The scalar view: micrograd's Value class
In sources/karpathy-repos/micrograd/micrograd/engine.py, every
number in the computation graph is a Value object that stores
three things: its data (the scalar), its grad
(initially zero), and a closure _backward that knows how to
push gradients from this node into its children.
data- The scalar value at this node.
grad- The accumulated gradient, initially zero.
_backward- A closure that pushes gradients from this node into its children.
When you do c = a + b, the __add__ method
allocates a new Value with
data = a.data + b.data and a _backward that does:
def _backward():
self.grad += out.grad
other.grad += out.grad
That's the chain rule for addition: the local derivative of
a+b with respect to each input is 1, so the
upstream gradient out.grad passes through unchanged. For
multiplication, the local derivative of a*b with respect to
a is b, so _backward does
self.grad += other.data * out.grad. Karpathy walks through
these one by one in
lecture 1 of Zero-to-Hero.
+= matters. A node can be used in
multiple places downstream, and gradients from different paths must
accumulate. This is just the multivariate chain rule. Forgetting the
+= is one of the most common bugs when reimplementing
backprop.
Topological sort, then reverse
The forward pass builds a DAG implicitly: each Value records
_prev (its inputs) and _op (the operation). The
backward() method does a topological sort of the DAG so that
every node is processed only after all of its downstream consumers have
already deposited their gradients:
self.grad = 1
for v in reversed(topo):
v._backward()
You seed the loss with grad = 1 (the derivative of the loss
with respect to itself), then visit nodes in reverse topological order,
each calling its closure. After the loop, every leaf parameter has the
correct gradient and you can call optimizer.step().
Why "reverse-mode"
You could compute gradients by pushing forward: for every parameter, propagate its directional derivative through the graph. That's forward-mode autodiff, and it costs one forward pass per parameter. With billions of parameters, that's hopeless.
Forward-mode
One forward pass per parameter.
With billions of parameters, that's hopeless.
Reverse-mode
One forward pass to build the graph and cache activations, one backward pass to compute every parameter's gradient at once.
Roughly 2× the forward pass — independent of parameter count.
This is the whole reason deep learning is tractable.
From scalars to tensors
PyTorch's autograd is the same idea, but each node holds a tensor instead
of a scalar and each _backward is written in terms of tensor
ops (matmul, broadcast, sum). The local derivative of a matmul
Y = X @ W is:
| w.r.t. | Local derivative |
|---|---|
X |
dL/dX = dL/dY @ W.T |
W |
dL/dW = X.T @ dL/dY |
Karpathy hand-derives these in
lecture 5, "Becoming a Backprop Ninja",
where he reimplements every backward in the
MLP by hand without loss.backward().
The C view: llm.c
sources/karpathy-repos/llm.c/train_gpt2.c writes the whole
forward and backward pass by hand. There's no autograd — each kernel
(matmul_backward, attention_backward,
layernorm_backward, etc.) is its own function, and the
training loop calls them in reverse order of the forward pass. It's the
same algorithm as micrograd, just with no graph object and no closures:
the structure is hardcoded into the source.
_backward closures, PyTorch's
autograd engine, hand-rolled C kernels — looks like the same thing in
different costumes.
Related
- adamw — what consumes the gradients
- value-class — the data structure that holds a node
- gradient-accumulation — when one backward isn't enough
- layernorm-vs-rmsnorm — a normalization whose backward is famously tricky to derive