Value: the micrograd autograd primitive
Value is the entire computation graph data structure in
micrograd. It holds one scalar
number, knows the operation that produced it, and carries a closure
that knows how to propagate gradients to its inputs. Reading the ~100
lines of engine.py is the single best way to understand
what PyTorch's autograd is doing under the hood — same data structure,
just with tensor-shaped nodes instead of scalar ones.
The data
class Value:
def __init__(self, data, _children=(), _op=''):
self.data = data # the scalar value
self.grad = 0 # the gradient (will be filled in during backward)
self._backward = lambda: nil # closure to propagate gradients to children
self._prev = set(_children) # the inputs to the op that produced this node
self._op = _op # the op name, for graphviz / debugging
Four pieces of state:
| Field | Role |
|---|---|
data |
the actual scalar value (forward result). |
grad |
the gradient of the final loss with respect to this node. Filled in during backward(). |
_backward |
a closure (function with no args) that knows how to take this node's grad and add contributions to the grads of its children. Built when the op runs. |
_prev |
pointer set to the children. Used to do topological sort. |
Operator overloads
Each Python operator is overridden to produce a new Value
and attach a _backward closure for that op. The pattern
is unvarying:
def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data, (self, other), '*')
def _backward():
self.grad += other.data * out.grad
other.grad += self.data * out.grad
out._backward = _backward
return out
Three lines of logic:
- Compute forward.
- Construct the output node with the right children.
- Define and attach the backward closure that uses the chain rule on this specific op.
The local derivatives of a*b with respect to a
and b are b and a. The closure
does self.grad += other.data * out.grad, which is
"downstream gradient (out.grad) times local derivative
(other.data)" — exactly the chain rule.
Same pattern for the other ops:
__add__
local derivatives 1 and 1
__pow__
local derivative c * x^(c-1)
relu
local derivative 1 if x > 0 else 0
The += is load-bearing
Every closure does self.grad += ..., not
self.grad = .... Reason: a single Value
might be used in multiple downstream expressions. For example:
a = Value(2.0)
b = a + a
Both children of b are the same node a. The
_backward for + runs twice (one for each
child), and a.grad needs to accumulate both contributions.
If you used = instead of +=, you'd lose the
first contribution. This is the multivariate chain rule: when a
variable appears multiple times, its total gradient is the sum across
appearances.
This is also the reason for optimizer.zero_grad() in
PyTorch — gradients persist across backward calls and you have to
explicitly clear them between training iterations.
backward()
def backward(self):
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._prev:
build_topo(child)
topo.append(v)
build_topo(self)
self.grad = 1
for v in reversed(topo):
v._backward()
Two phases:
Topological sort
DFS from the output (the loss). Each node is added to
topo after all its predecessors. This gives
an ordering where every node appears before any of its consumers.
Reverse pass
Seed the loss's grad to 1
(d loss / d loss = 1). Iterate topo in
reverse — for each node, run its _backward, which
deposits gradients into the children. By the time we visit a
node, all its consumers have already deposited their contributions
into its grad.
After backward(), every leaf Value (the
inputs / parameters) has the correct gradient.
From Value to Tensor
PyTorch's autograd is the same algorithm with two changes:
Change 1 — payload
Each node holds a tensor instead of a scalar. The data
is an n-dimensional array; the grad is
too.
Change 2 — closures
The _backward closures are written in terms of tensor
ops. The backward for matmul(X, W) involves transposes
and another matmul.
Otherwise it's the same DAG, the same topo sort, the same reverse
iteration with += gradients. If you can read
engine.py, you can read autograd's C++ source (well,
slowly, but conceptually).
Related
- backpropagation — the algorithm
Valueimplements - repos/micrograd — the full ~100-line library
- zero-to-hero-arc — lecture 1 walks through
Valueline by line