Primitive · micrograd

Value: the micrograd autograd primitive

Value is the entire computation graph data structure in micrograd. It holds one scalar number, knows the operation that produced it, and carries a closure that knows how to propagate gradients to its inputs. Reading the ~100 lines of engine.py is the single best way to understand what PyTorch's autograd is doing under the hood — same data structure, just with tensor-shaped nodes instead of scalar ones.

The data

class Value:
    def __init__(self, data, _children=(), _op=''):
        self.data = data            # the scalar value
        self.grad = 0               # the gradient (will be filled in during backward)
        self._backward = lambda: nil  # closure to propagate gradients to children
        self._prev = set(_children)   # the inputs to the op that produced this node
        self._op = _op                # the op name, for graphviz / debugging

Four pieces of state:

FieldRole
data the actual scalar value (forward result).
grad the gradient of the final loss with respect to this node. Filled in during backward().
_backward a closure (function with no args) that knows how to take this node's grad and add contributions to the grads of its children. Built when the op runs.
_prev pointer set to the children. Used to do topological sort.

Operator overloads

Each Python operator is overridden to produce a new Value and attach a _backward closure for that op. The pattern is unvarying:

def __mul__(self, other):
    other = other if isinstance(other, Value) else Value(other)
    out = Value(self.data * other.data, (self, other), '*')

    def _backward():
        self.grad += other.data * out.grad
        other.grad += self.data * out.grad
    out._backward = _backward

    return out

Three lines of logic:

  1. Compute forward.
  2. Construct the output node with the right children.
  3. Define and attach the backward closure that uses the chain rule on this specific op.

The local derivatives of a*b with respect to a and b are b and a. The closure does self.grad += other.data * out.grad, which is "downstream gradient (out.grad) times local derivative (other.data)" — exactly the chain rule.

Same pattern for the other ops:

__add__

local derivatives 1 and 1

__pow__

local derivative c * x^(c-1)

relu

local derivative 1 if x > 0 else 0

The += is load-bearing

Every closure does self.grad += ..., not self.grad = .... Reason: a single Value might be used in multiple downstream expressions. For example:

a = Value(2.0)
b = a + a

Both children of b are the same node a. The _backward for + runs twice (one for each child), and a.grad needs to accumulate both contributions. If you used = instead of +=, you'd lose the first contribution. This is the multivariate chain rule: when a variable appears multiple times, its total gradient is the sum across appearances.

This is also the reason for optimizer.zero_grad() in PyTorch — gradients persist across backward calls and you have to explicitly clear them between training iterations.

backward()

def backward(self):
    topo = []
    visited = set()
    def build_topo(v):
        if v not in visited:
            visited.add(v)
            for child in v._prev:
                build_topo(child)
            topo.append(v)
    build_topo(self)

    self.grad = 1
    for v in reversed(topo):
        v._backward()

Two phases:

1

Topological sort

DFS from the output (the loss). Each node is added to topo after all its predecessors. This gives an ordering where every node appears before any of its consumers.

2

Reverse pass

Seed the loss's grad to 1 (d loss / d loss = 1). Iterate topo in reverse — for each node, run its _backward, which deposits gradients into the children. By the time we visit a node, all its consumers have already deposited their contributions into its grad.

After backward(), every leaf Value (the inputs / parameters) has the correct gradient.

From Value to Tensor

PyTorch's autograd is the same algorithm with two changes:

Change 1 — payload

Each node holds a tensor instead of a scalar. The data is an n-dimensional array; the grad is too.

Change 2 — closures

The _backward closures are written in terms of tensor ops. The backward for matmul(X, W) involves transposes and another matmul.

Otherwise it's the same DAG, the same topo sort, the same reverse iteration with += gradients. If you can read engine.py, you can read autograd's C++ source (well, slowly, but conceptually).

Related