Primitive · micrograd

`Value`: the micrograd autograd primitive

Value is the entire computation graph data structure in micrograd. It holds one scalar number, knows the operation that produced it, and carries a closure that knows how to propagate gradients to its inputs. Reading the ~100 lines of engine.py is the single best way to understand what PyTorch's autograd is doing under the hood — same data structure, just with tensor-shaped nodes instead of scalar ones.

The data

class Value:
    def __init__(self, data, _children=(), _op=''):
        self.data = data            # the scalar value
        self.grad = 0               # the gradient (will be filled in during backward)
        self._backward = lambda: nil  # closure to propagate gradients to children
        self._prev = set(_children)   # the inputs to the op that produced this node
        self._op = _op                # the op name, for graphviz / debugging

Four pieces of state:

Field	Role
`data`	the actual scalar value (forward result).
`grad`	the gradient of the final loss with respect to this node. Filled in during `backward()`.
`_backward`	a closure (function with no args) that knows how to take this node's `grad` and add contributions to the `grad`s of its children. Built when the op runs.
`_prev`	pointer set to the children. Used to do topological sort.

Operator overloads

Each Python operator is overridden to produce a new Value and attach a _backward closure for that op. The pattern is unvarying:

def __mul__(self, other):
    other = other if isinstance(other, Value) else Value(other)
    out = Value(self.data * other.data, (self, other), '*')

    def _backward():
        self.grad += other.data * out.grad
        other.grad += self.data * out.grad
    out._backward = _backward

    return out

Three lines of logic:

Compute forward.
Construct the output node with the right children.
Define and attach the backward closure that uses the chain rule on this specific op.

The local derivatives of a*b with respect to a and b are b and a. The closure does self.grad += other.data * out.grad, which is "downstream gradient (out.grad) times local derivative (other.data)" — exactly the chain rule.

Same pattern for the other ops:

add

local derivatives 1 and 1

pow

local derivative c * x^(c-1)

relu

local derivative 1 if x > 0 else 0

The `+=` is load-bearing

Every closure does self.grad += ..., not self.grad = .... Reason: a single Value might be used in multiple downstream expressions. For example:

a = Value(2.0)
b = a + a

Both children of b are the same node a. The _backward for + runs twice (one for each child), and a.grad needs to accumulate both contributions. If you used = instead of +=, you'd lose the first contribution. This is the multivariate chain rule: when a variable appears multiple times, its total gradient is the sum across appearances.

This is also the reason for optimizer.zero_grad() in PyTorch — gradients persist across backward calls and you have to explicitly clear them between training iterations.

`backward()`

def backward(self):
    topo = []
    visited = set()
    def build_topo(v):
        if v not in visited:
            visited.add(v)
            for child in v._prev:
                build_topo(child)
            topo.append(v)
    build_topo(self)

    self.grad = 1
    for v in reversed(topo):
        v._backward()

Two phases:

Topological sort

DFS from the output (the loss). Each node is added to topo after all its predecessors. This gives an ordering where every node appears before any of its consumers.

Reverse pass

Seed the loss's grad to 1 (d loss / d loss = 1). Iterate topo in reverse — for each node, run its _backward, which deposits gradients into the children. By the time we visit a node, all its consumers have already deposited their contributions into its grad.

After backward(), every leaf Value (the inputs / parameters) has the correct gradient.

From `Value` to `Tensor`

PyTorch's autograd is the same algorithm with two changes:

Change 1 — payload

Each node holds a tensor instead of a scalar. The data is an n-dimensional array; the grad is too.

Change 2 — closures

The _backward closures are written in terms of tensor ops. The backward for matmul(X, W) involves transposes and another matmul.

Otherwise it's the same DAG, the same topo sort, the same reverse iteration with += gradients. If you can read engine.py, you can read autograd's C++ source (well, slowly, but conceptually).

backpropagation — the algorithm Value implements
repos/micrograd — the full ~100-line library
zero-to-hero-arc — lecture 1 walks through Value line by line

Value: the micrograd autograd primitive

The data

Operator overloads

__add__

__pow__

relu

The += is load-bearing

backward()

Topological sort

Reverse pass

From Value to Tensor

Change 1 — payload

Change 2 — closures

Related

`Value`: the micrograd autograd primitive

add

pow

The `+=` is load-bearing

`backward()`

From `Value` to `Tensor`