micrograd
Role in the lineage: the foundation. Karpathy's smallest educational repo, ~150 lines total, implementing reverse-mode autodiff over scalar values plus a tiny MLP library on top. Everything else in the Zero-to-Hero corpus is built on the same intuitions developed here.
What it is
From the README:
A tiny Autograd engine (with a bite! :)). Implements backpropagation (reverse-mode autodiff) over a dynamically built DAG and a small neural networks library on top of it with a PyTorch-like API. Both are tiny, with about 100 and 50 lines of code respectively.
Two files of substance:
Neuron, Layer, MLP.
Plus a demo notebook (demo.ipynb) that trains a 2-layer MLP to do binary classification on the moons dataset, and a graphviz visualization notebook.
Scalar-only
The defining limitation: every node in the computation graph holds a single scalar, not a tensor. So a neuron with 2 inputs has 2 weight Values and 1 bias Value, and the forward pass is:
class Neuron(Module):
def __init__(self, nin, nonlin=True):
self.w = [Value(random.uniform(-1,1)) for _ in range(nin)]
self.b = Value(0)
self.nonlin = nonlin
def __call__(self, x):
act = sum((wi*xi for wi,xi in zip(self.w, x)), self.b)
return act.relu() if self.nonlin else act
sum((wi*xi for wi,xi in zip(...)), self.b) builds a tree of + and * operations, each of which creates a new Value with an attached _backward closure. By the time you get to .relu(), you have a small DAG. After computing a loss and calling loss.backward(), gradients flow through the DAG and end up in w[0].grad, w[1].grad, b.grad.
It's deliberately the slow version of the right algorithm. A tensor-based version is identical in structure but uses tensor ops in the closures, with n_in * n_out matmuls instead of scalar multiplies. PyTorch's autograd is "this, but with tensors and a C++ engine."
What you learn from it
- The autograd algorithm is small. Less code than a typical sorting algorithm. Once you've read it, the concept "loss.backward() computes gradients" stops being magic.
- The
+=in closures and the topological sort are the non-obvious parts. Both come from the multivariate chain rule. - PyTorch's API is a direct generalization.
Module.parameters(),zero_grad(), the optimizer pattern — micrograd'snn.pymimics PyTorch's API so closely that you can switch between them mentally without effort.
In the lecture
Lecture 1, "The spelled-out intro to neural networks and backpropagation: building micrograd," is over 2 hours of Karpathy building micrograd from scratch in a Jupyter notebook. He starts by deriving partial derivatives by hand, then writes the Value class operator by operator, then builds the MLP, then trains it.
Anyone serious about understanding neural networks should watch this lecture once. It's the only place where backprop is built so slowly that there's nowhere for the magic to hide.
What it doesn't have
That's by design. micrograd isn't useful as a deep learning library. It's useful as a complete, minimal artifact you can read in one sitting.
No batching
Each example is forwarded through the MLP individually.
No GPU
CPU scalar math, slow.
No real models
The demo is 2-layer MLP on a 2D toy dataset.
No tensor ops
You can't write Conv2d or self-attention in this framework — well, you can, but every multiplication is its own Value, and it would take days.
Related
- value-class
- detailed walkthrough of the
Valueclass - backpropagation
- the algorithm
- zero-to-hero-arc
- the lecture that builds this