Repo · Zero-to-Hero lineage

micrograd

Role in the lineage: the foundation. Karpathy's smallest educational repo, ~150 lines total, implementing reverse-mode autodiff over scalar values plus a tiny MLP library on top. Everything else in the Zero-to-Hero corpus is built on the same intuitions developed here.

Total size: ~150 lines
Engine: ~100 lines (reverse-mode autodiff)
NN library: ~50 lines (PyTorch-like API)
Granularity: Scalar-only — one Value per number

What it is

From the README:

A tiny Autograd engine (with a bite! :)). Implements backpropagation (reverse-mode autodiff) over a dynamically built DAG and a small neural networks library on top of it with a PyTorch-like API. Both are tiny, with about 100 and 50 lines of code respectively.

Two files of substance:

micrograd/engine.py ~100 lines the Value class and the autograd engine.

micrograd/nn.py ~50 lines Neuron, Layer, MLP.

Plus a demo notebook (demo.ipynb) that trains a 2-layer MLP to do binary classification on the moons dataset, and a graphviz visualization notebook.

Scalar-only

The defining limitation: every node in the computation graph holds a single scalar, not a tensor. So a neuron with 2 inputs has 2 weight Values and 1 bias Value, and the forward pass is:

class Neuron(Module):
    def __init__(self, nin, nonlin=True):
        self.w = [Value(random.uniform(-1,1)) for _ in range(nin)]
        self.b = Value(0)
        self.nonlin = nonlin

    def __call__(self, x):
        act = sum((wi*xi for wi,xi in zip(self.w, x)), self.b)
        return act.relu() if self.nonlin else act

sum((wi*xi for wi,xi in zip(...)), self.b) builds a tree of + and * operations, each of which creates a new Value with an attached _backward closure. By the time you get to .relu(), you have a small DAG. After computing a loss and calling loss.backward(), gradients flow through the DAG and end up in w[0].grad, w[1].grad, b.grad.

It's deliberately the slow version of the right algorithm. A tensor-based version is identical in structure but uses tensor ops in the closures, with n_in * n_out matmuls instead of scalar multiplies. PyTorch's autograd is "this, but with tensors and a C++ engine."

What you learn from it

The autograd algorithm is small. Less code than a typical sorting algorithm. Once you've read it, the concept "loss.backward() computes gradients" stops being magic.
The += in closures and the topological sort are the non-obvious parts. Both come from the multivariate chain rule.
PyTorch's API is a direct generalization. Module.parameters(), zero_grad(), the optimizer pattern — micrograd's nn.py mimics PyTorch's API so closely that you can switch between them mentally without effort.

In the lecture

Lecture 1, "The spelled-out intro to neural networks and backpropagation: building micrograd," is over 2 hours of Karpathy building micrograd from scratch in a Jupyter notebook. He starts by deriving partial derivatives by hand, then writes the Value class operator by operator, then builds the MLP, then trains it.

Anyone serious about understanding neural networks should watch this lecture once. It's the only place where backprop is built so slowly that there's nowhere for the magic to hide.

What it doesn't have

That's by design. micrograd isn't useful as a deep learning library. It's useful as a complete, minimal artifact you can read in one sitting.

No batching

Each example is forwarded through the MLP individually.

No GPU

CPU scalar math, slow.

No real models

The demo is 2-layer MLP on a 2D toy dataset.

No tensor ops

You can't write Conv2d or self-attention in this framework — well, you can, but every multiplication is its own Value, and it would take days.

value-class: detailed walkthrough of the Value class
backpropagation: the algorithm
zero-to-hero-arc: the lecture that builds this

micrograd

What it is

Scalar-only

What you learn from it

In the lecture

What it doesn't have

No batching

No GPU

No real models

No tensor ops

Related