Repo Pedagogical character-level tiny-shakespeare ~10M params

ng-video-lecture

The bridge from makemore to real GPT. The companion repo for lecture 7, "Let's build GPT: from scratch, in code, spelled out". Two files: bigram.py (the starting point) and gpt.py (the destination — a small but legitimate GPT, ~10M parameters, trained on Tiny Shakespeare).

What it is

Two self-contained training scripts:

bigram.py ~120 lines

Character-level bigram model. One embedding table maps each character to a logit distribution over the next character. Trains in seconds. Output is gibberish that has Shakespeare's character distribution.

gpt.py ~225 lines

Character-level GPT. Token embedding + position embedding + 6 transformer blocks with 6-head attention + final LayerNorm + lm_head. ~10M parameters. Trains in a few minutes on a GPU. Output is Shakespeare-flavored coherent-looking text with mostly-invented words.

Both train on input.txt, the Tiny Shakespeare corpus (~1MB).

Why this repo matters

In a typical "let's build GPT" tutorial, you copy the code, run it, and feel slightly smarter. Karpathy's version is different: the lecture takes 2 hours to develop gpt.py line by line, starting from bigram.py and adding one piece at a time, with the loss decreasing at each step to confirm the addition helped.

The progression in the lecture

Bigram baseline loss ~2.5 (vs ~4.2 random)
Add a single attention head. The "weighted bag of words" trick: build a lower-triangular matrix of 1/T entries, matmul. This is just averaging the past; it improves loss because some context is better than none.
Replace the uniform weights with learned, content-dependent weights via Q and K. This is now real self-attention.
Add multi-head: parallel attention heads, concatenated.
Add the MLP (feedforward block).
Stack into 6 layers.
Add residual connections — loss drops sharply
Add LayerNorm, pre-norm placement.
Add dropout.
Scale up n_embd, n_head, n_layer, train longer.

By the end, the loss is around 1.0-1.5 and the output is what people quote as "look, GPT for kids":

ANGELO:
And cowards it be strawn to my bed,
And thrust the gates of my threats,
Because he that ale away, and hang'd
An one with him.

What's in `gpt.py`

The full transformer in 225 lines. Classes:

Head: One attention head: Q, K, V linears, causal mask, softmax, dropout, weighted V aggregation.
MultiHeadAttention: n_head parallel Heads, concat outputs, output projection, dropout.
FeedFoward: Two linears + ReLU + dropout. (The typo is in the original.)
Block: Pre-norm Block: x + attn(ln1(x)) then x + ffwd(ln2(x)).
GPTLanguageModel: Token + position embeddings, stack of Blocks, final LayerNorm, lm_head. Plus generate() with greedy torch.multinomial sampling.

Hyperparameters

The file is configured for a GPU:

batch_size64

block_size256

max_iters5000

learning_rate3e-4

n_embd384

n_head6

n_layer6

dropout0.2

block_size=256 characters, ~10M params, 5000 iterations of AdamW. Karpathy says this takes about 15 minutes on an A100.

Differences from `nanoGPT`

gpt.py is a self-contained pedagogical version of nanoGPT. Differences:

Dimension	ng-video-lecture	nanoGPT
Tokenization	character-level (vocab 65)	character-level and BPE
Attention impl	manual (explicit `q @ k.T` matmul, softmax, etc.)	`F.scaled_dot_product_attention` (flash attention) when available
Multi-head layout	`nn.ModuleList([Head(head_size) for _ in range(num_heads)])` with one Linear per head	All heads' Q/K/V fused into one Linear of size `3 * n_embd`
Head combination	Separate attention heads concatenated	Batched as an extra dimension
Data	Tiny Shakespeare in-memory	OpenWebText with sharded prep

The educational version trades efficiency for clarity. Once you understand gpt.py, reading nanoGPT/model.py is a fun exercise in spotting the optimizations.

What's in `bigram.py`

The minimum-viable LM. One nn.Embedding(vocab_size, vocab_size) directly outputs next-token logits when queried with the current token. Training loop is 30 lines. Loss converges to about 2.5 (vs uniform 4.2). Output is character soup with Shakespeare's character frequency.

The point isn't that bigram is good; the point is that bigram is a working LM that you can extend incrementally to GPT. Each addition has to help the loss to justify itself.

zero-to-hero-arc — the lecture
attention — the central concept being built
transformer-block — the result
repos/nanoGPT — the production sibling

ng-video-lecture

What it is

bigram.py ~120 lines

gpt.py ~225 lines

Why this repo matters

The progression in the lecture

What's in gpt.py

Hyperparameters

Differences from nanoGPT

What's in bigram.py

Related

What's in `gpt.py`

Differences from `nanoGPT`

What's in `bigram.py`