Repo Pedagogical character-level tiny-shakespeare ~10M params

ng-video-lecture

The bridge from makemore to real GPT. The companion repo for lecture 7, "Let's build GPT: from scratch, in code, spelled out". Two files: bigram.py (the starting point) and gpt.py (the destination — a small but legitimate GPT, ~10M parameters, trained on Tiny Shakespeare).

What it is

Two self-contained training scripts:

bigram.py ~120 lines

Character-level bigram model. One embedding table maps each character to a logit distribution over the next character. Trains in seconds. Output is gibberish that has Shakespeare's character distribution.

gpt.py ~225 lines

Character-level GPT. Token embedding + position embedding + 6 transformer blocks with 6-head attention + final LayerNorm + lm_head. ~10M parameters. Trains in a few minutes on a GPU. Output is Shakespeare-flavored coherent-looking text with mostly-invented words.

Both train on input.txt, the Tiny Shakespeare corpus (~1MB).

Why this repo matters

In a typical "let's build GPT" tutorial, you copy the code, run it, and feel slightly smarter. Karpathy's version is different: the lecture takes 2 hours to develop gpt.py line by line, starting from bigram.py and adding one piece at a time, with the loss decreasing at each step to confirm the addition helped.

The progression in the lecture

  1. Bigram baseline loss ~2.5 (vs ~4.2 random)
  2. Add a single attention head. The "weighted bag of words" trick: build a lower-triangular matrix of 1/T entries, matmul. This is just averaging the past; it improves loss because some context is better than none.
  3. Replace the uniform weights with learned, content-dependent weights via Q and K. This is now real self-attention.
  4. Add multi-head: parallel attention heads, concatenated.
  5. Add the MLP (feedforward block).
  6. Stack into 6 layers.
  7. Add residual connectionsloss drops sharply
  8. Add LayerNorm, pre-norm placement.
  9. Add dropout.
  10. Scale up n_embd, n_head, n_layer, train longer.

By the end, the loss is around 1.0-1.5 and the output is what people quote as "look, GPT for kids":

ANGELO:
And cowards it be strawn to my bed,
And thrust the gates of my threats,
Because he that ale away, and hang'd
An one with him.

What's in gpt.py

The full transformer in 225 lines. Classes:

Head
One attention head: Q, K, V linears, causal mask, softmax, dropout, weighted V aggregation.
MultiHeadAttention
n_head parallel Heads, concat outputs, output projection, dropout.
FeedFoward
Two linears + ReLU + dropout. (The typo is in the original.)
Block
Pre-norm Block: x + attn(ln1(x)) then x + ffwd(ln2(x)).
GPTLanguageModel
Token + position embeddings, stack of Blocks, final LayerNorm, lm_head. Plus generate() with greedy torch.multinomial sampling.

Hyperparameters

The file is configured for a GPU:

batch_size64
block_size256
max_iters5000
learning_rate3e-4
n_embd384
n_head6
n_layer6
dropout0.2

block_size=256 characters, ~10M params, 5000 iterations of AdamW. Karpathy says this takes about 15 minutes on an A100.

Differences from nanoGPT

gpt.py is a self-contained pedagogical version of nanoGPT. Differences:

Dimension ng-video-lecture nanoGPT
Tokenization character-level (vocab 65) character-level and BPE
Attention impl manual (explicit q @ k.T matmul, softmax, etc.) F.scaled_dot_product_attention (flash attention) when available
Multi-head layout nn.ModuleList([Head(head_size) for _ in range(num_heads)]) with one Linear per head All heads' Q/K/V fused into one Linear of size 3 * n_embd
Head combination Separate attention heads concatenated Batched as an extra dimension
Data Tiny Shakespeare in-memory OpenWebText with sharded prep
The educational version trades efficiency for clarity. Once you understand gpt.py, reading nanoGPT/model.py is a fun exercise in spotting the optimizations.

What's in bigram.py

The minimum-viable LM. One nn.Embedding(vocab_size, vocab_size) directly outputs next-token logits when queried with the current token. Training loop is 30 lines. Loss converges to about 2.5 (vs uniform 4.2). Output is character soup with Shakespeare's character frequency.

The point isn't that bigram is good; the point is that bigram is a working LM that you can extend incrementally to GPT. Each addition has to help the loss to justify itself.

Related