Training mechanics

Gradient Accumulation

Gradient accumulation is the trick that lets you train with a large effective batch size on hardware that can't fit a large batch in memory. The GPT-3 paper trained with batch size 3.2M tokens; an A100 with 40GB of memory can fit maybe 16k tokens at a time for a 1.5B model. Accumulation closes that gap.

The pattern

# from build-nanogpt/train_gpt2.py
total_batch_size = 524288   # 2**19, ~0.5M tokens
B = 64                       # micro batch size
T = 1024                     # sequence length
grad_accum_steps = total_batch_size // (B * T * ddp_world_size)

model.train()
optimizer.zero_grad()
loss_accum = 0.0
for micro_step in range(grad_accum_steps):
    x, y = train_loader.next_batch()
    x, y = x.to(device), y.to(device)
    with torch.autocast(device_type=device_type, dtype=torch.bfloat16):
        logits, loss = model(x, y)
    loss = loss / grad_accum_steps
    loss_accum += loss.detach()
    loss.backward()
norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()

What's happening: each microbatch does one forward and one backward. The backward calls loss.backward() which adds to .grad on every parameter — it does not overwrite. So after grad_accum_steps microbatches, the accumulated gradient on each parameter is the sum of the per-microbatch gradients.

zero_grad

.grad on every parameter is reset to 0

micro 1

forward → backward → .grad += g₁

micro 2

forward → backward → .grad += g₂

…

repeat for grad_accum_steps microbatches

micro N

forward → backward → .grad += gₙ

clip

clip_grad_norm_(..., 1.0) on the accumulated .grad

step

one optimizer.step() applies the summed gradient

.grad = g₁ + g₂ + … + gₙ → one update

A single optimizer.step() then applies the accumulated gradient, and optimizer.zero_grad() resets for the next big step.

The `loss / grad_accum_steps` divisor

This is the easy thing to get wrong, and Karpathy explicitly comments on it in the source:

we have to scale the loss to account for gradient accumulation, because the gradients just add on each successive backward(). addition of gradients corresponds to a SUM in the objective, but instead of a SUM we want MEAN.

What summing gives

Backprop on a sum yields gradients scaled by N more than backprop on a mean.

What we want

Cross-entropy loss is typically reported as the mean loss per token. Backprop on a mean produces gradients scaled by 1/N.

So you divide each microbatch's loss by grad_accum_steps before .backward() to compensate.

Silent failure mode. If you forget the divisor, your gradients are grad_accum_steps× too large, your effective learning rate is also that much too large, and training blows up or behaves weirdly.

Effective batch size and DDP

When training on multiple GPUs with DDP (DistributedDataParallel), each rank processes its own microbatches independently. PyTorch's DDP averages gradients across ranks at each .backward() call. So if you have:

B64

T1024

ddp_world_size8

grad_accum_steps1

effective batch = 64 × 1024 × 8 = 524,288 tokens

— the GPUs do the accumulation across themselves. If you want a larger effective batch than fits even across all your GPUs, set grad_accum_steps > 1 and you scale further. The math in build-nanogpt covers both cases uniformly:

grad_accum_steps = total_batch_size // (B * T * ddp_world_size)

DDP sync optimization

When you're accumulating gradients, the per-microbatch all-reduce across ranks is wasted work for all but the last microbatch — you'd be summing partial gradients that get summed again on the next microbatch. PyTorch lets you skip the sync for non-final microbatches:

if ddp:
    model.require_backward_grad_sync = (micro_step == grad_accum_steps - 1)

This is a noticeable speedup when grad_accum_steps is large and the cross-GPU all-reduce is non-trivial.

When to use it

The original GPT-3 paper used effective batch sizes of ~3.2M tokens. Modern LLM training uses similar or larger batches. On any hardware setup short of a massive cluster, you need accumulation. Even on a big cluster, accumulation can let you hit a target batch size with less inter-node communication per step (one all-reduce per grad_accum_steps microbatches instead of one per microbatch).

For makemore, ng-video-lecture, and micrograd — small models, small datasets — accumulation isn't necessary. It first appears in build-nanogpt where the goal is a faithful GPT-2 reproduction at GPT-2's batch size.

adamw: what runs once per grad_accum_steps microbatches
learning-rate-schedules: schedules are based on optimizer steps, not microbatches
dataloader: what produces the microbatches
training-stability: gradient clipping happens after accumulation

Gradient Accumulation

The pattern

The loss / grad_accum_steps divisor

What summing gives

What we want

Effective batch size and DDP

DDP sync optimization

When to use it

Related

The `loss / grad_accum_steps` divisor