Gradient Accumulation
Gradient accumulation is the trick that lets you train with a large effective batch size on hardware that can't fit a large batch in memory. The GPT-3 paper trained with batch size 3.2M tokens; an A100 with 40GB of memory can fit maybe 16k tokens at a time for a 1.5B model. Accumulation closes that gap.
The pattern
# from build-nanogpt/train_gpt2.py
total_batch_size = 524288 # 2**19, ~0.5M tokens
B = 64 # micro batch size
T = 1024 # sequence length
grad_accum_steps = total_batch_size // (B * T * ddp_world_size)
model.train()
optimizer.zero_grad()
loss_accum = 0.0
for micro_step in range(grad_accum_steps):
x, y = train_loader.next_batch()
x, y = x.to(device), y.to(device)
with torch.autocast(device_type=device_type, dtype=torch.bfloat16):
logits, loss = model(x, y)
loss = loss / grad_accum_steps
loss_accum += loss.detach()
loss.backward()
norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
What's happening: each microbatch does one forward and one backward. The
backward calls loss.backward() which adds to
.grad on every parameter — it does not overwrite. So after
grad_accum_steps microbatches, the accumulated gradient on each
parameter is the sum of the per-microbatch gradients.
.grad on every parameter is reset to 0.grad += g₁.grad += g₂grad_accum_steps microbatches.grad += gₙclip_grad_norm_(..., 1.0) on the accumulated .gradoptimizer.step() applies the summed gradient
A single optimizer.step() then applies the accumulated gradient,
and optimizer.zero_grad() resets for the next big step.
The loss / grad_accum_steps divisor
This is the easy thing to get wrong, and Karpathy explicitly comments on it in the source:
we have to scale the loss to account for gradient accumulation, because the gradients just add on each successive backward(). addition of gradients corresponds to a SUM in the objective, but instead of a SUM we want MEAN.
What summing gives
Backprop on a sum yields gradients scaled by N more than
backprop on a mean.
What we want
Cross-entropy loss is typically reported as the mean
loss per token. Backprop on a mean produces gradients scaled by
1/N.
So you divide each microbatch's loss by grad_accum_steps
before .backward() to compensate.
grad_accum_steps× too large, your effective
learning rate is also that much too large, and training blows up or behaves
weirdly.
Effective batch size and DDP
When training on multiple GPUs with DDP (DistributedDataParallel), each rank
processes its own microbatches independently. PyTorch's DDP averages
gradients across ranks at each .backward() call. So if you
have:
— the GPUs do the accumulation across themselves. If you want a larger
effective batch than fits even across all your GPUs, set
grad_accum_steps > 1 and you scale further. The math in
build-nanogpt covers both cases uniformly:
grad_accum_steps = total_batch_size // (B * T * ddp_world_size)
DDP sync optimization
When you're accumulating gradients, the per-microbatch all-reduce across ranks is wasted work for all but the last microbatch — you'd be summing partial gradients that get summed again on the next microbatch. PyTorch lets you skip the sync for non-final microbatches:
if ddp:
model.require_backward_grad_sync = (micro_step == grad_accum_steps - 1)
This is a noticeable speedup when grad_accum_steps is large and
the cross-GPU all-reduce is non-trivial.
When to use it
The original GPT-3 paper used effective batch sizes of ~3.2M tokens. Modern
LLM training uses similar or larger batches. On any hardware setup short of
a massive cluster, you need accumulation. Even on a big cluster,
accumulation can let you hit a target batch size with less inter-node
communication per step (one all-reduce per grad_accum_steps
microbatches instead of one per microbatch).
For makemore, ng-video-lecture, and micrograd — small models, small datasets — accumulation isn't necessary. It first appears in build-nanogpt where the goal is a faithful GPT-2 reproduction at GPT-2's batch size.
Related
- adamw
- what runs once per
grad_accum_stepsmicrobatches - learning-rate-schedules
- schedules are based on optimizer steps, not microbatches
- dataloader
- what produces the microbatches
- training-stability
- gradient clipping happens after accumulation