DataLoader Patterns
Feeding tokens into a transformer training loop is conceptually simple — sample random sequences from a giant token stream — but the engineering matters a lot at scale. Karpathy's repos progressively reveal what a good LLM dataloader looks like, from a one-liner in ng-video-lecture to the sharded streaming loader in build-nanogpt.
The three tiers
DataLoader over a Dataset
makemore — batching, shuffling, and multi-worker prefetching for free.
.npy shards, sequential reads, DDP rank slicing.
Tier 1 — in-memory random crop (ng-video-lecture, makemore)
The simplest possible dataloader, from ng-video-lecture/gpt.py:
data = torch.tensor(encode(text), dtype=torch.long) # everything in RAM
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]
def get_batch(split):
data = train_data if split == 'train' else val_data
ix = torch.randint(len(data) - block_size, (batch_size,))
x = torch.stack([data[i:i+block_size] for i in ix])
y = torch.stack([data[i+1:i+block_size+1] for i in ix])
x, y = x.to(device), y.to(device)
return x, y
The entire token stream is one tensor in CPU memory. To make a batch:
- Pick
batch_sizerandom starting positions in the stream. - Slice out
block_sizetokens starting from each. - The inputs
xare tokensi .. i+block_size. - The targets
yare tokensi+1 .. i+block_size+1— same window, shifted right by one. Each input token's target is the next token.
The shift-by-one window, visualized
x spans i .. i+block_size; y is the same window shifted right by one. Every position contributes one next-token-prediction signal.
block_size-token sample produces
block_size independent next-token-prediction supervision signals.
Memory. Tiny Shakespeare is ~1MB. Names is ~200KB. Both fit comfortably in RAM.
Tier 2 — PyTorch DataLoader over a Dataset (makemore)
makemore.py wraps a custom Dataset and uses torch.utils.data.DataLoader:
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader
# ... define CharDataset that yields (x, y) pairs ...
train_loader = DataLoader(train_dataset, ...)
This gives you batching, shuffling, and multi-worker prefetching for free. Worth using when:
- Data preprocessing per-sample is non-trivial (decoding images, parsing JSON).
- You want CPU prefetching to overlap with GPU compute.
- Datasets are too big to keep in one tensor.
Tier 3 — sharded streaming (build-nanogpt)
For multi-billion-token pretraining, you can't keep everything in RAM. build-nanogpt/train_gpt2.py has DataLoaderLite:
class DataLoaderLite:
def __init__(self, B, T, process_rank, num_processes, split):
self.B = B
self.T = T
self.process_rank = process_rank
self.num_processes = num_processes
data_root = "edu_fineweb10B"
shards = os.listdir(data_root)
shards = [s for s in shards if split in s]
shards = sorted(shards)
shards = [os.path.join(data_root, s) for s in shards]
self.shards = shards
self.reset()
def reset(self):
self.current_shard = 0
self.tokens = load_tokens(self.shards[self.current_shard])
self.current_position = self.B * self.T * self.process_rank
def next_batch(self):
B, T = self.B, self.T
buf = self.tokens[self.current_position : self.current_position+B*T+1]
x = (buf[:-1]).view(B, T)
y = (buf[1:]).view(B, T)
self.current_position += B * T * self.num_processes
if self.current_position + (B * T * self.num_processes + 1) > len(self.tokens):
self.current_shard = (self.current_shard + 1) % len(self.shards)
self.tokens = load_tokens(self.shards[self.current_shard])
self.current_position = B * T * self.process_rank
return x, y
What's going on
- Shards
- The 10B-token FineWeb corpus is split into many
.npyfiles (100M tokens each, typically).load_tokensloads one shard into a tensor. - Sequential, not random
- Each batch advances
current_positionbyB * T * num_processes. When you fall off the end of a shard, advance to the next one. No random sampling — the model sees the corpus in a fixed (but data-prep-shuffled) order. - DDP rank slicing
- Each DDP rank starts at offset
B * T * rankwithin the shard and advances byB * T * num_processes. This guarantees no two ranks see the same tokens in the same step.
DDP rank slicing, visualized
B · T · rank and advances by B · T · num_processes. No two ranks see the same tokens in the same step.
The fineweb.py preprocessor
build-nanogpt/fineweb.py is the script that tokenizes the FineWeb-Edu dataset and writes the shards. Two-step pattern:
- Stream the raw dataset (HuggingFace
datasets.load_dataset). - Tokenize each document with
tiktoken, write tokens to shard files of fixed size (100M tokens each).
Each token is a uint16 (fits any GPT-2 token ID, which max out at 50256). 10B tokens = 20GB of binary data. Manageable.
Llm.c's dataloader.h
llm.c/llmc/dataloader.h is the C version of the same thing — mmaps shard files, advances a pointer, hands batches to the training loop. Same pattern, no PyTorch, just file descriptors and pointer arithmetic.
Comparing the three tiers
| Tier 1 · in-memory | Tier 2 · DataLoader | Tier 3 · sharded | |
|---|---|---|---|
| seen in | ng-video-lecture, makemore | makemore | build-nanogpt, llm.c |
| storage | one CPU tensor | custom Dataset |
.npy shards on disk |
| sampling | random starting offsets | batching + shuffling for free | sequential, advances pointer |
| scale fits | Tiny Shakespeare ~1MB, Names ~200KB | datasets too big for one tensor | multi-billion-token pretraining |
| DDP-aware | no | via workers | yes — rank offset + stride |
Related
- tokenization — what produces the tokens being streamed
- gradient-accumulation — consumes batches from the loader
- repos/build-nanogpt — DataLoaderLite + FineWeb script
- repos/llm-c — C dataloader for the same data