Concept · Training infra

DataLoader Patterns

Feeding tokens into a transformer training loop is conceptually simple — sample random sequences from a giant token stream — but the engineering matters a lot at scale. Karpathy's repos progressively reveal what a good LLM dataloader looks like, from a one-liner in ng-video-lecture to the sharded streaming loader in build-nanogpt.

The three tiers
Tier 1
In-memory random crop ng-video-lecture, makemore — entire stream in one CPU tensor, random starting offsets per batch.
Tier 2
PyTorch DataLoader over a Dataset makemore — batching, shuffling, and multi-worker prefetching for free.
Tier 3
Sharded streaming build-nanogpt — many .npy shards, sequential reads, DDP rank slicing.

Tier 1 — in-memory random crop (ng-video-lecture, makemore)

The simplest possible dataloader, from ng-video-lecture/gpt.py:

data = torch.tensor(encode(text), dtype=torch.long)  # everything in RAM
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

The entire token stream is one tensor in CPU memory. To make a batch:

  1. Pick batch_size random starting positions in the stream.
  2. Slice out block_size tokens starting from each.
  3. The inputs x are tokens i .. i+block_size.
  4. The targets y are tokens i+1 .. i+block_size+1 — same window, shifted right by one. Each input token's target is the next token.

The shift-by-one window, visualized

stream
t0
t1
t2
t3
t4
t5
t6
t7
x =
t0
t1
t2
t3
t4
t5
·
·
y =
·
t1
t2
t3
t4
t5
t6
·
x spans i .. i+block_size; y is the same window shifted right by one. Every position contributes one next-token-prediction signal.
Teacher forcing + autoregressive training. At every position in the sequence, the loss asks "given the tokens up to here, predict the next one." A single block_size-token sample produces block_size independent next-token-prediction supervision signals.

Memory. Tiny Shakespeare is ~1MB. Names is ~200KB. Both fit comfortably in RAM.

Tier 2 — PyTorch DataLoader over a Dataset (makemore)

makemore.py wraps a custom Dataset and uses torch.utils.data.DataLoader:

from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader

# ... define CharDataset that yields (x, y) pairs ...

train_loader = DataLoader(train_dataset, ...)

This gives you batching, shuffling, and multi-worker prefetching for free. Worth using when:

For pure-token LM data, the in-memory tensor approach is simpler and faster (no per-sample Python overhead).

Tier 3 — sharded streaming (build-nanogpt)

For multi-billion-token pretraining, you can't keep everything in RAM. build-nanogpt/train_gpt2.py has DataLoaderLite:

class DataLoaderLite:
    def __init__(self, B, T, process_rank, num_processes, split):
        self.B = B
        self.T = T
        self.process_rank = process_rank
        self.num_processes = num_processes

        data_root = "edu_fineweb10B"
        shards = os.listdir(data_root)
        shards = [s for s in shards if split in s]
        shards = sorted(shards)
        shards = [os.path.join(data_root, s) for s in shards]
        self.shards = shards
        self.reset()

    def reset(self):
        self.current_shard = 0
        self.tokens = load_tokens(self.shards[self.current_shard])
        self.current_position = self.B * self.T * self.process_rank

    def next_batch(self):
        B, T = self.B, self.T
        buf = self.tokens[self.current_position : self.current_position+B*T+1]
        x = (buf[:-1]).view(B, T)
        y = (buf[1:]).view(B, T)
        self.current_position += B * T * self.num_processes
        if self.current_position + (B * T * self.num_processes + 1) > len(self.tokens):
            self.current_shard = (self.current_shard + 1) % len(self.shards)
            self.tokens = load_tokens(self.shards[self.current_shard])
            self.current_position = B * T * self.process_rank
        return x, y

What's going on

Shards
The 10B-token FineWeb corpus is split into many .npy files (100M tokens each, typically). load_tokens loads one shard into a tensor.
Sequential, not random
Each batch advances current_position by B * T * num_processes. When you fall off the end of a shard, advance to the next one. No random sampling — the model sees the corpus in a fixed (but data-prep-shuffled) order.
DDP rank slicing
Each DDP rank starts at offset B * T * rank within the shard and advances by B * T * num_processes. This guarantees no two ranks see the same tokens in the same step.

DDP rank slicing, visualized

rank 0
B·T
·
B·T
·
B·T
·
B·T
·
rank 1
·
B·T
·
B·T
·
B·T
·
B·T
Each rank starts at offset B · T · rank and advances by B · T · num_processes. No two ranks see the same tokens in the same step.
Sequential ordering is fine for LM pretraining because the corpus is so large that any given token is seen once or twice per training run, not many times. You can avoid random shuffling at the dataloader by doing it once at data prep time.

The fineweb.py preprocessor

build-nanogpt/fineweb.py is the script that tokenizes the FineWeb-Edu dataset and writes the shards. Two-step pattern:

  1. Stream the raw dataset (HuggingFace datasets.load_dataset).
  2. Tokenize each document with tiktoken, write tokens to shard files of fixed size (100M tokens each).

Each token is a uint16 (fits any GPT-2 token ID, which max out at 50256). 10B tokens = 20GB of binary data. Manageable.

Llm.c's dataloader.h

llm.c/llmc/dataloader.h is the C version of the same thing — mmaps shard files, advances a pointer, hands batches to the training loop. Same pattern, no PyTorch, just file descriptors and pointer arithmetic.

Comparing the three tiers

Tier 1 · in-memory Tier 2 · DataLoader Tier 3 · sharded
seen in ng-video-lecture, makemore makemore build-nanogpt, llm.c
storage one CPU tensor custom Dataset .npy shards on disk
sampling random starting offsets batching + shuffling for free sequential, advances pointer
scale fits Tiny Shakespeare ~1MB, Names ~200KB datasets too big for one tensor multi-billion-token pretraining
DDP-aware no via workers yes — rank offset + stride

Related