Concept · Training infra

DataLoader Patterns

Feeding tokens into a transformer training loop is conceptually simple — sample random sequences from a giant token stream — but the engineering matters a lot at scale. Karpathy's repos progressively reveal what a good LLM dataloader looks like, from a one-liner in ng-video-lecture to the sharded streaming loader in build-nanogpt.

The three tiers

Tier 1

In-memory random crop ng-video-lecture, makemore — entire stream in one CPU tensor, random starting offsets per batch.

Tier 2

PyTorch DataLoader over a Dataset makemore — batching, shuffling, and multi-worker prefetching for free.

Tier 3

Sharded streaming build-nanogpt — many .npy shards, sequential reads, DDP rank slicing.

Tier 1 — in-memory random crop (ng-video-lecture, makemore)

The simplest possible dataloader, from ng-video-lecture/gpt.py:

data = torch.tensor(encode(text), dtype=torch.long)  # everything in RAM
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

The entire token stream is one tensor in CPU memory. To make a batch:

Pick batch_size random starting positions in the stream.
Slice out block_size tokens starting from each.
The inputs x are tokens i .. i+block_size.
The targets y are tokens i+1 .. i+block_size+1 — same window, shifted right by one. Each input token's target is the next token.

The shift-by-one window, visualized

stream

t₀

t₁

t₂

t₃

t₄

t₅

t₆

t₇

x =

t₀

t₁

t₂

t₃

t₄

t₅

y =

t₁

t₂

t₃

t₄

t₅

t₆

x spans i .. i+block_size; y is the same window shifted right by one. Every position contributes one next-token-prediction signal.

Teacher forcing + autoregressive training. At every position in the sequence, the loss asks "given the tokens up to here, predict the next one." A single block_size-token sample produces block_size independent next-token-prediction supervision signals.

Memory. Tiny Shakespeare is ~1MB. Names is ~200KB. Both fit comfortably in RAM.

Tier 2 — PyTorch `DataLoader` over a `Dataset` (makemore)

makemore.py wraps a custom Dataset and uses torch.utils.data.DataLoader:

from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader

# ... define CharDataset that yields (x, y) pairs ...

train_loader = DataLoader(train_dataset, ...)

This gives you batching, shuffling, and multi-worker prefetching for free. Worth using when:

Data preprocessing per-sample is non-trivial (decoding images, parsing JSON).
You want CPU prefetching to overlap with GPU compute.
Datasets are too big to keep in one tensor.

For pure-token LM data, the in-memory tensor approach is simpler and faster (no per-sample Python overhead).

Tier 3 — sharded streaming (build-nanogpt)

For multi-billion-token pretraining, you can't keep everything in RAM. build-nanogpt/train_gpt2.py has DataLoaderLite:

class DataLoaderLite:
    def __init__(self, B, T, process_rank, num_processes, split):
        self.B = B
        self.T = T
        self.process_rank = process_rank
        self.num_processes = num_processes

        data_root = "edu_fineweb10B"
        shards = os.listdir(data_root)
        shards = [s for s in shards if split in s]
        shards = sorted(shards)
        shards = [os.path.join(data_root, s) for s in shards]
        self.shards = shards
        self.reset()

    def reset(self):
        self.current_shard = 0
        self.tokens = load_tokens(self.shards[self.current_shard])
        self.current_position = self.B * self.T * self.process_rank

    def next_batch(self):
        B, T = self.B, self.T
        buf = self.tokens[self.current_position : self.current_position+B*T+1]
        x = (buf[:-1]).view(B, T)
        y = (buf[1:]).view(B, T)
        self.current_position += B * T * self.num_processes
        if self.current_position + (B * T * self.num_processes + 1) > len(self.tokens):
            self.current_shard = (self.current_shard + 1) % len(self.shards)
            self.tokens = load_tokens(self.shards[self.current_shard])
            self.current_position = B * T * self.process_rank
        return x, y

What's going on

Shards: The 10B-token FineWeb corpus is split into many .npy files (100M tokens each, typically). load_tokens loads one shard into a tensor.
Sequential, not random: Each batch advances current_position by B * T * num_processes. When you fall off the end of a shard, advance to the next one. No random sampling — the model sees the corpus in a fixed (but data-prep-shuffled) order.
DDP rank slicing: Each DDP rank starts at offset B * T * rank within the shard and advances by B * T * num_processes. This guarantees no two ranks see the same tokens in the same step.

DDP rank slicing, visualized

rank 0

B·T

rank 1

B·T

Each rank starts at offset B · T · rank and advances by B · T · num_processes. No two ranks see the same tokens in the same step.

Sequential ordering is fine for LM pretraining because the corpus is so large that any given token is seen once or twice per training run, not many times. You can avoid random shuffling at the dataloader by doing it once at data prep time.

The `fineweb.py` preprocessor

build-nanogpt/fineweb.py is the script that tokenizes the FineWeb-Edu dataset and writes the shards. Two-step pattern:

Stream the raw dataset (HuggingFace datasets.load_dataset).
Tokenize each document with tiktoken, write tokens to shard files of fixed size (100M tokens each).

Each token is a uint16 (fits any GPT-2 token ID, which max out at 50256). 10B tokens = 20GB of binary data. Manageable.

Llm.c's `dataloader.h`

llm.c/llmc/dataloader.h is the C version of the same thing — mmaps shard files, advances a pointer, hands batches to the training loop. Same pattern, no PyTorch, just file descriptors and pointer arithmetic.

Comparing the three tiers

	Tier 1 · in-memory	Tier 2 · DataLoader	Tier 3 · sharded
seen in	ng-video-lecture, makemore	makemore	build-nanogpt, llm.c
storage	one CPU tensor	custom `Dataset`	`.npy` shards on disk
sampling	random starting offsets	batching + shuffling for free	sequential, advances pointer
scale fits	Tiny Shakespeare ~1MB, Names ~200KB	datasets too big for one tensor	multi-billion-token pretraining
DDP-aware	no	via workers	yes — rank offset + stride

tokenization — what produces the tokens being streamed
gradient-accumulation — consumes batches from the loader
repos/build-nanogpt — DataLoaderLite + FineWeb script
repos/llm-c — C dataloader for the same data

DataLoader Patterns

The three tiers

Tier 1 — in-memory random crop (ng-video-lecture, makemore)

The shift-by-one window, visualized

Tier 2 — PyTorch DataLoader over a Dataset (makemore)

Tier 3 — sharded streaming (build-nanogpt)

What's going on

DDP rank slicing, visualized

The fineweb.py preprocessor

Llm.c's dataloader.h

Comparing the three tiers

Related

Tier 2 — PyTorch `DataLoader` over a `Dataset` (makemore)

The `fineweb.py` preprocessor

Llm.c's `dataloader.h`