Reference · Sampling

Sampling Strategies

Once a language model is trained, generating text from it is a sampling problem: at each step, the model outputs a probability distribution over the vocabulary, and you have to pick one token. The choice of sampling strategy affects diversity, coherence, and whether the model gets stuck in loops.

Greedy sampling

Pick the argmax. Deterministic, always the most likely token. In Karpathy's code this is the temperature == 0 branch:

# from llama2.c/model.py
if temperature == 0.0:
    _, idx_next = torch.topk(logits, k=1, dim=-1)

Greedy decoding gives the highest-probability sequence locally but is famously bad for open-ended generation — it gets stuck in loops ("the the the the") and produces bland, repetitive text. It's fine for tasks where there's a single correct answer (closed-domain QA, deterministic translation), but not for "tell me a story."

Temperature

The simplest stochastic strategy: divide the logits by a temperature T, apply softmax, sample multinomially.

logits = logits / temperature
probs = F.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)
T = 1.0
sample from the raw distribution.
T → 0
peaky distribution, approaches greedy.
T > 1
flatter distribution, more diverse / weirder samples.
T < 1
sharper distribution, fewer wild words.
Caveat. Temperature alone has a problem: even at T = 1, the long tail of low-probability tokens collectively has non-trivial mass, and you can sample a junk token. So temperature is usually combined with a truncation strategy.

Top-k sampling

Keep only the top k highest-logit tokens; zero out everything else; renormalize and sample. From nanoGPT/model.py:

if top_k is not None:
    v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
    logits[logits < v[:, [-1]]] = -float('Inf')
probs = F.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)

Typical k: 40 or 50. The Huggingface generation pipeline defaults to top_k=50. This guarantees you never sample from the long tail of garbage. The downside: k is fixed regardless of context. Sometimes the model is genuinely uncertain and you want to sample from 200 plausible tokens; sometimes it's extremely confident and even 5 tokens is too many.

Top-p (nucleus) sampling

Holtzman et al. 2020. Sort tokens by descending probability; take the smallest set whose cumulative probability exceeds p; sample from that set. Adaptive: the set shrinks when the model is confident and grows when it's not.

llama2.c/run.c implements top-p in C (the function sample_topp). The recommended Llama defaults are temperature = 1.0, top_p = 0.9.

Fixed window

Top-k

k is fixed regardless of context. Same size cut whether the model is confident or uncertain.

Adaptive window

Top-p

The set shrinks when the model is confident and grows when it's not, because cumulative probability adapts to the distribution's shape.

Top-k and top-p together (or just one)

Common practice is to apply temperature, then either top-k or top-p (sometimes both). Karpathy's llama2.c README puts it bluntly:

to control the diversity of samples use either the temperature (i.e. vary -t between 0 and 1 and keep top-p off with -p 0) or the top-p value (i.e. vary -p between 0 and 1 and keep -t 1), but not both.

This is good practical advice. Tuning both at once is fiddly and the parameters interact in confusing ways.

Build-nanogpt's pipeline

In build-nanogpt/train_gpt2.py mid-training samples:

# do top-k sampling of 50 (huggingface pipeline default)
topk_probs, topk_indices = torch.topk(probs, 50, dim=-1)
ix = torch.multinomial(topk_probs, 1, generator=sample_rng)
xcol = torch.gather(topk_indices, -1, ix)
Implementation note. torch.multinomial doesn't need a normalized input. The top-k probs sum to less than 1 (because the rest of the distribution was discarded), and multinomial just treats them as weights. This is slightly different from "zero out and renormalize then sample," but produces the same distribution.

Other strategies

Strategies the Karpathy corpus doesn't cover but are worth knowing about:

Sequence-level
Beam search
Maintain k candidate sequences, expand each by one token, keep the top k by joint probability. Standard for translation; not used in open-ended LLM generation because it produces low-diversity, bland text.
Truncation
Min-p
Like top-p but parameterized by a minimum relative probability — keep tokens with prob ≥ p * max_prob. Cleaner than top-p in some regimes.
Post-hoc adjustment
Repetition penalty
Multiply logits of tokens that appeared recently by a penalty < 1. Hacky but effective against degenerate repetition.

Related