Sampling Strategies
Once a language model is trained, generating text from it is a sampling problem: at each step, the model outputs a probability distribution over the vocabulary, and you have to pick one token. The choice of sampling strategy affects diversity, coherence, and whether the model gets stuck in loops.
Greedy sampling
Pick the argmax. Deterministic, always the most likely token. In
Karpathy's code this is the temperature == 0 branch:
# from llama2.c/model.py
if temperature == 0.0:
_, idx_next = torch.topk(logits, k=1, dim=-1)
Greedy decoding gives the highest-probability sequence locally but is famously bad for open-ended generation — it gets stuck in loops ("the the the the") and produces bland, repetitive text. It's fine for tasks where there's a single correct answer (closed-domain QA, deterministic translation), but not for "tell me a story."
Temperature
The simplest stochastic strategy: divide the logits by a temperature
T, apply softmax, sample multinomially.
logits = logits / temperature
probs = F.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)
- T = 1.0
- sample from the raw distribution.
- T → 0
- peaky distribution, approaches greedy.
- T > 1
- flatter distribution, more diverse / weirder samples.
- T < 1
- sharper distribution, fewer wild words.
T = 1, the long tail of low-probability tokens collectively
has non-trivial mass, and you can sample a junk token. So temperature is
usually combined with a truncation strategy.
Top-k sampling
Keep only the top k highest-logit tokens; zero out
everything else; renormalize and sample. From nanoGPT/model.py:
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = -float('Inf')
probs = F.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)
Typical k: 40 or 50. The Huggingface generation pipeline
defaults to top_k=50. This guarantees you never sample
from the long tail of garbage. The downside: k is fixed
regardless of context. Sometimes the model is genuinely uncertain and
you want to sample from 200 plausible tokens; sometimes it's extremely
confident and even 5 tokens is too many.
Top-p (nucleus) sampling
Holtzman et al. 2020. Sort tokens by descending probability; take the
smallest set whose cumulative probability exceeds p;
sample from that set. Adaptive: the set shrinks when
the model is confident and grows when it's not.
llama2.c/run.c implements top-p in C (the function
sample_topp). The recommended Llama defaults are
temperature = 1.0, top_p = 0.9.
Top-k
k is fixed regardless of context. Same size cut whether
the model is confident or uncertain.
Top-p
The set shrinks when the model is confident and grows when it's not, because cumulative probability adapts to the distribution's shape.
Top-k and top-p together (or just one)
Common practice is to apply temperature, then either top-k or top-p
(sometimes both). Karpathy's llama2.c README puts it
bluntly:
to control the diversity of samples use either the temperature (i.e. vary
-tbetween 0 and 1 and keep top-p off with-p 0) or the top-p value (i.e. vary-pbetween 0 and 1 and keep-t 1), but not both.
This is good practical advice. Tuning both at once is fiddly and the parameters interact in confusing ways.
Build-nanogpt's pipeline
In build-nanogpt/train_gpt2.py mid-training samples:
# do top-k sampling of 50 (huggingface pipeline default)
topk_probs, topk_indices = torch.topk(probs, 50, dim=-1)
ix = torch.multinomial(topk_probs, 1, generator=sample_rng)
xcol = torch.gather(topk_indices, -1, ix)
torch.multinomial
doesn't need a normalized input. The top-k probs sum to less than 1
(because the rest of the distribution was discarded), and multinomial
just treats them as weights. This is slightly different from "zero out
and renormalize then sample," but produces the same distribution.
Other strategies
Strategies the Karpathy corpus doesn't cover but are worth knowing about:
k candidate sequences, expand each by one
token, keep the top k by joint probability. Standard
for translation; not used in open-ended LLM generation because it
produces low-diversity, bland text.
p * max_prob.
Cleaner than top-p in some regimes.
Related
- transformer-block — produces the logits
- kv-cache — makes sampling fast
- repos/llama2-c — top-p in pure C
- repos/nanoGPT — top-k in PyTorch