HellaSwag and Multiple-Choice LLM Evals
build-nanogpt is the only repo in this corpus that includes a real downstream benchmark, HellaSwag, and uses it as the success criterion for the GPT-2 reproduction. The eval methodology is worth understanding because it's the same shape as nearly every multiple-choice LLM eval (MMLU, ARC, BBH, etc.).
What HellaSwag is
HellaSwag (Zellers et al. 2019) is a commonsense reasoning multiple-choice dataset. Each example has:
- A context sentence describing a situation.
- Four completion candidates, exactly one of which is the actual continuation (the others are adversarially generated to fool weaker models).
The task: pick the right completion.
How language models eval on multiple-choice
The standard trick: instead of asking the model "which answer is right?", score each completion under the model and pick the lowest-loss one.
# from build-nanogpt/train_gpt2.py
def get_most_likely_row(tokens, mask, logits):
# evaluate the autoregressive loss at all positions
shift_logits = (logits[..., :-1, :]).contiguous()
shift_tokens = (tokens[..., 1:]).contiguous()
flat_shift_logits = shift_logits.view(-1, shift_logits.size(-1))
flat_shift_tokens = shift_tokens.view(-1)
shift_losses = F.cross_entropy(flat_shift_logits, flat_shift_tokens, reduction='none')
shift_losses = shift_losses.view(tokens.size(0), -1)
# now get the average loss just for the completion region (where mask == 1), in each row
shift_mask = (mask[..., 1:]).contiguous()
masked_shift_losses = shift_losses * shift_mask
sum_loss = masked_shift_losses.sum(dim=1)
avg_loss = sum_loss / shift_mask.sum(dim=1)
# now we have a loss for each of the 4 completions
pred_norm = avg_loss.argmin().item()
return pred_norm
What's happening
- Tokenize each completion as
[context_tokens, completion_tokens]. Each of the 4 options is a row in a batch. - A
maskmarks which tokens are the completion (the part the model is being scored on) vs the context (which is "given"). - Run the model forward once. The logits at position
ipredict tokeni+1. - Compute cross-entropy loss at every position, but mask out the context positions — we only care about the completion's loss.
- Average the per-token loss over the completion region of each row.
- The lowest average loss wins. That's the model's "answer."
This is sometimes called "normalized" or "length-normalized" log-likelihood scoring. The normalization (dividing by shift_mask.sum(dim=1)) is important: completions are different lengths, and without normalization the model would systematically prefer shorter completions.
Why not just ask the model?
You could ask "Question: ...? A) ..., B) ..., C) ..., D) ..., Answer:" and have the model generate A/B/C/D. But:
- Base models (pretrained, not chat-tuned) aren't necessarily fluent at multiple-choice format.
- The completion-scoring method works on any language model, no instruction-following needed.
- Greedy generation can be wrong even when the model "knows" the answer — the right token might be the second-most-likely.
Completion scoring is the universal method. It's what lm-eval-harness uses, what most LLM papers report on benchmark numbers.
The _norm suffix
Karpathy reports acc_norm — the accuracy when picking the completion with lowest normalized (per-token-average) loss. Other versions:
| Metric | Loss used | Note |
|---|---|---|
acc |
lowest unnormalized (sum) loss | Biased toward shorter completions. |
acc_norm |
lowest normalized (per-token-average) loss | What's reported. |
For HellaSwag specifically, acc_norm is the headline number.
Karpathy's build-nanogpt run reaches ~0.29 on HellaSwag, slightly below the GPT-3 paper's 124M number (different training data, slightly different hyperparameters).
Why HellaSwag
HellaSwag was chosen by Karpathy because:
- It's commonsense reasoning, not domain knowledge. Easier to interpret loss numbers.
- It's not saturated for 124M-scale models — there's signal at low accuracy levels.
- It has a standard eval harness widely used in the literature.
- The dataset is online and free.
For frontier models, HellaSwag is now near-saturated (>95% for GPT-4). At that scale, you switch to harder evals — MMLU, GSM8K, HumanEval, etc. The methodology is the same; only the question difficulty changes.
What this doesn't measure
HellaSwag completion accuracy is a coarse signal. It does not measure:
- Long-context capabilities
- The prompts are short.
- Instruction following
- It's pretrained-style scoring.
- Calibration
- The model could be right with high or low confidence.
- Out-of-distribution robustness
- HellaSwag is generated from one specific corpus.
For build-nanogpt's purposes — "did we reproduce GPT-2 well enough that downstream accuracy matches?" — HellaSwag is enough. For deciding whether a model is "good," you need a much larger eval suite.
Related
- repos/build-nanogpt — where HellaSwag eval lives
- sampling — at-train sampling is different from eval scoring
- training-stability — eval is the validation signal training stability cares about