EVAL METHODOLOGY

HellaSwag and Multiple-Choice LLM Evals

build-nanogpt is the only repo in this corpus that includes a real downstream benchmark, HellaSwag, and uses it as the success criterion for the GPT-2 reproduction. The eval methodology is worth understanding because it's the same shape as nearly every multiple-choice LLM eval (MMLU, ARC, BBH, etc.).

What HellaSwag is

HellaSwag (Zellers et al. 2019) is a commonsense reasoning multiple-choice dataset. Each example has:

The task: pick the right completion.

Humans
>95%
easy
Pre-LLM models
~50%
hard
Frontier LLMs
~95%
getting saturated

How language models eval on multiple-choice

The standard trick: instead of asking the model "which answer is right?", score each completion under the model and pick the lowest-loss one.

Row 0 [context tokens] + completion A avg loss
Row 1 · argmin [context tokens] + completion B lowest
Row 2 [context tokens] + completion C avg loss
Row 3 [context tokens] + completion D avg loss
One forward pass over a 4-row batch. Loss is computed only over completion tokens; the argmin row is the model's "answer."
# from build-nanogpt/train_gpt2.py
def get_most_likely_row(tokens, mask, logits):
    # evaluate the autoregressive loss at all positions
    shift_logits = (logits[..., :-1, :]).contiguous()
    shift_tokens = (tokens[..., 1:]).contiguous()
    flat_shift_logits = shift_logits.view(-1, shift_logits.size(-1))
    flat_shift_tokens = shift_tokens.view(-1)
    shift_losses = F.cross_entropy(flat_shift_logits, flat_shift_tokens, reduction='none')
    shift_losses = shift_losses.view(tokens.size(0), -1)
    # now get the average loss just for the completion region (where mask == 1), in each row
    shift_mask = (mask[..., 1:]).contiguous()
    masked_shift_losses = shift_losses * shift_mask
    sum_loss = masked_shift_losses.sum(dim=1)
    avg_loss = sum_loss / shift_mask.sum(dim=1)
    # now we have a loss for each of the 4 completions
    pred_norm = avg_loss.argmin().item()
    return pred_norm

What's happening

  1. Tokenize each completion as [context_tokens, completion_tokens]. Each of the 4 options is a row in a batch.
  2. A mask marks which tokens are the completion (the part the model is being scored on) vs the context (which is "given").
  3. Run the model forward once. The logits at position i predict token i+1.
  4. Compute cross-entropy loss at every position, but mask out the context positions — we only care about the completion's loss.
  5. Average the per-token loss over the completion region of each row.
  6. The lowest average loss wins. That's the model's "answer."

This is sometimes called "normalized" or "length-normalized" log-likelihood scoring. The normalization (dividing by shift_mask.sum(dim=1)) is important: completions are different lengths, and without normalization the model would systematically prefer shorter completions.

Why not just ask the model?

You could ask "Question: ...? A) ..., B) ..., C) ..., D) ..., Answer:" and have the model generate A/B/C/D. But:

Completion scoring is the universal method. It's what lm-eval-harness uses, what most LLM papers report on benchmark numbers.

The _norm suffix

Karpathy reports acc_norm — the accuracy when picking the completion with lowest normalized (per-token-average) loss. Other versions:

Metric Loss used Note
acc lowest unnormalized (sum) loss Biased toward shorter completions.
acc_norm lowest normalized (per-token-average) loss What's reported.

For HellaSwag specifically, acc_norm is the headline number.

GPT-3 125M
0.337
GPT-3 paper
GPT-3 175B
0.789
GPT-3 paper
build-nanogpt
~0.29
slightly below GPT-3's 124M number

Karpathy's build-nanogpt run reaches ~0.29 on HellaSwag, slightly below the GPT-3 paper's 124M number (different training data, slightly different hyperparameters).

Why HellaSwag

HellaSwag was chosen by Karpathy because:

  1. It's commonsense reasoning, not domain knowledge. Easier to interpret loss numbers.
  2. It's not saturated for 124M-scale models — there's signal at low accuracy levels.
  3. It has a standard eval harness widely used in the literature.
  4. The dataset is online and free.

For frontier models, HellaSwag is now near-saturated (>95% for GPT-4). At that scale, you switch to harder evals — MMLU, GSM8K, HumanEval, etc. The methodology is the same; only the question difficulty changes.

What this doesn't measure

HellaSwag completion accuracy is a coarse signal. It does not measure:

Long-context capabilities
The prompts are short.
Instruction following
It's pretrained-style scoring.
Calibration
The model could be right with high or low confidence.
Out-of-distribution robustness
HellaSwag is generated from one specific corpus.

For build-nanogpt's purposes — "did we reproduce GPT-2 well enough that downstream accuracy matches?" — HellaSwag is enough. For deciding whether a model is "good," you need a much larger eval suite.

Related