EVAL METHODOLOGY

HellaSwag and Multiple-Choice LLM Evals

build-nanogpt is the only repo in this corpus that includes a real downstream benchmark, HellaSwag, and uses it as the success criterion for the GPT-2 reproduction. The eval methodology is worth understanding because it's the same shape as nearly every multiple-choice LLM eval (MMLU, ARC, BBH, etc.).

What HellaSwag is

HellaSwag (Zellers et al. 2019) is a commonsense reasoning multiple-choice dataset. Each example has:

A context sentence describing a situation.
Four completion candidates, exactly one of which is the actual continuation (the others are adversarially generated to fool weaker models).

The task: pick the right completion.

Humans

>95%

easy

Pre-LLM models

~50%

hard

Frontier LLMs

~95%

getting saturated

How language models eval on multiple-choice

The standard trick: instead of asking the model "which answer is right?", score each completion under the model and pick the lowest-loss one.

Row 0 [context tokens] + completion A avg loss —

Row 1 · argmin [context tokens] + completion B lowest

Row 2 [context tokens] + completion C avg loss —

Row 3 [context tokens] + completion D avg loss —

One forward pass over a 4-row batch. Loss is computed only over completion tokens; the argmin row is the model's "answer."

# from build-nanogpt/train_gpt2.py
def get_most_likely_row(tokens, mask, logits):
    # evaluate the autoregressive loss at all positions
    shift_logits = (logits[..., :-1, :]).contiguous()
    shift_tokens = (tokens[..., 1:]).contiguous()
    flat_shift_logits = shift_logits.view(-1, shift_logits.size(-1))
    flat_shift_tokens = shift_tokens.view(-1)
    shift_losses = F.cross_entropy(flat_shift_logits, flat_shift_tokens, reduction='none')
    shift_losses = shift_losses.view(tokens.size(0), -1)
    # now get the average loss just for the completion region (where mask == 1), in each row
    shift_mask = (mask[..., 1:]).contiguous()
    masked_shift_losses = shift_losses * shift_mask
    sum_loss = masked_shift_losses.sum(dim=1)
    avg_loss = sum_loss / shift_mask.sum(dim=1)
    # now we have a loss for each of the 4 completions
    pred_norm = avg_loss.argmin().item()
    return pred_norm

What's happening

Tokenize each completion as [context_tokens, completion_tokens]. Each of the 4 options is a row in a batch.
A mask marks which tokens are the completion (the part the model is being scored on) vs the context (which is "given").
Run the model forward once. The logits at position i predict token i+1.
Compute cross-entropy loss at every position, but mask out the context positions — we only care about the completion's loss.
Average the per-token loss over the completion region of each row.
The lowest average loss wins. That's the model's "answer."

This is sometimes called "normalized" or "length-normalized" log-likelihood scoring. The normalization (dividing by shift_mask.sum(dim=1)) is important: completions are different lengths, and without normalization the model would systematically prefer shorter completions.

Why not just ask the model?

You could ask "Question: ...? A) ..., B) ..., C) ..., D) ..., Answer:" and have the model generate A/B/C/D. But:

Base models (pretrained, not chat-tuned) aren't necessarily fluent at multiple-choice format.
The completion-scoring method works on any language model, no instruction-following needed.
Greedy generation can be wrong even when the model "knows" the answer — the right token might be the second-most-likely.

Completion scoring is the universal method. It's what lm-eval-harness uses, what most LLM papers report on benchmark numbers.

The `_norm` suffix

Karpathy reports acc_norm — the accuracy when picking the completion with lowest normalized (per-token-average) loss. Other versions:

Metric	Loss used	Note
`acc`	lowest unnormalized (sum) loss	Biased toward shorter completions.
`acc_norm`	lowest normalized (per-token-average) loss	What's reported.

For HellaSwag specifically, acc_norm is the headline number.

GPT-3 125M

0.337

GPT-3 paper

GPT-3 175B

0.789

GPT-3 paper

build-nanogpt

~0.29

slightly below GPT-3's 124M number

Karpathy's build-nanogpt run reaches ~0.29 on HellaSwag, slightly below the GPT-3 paper's 124M number (different training data, slightly different hyperparameters).

Why HellaSwag

HellaSwag was chosen by Karpathy because:

It's commonsense reasoning, not domain knowledge. Easier to interpret loss numbers.
It's not saturated for 124M-scale models — there's signal at low accuracy levels.
It has a standard eval harness widely used in the literature.
The dataset is online and free.

For frontier models, HellaSwag is now near-saturated (>95% for GPT-4). At that scale, you switch to harder evals — MMLU, GSM8K, HumanEval, etc. The methodology is the same; only the question difficulty changes.

What this doesn't measure

HellaSwag completion accuracy is a coarse signal. It does not measure:

Long-context capabilities: The prompts are short.
Instruction following: It's pretrained-style scoring.
Calibration: The model could be right with high or low confidence.
Out-of-distribution robustness: HellaSwag is generated from one specific corpus.

For build-nanogpt's purposes — "did we reproduce GPT-2 well enough that downstream accuracy matches?" — HellaSwag is enough. For deciding whether a model is "good," you need a much larger eval suite.

repos/build-nanogpt — where HellaSwag eval lives
sampling — at-train sampling is different from eval scoring
training-stability — eval is the validation signal training stability cares about

HellaSwag and Multiple-Choice LLM Evals

What HellaSwag is

How language models eval on multiple-choice

What's happening

Why not just ask the model?

The _norm suffix

Why HellaSwag

What this doesn't measure

Related

The `_norm` suffix