Is That Improvement Real? A visual guide to eval statistics
You changed the prompt and the score went up. Should you ship it? An interactive companion to Anthropic's eval guide that walks through variance, standard error, and paired comparisons on a concrete example — with less assumed stats background.