Evals on Shane Caldwell

Evals on Shane Caldwell https://hackbot.dad/tags/evals/ Recent content in Evals on Shane Caldwell Shane Caldwell https://hackbot.dad/ https://hackbot.dad/ Hugo -- 0.146.2 en-us Sun, 17 Aug 2025 00:00:00 +0000 GPT-5 is Good, Actually: The Agony and Ecstasy of Public Benchmarks https://hackbot.dad/writing/agony-and-ecstasy-evals/ Sun, 17 Aug 2025 00:00:00 +0000 https://hackbot.dad/writing/agony-and-ecstasy-evals/ An attempt to explain why benchmarks are either bad or secret, and why the bar charts don't matter so much.