The Tests All Pass
METR’s SWE-bench analysis shows us taste isn’t verifiable.
METR’s SWE-bench analysis shows us taste isn’t verifiable.
If you contribute a public benchmark, are you giving free capability to your competitors?
An attempt to explain why benchmarks are either bad or secret, and why the bar charts don’t matter so much.