No Autonomy Without Scalable Oversight
What to expect as we enter the Year of The Judge.
What to expect as we enter the Year of The Judge.
Towards measuring alignment with human taste in autoformalization with judge agents.
METR’s SWE-bench analysis shows us taste isn’t verifiable.
If you contribute a public benchmark, are you giving free capability to your competitors?
An attempt to explain why benchmarks are either bad or secret, and why the bar charts don’t matter so much.