ProofJudge: Can we align vibe-proving with human taste?
Towards measuring alignment with human taste in autoformalization with judge agents.
Towards measuring alignment with human taste in autoformalization with judge agents.
METR’s SWE-bench analysis shows us taste isn’t verifiable.
If you contribute a public benchmark, are you giving free capability to your competitors?
An attempt to explain why benchmarks are either bad or secret, and why the bar charts don’t matter so much.