Evals

ProofJudge: Can we align vibe-proving with human taste?

Towards measuring alignment with human taste in autoformalization with judge agents.

METR’s SWE-bench analysis shows us taste isn’t verifiable.

If you contribute a public benchmark, are you giving free capability to your competitors?

An attempt to explain why benchmarks are either bad or secret, and why the bar charts don’t matter so much.