Evals

No Autonomy Without Scalable Oversight

What to expect as we enter the Year of The Judge.

Towards measuring alignment with human taste in autoformalization with judge agents.

METR’s SWE-bench analysis shows us taste isn’t verifiable.

If you contribute a public benchmark, are you giving free capability to your competitors?

An attempt to explain why benchmarks are either bad or secret, and why the bar charts don’t matter so much.