Hackbot R&D

Towards measuring alignment with human taste in autoformalization with judge agents.
METR’s SWE-bench analysis shows us taste isn’t verifiable.
Getting comfortable with the hardware on a quest for more MFU.
An attempt to explain why benchmarks are either bad or secret, and why the bar charts don’t matter so much.
If you contribute a public benchmark, are you giving free capability to your competitors?
Optimizing training a Llama 3.2 1B model so we can pretrain in a day without going broke.