No Autonomy Without Scalable Oversight
What to expect as we enter the Year of The Judge.
What to expect as we enter the Year of The Judge.
Towards measuring alignment with human taste in autoformalization with judge agents.
METR’s SWE-bench analysis shows us taste isn’t verifiable.
Getting comfortable with the hardware on a quest for more MFU.
The practical realities of devestatingly high communication cost in training.
Looking at the data and letting it look back at us.
Optimizing training a Llama 3.2 1B model so we can pretrain in a day without going broke.
If you contribute a public benchmark, are you giving free capability to your competitors?
Distributed training sans datacenter.
We tried RL once. It didn’t work. I’m confident it will this time.