Pretraining at home: 20B tokens from 222 hours to 12
Optimizing training a Llama 3.2 1B model so we can pretrain in a day without going broke.
Optimizing training a Llama 3.2 1B model so we can pretrain in a day without going broke.
We tried RL once. It didn’t work. I’m confident it will this time.
An attempt to explain why benchmarks are either bad or secret, and why the bar charts don’t matter so much.