Training

All Reduce Across the Atlantic: Bandwidth in Decentralized Training

The practical realities of devestatingly high communication cost in training.

Looking at the data and letting it look back at us.

Optimizing training a Llama 3.2 1B model so we can pretrain in a day without going broke.

Distributed training sans datacenter.