Hackbot R&D

I’m a researcher working at the intersection of artificial intelligence and computer security. I work at Dreadnode training and evaluating the hacking capabilities of agents. I’ll be working in this field until we see hacking’s move 37.
Shane Caldwell

ProofJudge: Can we align vibe-proving with human taste?

Towards measuring alignment with human taste in autoformalization with judge agents.

March 29, 2026 · 8 min · 1553 words · Shane Caldwell

The Tests All Pass

METR’s SWE-bench analysis shows us taste isn’t verifiable.

March 14, 2026 · 12 min · 2533 words · Shane Caldwell

Intro to GPUs For the Researcher

Getting comfortable with the hardware on a quest for more MFU.

January 5, 2026 · 31 min · 6553 words · Shane Caldwell

GPT-5 is Good, Actually: The Agony and Ecstasy of Public Benchmarks

An attempt to explain why benchmarks are either bad or secret, and why the bar charts don’t matter so much.

August 17, 2025 · 17 min · 3488 words · Shane Caldwell

Offsec Evals: Growing Up In The Dark Forest

If you contribute a public benchmark, are you giving free capability to your competitors?

October 28, 2025 · 11 min · 2258 words · Shane Caldwell

Pretraining at home: 20B tokens from 222 hours to 12

Optimizing training a Llama 3.2 1B model so we can pretrain in a day without going broke.

November 23, 2025 · 21 min · 4329 words · Shane Caldwell