Hackbot R&D

I’m a researcher working at the intersection of artificial intelligence and computer security. I work at Dreadnode training and evaluating the hacking capabilities of agents. I’ll be working in this field until we see hacking’s move 37.

ProofJudge: Can we align vibe-proving with human taste?

Towards measuring alignment with human taste in autoformalization with judge agents.

The Tests All Pass

METR’s SWE-bench analysis shows us taste isn’t verifiable.

Intro to GPUs For the Researcher

Getting comfortable with the hardware on a quest for more MFU.

GPT-5 is Good, Actually: The Agony and Ecstasy of Public Benchmarks

An attempt to explain why benchmarks are either bad or secret, and why the bar charts don’t matter so much.

Offsec Evals: Growing Up In The Dark Forest

If you contribute a public benchmark, are you giving free capability to your competitors?

Pretraining at home: 20B tokens from 222 hours to 12

Optimizing training a Llama 3.2 1B model so we can pretrain in a day without going broke.