Papers

PentestJudge: Judging Agent Behavior Against Operational Requirements

Shane Caldwell, Max Harley, Michael Kouremetis, Vincent Abruzzo, Will Pearce

August 4th, 2025

We introduce PentestJudge, a system for evaluating the operations of penetration testing agents.

AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models

Ads Dawson, Rob Mulla, Nick Landers, Shane Caldwell

June 17th, 2025

We introduce AIRTBench, an AI red teaming benchmark for evaluating language models’ ability to autonomously discover and exploit Artificial Intelligence and Machine Learning (AI/ML) security vulnerabilities.