PentestJudge: Judging Agent Behavior Against Operational Requirements

Shane Caldwell, Max Harley, Michael Kouremetis, Vincent Abruzzo, Will Pearce
August 4th, 2025
We introduce PentestJudge, a system for evaluating the operations of penetration testing agents.

AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models

Ads Dawson, Rob Mulla, Nick Landers, Shane Caldwell
June 17th, 2025
We introduce AIRTBench, an AI red teaming benchmark for evaluating language models’ ability to autonomously discover and exploit Artificial Intelligence and Machine Learning (AI/ML) security vulnerabilities.