We introduce PentestJudge, a system for evaluating the operations of penetration testing agents.
We introduce AIRTBench, an AI red teaming benchmark for evaluating language models’ ability to autonomously discover and exploit Artificial Intelligence and Machine Learning (AI/ML) security vulnerabilities.