No Autonomy Without Scalable Oversight

2026 is The Year of The Judge. I don’t think of this as a prediction as much as it is describing the inevitable. That is because this year is the first year we’re seriously deploying long running autonomous agents in all work that touches a computer. There is no way for that to make sense if human beings are reviewing every action and output of significance that a deployed agent makes that might have an effect on an external environment. That would ultimately amount to observing every action and output each agent takes. This is both technically infeasible, and as the last four months of software agents have illustrated, we don’t have the patience for it.

2025 was the year of mechanical verification. We found that if we could define a “correct” answer we could both hill-climb the capabilities of models (using RLVR) and improve the quality of model outputs (by sampling until we verified the value). These have been extremely effective, but have also ended up revealing themselves as spiky and brittle: many important things are not mechanically verifiable.

To see how mechanical verification has failed us in evaluation, we can look at METR’s recent post describing SWE-bench-Passing PRs that would not be merged into a production codebase. This is a case where mechanical evaluation like passing unit tests is now insufficient to distinguish the quality bar we’re interested in. We are no longer asking whether LLMs can write Python that runs - that’s settled. We’re now asking questions like “can LLMs write high-quality abstractions for Python that fit developer conventions”. The mechanically verifiable machinery just doesn’t have the capacity to meaningfully answer those kind of questions. What it provides in reproducibility it gives up in measurement sensitivity, which is a tradeoff we will simply have to give up if we want to be able to empirically ask questions that are meaningful to agent deployment.

Training has suffered similarly. The difference in content between a mechanically verifiable task and a non-mechanically verifiable one is just too arbitrary. For example, many people have created mechanical verifiers that can reliably tell you if a web app pentesting agent has correctly popped XSS. This is not possible to do for testing IDORs, because the significance of access control bugs is specific to a particular application and the context of the data revealed through those bugs. That’s not to say IDORs are harder to find, they’re often considered a great first bug class to hunt on bug bounty specifically because they’re the easiest bug class to hunt for that requires human verification and prevents getting scooped by fuzzers. Let’s say that around 50% of the web app bug classes of interest can be formulated as mechanically verifiable, and the other 50% can’t. Will users accept an agent product that can do half the things professional web app pentesters can do? Obviously not. Are models incapable of understanding the context of data as it appears in an application and whether it’s sensitive? Obviously not. So: you will have to verify from validators that are noisy and stochastic - judges. RLVR seems to be robust to judge noise.

These are just changes that need to be made for evaluation and training of the type we do today to continue making progress. Let’s consider what pressures are specifically added by long-running agents that take meaningful actions in the real world.

Scalable Oversight of Infinite Agents

The labs are signaling their interest in models that can work autonomously for longer and longer periods of time. METR has Claude Opus 4.6 at doing tasks over a 12 hour period. These are less like “implement a bug fix” and more like “build these net new features”. In the former case, it’s easy for a human being to check whether the bug is gone at the end and the new code looks good. In the latter case, building up enough context to understand the code and grade it is time consuming. If you’re running multiple Claude Opus instances in different parts of your codebase, you’re now talking about a huge branching action space with output you simply cannot review manually. Running linters and unit test in the loop helps tie the model outputs to reality, but is ultimately not sufficient to avoid creating unmaintainable slop. This requires both outcome based judges and in-process judges.

Autonomy moves at the speed of trust. In order for you to be comfortable with a dozen agents running hundreds to thousands of commands on your computer, you have to trust that no one of them is going to rm -rf /. Currently this problem is being solved by sandboxes. This works for local software development, but it’s just kicking the can down the road. Running commands in a sandbox is mostly fine because the blast radius is small. In order for these agents to be doing meaningful autonomous work, they will have to take action outside the sandbox. As a security-adjacent AI researcher this is obvious to me - there’s no sandbox when you’re on a network pentest - there’s no sandbox for submitting your taxes - there’s no sandbox for on-call infrastructure bugs. The work cannot get done without the commands being run, and the wrong command can have hugely negative consequences. In order for that loop to run, you’ll need strong judges in and around the loop.

In a new Claude Code update, in response to the overwhelming willingness of software developers to use --dangerously-skip-permissions (or just clicking approve without looking), the new auto mode has been introduced. The auto mode is specifically designed around another model looking at the transcript of a series of tool calls and deciding whether to approve or deny that call. This model is Sonnet 4.6, which suggests a focus on cheap models that can verify the actions of more expensive models. auto-mode is the clearest public example of the runtime judge gating tool calls based on inferred user intent. This is the in-process judge.

From Anthropic’s recent article on harnesses for long-running application development, we see the outcome based judge. When discussing having a harness that allows a model to work for an extended period of time on a frontend project, they decide to have a separate model act as a judge for the software engineering of the first. This occurs in a loop - a generator model submits work, and then a critic reviews that work and asks for changes and points out bugs. This is the outcome-based judge.

To quote:

[…]Tuning a standalone evaluator to be skeptical turns out to be far more tractable than making a generator critical of its own work.

So, how do you do scalable oversight of infinite agents doing actions of arbitrary potential consequence? Turtles all the way down. While it will take time for a final pattern to emerge, it’s looking like layered judges will be the norm. Generator models will do work: write code, pentest, file taxes. In-process judges will gate tool calls to allow work to run more autonomously without oversight, and outcome-judges will run when generators stop to confirm that the product is aligned with the users intent, and ask for changes otherwise. Other models will be layered in as needed to look for prompt injection, security risks. The models will be deployed in their own personal panopticon.

To be clear, I don’t think that’s a bad thing. In fact aligning judges to human written rubrics is probably the easiest way for subject matter experts to productively use their skills and knowledge to interact with increasingly autonomous systems taking actions on their behalf. When I wrote PentestJudge last year, it was in anticipation of these trends. Multiple companies are deploying pentesting agents. This is now practical because agents are performing at longer time horizons. It is also perhaps the most clear place where AI alignment with human intent becomes a contact sport. Cybersecurity is a field where actions have very significant consequences, and a large portion of the field is made up of consultants who are very used to explicitly writing out scope and intentionally communicating processes and results to semi-technical audiences. Information security is the place we will first see a judge methodology cohere outside of software development, and that pattern will be copied for every other knowledge work field.

The Judge Research Agenda

So if we assume all of the above is true, what’s worth working on?

Calibration: figuring out the best way to align LLM-judges with their human counterparts. I’ve done it in a fairly simple way: make all the judgements true/false, collect a bunch of human labels, then iterate on the rubric using the training set in such a way that you maximize F1 on the test set. Finding out the best tools and context to provide a model and the best way for subject matter experts to create rubrics are all open questions that we’ll discover through doing the work.
Adversarial robustness: This shifts a lot of responsibility onto judges, and makes them the natural point to attack. Can judges be swayed by prompt injection from the environment the operating agent is working in? Can a misaligned model fool a judge into allowing a tool call it shouldn’t? These go from being interesting theoretical questions to part and parcel of day to day operations with long running agentic harnesses working in sensitive domains. The answers will likely look different for in-process and outcome-based judges.
Escalation sensitivity: Can a judge be tuned to know when to raise a decision to a human in a way that doesn’t cause “approval fatigue”? Traditionally infosec tooling has been willing to submit many false positives to users so they’re not on the hook for any false negatives. For autonomy to function, escalations have to be important and rare. How can we meaningfully study that question, and how can we create systems that work?

I’ll close on two claims:

First, any deployment story that assumes a human employee is reviewing every consequential agent interaction will not survive contact with reality. Judges are how you recover autonomy, and agent products that don’t allow companies to calibrate those judges will find themselves locked out of high impact industries.

Second, attacking judges in harnesses like Claude Code’s auto mode is where AI red teaming goes next. The methodology doesn’t exist yet, and it will not be practical for humans to test these systems “by hand”.

As for myself, I’ll be publishing on calibration, adversarial robustness, and escalation sensitivity for the rest of 2026; if you’re working on any of them I’d love to hear from you.

Scalable Oversight of Infinite Agents#

The Judge Research Agenda#

Scalable Oversight of Infinite Agents

The Judge Research Agenda