Source: wiki synthesis: Measuring AI Agent Autonomy in Practice (Anthropic), MirrorCode (Epoch AI + METR), The Production Class Ladder (Nate B Jones)

The Verification Frontier explains why autonomy gates on verification; this article is the measurement companion — how to locate where your agents actually are. Three independent instruments answer the same question from different angles: Anthropic’s first-party field study measures what deployed agents actually do, the MirrorCode benchmark measures what they can do unattended at real budgets, and Nate Jones’s production class ladder classifies what standard the output must meet given who relies on it. Read together, they compose into a usable self-assessment: which rung you’re on, what evidence earns the next one, and where autonomy actually breaks.

Key Takeaways

  • Three instruments, three questions.^[inferred — the do/can/must-meet framing is this article’s synthesis] Field telemetry (Clio analysis of Claude Code + public API) = autonomy exercised; MirrorCode (rebuild 25 real programs, no human, no source access) = autonomy possible; the production class ladder (4 rungs with explicit requirements) = autonomy warranted by who depends on the output.
  • The field headline is the deployment overhang. Models can handle more autonomy than they’re granted: ~5 hours of METR-measured capability (Opus 4.5, 50% success) vs ~42 minutes actually exercised at Claude Code’s 99.9th percentile — not directly comparable numbers, but the order-of-magnitude gap is the load-bearing observation. Longest sessions doubled in three months (<25 min → >45 min) smoothly across model releases, so the constraint is granted latitude, not capability jumps.
  • The ceiling is 56%. Even with no human in the loop and a budget big enough for a real attempt (one task: $2,600, 19 days unattended), the leading model (Claude Opus 4.7) solves 56% of MirrorCode’s rebuild-from-scratch tasks. Nearly half of hard long-horizon builds fail at today’s frontier — anchor any “fully autonomous large project” claim against that number.
  • Autonomy is granted through a trust shift, and the data shows its shape: auto-approve usage rises 20% → 40%+ with experience while interrupt rate also rises 5% → ~9%. Not a paradox — experienced operators stop approving every action and instead monitor and intervene when it matters. “Effective oversight doesn’t require approving every action but being in a position to intervene when it matters.”
  • Risk and autonomy are largely independent axes in Anthropic’s cluster data (autonomous crypto trading: autonomy 7.7, risk 2.2). The ladder supplies the third axis the telemetry can’t: reliance — a personal tool and a customer-facing feature can have identical autonomy and risk scores but need entirely different requirements (owner, failure plan, monitoring, AI-specific evals).
  • Where autonomy breaks, measured: in the field, humans interrupt mostly to supply missing context or corrections (32%) or because the agent was slow/hanging/excessive (17%) — context supply and runaway behavior, not catastrophes; Claude stops itself more than 2× as often as humans interrupt on complex tasks, mostly to present a choice between approaches (35%). At the long-horizon ceiling, failure is budget-and-wall-clock-shaped: capability only appears when you stop capping inference at 10 per task.

The three instruments side by side

InstrumentWhat it measuresUnit of measurementHeadline reading
Anthropic field study (Feb 2026, Clio, 500k interrupts + ~1M tool calls)autonomy exercised in real deploymentssession length, auto-approve %, interrupt %, per-tool-call risk/autonomy/human-involvement99.9th-pct sessions <25 → >45 min in 3 months; 80% of API tool calls have safeguards, 73% human-in-loop, 0.8% irreversible
MirrorCode (Epoch AI + METR, June 2026)largest software project completable unattendedsolve rate at realistic budgets (up to $2,600 / 19 days per task)Opus 4.7 leads at 56% — “significant room for further improvement”
Production class ladder (Nate B Jones, May 2026)reliance placed on AI-built outputrung 1-4, each with explicit requirementsrequirements scale with dependence, up to AI-specific evals + governance at rung 4; demotion is as important as promotion

A lineage note: METR appears on both capability instruments — its task-length benchmark supplies the “what models can do” side of Anthropic’s overhang comparison, and it co-built MirrorCode. The exercised side (field telemetry) is currently Anthropic-only data.

A usable self-assessment

The ladder gives you the rung; the telemetry gives you the oversight posture appropriate to it; the benchmark caps what you may promise at the top. The rung-to-posture mapping below is this article’s construction^[inferred] — each ingredient traces to a source:

  • Rung 1 — personal tool. Jones: keep it away from sensitive data. Evidence to advance: instrument yourself. Log every tool call with approval requested/granted (Anthropic’s replicable schema: tool name, parsed input, timestamp, elapsed time, approval outcome) so the next rung’s claims are measurable rather than vibes.
  • Rung 2 — team beta. Jones: owner + backup owner, a systems-touched list, a failure plan. Evidence to advance: your own 80/73/0.8 numbers — % of tool calls with safeguards, % with a human in the loop, % irreversible (keep irreversible well under 1%, the fleet-wide baseline being 0.8%).
  • Rung 3 — supported internal product. Jones: product ownership, access management, monitoring, auditability, a change process. Matching oversight posture from the field data: drop per-action approval in favor of monitoring plus targeted interrupts — the experienced-operator pattern — provided the intervention path is fast and clear (“switching from approval to monitoring is not a safety regression if your intervention path is fast and clear”).
  • Rung 4 — customer-facing. Jones: usual product standards plus AI-specific evals and governance. Calibrate the promise against the ceiling: the leading model solves 56% of hard unattended builds at a $2,600/19-day budget — “treat ‘fully autonomous large-build’ claims above that as needing evidence.”
  • The down-staircase applies to agents too. “A ladder that only moves upward becomes a junk drawer” — schedule demotion reviews, because the quiet cost is dead software (or a dead agent workflow) the company keeps paying support cost on.^[inferred — Jones states this for AI-built software; extending it to agent deployments is this article’s bridge]

How this differs from adjacent connections: the verification frontier is the law (no cheap verification, no autonomy); the L0→L3 readiness ladder in The Loop Is the Unit of Work is design-time readiness for one loop you’re building. This article is measurement: three instruments for locating an existing deployment — what it does, what it could do, what it must meet.

What the measurements say about where autonomy breaks

  • The binding constraint in practice is latitude, not capability. The overhang gap plus the smooth (not release-driven) session-length doubling both point the same way: existing models can handle more autonomy than operators grant. Anthropic’s internal numbers agree — success on its most challenging tasks doubled Aug → Dec 2025 while human interventions per session fell 5.4 → 3.3.
  • Interrupts are mundane, and that’s the finding. The top human-interrupt reasons are missing context/corrections (32%) and slowness/excess (17%) — supply-side problems an operator can fix with better context and tighter scoping, not exotic failures.
  • Agent-initiated stops are a real oversight channel. Claude asks for clarification more than 2× as often as humans interrupt on complex tasks — agent-side calibration complements external safeguards, and it’s why Anthropic argues against mandating per-action approval (“friction without necessarily producing safety benefits”).
  • At the frontier, budget is the variable. MirrorCode’s design point: long-horizon capability only shows up when inference isn’t capped at 10 and minutes-to-hours. If you’re deciding whether to “let it run,” the real levers are budget and wall-clock (days), then model choice.
  • The instruments don’t yet cover everything. Anthropic can’t link public-API calls into sessions (its strongest findings are Claude Code-only, overwhelmingly software engineering); software is uniquely amenable to supervisory oversight because output is testable — the autonomy-trust curve in law, medicine, or finance may look completely different.

Try It

  1. Instrument before you assess. Adopt the tool-call logging schema above; Anthropic’s PDF appendix (linked in the source article) contains the reusable classification prompts for risk, autonomy, and human involvement.
  2. Compute your three numbers — % safeguarded, % human-in-loop, % irreversible — and keep irreversible actions well under 1% of tool calls.
  3. Assign every agent workflow a rung and attach that rung’s requirements; put a demotion review on the calendar at the same time.
  4. Shift oversight posture deliberately, not by drift: move from per-action approval to monitoring + targeted interrupts only once your intervention path is fast — the 5% → 9% interrupt-rate rise among experienced users is the trust-building signal to emulate.
  5. Price the ceiling into plans: for unattended long-horizon work, anchor on 56% at $2,600/19 days, and pull the full MirrorCode table (epoch.ai/mirrorcode) before citing per-model gaps.

Open Questions

  • All field data is single-provider. Anthropic measures Anthropic; no public equivalent exists for GPT-5 / Gemini / Grok agent deployments, and the Claude Code findings may not generalize beyond software engineering.
  • The full MirrorCode leaderboard was not fetched — only the Opus 4.7 top line reached the wiki via the Epoch newsletter; per-model gaps and the exact pass criterion live at the primary.
  • The rung-to-oversight-posture mapping is constructed here. No organization has published results applying Jones’s ladder as written, let alone the combined self-assessment.^[inferred]
  • Non-testable domains are unmeasured. All three instruments lean on software; how to measure warranted autonomy where output can’t be cheaply verified (law, medicine, finance) is open — the same boundary the verification frontier draws.