Measuring AI Agent Autonomy in Practice — Anthropic First-Party Data

Source: ai-research/anthropic-measuring-agent-autonomy-2026-02-18.md — Anthropic research blog, Feb 18, 2026. 20 authors, Clio privacy-preserving analysis across Claude Code + public API.

Anthropic’s first first-party empirical study of agent deployment behavior in the wild. Spans Claude Code (full session visibility) plus the public API (broad cross-customer breadth at tool-call granularity). The companion to How We Contain Claude — that piece was the engineering blog on the containment design, this is the research blog on what people actually do with deployed agents. Provides the empirical basis for auto-mode adoption discipline + the Anthropic-as-90%-approval-rate-driving-defaults framing.

Key Takeaways — top-line findings

Claude Code longest sessions doubled in three months: <25 min → >45 min (99.9th percentile), Oct 2025 → Jan 2026. Smooth across model releases, not a discontinuous jump with each new launch — which suggests existing models can handle more autonomy than they exercise in practice. Anthropic calls this the deployment overhang.
Median Claude Code turn = 45 seconds (stable, 40-55s range). The story is in the tail, not the median. Nearly every percentile below the 99th has remained relatively stable.
Auto-approve usage rises 20% → 40%+ as users gain experience (new users <50 sessions → users at 750+ sessions). Default settings require manual approval per action; the rise reflects users configuring the product as they build trust.
Interrupt rate ALSO rises with experience: 5% (10 sessions) → ~9% (experienced). Apparent contradiction with auto-approve growth, but not a paradox. Reflects a strategy shift: new users approve every action and rarely interrupt; experienced users let Claude run autonomously and step in when it matters. “Effective oversight doesn’t require approving every action but being in a position to intervene when it matters.”
Claude Code asks for clarification MORE than humans interrupt on complex tasks — more than 2× as often. Claude exercises some calibration on its own uncertainty. Agent-initiated stops are a real form of oversight, complementing external safeguards.
Internal Anthropic data: Claude Code success rate on most-challenging-tasks DOUBLED Aug → Dec 2025, while average human interventions per session dropped 5.4 → 3.3. More autonomy AND better outcomes AND fewer interventions, simultaneously.
80% of public API tool calls have safeguards. 73% have a human in the loop. 0.8% are irreversible (e.g., sending an email to a customer).
Software engineering = ~50% of all agentic activity on the public API. Beyond coding: business intelligence, customer service, sales, finance, e-commerce — each at a few percentage points of traffic.
Risk and autonomy are largely independent dimensions at the cluster level. Most clusters are low-risk; autonomy varies more widely. High-autonomy / low-risk: system health monitoring, crypto trading (autonomy 7.7 / risk 2.2), meeting reminders. Highest-risk clusters tend to be security-related, financial, or medical — and many of those are evaluations / red-team exercises, not production.
Public API ≠ full sessions. Anthropic has no reliable way to link API requests into agent sessions. Many strongest findings come from Claude Code (own product, full session visibility) and may not generalize. This is the headline methodology limitation.
Related first-party autonomy datapoint — Project Fetch Phase 2 (2026-06-18). [X signal — @AnthropicAI Frontier Red Team] Using Claude (Opus 4.7, largely unaided) to program a robodog, the model was reported ~20× faster than the prior year’s best human team aided by Opus 4.1 (the robodogs still failed to fetch a beach ball). A concrete embodied-task autonomy result complementing this study’s software-task findings. Blog: anthropic.com/research/project-fetch-phase-two. ^[inferred — figures from the X synthesis in raw/x-account-anthropicai-2067651699486200091.md, not yet cross-checked against the blog]

Why this matters — the deployment-overhang frame

The headline metric is not model capability. It’s the gap between what models can do (~5 hours per METR’s benchmark for Opus 4.5 at 50% success) and what they actually do in practice (~42 minutes at the 99.9th percentile in Claude Code). The two numbers are not directly comparable — METR measures idealized task difficulty, Anthropic measures actual elapsed time including human interruptions and Claude’s own clarification stops — but the order-of-magnitude gap is the load-bearing observation.

“The latitude granted to models in practice lags behind what they can handle.”

This is also why Anthropic argues against premature regulatory mandates on agent oversight: “Oversight requirements that prescribe specific interaction patterns, such as requiring humans to approve every action, will create friction without necessarily producing safety benefits.” As agents and the science of measurement mature, focus should be on whether humans are in a position to effectively monitor and intervene, not on requiring particular forms of involvement.

Methodology — Clio + tool-call analysis

Definition adopted: “An agent is an AI system equipped with tools that allow it to take actions” (consistent with Russell & Norvig 1995 and Simon Willison’s “system that runs tools in a loop to achieve a goal”).
Two data sources, complementary tradeoffs:
- Public API — broad visibility across thousands of customers, but only at individual tool-call granularity (cannot reconstruct full agent sessions).
- Claude Code — full session visibility, but only one product (overwhelmingly software engineering).
Privacy-preserving infrastructure: Clio (Anthropic’s privacy-preserving analysis tool). All classifications generated by Claude with opt-out categories validated against internal data; no manual inspection due to privacy constraints.
Sample sizes: 500k human interruptions + 500k completed turns for Claude Code interrupt-reason clustering. 998,481 random tool calls from public API for risk/autonomy scoring.

Why Claude stops itself vs. why humans interrupt Claude

Why Claude stops itself	Why humans interrupt Claude
Present a choice between proposed approaches (35%)	Provide missing technical context or corrections (32%)
Gather diagnostic information or test results (21%)	Claude was slow, hanging, or excessive (17%)
Clarify vague or incomplete requests (13%)	They received enough help to proceed independently (7%)
Request missing credentials, tokens, or access (12%)	They want to take the next step themselves (7%)
Get approval/confirmation before acting (11%)	Change requirements mid-task (5%)

The 35% “present a choice between approaches” is the load-bearing pattern that pairs with Troubleshooting Claude’s framing-vs-prohibition discipline — Claude actively presenting options is preferable to acting silently on ambiguity.

Risk × autonomy clusters

Higher average risk	Higher average autonomy
API key exfiltration backdoors (risk 6.0, autonomy 8.0)	Red team privilege escalation (autonomy 8.3, risk 3.3)
Reactive-chemical relocation in labs (risk 4.8, autonomy 2.9)	Heartbeat health checks (autonomy 8.0, risk 1.1)
Patient medical records retrieval (risk 4.4, autonomy 3.2)	Autonomous crypto trades (autonomy 7.7, risk 2.2)
Fire emergency response (risk 3.6, autonomy 5.2)	Auto meeting reminders (autonomy 7.6, risk 1.7)
Production bug deployment (risk 3.6, autonomy 4.8)	Email + business-message alerting (autonomy 7.5, risk 1.7)

(Many high-risk clusters are evaluations or red-team exercises, not production. The framing notes that even rare actions have outsize consequences if mishandled.)

Where this lands in the wiki

Companion piece to How We Contain Claude — that May 2026 engineering post is the containment side, this Feb 2026 research post is the deployment patterns side. They explicitly cross-reference each other in the Anthropic Engineering / Research blog series.
Empirical basis for auto-mode discipline — the 20% → 40%+ auto-approve rise as users gain experience is the data behind why Plan Mode and auto-mode are the configured defaults for experienced operators.
Empirical basis for clarifying-questions discipline — Anthropic explicitly trains Claude to ask clarifying questions; this paper confirms the behavior plays out in practice. Pairs with the troubleshooting discipline of giving Claude explicit room to ask.
Bridge to Ramp’s marketing-to-AI-agents experiment — Ramp’s finding that Anthropic’s crawler is more aggressive than all other named AI crawlers combined is independently corroborated by Anthropic’s own usage-side measurements (rising session length + rising auto-approve + doubling user base). Two angles, same direction.

Implementation

Tool/Service: Clio (Anthropic’s privacy-preserving analysis infrastructure) is internal-only. The methodology — tool-call-level classification of risk, autonomy, human involvement — can be replicated by any operator with logging access to their own agent’s tool calls. Setup: Log every tool call with: (1) tool name, (2) parsed input, (3) timestamp, (4) elapsed time from prior call, (5) whether human approval was requested, (6) whether human granted/denied/modified. Append the system prompt + conversation history available at time of action. Cost: Self-hosted logging is cheap; Claude-based classification is per-call API cost. Integration notes: Anthropic’s PDF appendix at https://cdn.sanity.io/files/4zrzovbb/website/55e4d2de6eb39b3a9259c3f74843f86b1a12e265.pdf includes the full classification prompts for risk, autonomy, and human involvement. Worth reading the appendix before designing your own monitoring schema.

Claude AI topic
How We Contain Claude — Anthropic Engineering blog companion piece (containment design ↔ deployment-patterns data)
Claude Code CLI reference — Plan Mode + auto-approve flags
Claude Code best practices — autonomy + oversight patterns
Troubleshooting Claude — clarifying-questions discipline this paper confirms in practice
Managed Agents — adjacent primitive frame
Cowork — explicitly cited in the Anthropic post as the product making agents more accessible
Ramp’s marketing-to-AI-agents experiment — independent corroboration of Anthropic crawler aggressiveness

Try It

Read the PDF appendix before designing your own agent telemetry: https://cdn.sanity.io/files/4zrzovbb/website/55e4d2de6eb39b3a9259c3f74843f86b1a12e265.pdf. The classification prompts are reusable.
Adopt the 80/73/0.8 frame for your own deployments. Aim to keep irreversible-action rate well under 1% of tool calls.
Stop requiring per-action approval once you trust your agent. Anthropic’s data confirms experienced users shift to monitoring + targeted interrupts. Switching from approval to monitoring is not a safety regression if your intervention path is fast and clear.
Train your operators to interrupt rather than approve. 5% → 9% interrupt rate as users gain experience is the trust-building signal. Active monitoring + selective intervention beats passive per-action approval.
Calibrate against the 99.9th percentile. Median session = 45 seconds. The 99.9th percentile is where autonomy actually lives — that’s where your tooling and oversight need to scale.

Open Questions

Has the 99.9th percentile turn duration kept climbing past 45 min in the May 2026 timeframe? Anthropic noted a decline since mid-January (user base doubled, hobby projects shifted to work tasks). Worth a refresh-cycle check against subsequent reporting.
Does the deployment-overhang frame generalize beyond coding? Software is uniquely amenable to supervisory oversight (output is testable). In domains like law, medicine, or finance, the autonomy-trust curve may look completely different.
Mnemosyne / Honcho / MemoryKit affect on the autonomy curve. Memory-provider adoption in Hermes-style agents (see MemoryKit) changes the autonomy-trust relationship — agents that remember prior approvals may build trust faster. Open empirical question.
Cross-model comparison. This data is single-provider (Anthropic). The agent-autonomy curve for GPT-5 / Gemini 2.5 / Grok agents may differ structurally. No public equivalent yet.

Jonathon's AI Wiki

Explorer

Measuring AI Agent Autonomy in Practice — Anthropic First-Party Data

Key Takeaways — top-line findings

Why this matters — the deployment-overhang frame

Methodology — Clio + tool-call analysis

Why Claude stops itself vs. why humans interrupt Claude

Risk × autonomy clusters

Where this lands in the wiki

Implementation

Try It

Open Questions

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

Measuring AI Agent Autonomy in Practice — Anthropic First-Party Data

Key Takeaways — top-line findings

Why this matters — the deployment-overhang frame

Methodology — Clio + tool-call analysis

Why Claude stops itself vs. why humans interrupt Claude

Risk × autonomy clusters

Where this lands in the wiki

Implementation

Related

Try It

Open Questions

Graph View

Table of Contents

Backlinks