Source: raw/Tool_skill_or_subagent_Decomposing_an_agent_that_outgrew_its_prompt.md — YouTube https://www.youtube.com/watch?v=mWvtOHlZM-I — Will, engineer on Anthropic’s Applied AI team — Code with Claude London 2026.

A workshop walkthrough on what to do when an agent that worked on day one has had capability bolted on for months, the system prompt has swollen past 400 lines, the tool count has crept past a dozen with three of them wrapping sub-agents, and the eval has started regressing. Will runs the audience through a sample inventory-management agent called “Stock Pilot” and modernizes the agentic primitives — using skills for progressive disclosure of business logic, replacing custom tools with Claude-Code-style primitives (bash, read, write, code execution, file system), and using Claude Managed Agents’ (CMA) native callable-agents API for the one sub-agent that survives the cut. The eval climbs from a re-measured 62% baseline (was thought to be 83%) up to 92%, with lower token usage, lower cost, and lower latency. The technique he names for the whole exercise is “hill climbing” on evals using Claude Code itself as the triage engine.

Key Takeaways

  • The decomposition decision is tool vs. skill vs. subagent — picking the right agentic primitive for each capability rather than bolting more code onto whatever shape the agent already has.
  • Stock Pilot started as: 1 orchestrator + 400-line system prompt + 12 tools (3 of them wrappers around sub-agents with isolated context windows) → eval at 62% (it was assumed to be 83% based on stale measurements).
  • Stock Pilot ended as: 1 orchestrator running on Claude Managed Agents + 15-line system prompt + 3 primitive tools (bash / read / write) + 1 native callable-agent (forecasting) → eval at 92%, with lower token usage, lower cost, and lower latency.
  • Three observed root-cause patterns when an agent regresses under bolted-on capability — surfaced by having Claude Code triage the eval failures: (1) the model is doing reasoning that should be done by a tool; (2) output-structure enforcement breaks down between orchestrator and sub-agents (the sub-agent gets the task right, but the orchestrator can’t read the answer); (3) policy conflicts inside a long system prompt — two rules that lived in different parts of the prompt end up contradicting each other.
  • Skills beat stuffing the system prompt. The system prompt is for what Claude needs in mind regardless of task; skills are for what Claude needs sometimes. Progressive disclosure via skills protects the context window and lets Claude reach for information only when a task actually requires it.
  • Build agents on the same human-like primitives Claude Code uses — file system, code execution, web search, to-do list — and add custom tools only when those primitives demonstrably can’t cover the job. CMA exposes these primitives natively; the builder doesn’t have to write them.
  • Code execution often beats wrapping a tool. Instead of giving an agent a CSV-reader tool, give it bash and let it write a Python script. The agent uses far fewer tokens when it can write code, run code, and read results — versus loading the whole CSV into context and reasoning over the raw bytes. Will’s before/after on a single Stock Pilot task: token usage drops from over 200,000 to a small fraction, cost down, latency down.
  • MCP order matters. Start with Claude Code primitives → add custom tools specific to this agent → publish as an MCP server only when multiple clients need the same standardized governed tool set. A common customer anti-pattern is running straight to MCP first and ending up with a chaotic ecosystem of overlapping MCP servers.
  • Two valid sub-agent use cases: (1) parallelize many Claudes on the same problem — deep research, codebase exploration; (2) need a fresh-context reviewer separated from the writer — code-review pass, or in Stock Pilot’s case, forecasting (you don’t want the customer-facing Claude to also be the one running the forecast).
  • Don’t expose sub-agents as tools when CMA has a native callable-agents API for them. Tool-wrapping makes orchestrator-to-sub-agent communication brittle (the F2 eval failure in Stock Pilot was exactly this) and logging across multiple agents hard. CMA’s native API gives observability and metrics across the multi-agent transcript at the same fidelity as the orchestrator.
  • Frontier-model trend: collapse sub-agents into the main orchestrator as models get stronger. Customers increasingly need fewer sub-agents because the main agent handles more, and Will explicitly recommends scrapping the sub-agent entirely when feasible.
  • Eval is the decomposition compass. “Hill climbing” on evals is the technique name — run the suite, triage failures (Will uses Claude Code itself with Opus 4.7 at extra-high effort), make one architectural change, re-run, repeat. Always expand the eval suite as product capability expands so it covers what you actually care about.

Details

The Stock Pilot before-state

Stock Pilot is a fictional inventory-management agent for a mid-size retailer that the workshop uses to mirror real customer trajectories. The agent flags low stock, forecasts demand, picks suppliers, files purchase orders, and writes weekly reports. None of these capabilities are individually complex. The problem is the trajectory — each capability was added later, each time as bolt-on code, without revisiting architecture.

Architecture at the start of the session:

  • A single orchestrator at the top.
  • A 400-line system prompt that has accumulated policy after policy as new business requirements landed.
  • 12 tools — three of which are wrappers around sub-agents with completely isolated context windows (one of them is a forecaster, one is a report writer, one is an ordering helper).
  • Built directly on the Anthropic Messages API with a hand-rolled agent harness (this lives in the workshop repo under before/).

Eval results before the session: 12 tasks, five grader types (regression single-turn IDs prefixed R, multi-turn failure-mode IDs prefixed F), mix of deterministic graders (turn count, latency, token usage) and LLM-as-judge graders (output quality, tone, personality). The assumed pass rate was 83% (Will calls 17% failure “really expensive” for a manufacturing/inventory context). When he actually ran the suite live in front of the audience, it came back at 62% — worse than the assumed baseline, which itself foreshadows how easy drift is when no one is hill-climbing.

Eval as architecture compass — three failure-mode patterns surfaced

Will opens Claude Code (running Opus 4.7 at extra-high effort — his standing default), uses its bash capability to run uv run evals-agent before, and then asks Claude to triage the failures. Claude surfaces three thematic root causes that map cleanly onto the three decomposition primitives the workshop is about:

  • F1 — daily low-stock sweep. The agent gets the right answer but takes a winding path. Claude’s diagnosis: the model is doing reasoning that should be a tool. It’s burning turns and tokens reasoning across inventory data that a code-execution primitive could just process directly.
  • F2 — ordering under a promotion package. The sub-agent gets the task right, but the orchestrator can’t interpret the response — a communication breakdown between sub-agent and orchestrator. Will calls this one of the most common patterns customers hit with sub-agent-heavy systems: the output structure between a sub-agent and its parent isn’t enforced tightly enough.
  • R8 — forecasting during a promotion month. The agent pulls the right forecast baseline (12 units/day) and the right promotion multiplier (3.1×). But somewhere in the calculation it ends up using 1.35× instead of 3.1× — a hallucination. The cause isn’t a model problem. It’s that two policies live in different parts of the 400-line system prompt and contradict each other, and the model is getting pulled between them.

The three root causes line up one-to-one with the three primitives the rest of the talk explores: tools, skills, sub-agents.

Replace system-prompt bloat with skills (progressive disclosure)

Will’s first prompt to Claude Code is essentially “look at agent.py, my system prompt has gotten too long, can we use skills instead for progressive disclosure?” The mental model he repeats:

Leave the system prompt only for the information that Claude needs in its mind regardless of the task that you give it. Skills are fantastic for packaging information that Claude is going to need some of the time, not all of the time.

A skill in the agent context is the same primitive that ships in Claude Code: a packaged, composable bundle of instructions that Claude pulls into context on its own when it recognizes a task needs it. If a user asks Stock Pilot to build a forecast, the forecasting skill gets pulled in. If they don’t, it doesn’t pollute context.

Beyond context-window efficiency, the architectural win is resolution of policy conflicts: when each policy lives in its own skill, two contradictory rules can’t compete in a single system prompt because only one is in context for any given task. This is the structural fix for the R8 failure.

After the refactor: system prompt drops from ~400 lines to ~50 lines (and ultimately to 15 lines by the end of the session). Pre-built skills get activated to cover the policies that previously lived inline. The agent is then redeployed to Claude Managed Agents.

Replace custom tools with Claude-Code primitives (and code-execution)

The core principle Will articulates for tools:

When we build agents, we lean into the same primitives that we as humans have access to.

A human showing up to work has a computer, a file system, a browser, and (if they’re an engineer) the ability to write and execute code. Claude Code as an agent has exactly the same primitives — file system, code execution, web search, to-do list. This is why dropping a stronger model into Claude Code makes the whole product better without changing tools: the primitives compose with intelligence rather than competing with it.

The Stock Pilot before-state had bespoke tools for every operation — retrieve data, analyze data, look up suppliers, etc. Will’s diagnostic question for each one: could code execution replace this? For an inventory-management agent reading Excel and CSV data, the answer is almost always yes — bash + a Python script is more flexible than a hand-rolled CSV-reader tool, and (load-bearing point) uses dramatically fewer tokens because the agent doesn’t have to read the raw CSV bytes into its context window.

Concrete before/after Will shows on screen for a single task:

  • Before (custom tools, CSV loaded into context): over 200,000 tokens per task.
  • After (file system + code execution primitives): token usage drops dramatically (Will doesn’t read out the exact number but the bar visibly collapses; cost drops; latency drops).

He’s careful to note: this won’t always be a win. Sometimes replacing a tool with code execution will regress a specific task. But for data-heavy work on this agent it was clearly the right move.

Critical implementation note: when building on Claude Managed Agents, these Claude-Code-equivalent primitives are included by default. The builder does not have to write a bash tool, a file system tool, or a code-execution tool. CMA exposes them natively.

The MCP order: primitives → custom → MCP only when shared

Will gives an explicit ordering for tool decisions:

  1. Start with Claude Code primitives — file system, code execution, web search, to-do list. Remove any you don’t need (e.g., if your agent never needs the web, drop web search). This is the foundation.
  2. Add custom tools specific to your agent’s domain. Only the ones the primitives genuinely can’t cover.
  3. Publish to MCPonly when you have a common, standardized, governed set of tools that multiple clients need to access (multiple agents, multiple Claude Code clients, multiple downstream products).

The anti-pattern he repeatedly sees with customers: running to MCP first. What that produces is “a chaotic ecosystem of MCP servers” with overlap, no central governance, and substantial context-pollution overhead from MCP-tool metadata sitting in every conversation.

He also surfaces an emerging industry pattern that goes one step further: using code execution to invoke APIs and CLIs directly, rather than wrapping them as tools at all. One of MCP’s drawbacks is the context cost — tool schemas have to be loaded, descriptions have to be parsed. For some integrations, just giving Claude bash and letting it call the API with curl or use the vendor’s CLI is more flexible and uses less context.

Sub-agent: when to keep, when to collapse, how to wire

After the system prompt is sliced into skills and most custom tools are collapsed into primitives, three sub-agents remain to evaluate. Will applies the two-criterion test:

  • Parallel-many-Claudes use case. Use a sub-agent when you want to throw a lot of Claude at a problem in parallel — deep research, codebase exploration, multi-document review. None of the three Stock Pilot sub-agents fit this — the workload is sequential, not embarrassingly parallel.
  • Fresh-mind reviewer use case. Use a sub-agent when the task genuinely needs a Claude with a different context than the orchestrator. The Claude Code archetype is the code-review sub-agent: you don’t want the same instance of Claude that wrote the code to review it. In Stock Pilot, forecasting fits this pattern — the customer-facing Claude shouldn’t also be the forecaster, because anything in the conversation context could distort the forecast.

The ordering helper and the report writer fail both tests — they get collapsed into the main orchestrator and replaced with skills + primitives. Only the forecaster survives.

The wiring change for the surviving sub-agent matters as much as the keep/cut decision:

  • Before: the forecaster was exposed to the orchestrator as a tool — the orchestrator called it like any other function. This is the pattern that produced the F2 communication-breakdown failure: brittle output structure between two Claudes, and logging that has to be reconstructed by stitching transcripts.
  • After: the forecaster is wired through CMA’s native callable-agents API. This gives session-level observability and metrics across the multi-agent run at the same fidelity as the orchestrator itself. Logging is unified, output structure is managed by the platform.

Will closes the section with the frontier-model trend that’s pulling sub-agent counts down rather than up:

What we have a lot of customers doing is actually just consuming capability into their main orchestrator because frontier models have gotten intelligent enough to manage across more information where you just don’t need as many sub-agents.

In other words: the sub-agent decision keeps getting more conservative each model release. Default to fewer sub-agents and reach for them only when the two criteria above are clearly met.

Final architecture + results (92% eval, lower tokens, lower cost)

End-state Stock Pilot:

  • 1 orchestrator, deployed on Claude Managed Agents (off-loads infrastructure, scaling, security, memory management).
  • System prompt down from 400 lines to 15 lines — only what Claude needs regardless of task.
  • 3 primitive tools — bash, read, write. Data is synced into the CMA sandbox environment when the agent starts so it can be reasoned across with code execution.
  • 1 native callable-agent — the forecaster, wired through CMA’s API (not exposed as a tool).
  • Business logic moved entirely into skills, pulled into context on demand.

Eval result: 92% pass rate, up from the actual baseline of 62%. Token usage down (driven mostly by code execution replacing context-loaded data). Cost down. Latency down on most tasks; held roughly flat on a few of the high-intelligence tasks like forecasting, where Will is comfortable trading a small amount of latency for the quality and cost improvements. Turn count roughly the same — and explicitly fine, because tokens-per-turn went down enough that more turns is not a cost regression.

Closing technique the talk wants to be remembered for: hill climbing on evals. Run the suite, get a number, change one architectural primitive, re-run, watch the number move. Use Claude Code (on Opus 4.7 at extra-high effort) as the triage engine for the failures. Expand the eval suite as product capability expands so it always covers what you actually care about.

Try It

  • Inventory your agent’s tools. For each tool, ask: could code execution replace this? If the tool is reading a CSV, parsing a spreadsheet, transforming data between formats, or invoking a CLI, the answer is almost always yes. Code execution + the file system primitive is a more flexible substitute that uses far less context.
  • Compute your system prompt’s line count. If it’s over ~100 lines, you’ve crossed the bolt-on threshold. Identify sections that are task-specific (a forecasting policy, an output-formatting policy, a domain-specific procedure) and refactor each into a separate skill. Leave the system prompt with only what Claude needs in mind for every task — identity, tone, hard rules.
  • For every sub-agent, run the two-criterion test. Does this sub-agent (a) parallelize many Claudes on the same problem, or (b) genuinely need a fresh context separated from the parent? If neither, collapse it into the main orchestrator and replace its capability with a skill plus the right primitives. Frontier models are absorbing more of these jobs into the main agent each release.
  • Wire eval as a Claude Code workflow. Put your eval suite behind a single command. Have Claude Code (Opus 4.7 at extra-high effort is Will’s default) run it, then triage the failures and propose root causes. This is the loop that turns architecture decisions into evidence-based ones instead of intuition-based ones — what Anthropic internally calls hill climbing.
  • If you’re using CMA, swap sub-agents-as-tools to the native callable-agents API. The native path gives unified observability and metrics across the multi-agent transcript, fixes the communication-breakdown failure mode, and removes a class of brittle output-structure bugs.
  • Default to Claude Code primitives, then custom tools, then MCP — never the other order. MCP only when multiple clients need the same standardized governed tool set. Resist the pull to stand up an MCP server before the agent has even stabilized — that’s the path to a chaotic overlapping-MCP ecosystem.