Source: The expanding toolkit (Anthropic / Lucas — Research PM, Code with Claude 2026, May 7 2026 conference talk, YouTube KLCuxMDZSDg)
Lucas (Research PM, Anthropic) frames the 2026 model story as “the scaffolding you had to build last year now ships with the model.” A year ago, building an agent meant writing routers, retry decorators, output validators, context compactors, image-coordinate math for computer use — hundreds of lines of code before any product logic. In 2026, that scaffolding has moved into the model itself. The talk walks through four capability areas — tool use, context management, code execution, computer use — showing the before/after, and ships a quick Claude Code tip with each one (pre/post tool hooks, /context for live grid breakdown, /schedule for cron-triggered autonomous runs, Claude in Chrome for human-like in-browser testing). Closes with a sharp rule for builders: “any code you’re writing that compensates for model unreliability has a half-life of just months. Code that connects the model to your world tends to compound.”
Key Takeaways
- The thesis: model + integrated capabilities, not just an input-output LLM. Tool use, context management, code execution, computer use all baked in. Don’t think of the LLM as a box you wrap with scaffolding; think of it as an expanding toolkit that absorbs the scaffolding over time.
- Tool use — model-driven routing replaces handwritten routers.
- Before: brittle if/else routers with string-matching heuristics (“if model mentioned SQL, give it the database tool”), retry decorators because tools failed often.
- After: the model searches across the tool set itself. Tool selection accuracy is now high enough that routers and prefiltering usually make things worse, not better. When a tool errors, Claude sees the error, recovers, calls again — no harness retry logic needed.
- Quick tip: describe the output schema in the tool definition, not just inputs. Example:
"This search_docs tool returns id, title, snippet, and score."Lets Claude rank or filter outputs without an extra round trip back to the harness. - Claude Code tip: pre/post tool-use hooks in
~/.claude/settings.json— block specific tool calls in specific situations or analyze/log outputs after a call.
- Context management — built-in beats hand-rolled memory systems.
- Before: chunking, RAG, summarization-after-N-turns calls to a smaller model, manually moving cache breakpoints across turns to save cost — homegrown memory scaffolding.
- After: 1M context length at flat pricing + server-side compaction + context editing turns most of that into a few lines of config. Closer to “the feeling of an infinite context window” the morning keynote referenced.
- Quick tip: clear tool results every N turns. Pruning stale tool outputs (old screenshots, file reads, search dumps) saves enormous tokens while keeping Claude’s transcripted decisions intact.
- Claude Code tip:
/context— live colored grid breakdown of what’s filling your context window (messages / tool results / system / MCP definitions). Try it now if you have Claude Code open. Surfaces optimization suggestions in the same view.
- Code execution — the write/run/fix loop now happens server-side in one API turn.
- Before: find a VM provider, spin up a sandbox, send code, run, parse trace back, feed back into the model, loop. Multiple API round-trips per iteration.
- After: the code execution tool gives Claude a hosted server-side sandbox. The whole write→run→fix loop happens inside a single API turn. No harness round-trips between Claude and a VM.
- Mental model (not tip): Claude with code execution gets its own computer to do stuff on — a separate-from-your-local-bash environment for stateless compute, data analysis, library installs, scratch work. When Claude needs your repo or your Python venv, it switches to local bash. Claude knows which to use for what.
- Claude Code tip:
/schedule— cron-triggered autonomous runs. The same self-iteration loop, on a timer, completely autonomous.
- Computer use — Opus 4.7 native 1440p screenshots eliminate the scaling math pile.
- Before: 1080p laptop screenshots had to be downscaled to fit Claude’s pixel limit, scaling factor tracked, sampled clicks scaled back up to original resolution, all wrapped in retries and verify statements. Lots of image glue.
- After: Opus 4.7 takes native-resolution screenshots and returns 1:1 pixel coordinates up to 1440p, capturing the vast majority of display resolutions. Scaling math gone. Just send the image; trust the click. (For 4K, still downscale on your side. For ≤1440p, just experiment with resolutions and image formats — JPEG/PNG/WebP each compress differently and produce different artifacts; pick what works for your specific UIs.)
- OSWorld benchmark trajectory: Anthropic’s headline computer-use eval. <12 months ago Claude was scoring <50%. Opus 4.7 reports 78%, on track to hit 80%. Computer use is “at the cusp of broad usability.”
- Claude Code tip: Claude in Chrome (
claude.ai/chrome) lets a Claude Code session itself navigate the web and use computer use against the live browser, including local development. Demo: Claude Code working on a Trello-like board, running Claude in Chrome to reproduce a user-reported bug, type into the form, watch it fail, go back into the source code, fix the wiring, retry, succeed, then test drag-and-drop, hit a second bug (card lands in wrong column), diagnose and fix that too, then recap the changes. Closes the human-software-needs-human-like-testing loop — Claude finds and fixes bugs without the developer needing to handhold to the bug + the solution.
- The closing rule. “Any code you’re writing that is compensating for model unreliability will have a half-life of just months. Leave that work to us — we’ll keep absorbing retry logic, routers, planners, verification loops into the model. Code that connects your model to your world — your custom tools, your data, your auth, your specific context — that compounds. The model can’t absorb what it can’t see; giving it that is what’s valuable.”
- Forward direction. “Every agent, every piece of software will get a front door for agents in the near future. The interesting work is no longer making the model more reliable. The interesting work is what you put on the other side of your agent front door that nobody else can.” Restates the keynote’s “build for agents, not just users” framing in builder-side language.
Where it fits in the wiki
- Sister talk to Mahes’ Memory + Dreaming talk — both are Code with Claude 2026 deep-dives by Anthropic platform/research staff. Mahes covers Memory + Dreaming in Managed Agents API; Lucas covers the four agentic capabilities below the orchestration layer (tool use, context management, code execution, computer use). They compose: Memory + Dreaming sit on top of these primitives.
- Reframes the computer use and extended thinking articles. This talk is the canonical 2026-state explainer for those primitives, with named numbers (OSWorld <50% → 78%) and a clear before/after for each. Tool use itself doesn’t have a dedicated wiki article yet; this talk is the de-facto primary reference for that primitive’s 2026 shape.
- Pairs with Karpathy’s Claude Code techniques / the broader Karpathy AIOS line — Lucas’ “scaffolding moves into the model” thesis is the model-vendor-side view of what Karpathy frames as “give the LLM the right tools, then trust it.” Same direction, opposite-end framing.
- Cross-references the Week 19 release digest for the Claude Code features the talk’s tips lean on (
/context,/schedule,/v1/modelsplugin discovery, hooks). - Confirms the keynote’s “infinite context window” framing is real — server-side compaction + 1M flat pricing + context editing.
Implementation
- Tool/Service: Anthropic API + Claude Code, all 2026 features.
- Setup: Each capability is opt-in via API parameter or Claude Code config.
- Tool use schema-with-output: include the output schema in the tool description, not just input parameters.
- Pre/post tool hooks:
~/.claude/settings.json→hooks.preToolUse,hooks.postToolUse. - Server-side context features: 1M context (flat pricing), server-side compaction, context editing — all single API parameters.
- Tool-results pruning: prune every N turns to keep the transcripted decisions and lose the stale large content.
/context: Claude Code slash command for live context-window breakdown.- Code execution: API tool. No setup; Claude gets its own sandbox automatically.
/schedule: Claude Code slash command for cron-triggered autonomous runs.- Computer use: Opus 4.7 native screenshots up to 1440p — just send the image, no scaling math.
- Claude in Chrome: install from claude.ai/chrome → Claude Code can drive the browser for testing/debugging.
- Cost: Tools use the model’s normal token billing. Code execution adds a sandbox layer (no separately quoted cost in the talk). 1M context at “flat pricing” — no per-token markup beyond the headline rate.
- Integration notes: Claude knows when to use code-execution sandbox vs local bash and switches automatically. Pre/post hooks let you block dangerous tool calls or audit-log every call. Computer use is now reliable enough to do live UI testing inside an agentic coding session — Lucas demoed bug-find-and-fix with Claude in Chrome.
Open Questions
- Token cost of server-side compaction. Flat 1M pricing covers it, but at what point does the compaction itself reach a budget the user should know about? Not addressed.
- Output-schema field — semantics. Lucas describes describing the output in the tool’s prose description (“this tool returns id, title, snippet, score”). Is there a structured
outputSchemafield landing in the API spec, or is this still prose-in-description? - Claude in Chrome auth/session model. When Claude Code drives a Chrome session, whose cookies/auth are loaded? Per-tab, per-session, persistent? Not detailed in the talk.
- Pre/post hook reliability. Hooks can block tool calls — what happens on hook failure (timeout, exception)? Fail-closed or fail-open? Default behavior?
- Computer-use throughput at 1440p native. Image bandwidth is the typical computer-use bottleneck. Native 1440p means more bytes; what’s the effective click-to-click latency vs the previous downscale-track-rescale flow?
Try It
- Watch the talk (YouTube
KLCuxMDZSDg) — concrete demos for each capability, especially the Claude-in-Chrome bug-find-and-fix at the end. - Open Claude Code right now and run
/context. See exactly what’s filling your window. Lucas explicitly invited the audience to try this live. - Add output-schema text to one tool definition in your current project. Re-run a task that uses that tool and watch whether Claude makes fewer redundant follow-up calls.
- Add a pre-tool hook that just logs every tool call to a file. Read the log after a session — useful both for debugging and for understanding how Claude actually uses your tools.
- Try
/schedulefor a recurring task you do daily. The cron + autonomous-run combination is the lowest-effort first taste of “Claude works while I’m not watching.” - Install Claude in Chrome at claude.ai/chrome. Use it once to reproduce a UI bug in a project you’re working on. The “Claude finds the bug, edits the code, retests” loop is the demo’s centerpiece.
- Re-read the closing rule. Walk through the scaffolding code in your current project. What would have been “compensating for model unreliability” a year ago and is now redundant? Cut it.
Related
- Code with Claude 2026 — Opening Keynote — umbrella talk
- Memory and Dreaming for Self-Learning Agents (Mahes) — sister deep-dive
- The Thinking Lever (Matt Bleifer) — sister deep-dive on test-time compute
- Computer Use — Opus 4.7’s 1440p native screenshots
- Extended Thinking API reference
- Claude Code hooks — pre/post tool-use hook surface
- Claude Code CLI reference —
/context,/schedule, slash commands surface - Karpathy techniques for Claude Code
- Opus 4.7 Best Practices — model-side context for the OSWorld 78% claim
- Week 19 release digest