Source: Code with Claude London 2026 — Opening Keynote (YouTube 6amLO7I9xdg), Anthropic, May 21 2026. Author-provided captions.

Anthropic’s first Code with Claude outside San Francisco — same three-layer structure as the May 7 SF keynote (model layer → platform layer → Claude Code layer), but with two genuinely new Managed Agents primitives (self-hosted sandboxes + MCP tunnels), a fresh Mythos demonstration (read OpenBSD source, found a 27-year-old vulnerability that every prior reviewer/fuzzer/static-analyzer missed), and updated platform-scale numbers (API volume up 17× year-over-year, average developer running Claude Code 20+ hours per week). Opens with Boris Cherny; Lisa from Research PM covers the model layer; Angela Jiang + Caitlyn Burke cover the platform layer with a live Counter (fictional commerce SaaS) growth-agent demo; Cat Wu + Boris return for Claude Code with an AcmePay refunds demo plus the new Cloud Agents View in CLI.

Key Takeaways

  • Two new Managed Agents primitives launched today. Both detailed in Self-Hosted Sandboxes + MCP Tunnels. Self-hosted sandboxes — Cloud Managed Agents can now execute work in the customer’s own infrastructure with first-class support for Daytona, Cloudflare, Vercel, and Modal. MCP tunnels — internal MCP servers stay behind the customer’s firewall on a private network; Managed Agents reach them via a secure tunnel through tunnel.anthropic.com, no public-internet exposure required. Both configured in the Claude Developer Console.
  • Mythos found a 27-year-old OpenBSD vulnerability. Boris’s opening anecdote: Mythos read the entire OpenBSD source tree and surfaced a flaw that survived every human reviewer, every fuzzer, and every static analyzer thrown at the codebase for almost three decades. Framed as concrete evidence that “the jumps keep getting bigger and the intervals keep getting shorter.”
  • 17× year-over-year API growth. 20+ hours/week per developer on Claude Code. Boris: “API volume is up nearly 17× on the Claude platform. On Claude Code, the average developer is now spending over 20 hours a week running Claude.”
  • 8 frontier models in the last 12 months. Lisa (Research PM): Anthropic has shipped 17 versions of Claude since Claude 3 — eight of them frontier releases in the past year. Lineage callouts: Opus 3 (long-form code), Sonnet 3.5/3.6 (computer use safely), Sonnet 3.7 (think-before-answer), Opus 4 (first to write complex Excel + PowerPoint), Opus 4.7 + Mythos preview (own outcomes end-to-end, judgment in ambiguity).
  • Opus 4.7 customer wins. AMP moved its full smart-mode to Opus 4.7 and simplified its tooling (model no longer needed scaffolding). Rakuten: 3× more production-engineering tasks resolved than the previous model. Intuit: Opus 4.7 catches its own logical faults in the planning phase. Claude Design (Anthropic Labs) launched on Opus 4.7 — “the model has a taste for visual design.”
  • Build for the next model, not the current one. Lisa: “Make upgrades easy by automating evals. Build ambitious prototypes that don’t quite work today — that’s how you’ll notice when the exponential moves under you. The teams winning are the ones whose architecture absorbs the next big jump.”
  • Scaffolding will hold smarter models back. Lisa: “As models get smarter, the scaffolding that used to help can hold Claude back. More intelligent models often get further with generalized primitives like a filesystem and a sandbox computing environment.” Foreshadows leaner orchestration layers as model capability rises.
  • Advisor strategy as cost lever. Same primitive announced at SF — split execution from advising via the Messages API tools array. Sonnet executor + Opus advisor runs better and cheaper than Sonnet alone. Eve Legal: “frontier-model quality at 5× lower cost.”
  • Claude Code surfaces, layered. CLI (power users) → IDE plugin (follow edits live) → Claude Code Desktop (full-screen GUI, image rendering, sidebar control plane for parallel agents — local + remote). Plus the newest surface: Cloud Agents View in the CLI (“what’s running / waiting / done in one glance”). All four sit on the Claude Agent SDK same as third-party builders.
  • Anthropic wall-to-wall Claude Code → 200% increase in PRs/engineer. Held constant through substantial engineering-org scaling.
  • MercadoLibre at scale. 23,000 engineers all on Claude Code. 500,000+ PRs reviewed with human oversight. 9,000+ apps modernized. Oscar Mowen targeting 90% autonomous coding in a fully agent-driven PR loop by Q3 2026. Boris’s favorite detail: managers and VPs who haven’t committed code in years are shipping again — “Claude Code is putting coding back in the hands of people who’ve spent the last decade in reviews and roadmap sessions.”
  • Shopify cross-functional adoption. Engineering + non-engineers (PMs, designers, data scientists) all on Claude Code. Director of applied AI Andrew McNamara: “the speed is just crazy.”
  • Spotify migration tooling at scale. Niklas Gustafsson’s team merging 1,000+ PRs/month via Honk, cutting migration time by 90%+. Standalone deep dive: Spotify — Coding is no longer the constraint.
  • Bindi case study (mission impact, not just efficiency). Felicia Coruru’s foster-care SaaS used the Claude API to give caseworkers their paperwork hours back and took 20 days off the foster-family licensing process. Boris’s frame: “Not just an efficiency metric — that’s a kid connecting with a family.”
  • Routines reframed: higher-order prompts. “I’m not the one doing the prompting. I’m the one who creates a routine that does the prompting.” Schedule, webhook, or arbitrary API trigger; local or remote-Cloud-Compute execution. See Routines.
  • Code review (the product). Deploys a fleet of agents to traverse code changes + auxiliary files looking for critical bugs. “Thousands of companies use this every day, including every internal Anthropic team.”
  • Mobile remote control. Claude Code added to iOS and Android Claude apps. “Go to a park, touch grass, still get your tasks done.”
  • AutoFix — green PRs by default. Listens for code-review comments, flaky-CI events, and merge conflicts; proactively fixes each. In the Claude Code codebase itself, AutoFix is configured to fix the root cause of flaky CI, not just retry.
  • Claude Security. Overnight whole-codebase scan; ranks vulnerabilities by severity; lets you kick off Claude Code sessions to address each finding.
  • Three layers, one story. Boris’s closing frame: “The capability is already here. The remaining gap is how fast we put it to work.”

The three layers

Layer 1 — Model intelligence (Lisa, Research PM)

  • 17 Claude versions since Claude 3. Eight frontier in the past 12 months.
  • Lineage: Opus 3 (long-form code), Sonnet 3.5/3.6 (computer use safely), Sonnet 3.7 (think-before-answer), Opus 4 (Excel + PowerPoint authoring — “we didn’t know at the time”), Opus 4.7 + Mythos preview (end-to-end outcome ownership, judgment in ambiguity).
  • Three product-direction commitments: (a) higher judgment + code taste — trustable autonomous engineering, (b) context windows that feel effectively infinite when paired with high-quality memory, (c) multi-agent coordination for goals too big for one agent.
  • Task horizon is the metric to watch. Last year: minutes. Today: hours. Future: continuous, proactive, always-on agents responsible for high-level goals. Lisa’s framing example: not “Claude, write a project update” but “Claude, keep the project on track this week”; not “Claude, build a forecast” but “Claude, own and update the forecast over time.”
  • Builder advice — four parts: (1) Design for the next version, not the current one; (2) Lean on generalized primitives (filesystem + sandbox) — scaffolding from the prior model generation will hold the next generation back; (3) Keep making harder evals + ambitious prototypes — they’re how you notice the exponential moving; (4) Treat model upgrades as business opportunities — automate evals + testing so you can drop in a new model fast, hands-on-test every release.

Layer 2 — Claude Platform agents (Angela Jiang + Caitlyn Burke)

Two problems the platform exists to solve. (1) Getting the right outcome — too much prompt-opt, tool construction, harness engineering. (2) Ship-fast-but-scalably — easy to prototype, hard to scale in production.

Platform composition. API primitives tuned to Claude models + infrastructure to build/scale agentic systems + controls to operate those systems.

Advisor strategy (recap from SF). Same primitive. Eve Legal cite: 5× cost reduction at frontier-model quality. Good for premium-product cost containment + high-volume agentic workloads.

Cloud Managed Agents (productionised since SF). Recently shipped: multi-agent orchestration (Claude clones itself + delegates to specialist agents), outcomes (rubric-graded iteration to success), memory (public beta — persistent stores between sessions), dreaming (Claude reflects on past sessions, codifies learnings as new memories). All four are the foundation the London announcements build on.

The two London announcements:

  1. Self-hosted sandboxes — agents execute on customer infrastructure. First-class support for Daytona, Cloudflare, Vercel, Modal. Work-item queue pattern: Managed Agent emits a work item; customer’s chosen provider picks it up and spins a sandbox in the customer’s own account.
  2. MCP tunnels — internal MCP servers stay on the private network. Customer sets up a gateway in their own network → establishes a secure connection to Anthropic → any Managed Agent can hit those MCP servers via tunnel.anthropic.com URLs without exposing them to the public internet. Configurable in the Claude Developer Console.

Both detailed in Self-Hosted Sandboxes + MCP Tunnels.

Counter demo (live). Fictional commerce SaaS hosting small-business storefronts. Growth-experiment workflow on merchant onboarding. The Growthbot Managed Agent: posts in Slack (“there’s a clear winner”), proactively calls the A/B experiment via a feature-flags MCP server (accessed through an MCP tunnel — internal to Counter), opens a cleanup PR (executed on Counter’s own Vercel sandbox), screenshots before/after, and on What next? proactively surfaces a 46% drop-off opportunity it spotted querying Counter’s data warehouse (also via MCP tunnel). Three Slack-collaborative + sandbox-isolated + tunnel-secured behaviors in one session.

Layer 3 — Claude Code primitives (Cat Wu + Boris Cherny)

Four surfaces, same SDK.

  1. CLI (power users / minimal text / max customization).
  2. IDE plugin (same agent + follow code changes as they happen).
  3. Claude Code Desktop (full-screen GUI, built-in previews, sidebar control plane, image + rich-output rendering, single view across both local and Cloud sessions with visual run/blocked/needs-input indicators).
  4. Cloud Agents View in the CLI (newest, for people who prefer to stay in the terminal — same run/wait/done indicators, inline reply-to-unblock, jump in and out of sessions without losing place).

All four built on the Claude Agent SDK — same one external builders use.

Five primitives shown:

  • Code review (product). Agent fleet traverses code + auxiliary files for critical bugs. Used internally by every Anthropic team + thousands of external companies.
  • Remote control (Claude Code on iOS + Android). Fire-and-forget mobile triggering.
  • AutoFix. Listens for code-review comments, CI failures, merge conflicts; writes fixes proactively. In CC’s own codebase: root-cause-fix flaky CI rather than retry.
  • Routines. Configure once; trigger by schedule, webhook, API. Local or remote-Cloud-Compute execution. Boris: “The default isn’t I’m going to prompt Claude Code. The default is I will have Claude prompt Claude Code.”
  • Claude Security. Overnight whole-codebase scan, severity ranking, spawn Claude Code sessions to fix each finding.

AcmePay refunds demo (live). Boris shipped a refunds feature on a fictional payments SaaS inside one Claude Code session. Adds idempotency for duplicate webhooks, multicurrency handling across regions, audit logging for compliance. Claude pulls the merchant dashboard in the browser, triggers a refund, watches a race condition where the modal closes before the success toast appears, traces it to an optimistic-update race condition, fixes it, re-verifies in the browser, then declares done. Then zooms out — the desktop app shows that session as one of many running/blocked/PR-merged sessions. A teammate filed a GitHub issue overnight; a routine watching the repo picked it up async and started a new working clock. AutoFix is babysitting the resulting PR through code review + flaky CI + merge conflicts. CI flicked on a network timeout — the routine woke up, diagnosed a known infra issue, retried the job (in Claude Code’s own repo it’d fix the root cause). Engineer never sees the red X.

Customer scale callouts

OrgDetail
Anthropic (internal)Claude Code wall-to-wall; 200% increase in PRs/engineer at constant quality bar through substantial org scaling
MercadoLibre23,000 engineers all on Claude Code; 500,000+ PRs reviewed w/ human oversight; 9,000+ apps modernized; targeting 90% autonomous coding by Q3 2026 (Oscar Mowen). VPs + managers shipping again
ShopifyEngineering + non-engineering (PMs, designers, data scientists) all on Claude Code; “the speed is just crazy” (Andrew McNamara, dir. applied AI)
SpotifyHonk + Fleet Shift merging 1,000+ PRs/month; migration time cut 90%+; latest Java migration 3 days instead of weeks. Full breakdown: Spotify deep dive
BindiFoster-care SaaS; Claude API gave caseworkers their hours back and took 20 days off foster-family licensing process (Felicia Coruru)
AMPSmart-mode migrated to Opus 4.7; simplified tooling
Rakuten3× more production-engineering tasks resolved with Opus 4.7 vs prior
IntuitOpus 4.7 self-corrects logical faults in planning phase
Eve LegalAdvisor strategy — 5× lower cost at frontier-model quality

Speakers

  • Boris Cherny — Head of Claude Code (intro + AcmePay demo + close)
  • Lisa ^[ambiguous] — Research PM (model layer; Lisa Nealon per Anthropic team page — full name unconfirmed against transcript)
  • Angela Jiang — Claude Platform Head of PM (platform layer)
  • Caitlyn Burke ^[ambiguous] (whisper transcribed as “Caitlyn” — likely Katelyn Lesse per the same name resolution in SF keynote Open Questions) — Claude Platform Head of Engineering
  • Cat Wu — Claude Code / Cowork Head of PM (CC primitives intro)

Try It

  1. Test self-hosted sandboxes on your own Vercel/Cloudflare/Modal/Daytona account. Concrete next step: deploy a hello-world Managed Agent that writes a file in your own sandbox, observe how the work-item queue spins up infrastructure in your account. The compliance / data-residency unlock is the headline; verify it on your stack.
  2. Wire an MCP tunnel for a sensitive internal MCP server. Pick the most-internal MCP server you currently want an agent to call but can’t expose publicly. Set up a gateway in your network → establish a tunnel.anthropic.com URL → reconfigure the agent to use the tunneled URL. The Counter demo’s data-warehouse + feature-flag pattern is the template.
  3. Watch a flaky-CI test get root-cause-fixed by AutoFix in your repo. Stand up a routine + AutoFix on a small repo that has a known flake. Configure AutoFix to fix the cause, not retry. Boris’s claim is empirical — replicate or refute it on your CI surface.
  4. Replicate the AcmePay refund demo on your stack. Pick a feature with idempotency + multicurrency + audit-log concerns. Run a single Claude Code session in Desktop or via Cloud Agents View. Time it. Compare to your typical “build a refund feature” timeline.
  5. Stage an eval that doesn’t pass today but might next quarter. Pick a workflow that fails on Opus 4.7. Instrument it so it alerts when Mythos (or the next frontier model) closes the gap. Lisa’s “build ambitious prototypes that don’t quite work today” advice rendered as an alerting policy.
  6. Try the Cloud Agents View on CLI. If you’ve been on the desktop app, try the terminal-native alternative. Both surface the same primitives — but the keyboard-driven flow is what Cat called out as new today.

Open Questions

  • Self-hosted sandbox pricing. Is the sandbox-hour bill paid to the customer’s provider (Vercel/Cloudflare/Daytona/Modal) directly, with no Anthropic surcharge — or is there a Managed Agents per-session orchestration fee on top? The transcript doesn’t address it. See Managed Agents cost-model section once updated.
  • MCP tunnel limits. Number of concurrent tunnels per org, throughput, region availability, latency vs direct public-internet MCP. Worth checking the Developer Console docs.
  • Mythos availability. Boris cited it as “Mythos read OpenBSD and found a 27-year-old vuln” without naming an availability date. Mythos Preview System Card gates access tightly — does London expand the cohort?
  • Multi-agent orchestration vs Outcomes vs Subagents — same boundary question as the SF keynote surfaced. Three overlapping primitives. When does coordinator-managed delegation win vs outcome-graded iteration vs subagent fan-out?
  • Cloud Agents View — keyboard-only mode parity with Desktop. Cat noted the new CLI view; does it cover the full Desktop feature surface (image rendering, rich outputs, preview iframe) or only the session-management slice?
  • MercadoLibre’s “90% autonomous coding by Q3” — what’s the unit? PRs merged without human review? PRs merged with light human review? Stories closed? The headline is shareable but the operational definition matters.