Source: How to get to production faster with Claude Managed Agents (YouTube E9gaQHrw_rg), Jess Ann (PM, Managed Agents) + Lance Martin (devx team), Code with Claude London 2026 (May 21 2026). Transcript via local Whisper fallback.
The primitive-by-primitive companion to the London keynote platform layer. Jess and Lance walk through the mental model (Agent / Environment / Session / Events), the four event categories (User / Agent / Session / Span), two live demos (Pascal — single-session JIT grocery analytics; Boss Agent — outcomes-driven AGI-Pilled CEO dashboard with multi-agent + outer-loop refinement), and the practical on-ramps (the /cloud-API skill built into Claude Code; CLI for YAML configs and programmatic session capture; the Cookbook; the Quick Start interactive walkthrough). Reinforces the keynote framing: the bottleneck is no longer intelligence — it’s infrastructure. Pre-launch research showed 1 in 3 developers struggled with context management, ~50% cited infra as their #1 production blocker, and a majority were running agents with no formal observability.
Key Takeaways
- The AI exponential changes what we delegate. Old delegation pattern: a single component / debug a flaky test suite — minutes to an hour of focus work, heavy steering. New pattern: agents run overnight while you sleep — wake up to a closed linear backlog. Near-future: multi-agent coordinated teams running quarter-long workstreams (e.g., a full M&A pipeline end-to-end).
- As tasks evolve from prompts to hours-and-days of work, scaffolding alone is insufficient — you need a true agentic runtime. Jess + Lance position Managed Agents as exactly that runtime.
- Two production problems Managed Agents solves: reliability + security. Long-horizon agents have to be reliable across hours / weeks / days and secure (credentials, access boundaries, human-in-the-loop checkpoints).
- New interaction modes for long-horizon agents. Outcome-oriented tasks (specify a rubric instead of a turn-by-turn script). Start + resume (“the most human-like interaction pattern — nothing more human than procrastination”).
- Three crisis points from pre-launch research. (1) ~1/3 of devs struggled with context management — when context helps vs distracts. (2) ~50% cited credential management / security / access / human-in-the-loop as their #1 production blocker. (3) Majority were running agents without formal observability — “agents running off probabilistic outputs with no way to know if they’re doing something good.”
- Managed Agents = harness + foundational building blocks + observability. Tool permissioning, tool execution, automatic context management, checkpointing, retries — bundled. Rich console observability layered on top.
- Mental model — the four objects. Agent (model + prompt + tools + skills configuration) → Environment (networking, packages, code-write surface) → Session (an execution; carries resources like git repos + an Outcome rubric) → Events (the observability stream, four categories below).
- Event topology — four categories. User events (steer, guide, interrupt, define exit criteria). Agent events (tools invoked, context compactions, delegations). Session events (lifecycle — running, idle, waiting). Span events (broader instrumentation that groups related events).
- Pascal demo (single-session). Hypothetical grocery store “Just-in-Time” — agent runs on the company’s dataset in a Python-preloaded container, produces (1) product analysis (“bananas are popular”), (2) shopper analysis (“Sunday morning peak”), (3) customer-reorder-probability predictive model. Events stream live to the Console; debug-agent post-mortem identifies bottlenecks → recommended fixes can be applied directly via Claude Code.
- The
/cloud-APIskill in Claude Code is the primary on-ramp. Backslash-cloud-APIopens a built-in skill that understands Managed Agents primitives end-to-end. Lance: “I don’t write a lot of Managed Agent code myself — I have Claude Code do it.” Includes session-log grabbing via the CLI. - CLI is YAML-first. Define agents as YAML files checkable into source control. Sessions are programmatically pull-able for analysis (especially useful for agent-on-agent code workflows).
- Cookbook + Quick Start. Two QR codes on the closing slide — developer docs + a rich interactive walkthrough that builds an agent in minutes.
- Advanced primitives (shipped between SF + London). Multi-agent orchestration — Claude clones itself and delegates to pre-configured additional agents; decomposes complex tasks into smaller units. Outcomes — Claude iterates until pre-defined exit criteria are satisfied (you specify the goal; Claude keeps going). Memory (public beta) — Claude reads + writes to persistent stores instead of starting each session fresh. Dreaming — Claude reflects, codifies new learnings into new memories; agents literally improve between every single run.
- The inner loop / outer loop pattern. Inner loop: Outcomes + Managed Agents iterate against a rubric, produce an output. Outer loop: you look at the output, give feedback to Claude Code, Claude Code uses the CLI to pull the session log, reflect on the session, look at the rubric and agent instructions, update them, and kick off a new session. Two loops working together — automated iteration inside, human-in-the-loop refinement outside.
- Boss Agent demo — outcomes + multi-agent end-to-end. “What would the AGI-Pilled CEO have at his hardware disposal?” prompted by Angela. Input: a question. Output: an interactive visualization (artifact-style — Claude produces SVG, renders in a browser). One custom tool:
render code to browser. - Outcomes rubric drove auto-optimization on Boss Agent. Rubric: “produce timing + take a screenshot + do an analysis + send analysis back to main agent.” Sub-agent spins up at session-end, evaluates artifacts (page screenshot), feeds analysis to main agent. The session was instructed to use Outcomes specifically to make the dashboard rendering faster.
- Autonomous performance wins from the Outcomes loop. Boss Agent went from ~37 sec → ~10 sec for rendering — agent discovered (1) parallelize tool calls, (2) switch to fast mode, (3) prompt optimization, (4) use multi-agent for multi-chart inputs to save ~7 sec. All four discovered autonomously via the rubric.
- Multi-agent enables simultaneous artifact rendering. Boss Agent’s multi-chart dashboard renders three visualizations simultaneously via multi-agent.
- Asana + Notion partner cohort. Mentioned by Jess: “innovative agentic partners who are trying to use Claude to extend the capabilities of their platforms.” See Asana AI Teammates talk from the SF event.
Mental model — the four objects
Agent Environment Session Events
───── ─────────── ─────── ──────
model networking resources User events
prompt packages (repos, etc.) Agent events
tools code-write Outcome rubric Session events
skills (sandbox) ┌─ runs Span events
└─ emits events
The inner loop / outer loop
Inner loop (runs inside Managed Agents):
Outcome rubric
↓
Agent iterates → produces artifact → grader checks against rubric
↑ │
└───────── (rubric not met) ────────────┘
↓ (met)
Session completes
Outer loop (you + Claude Code modulate the rubric):
You see the output
↓
"I don't like this" → Claude Code
↓
CLI: pull session log
↓
Reflect on session
↓
Look at rubric + agent instructions
↓
Update them
↓
Kick off new session
These two loops compose: outcomes-driven autonomy inside, human-judgment-driven refinement outside.
On-ramps
/cloud-APIskill in Claude Code. Built into CC globally.\cloud-APIopens the skill — Claude Code understands Managed Agents primitives, can write management code on your behalf, can grab session logs for analysis.- CLI. YAML-first agent configs. Programmatic session pulls — especially useful for agent-on-agent code workflows.
- Cookbook. Hand-curated patterns including multi-agent + outcomes recipes.
- Interactive Quick Start. Rich walkthrough that builds an agent in minutes — link via the QR code on the closing keynote slide.
Try It
- Run the Pascal pattern on your own data. Pick a single CSV / database / S3-bucket source. Stand up a single-session Managed Agent with a Python-loaded environment. Have it produce three outputs (e.g., distribution analysis, segmentation, predictive model). Watch the event stream in Console. Run the debug-agent on the session afterwards.
- Try outcomes on a performance optimization. Pick a workflow with a clear timing target. Configure an Outcomes rubric (“produce X + verify Y < N seconds”). Let the agent autonomously optimize. Compare the agent’s chosen optimizations to a human first-pass.
- Test the inner-loop / outer-loop pattern. Build a small dashboard or report. Let outcomes iterate inside. Look at the result. Use Claude Code to pull the session log via CLI, reflect, update the rubric or system prompt. Re-run. Measure: how many outer-loop iterations to get a satisfactory dashboard.
- Replace a turn-by-turn scaffold with an Outcomes spec. Pick the longest, most-step-by-step Claude prompt you’ve written. Replace the steps with a rubric. Run it. Measure: token cost, latency, quality vs the original.
- Wire
/cloud-APIinto your CC sessions. Just open the skill once. Lance’s claim is that it’s a much faster path than writing Managed Agents code by hand.
Open Questions
/cloud-APIskill — depth of the LM. The talk frames the skill as understanding the platform “extremely well” but doesn’t show its prompts/skills folder. Worth diffing the skill’s manifest against the public Managed Agents docs.- Multi-agent fan-out semantics — when does outcomes route to multi-agent vs sequential delegation? Boss Agent multi-agent kicked in for multi-chart inputs specifically. The agent learned this autonomously from the rubric. What’s the trigger heuristic? Is it model-internal or platform-mediated?
- Memory + Dreaming + Outer-loop refinement overlap. All three are “the agent gets better over time” patterns. Memory = persistent stores. Dreaming = self-reflection writes to memory. Outer-loop = human reviews + updates rubric. Where do they compose vs replace one another? A connection article candidate.
- Outcomes rubric expressiveness. Markdown rubrics (“produce timing + take screenshot + analyze”) look simple. What’s the limit? Can rubrics include scoring weights? Probabilistic acceptance (“≥80% on this dimension”)? Pull from external sources at grading time?
- Span events. Four event categories named; user / agent / session are intuitive; span “groups related events together” — but the grouping semantics aren’t shown. Are spans nestable? Tied to OpenTelemetry conventions?
- Boss Agent music. Lance: “Cloud made the music.” Was that audio generated as part of the Managed Agent session itself, or pre-generated? The transcript doesn’t say.
Related
- Code with Claude London 2026 Keynote
- Claude Managed Agents — base entity page
- Self-Hosted Sandboxes + MCP Tunnels — same-day launch
- Claude Dreaming — the “agents improve between runs” primitive Jess introduced
- Cookbook — Multi-Agent + Outcomes
- Asana AI Teammates on Managed Agents (SF event)
- Code with Claude 2026 SF Keynote
- Spotify — Honk + Fleet Shift