Source: raw/Inside_Anthropic_s_Bet_on_Claude_Agents_that_Work_While_You_Sleep_Jess_Yan.md Creator: Jess Yan (Anthropic) | URL: https://www.youtube.com/watch?v=Xu5gz2qsaz8 | Platform: YouTube
Jess Yan, product lead for Claude Managed Agents at Anthropic, talks with Peter Yang about how agents have shifted from simple prompting loops to autonomous, self-discovering, long-running actors — and what that unlocks when you set them tasks overnight. The conversation covers the model/harness pairing, a live demo of a data-analyst agent, how Anthropic uses agents internally, eval and prompting practice, enterprise rollout advice, and where Jess thinks the space is heading. Interviewer captions normalized: “Cloud Code” → Claude Code, “Cloud Managed Agents” → Claude Managed Agents, “platform.cloud.com” → platform.claude.com.
Key Takeaways
- The core thesis: agents have “evolved from agents being prompting loops to agents being autonomous, self-discovering and long-running actors” with access to third-party systems and a need for permissioning, observability, and steering (not just question → answer).
- Work while you sleep is already partly real: per Jess, “we set them tasks overnight and then we wake up and backlog is resolved and bugs are squashed.” She frames the future limiter as delegation, not headcount: “the limits of what we can achieve will really be based off of how much we can delegate at once more so than what our personal capacities are.”
- “10,000 times easier”: Jess’s framing for how much internal agents have changed her PM work — from filling out customer-RFP security checklists to diagnosing field problems and interrogating the codebase, “all of that is 10,000 times easier because of all the agents that we have internally.”
- Model and harness are paired. The underlying components are still the model, the system prompt / behavioral instructions, and the harness driving the loop — but rising task sophistication forces harness sophistication. Anthropic always tests models in conjunction with the harnesses it ships (e.g., Claude Code, Claude Cowork), because “every single model distribution now is through a harness.”
- Harness defined: the scaffolding around the model that lets it run tools, call memory, and know when to ask for human-in-the-loop input vs. keep executing — “what elevates us from the random sampling of tokens in and tokens out to actual actionable products.”
- Claude Managed Agents is a pre-built harness plus companion infrastructure for running complex tasks at scale. The motivating principle: return on effort for building an agent should be “extremely extremely high” — easy-to-stack primitives, flexible developer APIs, out-of-the-box infra — so low-effort setup can still delegate work that “might have taken you days, months, weeks.”
- vs. Claude Code: Managed Agents are “really long-running cloud-hosted sessions.” Claude Code is bound by your laptop’s constraints and whether it’s on; Managed Agents push that to the cloud, increasing both capacity and longevity. You only need an API key; Jess says there’s heavy individual (not just enterprise) usage.
From Prompting Loops to Self-Recovering Agents
- A raw prompting loop is “highly synchronous” — each step depends on the prior request completing successfully. Fine for haiku-writing, “increasingly unscalable” for large delegated tasks: if the first message drops or drifts, your ability to pivot gracefully is low.
- The shift is toward self-running, self-recovering agent loops that recover from errors, re-steer after going off course, and keep you informed as they do.
- Self-recovery in practice: hit an error, debug it, run searches, figure it out — and, more basically, recognize when an output doesn’t match what “good” looks like and revise course. Hard to wire by hand in a raw loop.
The Demo: A Reusable Data-Analyst Agent
- Built in the Claude console from: model selection (the intelligence layer), a system prompt (behavior, guardrails, high-level task awareness), a built-in tool set shipped with every Managed Agent, plus file-system access. Skills are optionally grantable (none granted here).
- Permissions were set to “always allow” per tool, but each can be configured as “ask” to keep a human in the loop.
- Task: analyze a fictitious grocery store (“just in time”). The initial prompt supplies the data schema up front (front-loading exploration) and breaks work into discrete steps so outputs are predictable — “these agents are randomized actors,” so be prescriptive when you want predictable output.
- Split of responsibilities: the system prompt holds general performance optimizations (so the agent is reusable across datasets); the initial prompt holds the specific schema discovery and the step-by-step task description.
- Output: three browser-renderable HTML artifacts — (1) product analysis (common shopping-cart order patterns), (2) shopper analysis with heat maps and radar charts, and (3) Jess’s “personal favorite,” a predictive model that, from customer and product attributes, predicts whether a customer will return — “able to produce this really rich level of insight in just minutes,” using only simple prompting plus Python packages in the agent’s environment.
- External data via MCP: the standardized way to connect an agent to a third-party database or system, with an authentication layer in front for safe access to internal services.
- A debug agent in the Claude console analyzes the full session history of another agent, surfacing where the agent could be improved.
Evals and Outcome-Based Prompting
- Evals are “the toughest part about building agents today.” Anthropic uses a mix: binary pass/fail, scoring (LLM-as-judge / letter-grading), and triggering evals (confirm a given action actually fires — e.g., verifying skills trigger at the right time, since progressive disclosure is the point of skills).
- More advanced patterns: replays of multi-step interactions and A/B testing versions by replaying the same user-interaction string and comparing responses.
- Built-in eval loop: Managed Agents can have the agent grade its own outputs inside the session, ideally in separate context windows to avoid bias — Jess: “when you can have agents evaluating their own work… you’re always going to get a better output.”
- Outcome as the new structured output: instead of rigidly specifying output format and gluing JSON blobs into something rich, you skip ahead to “let’s build this rich and interactive thing” and supply a “tastemaker’s assessment of what good would look like,” letting the autonomous agent self-correct toward it. Examples: slide/content generation and visual/editorial artifacts; or telling a predictive-model agent to hit a target accuracy benchmark (e.g., a 90% score) and letting it iterate until it gets there.^[inferred: the 90% figure was given as an illustrative target, not a reported benchmark result]
How Anthropic Uses Agents Internally
- Codebase access is the biggest unlock for Jess: managing state by tracking PRs directly (merged / deployed) instead of poking engineers, and understanding the product more deeply by prototyping agents or interrogating the codebase.
- Scheduled runs summarize activity (e.g., a Monday digest), with ad-hoc deep dives driven by upcoming pitches and customer conversations.
- Agents monitor Slack channels she shares with customers, summarizing activity she can’t follow live.
- Always-on, proactive coworker model: you should be able to tag agents anywhere, but they should also “proactively surface things… in the way that a co-worker truly would.” Two power levers: the level of data access, and a humanlike, proactive (not just reactive) interaction style. Proactivity is driven by triggered events and cron jobs plus continuously refreshing data so the agent is “as up-to-date as you are.”
- Talks to Claude more than coworkers: spends “therapy time” with Claude on thorny concepts, which uplevels human conversations because she arrives with a real opinion and baseline research. Anthropic even uses an API-review Claude as a neutral judge to break design impasses — “agent to agent communication.”
- Throwaway agents: spinning one up takes ~half an hour; Jess keeps “a different one going every couple of weeks.” Example — a 4,000-organization waitlist for advanced features, full of invalid/duplicate entries, only relevant for a few weeks: she built an agent (wired to internal systems and databases) to parse out invalid entries, score who’s most likely to convert and give high-quality feedback, and pick who to pull off the waitlist daily. “No point in building something super shiny for it.”
- Personal use (anecdotal): a new-parent friend has her child’s hourly schedule (feedings, tummy time) managed by Claude, plus fridge monitors and grocery-management agents.
Rollout Best Practices
- Start with the individual, not the mega-process. Enterprises jump to automating “crazy 20-team workflows” with multi-month, cross-cutting coordination. More immediate value comes from making each individual “exponentially more powerful” — fewer dependency requests, more work they can ship in isolation, “a bunch of one-person startups inside a large company.” Raise individual creative ceilings first, then tackle multi-team processes.
- Templates over blank pages: give people templates plus spotlight examples from people who know what they’re doing, then let them iterate freely.
- Ship to users fast; vibe-test before eval. Get the agent into beta users’ hands quickly — “the vibe testing is honestly the most important first step.” You outgrow vibe testing once you can’t aggregate vibe signals at scale; that’s when you build formal evals.
- Resist over-engineering: Jess’s bias is to build one agent, see if it works and gets used, before reaching for orchestrators and elaborate structure.
Where It’s Going
- Verticalization of agents: as models get smarter, broad domain expertise becomes a given, so value shifts to “incredibly specific and niche use cases” (e.g., an accounting agent for solopreneurs rather than a general accounting agent). The shared, reusable layer becomes context patterns and task-orchestration patterns rather than a canonical “how to build a finance/healthcare agent.”
- Survivors meet users where their workflows are: hyper-specific to the task yet adaptable, and present exactly where you need them — increasingly inside Claude Code and chat. Jess invokes the “everything is chat now” view (citing Vercel’s chat SDK), with the chat connected to your personal context/contacts.
- Open question she flags herself: if anyone can build a hyper-specific vertical agent, what makes such a product durable? Jess admits she doesn’t have a strong answer beyond “meet users where their workflows are.”
Try It
- Spin up a Managed Agent for a throwaway task. With an Anthropic API key, build a single-purpose agent (Jess targets ~30 minutes): pick a model, write a system prompt with general guardrails, grant the built-in tool set + file access, and set tool permissions to “ask” while you build trust. Use it for something time-boxed (clean a messy list, triage a backlog) rather than a grand workflow.
- Split system prompt vs. initial prompt the way the demo did: put reusable performance guidance in the system prompt; put the dataset schema and the step-by-step task in the initial prompt for predictable output.
- Add a self-eval step. Have the agent grade its own output against a “what good looks like” description — ideally in a separate context window — and iterate until it passes, before you review.
- Decide cloud vs. local: use Claude Managed Agents (cloud-hosted, long-running) when a task must outlive your laptop session; keep short, interactive work in Claude Code.
- Learn more: use the Claude Code skill to explore Claude Managed Agents, and read the docs at platform.claude.com.
Open Questions
- No pricing for Claude Managed Agents is given beyond “you need an API key”; cost structure and limits are not stated in the transcript.
- The “10,000 times” and overnight backlog/bug-squashing claims are the speaker’s qualitative framing, not benchmarked figures.
- Jess gives a Twitter/X handle verbally, but the auto-captioned rendering is garbled, so it is omitted here rather than guessed.
- Durability of hyper-specific vertical agents as products is left explicitly unresolved by the speaker.