Stop Babysitting Your Agents — Three-Layer Autonomy Stack (Sid Benesaria, Anthropic Founding Engineer)

Source: raw/Stop_babysitting_your_agents.md — Sid Benesaria (founding engineer, Claude Code), Code with Claude conference talk, youtube.com/watch?v=wI0ptqCSL0I, fetched 2026-05-20.

A Code with Claude 2026 talk explicitly framed as “a Claude Code 301 type university class” — for users who already have CLAUDE.md, connected tools, and Claude Code on the web set up, and want to push further into autonomy. The 30-minute talk lays out a three-layer stack: verification loops (Claude checks its own work) → multi-clauding (run several in parallel) → background loops (Claude works while you’re away from the keyboard). Companion to the keynote, Dixon’s What’s New talk, and Lucas’s Expanding Toolkit.

Key Takeaways

The prerequisite stack (Sid’s three table-stakes) for getting value from this talk:
1. High-quality CLAUDE.md — “the single highest-leverage thing you can do to improve your Cloud Code experience.”
2. Connected tools — Slack, Asana, Linear, Datadog, BigQuery. “A good rule of thumb is if a tool is useful for you in your day-to-day life, it will also be useful for Claude. It’s able to perform much better if you give it access to these tools.”
3. Claude Code on the web set up — decouples compute from laptop. Close your laptop, spill coffee on it, Cloud Code sessions keep running in the cloud. Audience-show-of-hands: ~50% had #1+#2; far fewer had all three.
Why tooling needs to change. Most existing developer tooling (linters, IDEs, prettiers, type-checkers, compilers) was built with humans in mind. But humans aren’t writing most code anymore — agents are. The good news: most of these tools translate well to agent use. The bad news: there are blind spots — assumptions humans make about their tool-chain that Claude doesn’t. The talk’s framing question: “What does an agent need from your codebase that a human takes for granted?”
Verification = the entire ratchet. Sid’s brainstorm: think about the last feature you worked on. How did you check your work? The human verification playbook = the same playbook Claude can use:
1. Design and write code
2. Build it / run compilers / type-checkers (loop on failures)
3. Run executable (Docker / CLI / web server)
4. Check side effects (browser UI, logs, database state)
5. Run unit tests (regression check + add new test for the work)
6. Deploy to staging (or “if you’re really brave, straight to prod”)
The core abstraction: a verification loop. Slide Sid calls “arguably the most important slide in this presentation.” A loop is an autonomous circuit Claude can complete on a given task — it hill-climbs on a success criterion. Give Claude access to tools to verify its own work and to write code. Claude writes some code → checks if it failed → debugs → writes more → repeats until it reaches success. The PR Claude finally sends you is higher quality and actually works.
Worked example: the signup button. Sid’s personal website had a broken signup button. He told Claude “make the signup button work.” Claude wrote code → built the app → opened a browser → clicked the button → saw nothing happened → read logs → found the bug → fixed the code → reloaded → repeated until working PR.
Verification has multiple flavors that all merge into one. UX (browser-driven), backend (API), end-to-end including infra. Once you give Claude the tools and instructions for any flavor, the loop pattern is universal — you don’t have to be very specific about the kind of verification.
Four concrete instructions for a UX verification loop.
1. Run your application — npm run start or your dev-server command spins up local environment.
2. Use the web server via Claude in Chrome MCP (Sid’s preferred tool). Activate with /chrome in Claude Code. Alternative: Playwright or other browser-control MCPs.
3. Prove something works — take screenshots before and after the fix, confirm the right state.
4. Unblock yourself — handle auth (Claude needs an identity to log in) and state (pre-configure databases / inventory so the app is usable). Both are standard end-to-end-test patterns — the only difference for Claude is making the state-setup scripts dynamic, not too prescriptive.
Skills are the packaging format for verification loops. A skill is “just a way to store some arbitrary context about a specific topic.” For verification, that topic is the verification recipe itself. Self-improving twist: put an instruction in the skill “improve the skill every time Claude hits a blocker” — and you get a self-documenting, self-improving skill the whole team contributes to. This is what the Claude Code team itself does internally — one single verification skill, explicitly told to keep documenting itself, edited whenever someone hits a blocker.
Live demo: MonkeyType verification loop. Sid uses MonkeyType (open-source TypeScript typing tester with Express + MongoDB + Redis backend) as a representative full-stack app. Tells Claude to spin up the dev server, enable /chrome MCP, then “use the Chrome MCP to make sure that the front end is working” + “can you try typing and make sure everything works” + “can you also use the settings and change something.” Claude drives the browser via Chrome MCP, navigates to localhost, types, changes settings, verifies persistence. Then Sid says “take everything we learned and put it into a skill file in .cloud/demo/verification” — Claude creates a fairly large skill.md with: (1) bring up the stack, (2) load Chrome MCP tools, (3) smoke test via browser tools.
Then the test of the verification loop: Sid asks Claude “every time I mistype, please show me a confetti animation, and use the skill we just created to verify your work.” Claude writes the feature, hits 2 lint errors, fixes them itself, re-verifies, hits a good state. The verification skill executed end-to-end without Sid in the loop.
Multi-clauding (parallel sessions) is gated by verification reliability. Once Claude can self-verify, you can run many in parallel and trust them. Sid’s personal cap: “more than 4-5 sessions open simultaneously takes a big load on my brain and I can’t really function beyond that.” Attention is the scarce resource.
Four ways to multi-claude (mid-2026 stack):
1. Claude Code Desktop App — sidebar with all sessions across all surfaces (terminal, cloud, multiple git repos). Pin, rename, color sessions. The central control plane for managing attention.
2. Claude Agents (terminal-native) — Sid himself prefers staying in terminal. claude agents opens a view like the desktop sidebar. Released “I think a week ago.” Sorts sessions by attention required: blocked-on-permission-prompt at the top, completed lower down. Pin / rename / reorder.
3. Claude Code on the web — cloud-running sessions, decoupled from laptop. claude.ai/code to get started.
4. Remote Control (Sid’s favorite feature) — control any session running on any surface from your phone. /remote-control from a Cloud Code session enables this. Phone-app notifications when Claude needs input. “You could be in your car. You could be doing whatever you want. And you could just give Cloud the input that it needs.”
Sid’s prior multi-claude setup was tmux + work trees — “works honestly, but it’s a lot to manage.” Claude Agents is the convenience replacement; Sid still uses both but recommends Claude Agents for newcomers.
Background loops = take your keyboard out of the hot path entirely. Sid frames this as the third progression past multi-clauding. Even with multi-clauding you still have to spin up sessions with a goal. Background loops remove that step too.
Software engineering tasks that don’t need you in the loop:
- Babysitting PRs (more PRs now thanks to Claude → reviewing comments, merge conflicts, CI failures, all the bookkeeping)
- Updating docs (velocity outpaces doc maintenance)
- Triaging issues / monitoring feedback (every day, no novel thinking each time)
- Keeping CI green (each fix is unique but the loop is generic)
/loop = the in-session background-loop primitive. Run a prompt at a specific interval. /loop 10 minutes babysit my open PRs — the session wakes every 10 minutes, re-runs the prompt, uses your CLAUDE.md + connected tools to figure out what to do.
Routines = /loop running remotely. Same primitive but in Anthropic’s cloud containers (same containers as Claude Code on the web). Set up via the web or desktop app’s Routines tab. Two trigger types: time-based or event-based. Both spawn a new Cloud Code session with a specified prompt. Claude Code team examples:
- A routine that updates docs every day.
- A routine that looks at issues + feedback and posts to Slack every 6 hours.
The stack composes — the closing slide. Verification ratchets reliability → multi-clauding scales horizontal throughput → background loops remove the keyboard from the hot path. “That really is the ultimate goal — you can spend your attention on the tasks you care about. Everything else can be delegated to Cloud with high reliability and a high degree of confidence.”

The three-layer stack at a glance

Layer	Primitive	Purpose	When	Who’s in the loop
1. Verification	Self-checking skill	Make Claude reliable enough to trust	One-time setup per app	You set it up; Claude self-verifies forever
2. Multi-Clauding	Agent view / Desktop / Remote Control	Run 4-5 sessions in parallel	When verification is reliable	You manage attention across sessions
3. Background Loops	`/loop` + Routines	Take keyboard out of hot path entirely	When ongoing maintenance jobs accrue	Nobody — runs unattended

The verification skill recipe (Sid’s actual demo skill structure)

.claude/skills/verification/
└── SKILL.md
    ## Section 1: Bring up the stack
    — concrete commands (npm run dev, docker compose up, etc.)
    
    ## Section 2: Load tools
    — Chrome MCP enable, Playwright, etc.
    
    ## Section 3: Smoke test
    — open browser, navigate, click key UI elements
    — take screenshots before/after for proof
    
    ## Self-improvement loop
    — "Whenever you hit a blocker, edit this skill to document the fix"

Try It

Confirm you have the prerequisites. CLAUDE.md, connected tools, Claude Code on the web. If not, start there — none of this works without them.
Build ONE verification skill for ONE app. Use Sid’s recipe: bring-up-stack → load-tools → smoke-test → self-improvement clause. Tell Claude “take everything we just did and put it into a skill in .claude/skills/verification.”
Add /chrome MCP for any web-facing app. /chrome enables Claude in Chrome MCP. Sid’s preferred path. Fallback: Playwright MCP.
Make your verification skill self-improving from day one. Include the line “improve this skill every time you hit a blocker” in the skill body. The team’s compounding loop matters more than the initial quality.
Try claude agents for parallel sessions instead of tmux. Even if you love terminal. Sid himself transitioned and recommends it.
Set up /remote-control on one important Cloud Code session. “This is my favorite feature.” Lets the agent buzz your phone when stuck.
Pick ONE recurring bookkeeping task and convert it to a Routine. Updating docs, weekly issue triage, daily Slack digest of feedback. Routines run in the cloud — no laptop required.
Don’t open more than 4-5 Claudes simultaneously. Sid’s personal attention budget. Past this you stop being effective regardless of how good verification is.

Code with Claude 2026 — Opening Keynote — Conference parent.
What’s New in Claude Code (Dixon) — Companion conference talk. Lots of overlap on the agent-view, remote-control, routines, /loop primitives (Dixon’s talk goes deeper on the harness-layer details).
The Expanding Toolkit (Lucas) — Companion talk explicitly noting verification loops as one of the things “absorbed into the model.”
The Thinking Lever (Bleifer) — Sister talk on test-time compute. Verification loops + extra-high effort = the two compounding levers.
Claude Code Routines — Primary docs for the cloud-routine surface Sid demos.
Claude Code Scheduled Tasks — Primary docs for /loop + CronCreate.
Claude Code Agent Teams — Adjacent: explicit multi-agent coordination (vs. Sid’s “many parallel solo sessions” framing).
Claude Code CLI Reference — claude agents flags, /remote-control, /chrome.
Computer Use (Desktop + CLI) — Companion to Chrome MCP for non-web-app verification.
Anthropic Engineers’ Four Skill Rules — Self-improving skills is the rule Sid embodies in his verification-skill demo.
skills — The skill open standard the verification skill ships against.
Claude Code Best Practices — Where the “verification is the single highest-leverage thing” quote originates (Cal Rueb’s earlier May 2025 talk).
[[claude-ai/claude-code-goal-command-walkthrough|/goal Walkthrough]] — Companion long-running-loop primitive (different shape from /loop).

Open Questions

What’s the exact upper bound on /loop interval frequency? Sid demos /loop 10 minutes but doesn’t specify limits. The wiki’s scheduled tasks article has the official answer — worth cross-checking that Sid’s framing matches.
Routines vs /loop cost model — Sid doesn’t address whether running a Routine for 6 hours via the cloud container racks up the same usage as a 6-hour interactive session. The Claude Code team example (docs-update routine + every-6-hour Slack routine) doesn’t quote a cost. Worth checking against the W19 weekly-limits documentation.
Concrete verification recipes per stack — Sid demos a TypeScript+Express+Mongo+Redis recipe. Would benefit from documented recipes for Python+FastAPI, Rails, Go, and a pure-CLI app. Likely candidate for a future “verification-skill cookbook” article.
How does the self-improving skill avoid degenerating? A skill that edits itself every time something goes wrong could accumulate redundant or contradictory advice. Sid frames it as positive-only but doesn’t address pruning. Worth pairing with the four skill rules which would recommend periodic pruning.
What’s the cap on attention for unattended sessions? Sid says 4-5 parallel for attended multi-clauding. For Routines (zero attention required), the cap should be much higher — but does running 30 Routines exhaust shared cloud quota?

Jonathon's AI Wiki

Explorer

Stop Babysitting Your Agents — Three-Layer Autonomy Stack (Sid Benesaria, Anthropic Founding Engineer)

Key Takeaways

The three-layer stack at a glance

The verification skill recipe (Sid’s actual demo skill structure)

Try It

Open Questions

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

Stop Babysitting Your Agents — Three-Layer Autonomy Stack (Sid Benesaria, Anthropic Founding Engineer)

Key Takeaways

The three-layer stack at a glance

The verification skill recipe (Sid’s actual demo skill structure)

Try It

Related

Open Questions

Graph View

Table of Contents

Backlinks