Source: Live coding session with Boris Cherny and Jarred Sumner (Code with Claude London 2026, YouTube DlTCu_pNDHE)

Boris Cherny (head of Claude Code at Anthropic) and Jarred Sumner (creator of Bun, at Anthropic) do a live-coding session showing how the Bun runtime uses Claude Code to maintain itself. The centerpiece is “Robun” (also “Robo Bun”) — a bot that auto-reproduces every GitHub issue the moment it’s filed and opens a test-gated PR before any human looks at it. The talk is a worked example of a fully-closed agentic loop on a systems codebase, and the recurring theme is that the work shifts from “fix the bug” to “is this the right thing to merge?”

Key Takeaways

  • Robun auto-reproduces every issue, before anyone looks at it. Every time someone files an issue in the Bun repo, a Claude-powered bot (“Robun” / “Robo Bun”) automatically spins up and tries to reproduce it, then submits a PR — all before a human triages the issue.
  • Test-gated PR with a hard fail-on-old/pass-on-new requirement. Every Robun PR must include tests — “one of the actual hard requirements before it can submit a PR.” The specific gate: the test must fail on the previous version of Bun and pass on the fix branch. The bot literally cannot submit a PR unless that’s true. This is the verification mechanism that makes the loop trustworthy.
  • Robun out-contributes Jarred. Per GitHub Insights → Contributors → last 3 months (commits to main), Robun is now a bigger contributor to Bun than Jarred Sumner himself — and that’s with not all of its PRs being merged.
  • The work shifts from debugging to merge judgment. Because repro + fix + tests are automated, the challenge moves from “fix and debug the issue” to “is this the right fix? How good is it? Is this the right thing to merge?” Verifying that changes are correct is now the bottleneck, not writing code.
  • Adversarial / multi-agent code review in the loop. Automatic code-review bots run on Robun’s PRs and iterate with it. Code Rabbit leaves a comment, Robun responds, and they go back and forth (one PR had ~30 comments); the bots mark comments resolved when done. Cherny floated calling the pattern “adversarial code review.”^[inferred] (the name was proposed live, not settled)
  • Division of labor between review bots. Code Rabbit is good for stylistic issues and CLAUDE.md conformance; Claude’s code review is good at subtle edge cases that need full-codebase context and control-flow tracing — surfacing bugs not visible in the diff. Claude code review is wrong only ~10% of the time, vs older review products where “you had to ignore most of what it said.”
  • Code review must be in the loop, not just commenting. The automation only works because review bots actually fix (reply-and-edit), not merely comment. This removes the old switching cost — checking out a branch locally, fixing a lint error, running the linter, pushing back up.
  • CLAUDE.md is the load-bearing prerequisite. “If you just do this then it doesn’t quite work.” The first requirement is a correctly set-up dev environment documented in CLAUDE.md, or the bot submits PRs that don’t make sense to merge.
  • Bun’s CLAUDE.md encodes a special build-and-run command. Because Bun must be compiled, CLAUDE.md mandates one command that both builds and runs (forwarding arguments) so the agent tests the actual changes, not a stale debug build. It also documents how to run tests, how to write them, where to put them, and a log of previously-encountered issues.
  • Compound engineering: every repeated correction goes into CLAUDE.md. The pattern: when you catch the agent making the same mistake once or twice (e.g., writing a bad test), tell it to add the rule to CLAUDE.md so it does it right the first time next time. A concrete example: print the error message before less-informative conditions so Claude always sees it.
  • The agent must run the full loop, including CI. Set the agent up to write code → test it → check CI → monitor CI → read all build/CI errors. The ideal is that by the time a PR reaches a human, everything is set up so the reviewer can be high-confidence to merge.
  • Self-verification is the prerequisite for scaling to hundreds of parallel agents. Sumner now runs “hundreds of agents every single night” in auto mode — realizing Cherny’s earlier vision. Autonomous parallel agents only work if they can self-verify.
  • Opus 4.7 is the inflection model. Sumner: “47 [Opus 4.7] is the first model where it’s really felt like it’s able to do this.” Previously it took heavy scaffolding (“throw a bunch of tokens at it”); now it’s efficient enough to do day-to-day. None of this would have worked several months ago; “3 months ago this is doable” is very recent.
  • Hill climbing on a metric + a verifier. Anthropic’s internal term for the loop: give the model a metric and a way to verify its result, and let it iterate in auto mode until it hits the target. Worked example — Sumner told Claude to make Bun’s new image-processing library “faster than sharp,” gave it a few ideas (e.g., read JavaScriptCore source to avoid cloning the typed array when unnecessary), ran the benchmark on a separate Linux box, and Claude hill-climbed to beat it. This was “one prompt” that “ran for 30 minutes.”
  • Robun’s evolution. Earlier it was just a Discord bot you @-mentioned to spin up a container — no CI integration, no code-review integration. The current pipeline (CI + review bots + Opus 4.7) is “so much better now.”
  • PRs become suggestions. Because there’s no human whose work you’d feel bad rejecting, the merge bar can rise — you can simply not merge a wrong/low-value PR from Claude without the social cost of rejecting a coworker. This paradoxically raises the bar for what gets merged.
  • Live demo result. During the ~25-minute talk, the ad-hoc agents produced 3 PRs (a 4th landing as tests finished), while Robun kept generating more from incoming issues. One issue Robun fixed had 20 upvotes.
  • The bottleneck keeps moving. Writing code → was the bottleneck, no longer. Verification/running tests → was the bottleneck, no longer. Now: a deeper layer of verification (communicating sufficient proof a change is correct, or making rollback easier), and after that, planning (“what should we do, what’s the right way to fix this”).
  • Why Bun is an easy case. It’s systems code and a CLI tool — easy to repro/verify against a particular architecture, and no browser needed to test. For products that aren’t open source, the generalizable starting point is a customer-support ticket instead of a GitHub issue: auto-pass tickets to a bot to repro → submit PR → review-bot loop.
  • Feature requests are deliberately gated by taste. Robun does not auto-build feature requests from issues (though @-mentioning it in Discord/Slack can make it implement a feature, sometimes a PR ~an hour later). Sumner is hesitant to auto-implement everything anyone asks for — adding e.g. an image-processing library to Bun is an engineering-taste call, and it’s unclear Claude is yet at the point where its taste matches his.
  • Larger PRs are increasingly Claude-driven too. Recent Claude-built work in Bun: a built-in image-processing library (plus follow-up PRs), an HTTP/3 server, an HTTP/2 server PR, fetch support for HTTP/3 and HTTP/2, and an ongoing (possibly-not-shipping) Rust rewrite — “the most ambitious one.”

Implementation

Tool/Service: Claude Code (CLI) + a custom GitHub-issue bot (“Robun”/“Robo Bun”) + Code Rabbit + Claude code review, running on the Bun repo.

Setup:

  • Trigger: on every new GitHub issue, fire the bot to auto-reproduce. (Generalizable analog for closed-source products: trigger off a customer-support ticket.) The bot can also be @-mentioned in Discord/Slack to implement a feature on demand.
  • Hard PR gate: the PR must include a test that fails on the previous version and passes on the fix branch — the bot cannot open a PR otherwise.
  • CLAUDE.md must encode the dev environment. For Bun specifically: a single build-and-run command that compiles then runs with forwarded args (so the agent never tests a stale debug build); how/where to write and place tests; a running log of past gotchas; folder/architecture overview; dependency notes; and the rule to print error messages before less-informative conditions so Claude always sees them.
  • Full agent loop: write code → run tests → check CI → monitor CI → read all build/CI errors, so a human reviewer arrives at a high-confidence merge decision.
  • Review-bot loop: Code Rabbit (style + CLAUDE.md conformance) and Claude code review (edge cases, control-flow tracing, diff-external bugs) iterate with the bot and mark comments resolved.
  • Permissions: auto mode (previously --dangerously-skip-permissions, which Cherny notes he’s “not supposed to recommend”). Auto mode lets Claude run for hours unattended without stalling on a permission prompt. Sumner also runs no flicker CLI mode (set the env var; Cherny: “no flicker equals 1 quad,” and floated making it the default) — a virtualized-scrolling renderer with constant memory/CPU and working mouse/click events in the terminal.

Cost: Not stated. (Sumner runs “hundreds of agents every single night.“)

Integration notes:

  • The compound-engineering loop (repeated mistake → write the rule into CLAUDE.md) is what makes running many agents maintainable — “to do that it needs to be written down.”
  • Hill climbing pattern: give a metric + a verifier (e.g., a benchmark on a dedicated box) and let Claude iterate in auto mode until it hits the target. Best with Opus 4.7.
  • Bun’s loop is eased by being a compiled CLI / systems codebase (cheap repro, no browser); UI-heavy products would add a screenshot/video-capture step to the verification loop.

Try It

  • Add a fail-on-old/pass-on-new test gate to any agent that opens fix PRs: require a test that fails on the pre-fix version and passes on the branch. This is the single most reusable, falsifiable rule from the talk.
  • Audit your CLAUDE.md against the loop. Document a one-command build-and-run that guarantees the agent tests real changes (not stale builds), where/how to write tests, folder/architecture map, dependency notes, and a running log of past gotchas.
  • Adopt compound engineering: each time you correct an agent twice on the same thing, have it append the rule to CLAUDE.md so it’s right the first time next run.
  • Put code review in the loop, not just commenting. Pair a style/conformance reviewer (e.g., Code Rabbit) with a context-tracing reviewer (Claude code review) and let them iterate with the fix bot.
  • Give the agent the full CI loop — read build logs, monitor CI, read all errors — so a human only ever sees high-confidence-to-merge PRs.
  • For closed-source products, wire customer-support tickets (not GitHub issues) into a repro → PR → review-bot pipeline.
  • Run in auto mode for long unattended runs instead of --dangerously-skip-permissions; try no flicker CLI mode for fast scrolling and working mouse events.

Open Questions

  • Cost/economics of running “hundreds of agents every single night” — not disclosed.
  • The exact name for the multi-agent review pattern (“adversarial code review” was floated live, not settled).
  • What tooling would let Bun trust auto-merge without a human — Sumner says “deeper verification” and “easier rollback,” but the concrete missing capability isn’t specified.
  • The Rust rewrite of Bun “may not ship” — status genuinely open.
  • No quantified time-savings figure is given beyond “saves so much developer time.”
  • Robun’s exact infrastructure (orchestration, sandbox/container model, container provider) beyond “it spins up a container” is not detailed.