Source: Autobrowse: The Mythos moment for Browser Agents is here (Kyle Jeong, Browserbase) · 2026-05 engineering post · backed by ai-research/browserbase-autobrowse-2026-05-09.md
Browserbase’s Autobrowse is a self-improving harness that runs a browser agent against a real task on a real site, iterates the strategy with a strategy.md scratchpad until the workflow is reliable rather than lucky, then graduates the converged approach into a markdown SKILL.md plus deterministic helper scripts. The article frames Autobrowse as the Karpathy autoresearch loop applied to browser-agent skill discovery — the first run is expensive on purpose, every subsequent run pays an order-of-magnitude less because the agent loads the graduated skill instead of re-deriving it. The thesis: the real bottleneck for browser agents in production is not reasoning but memory in a form humans and agents can both read and trust. Same diagnosis as Anthropic’s Platform team interview and Memory & Dreaming talk, applied specifically to web automation.
Key Takeaways
- Autobrowse = AI improving AI for browser skills. Hand the agent an objective on a real site → it runs end-to-end → studies its own trace → iterates strategy → converges → graduates a reusable SKILL.md. Cap iterations low (~3–5) and short-circuit aggressively. Article explicitly compares this to Karpathy’s autoresearch.
strategy.mdis the outer-loop scratchpad. Each iteration the agent dumps observations (what worked, what broke, what to try next, what to stop doing) intostrategy.md. The next iteration readsstrategy.mdfirst as context, so improvements compound instead of resetting every run.- The artifact IS the point. Output is a small readable markdown file with frontmatter (name, title, description, website, category, tags, status, source-trace, recommended_method, alternative_methods) plus deterministic glue: CLI calls (
browse fetch,browse search), helper scripts (helpers/amazon.py,helpers/opentable.py), and any selector/wait logic. No transcript, no embedding vector, no screenshot reel — just markdown a human can read, edit, and commit. - Concrete benchmarks (Browserbase-reported).
- Craigslist: Traditional Claude Code loop 0.12 / 27s (~45% cost, ~38% latency).
- Form-fill experiment: 0.24/run in 4 iterations, just by removing parts of the agent’s own approach that weren’t pulling weight.
- Federal grants portal: 28-page scrape collapsed into a single
browse fetchafter Autobrowse surfaced an undocumented JSON endpoint. Discovery is baked into the graduated skill.
- Recurring win pattern: “an agent tries something a person never would, and finds something a person would never see” — undocumented APIs visible in network traffic but not in the rendered page; heavy client-side rendering where the real content only appears after a sequence of interactions; multi-step login or wizard flows where the right path isn’t obvious from the first screen.
- Honest failure mode — deterministic-parsing tasks. A 167-row static HTML state catalog cost ~$24 across 4 iterations without ever returning all rows in a single output (per-turn output cap kept truncating). Pivoted to ~200 lines of Python with
browse fetch+ BeautifulSoup → sub-second runtime, zero inference cost. Lesson written into the skill itself: probe withbrowse fetchfirst; if the data comes back cleanly, write the parser; only escalate to Autobrowse if the response is empty / dynamic / gated. - Agency-level discipline. Browser agents span a spectrum from static script (no LLM in the loop) → router-style → tool-using → fully autonomous loops that spawn other agents and define their own tools. Autobrowse is the high-agency end and should be reached for only after cheaper, more deterministic options have been ruled out. The discipline of not using Autobrowse is part of using it well.
- Same pattern as Browserbase’s internal generalist agent (
bb). Their internal workflow agent (feature requests, session investigations, PRs, sales triage) loads small markdown skills on demand. Autobrowse extends that idea by letting the agent write its own skill, learned by actually doing the task. Hand-written and Autobrowse-graduated skills are the same artifact — once a skill exists, nothing about how an agent loads or runs it cares whether a human or another agent wrote it. - Skills as customer handoff. Today when an agent succeeds, the customer gets a trace, a session replay, or a paragraph of reasoning — none legible to the people who own the workflow. A skill is durable, debuggable, human-auditable, ownable. Engineers can read and edit it; non-engineers (technical PM, VP of tech, a grants manager who knows their portal) can also read it and roughly understand what the agent is doing without ever touching code. Shifts the contract from “trust the agent’s output” to “read the agent’s playbook.”
- Public Browse CLI ecosystem. Graduated skills land in a public skills repo accessible to anyone running a browser agent. A growing public directory of skills is positioned as the actual prize, not any single skill. (Echoes skills’s curated-marketplace shape but for browser-specific skills.)
How Autobrowse Works
The article documents the seven-step loop:
- Objective. Hand the agent a real task on a real site (e.g., “book a 7pm dinner reservation at this restaurant on OpenTable”).
- Run. Let the agent attempt the task end-to-end against a live browser.
- Study. The agent reads its own trace. Where did it stall? Where did it guess? Where did it spend tokens it didn’t need to?
- Strategy. The outer loop maintains
strategy.md— a scratchpad the agent dumps observations into after each iteration (what worked, what broke, what to try next, what to stop doing). Next iteration readsstrategy.mdfirst as context. - Iterate. Refine strategy based on the notes. Drop steps that didn’t pull weight. Lean on deterministic helpers (
browse fetch,browse search, custom Python) wherever possible. - Converge. Once consecutive iterations stop yielding meaningful improvements in cost or turn count, short-circuit. Goal is “reliable and cheap enough to be reused,” technically short of any theoretical global optimum.
- Graduate. Write out a
SKILL.mdplus any helper files into the public skills repo.
The Graduated SKILL.md Format
The post embeds a real graduated skill — craigslist-search-listings — as a worked example. Notable fields and conventions:
- Frontmatter captures
name,title,description,website,category,tags,status(e.g."autobrowsed-run-004 + prod-validated-002"),sourcetrace ("autobrowse + browser-trace · 4 iters · converged 2026-04-30 · cross-region prod validation 2026-05-01"),updateddate,recommended_method(api/browser), andalternative_methods(each with method + rationale for when to fall back). - Body sections: Purpose · When to Use · Workflow (numbered, with concrete URLs/headers/params) · Site-Specific Gotchas (the highest-density section — IP-based geo-redirects, undocumented enums, response decode tables that vary per cache TTL, etc.).
- Discovered behaviour goes into the gotchas list. The Craigslist skill documents that the API geolocates by source IP unless
postal=<zip>&search_distance=<mi>is supplied, thatdata.decode.locationsindexing is per-response and must never be cached, thatcategoryId → cat3is an undocumented enum (with observed mappings like68 → bik,93 → spo,122 → pts,197 → bop), and thatitem[0]is an offset fromdata.decode.minPostingIdnot the postingId itself. All of this discovery cost is paid once, then read by every subsequent agent for free.
Where Autobrowse Breaks
The post is unusually honest about failure modes — worth quoting the structure here because it informs when not to reach for the harness:
- Deterministic-parsing tasks. Plain HTML, no JS, no auth, no anti-bot — the data is right there in the markup. The “let the agent figure it out” framing is seductive but wrong; per-turn output caps will truncate the agent’s own reasoning before it ever returns a clean result. ~200 lines of Python beat $24 of iteration on the 167-row state catalog.
- Tasks that don’t reward cleverness. When the shortest reliable path is already obvious from the first network trace, Autobrowse is paying for exploration the task doesn’t need.
- Anywhere a
browse fetchprobe answers the question. Embedded in the lesson: probe-with-fetch first, escalate to Autobrowse only when the response is empty / dynamic / gated.
Roadmap
- Smarter stopping. Current heuristic is “cap iterations + short-circuit on cost/turn-count convergence.” Next step: let the agent reason about its own convergence more explicitly, comparing the structure of its trace across runs — not just cost and turns. Don’t optimize the variance away too aggressively, since some of Autobrowse’s most useful wins come from the agent randomly varying its approach and stumbling onto a much shorter path.
- Better priors about how to explore. Push the agent to reach for
fetchandsearchprimitives before spawning a full browser session. For more advanced tasks, let the agent inspect browser traces, network events, and CDP logs so it can discover internal APIs by watching network requests rather than guessing from the rendered DOM. - Recursive Autobrowse. Autobrowse improving Autobrowse — graduating improvements to its own harness. Better prompts for the iteration step, better priors for which primitives to reach for first, better SKILL.md templates per task class.
The Bigger-Picture Thesis
A dominant story about browser agents right now is that they’ll get good when the underlying models get good. We’re one Anthropic or OpenAI release away from agents that just work on the web. We don’t entirely buy that.
Even a perfect model still has to discover (on every new site) what a perfect model would already know if it had been there before. Without a place to put what the agent learns, every run is a fresh start.
The real bottleneck is memory, in a form humans and agents can both understand and trust.
This is the same diagnosis articulated by Anthropic’s Angela & Caitlin Platform team interview (path-dependence in primitives — file systems + skills as Claude’s deliberate memory format) and the Memory & Dreaming talk (file-system memory + out-of-band consolidation across sessions). Browserbase is mounting it on the browser-agent surface specifically.
Related
- Karpathy AutoResearch — self-improving coding-agent ratchet loop — the explicit progenitor pattern Autobrowse cites. Same converge-then-graduate shape, different domain (coding vs. browsing).
- TinyFish — Web Infrastructure APIs for AI Agents — direct competitor in the browser-infra space; TinyFish’s launch coverage names Browserbase explicitly. Different bet: TinyFish ships the primitives + agent endpoint as APIs; Browserbase ships the harness that produces artifacts on top of primitives.
- Browser Harness — CDP Browser Control Skill for Claude Code — sister browser-control harness aimed at Claude Code; Autobrowse is the analogue for the Browserbase ecosystem. Both share the “agent edits its own helpers mid-run” / self-healing design pattern.
- skills — the marketplace-shape precedent for a public skill directory; Browserbase’s “growing public directory of browser skills” claim follows the same playbook.
- Building Agents with Skills — Anthropic blog post that lists Browserbase as a Partner Skills builder. Autobrowse is the production engine behind those partner skills.
- Memory and Dreaming for Self-Learning Agents — same “memory as the bottleneck, not reasoning” thesis at the platform layer. Autobrowse is the browser-domain instantiation.
- Inside Claude’s Agent Platform — Angela & Caitlin — opinionated-primitives + verifiable-outcome design philosophy. Autobrowse is verifiable-outcome (cost + turn count) compressed onto a browser-agent loop.
- video-use — browser-use conversational video editor — different domain (video editing in a browser), same self-healing-agent / agent-writes-its-own-skill pattern from the same browser-use lineage.
- Crabbox — Remote Testbox for OpenClaw — sibling infrastructure: Crabbox solves the where does the agent run infra-wall problem; Autobrowse solves the what artifact does the agent leave behind memory-wall problem. Both target the post-Mac-mini-prototype phase Anthropic’s Platform team flags.
- Computer Use — the broader category that browser agents sit inside; Autobrowse is what production-ization looks like at one slice.
Try It
- Read the embedded
craigslist-search-listingsskill in full — it’s the most concrete example of what an Autobrowse-graduated artifact looks like, including the gotchas section (geo-redirect, undocumented enums, per-response decode tables). It’s a near-template for any new browser skill you author by hand. - Adopt
strategy.mdas a scratchpad in your own browser-agent loops, even without the rest of the harness. The cheap-to-implement piece: append observations after each iteration, read it first on the next. The compounding effect is most of the win. - Probe with
fetchbefore browser sessions. Write that as a checked-in rule for your own agent harness before reaching for full browser automation. Skip-the-browser-when-fetch-works is the cheapest discipline change with the largest cost-per-run impact. - Watch the public Browse CLI skills directory for graduated skills covering sites you also need agents to work against. Even hand-built agents can load the markdown directly — the format is the format.
Open Questions
- Where exactly does the public skills directory live? Article references “the public Browse CLI ecosystem” and a “public skills repo” but doesn’t link the canonical URL. Worth fetching when the repo surfaces (likely
github.com/browserbase/*). - Pricing model for Autobrowse runs. First-run cost is paid on purpose; the article cites Craigslist convergence at ~$0.22 baseline + iteration overhead but doesn’t disclose how Browserbase prices the harness vs. the inference. Open until product/pricing page is checked.
- Compatibility with non-Browserbase browser stacks. Skills are markdown + helper scripts and the article frames them as portable, but the deterministic glue references
browse fetch/browse searchprimitives — open whether those run on TinyFish / browser-harness / Stagehand / Playwright unmodified. - Recursive Autobrowse status. Listed as “what we’re working on next.” Not yet shipped at time of post.