Source: raw/gh-star-microsoft-webwright.md (gh-star metadata, fetched 2026-05-26) + raw/x-bookmark-2059026191646945515.md (X bookmark by @mr_r0b0t, 2,084 likes at capture) + ai-research/microsoft-webwright-github-readme-2026-05-27.md (README refresh, fetched 2026-05-27). Repo: github.com/microsoft/Webwright. Homepage: microsoft.github.io/Webwright. MSR blog: Webwright: A Terminal Is All You Need For Web Agents. Authors: Lu, Yadong; Xu, Lingrui; Huang, Chao; Awadallah, Ahmed (Microsoft Research). Stars: 143 (correcting the 1,106 figure captured at first ingest — see refresh note below). Forks: 7. License: MIT. Languages: Python 91.2%, HTML 8.7%, Shell 0.1%. Codebase: ~1.5k LoC total.
Microsoft Research browser-agent framework with a load-bearing inversion: separate the agent from the browser, and treat the browser as something the agent can launch, inspect, and discard while developing a program. The persistent artifact is the code and logs in the local workspace, not the browser session. Ships SOTA results on two real-website benchmarks (Online-Mind2Web 86.7% with GPT-5.4, Odysseys 60.1% +15.6 pts over Opus 4.6), and now installs as a first-class plugin in Claude Code, OpenAI Codex, OpenClaw, and Hermes Agent from one shared skills/webwright/ directory.
Refresh note (2026-05-27)
The first ingest on 2026-05-26 captured
stars: 1106from the gh-stars puller; the current GitHub count is 143 stars + 7 forks. The 1,106 figure was almost certainly a wrong field at the puller side. This refresh corrects the count and adds the SOTA benchmark numbers, architecture details, four-host plugin model, MSR blog citation, and the comparison table — all absent from the first ingest because the gh-star metadata alone was thin.
Key Takeaways
- Microsoft Research-published, not just Microsoft-org. Authors are MSR (Lu, Xu, Huang, Awadallah); the MSR blog post explicitly frames Webwright as a research artifact. First-party big-co + research-org maintainer signal stays the load-bearing prior versus community-maintained browser agents like Browserbase Autobrowse, TinyFish, and CloakBrowser.
- Code-as-action inversion is the load-bearing claim. Traditional browser agents predict one action per step (click, type, eval) and treat the browser session as state. Webwright instead has the model write free-form Python that drives Playwright, with the workspace as state and the browser as a disposable environment. “Write code → execute → inspect screenshots → repair” replaces “observe → predict → execute → repeat.” This is the same SWE-style decomposition philosophy as Tool, Skill, or Subagent? (Will, Applied AI) applied to web automation.
- SOTA results with concrete numbers (100-step budget): Online-Mind2Web 86.7% with GPT-5.4 — highest among open-sourced harnesses in the AutoEval category; Claude Opus 4.7 reaches 84.7% and is stronger on the hard split (80.5% vs 76.6% for GPT-5.4 at N=100). Odysseys 60.1% with GPT-5.4 (avg. 76.1 steps) — +15.6 points over prior SOTA (Opus 4.6 at 44.5%, vision-based + persistent browser) and +26.6 points over base GPT-5.4 (33.5%, xy-coordinate prediction + persistent browser).
- Small models + reusable tools punch above their weight. Generated scripts can be packaged as parameterized CLI tools — even Qwen-3.5-9B completes tasks well on Online-Mind2Web sites when 5+ tools are pre-available. This is the leverage flywheel: a strong frontier model writes the script once, then the script becomes a tool any smaller model can invoke.
- Lightweight by design. Core agent loop in a single ~450-line file, Playwright environment in ~570 lines, CLI in ~150 lines, model backends (OpenAI / Anthropic / OpenRouter) each ~150-200 lines. Zero hidden frameworks — just
httpx,pydantic,playwright,typer. The entire repo is ~1.5k LoC. Engineered for forking and reading, not abstracting. - Workspace-as-state, not browser-as-state. Every run writes trajectories and screenshots to disk for inspection. The agent can write exploratory scripts, spawn fresh browser sessions, and decide when to capture screenshots and inspect failures — the way a human engineer iterating on an RPA script behaves.
- Ships as a plugin to four hosts from one skill folder.
skills/webwright/loads across Claude Code (.claude-plugin/plugin.json), OpenAI Codex (.codex-plugin/plugin.json), OpenClaw, and Hermes Agent (agentskills.io compatible — symlink the folder, no manifest needed). Host drives the loop natively; no extra LLM API key or cost beyond the host subscription. Hosts that read PNG screenshots natively skip the OpenAI-backedimage_qa/self_reflectiontools. - Two slash-command modes in Claude Code / Codex.
/webwright:run <task>produces a one-shotfinal_script.pyfor the literal task values./webwright:craft <task>produces a reusable CLI tool —final_script.pybecomes a parameterized function with a Google-styleArgs:docstring andargparsewrapper whose flags default to the concrete task values, so the same script can be rerun later with different arguments (e.g., different dates, different origins). The:craftmode is the script-becomes-tool flywheel in operator-callable form. - Task2UI mode (2026-05-11). Webwright completes the task and renders task results into an HTML-based web app that’s easily viewed and reused. The
assets/task_showcase/Flask app consolidates runs for repeatable odyssey tasks (deals, inventory, listings, job boards, weather) into a single dashboard — each task ships onlytask.json(metadata) +report.json(curated structured output) and templates render generically.
Architectural comparison (from README)
| Stagehand (Browserbase) | agent-browser (Vercel) | browser-use | Webwright | |
|---|---|---|---|---|
| Paradigm | Hybrid: code + NL primitives (act / extract / agent) | CLI tool that another agent (Claude Code, Codex, etc.) calls | Autonomous LLM agent loop over DOM/AX snapshots | Coding agent with a terminal; browser is just an environment it spawns |
| Action space | Playwright code, or NL → LLM-translated Playwright | Discrete subcommands (open, click @e2, snapshot, eval) | Indexed click/type actions selected by the LLM | Free-form Python (writes Playwright scripts itself) |
| What is “state”? | The browser session | The browser session (held by daemon across CLI calls) | The browser session | The local workspace — code, screenshots, logs. Browser is disposable. |
| Loop shape | Imperative; agent() does multi-step when needed | One CLI invocation per micro-step | observe → predict next action → execute → repeat | write code → execute → inspect screenshots → repair (code-as-action) |
Where this fits
| Topic surface | Description | Relationship |
|---|---|---|
| Browserbase Autobrowse | Managed, ethics-curated browser-use platform | Different trust model — Browserbase ships managed runtime; Webwright is OSS you self-host |
| TinyFish | Managed, full-Chromium agent platform | Different trust model — same self-hosted vs managed split |
| CloakBrowser | Stealth Chromium with fingerprint patches | Different problem framing — CloakBrowser optimizes for detection-evasion; Webwright optimizes for reusable workflows |
| Hermes Agent topic landing | Nous Research stateful agent | Direct integration — Webwright loads as a Hermes skill from the shared skills/webwright/ folder |
| Tool, Skill, or Subagent? (Will, Applied AI) | SWE-style decomposition methodology | Same philosophy — code-as-source-of-truth applied to browser automation |
| Managed Agents Self-Hosted Sandboxes | Anthropic’s self-hosted agent execution | Adjacent surface — Webwright is browser-specific self-hosted; Managed Agents is the general case |
| Claude Code skills overview | Skill format Webwright ships in | Plugin host — Webwright is one of the more substantive single-skill installs |
Project map (from README)
webwright/
├── pyproject.toml # package: webwright
├── src/webwright/
│ ├── run/cli.py # CLI entrypoint (`webwright`)
│ ├── agents/default.py # core agent loop (~450 LoC)
│ ├── environments/ # Playwright browser workspace (~570 LoC)
│ ├── tools/ # image_qa, self_reflection
│ ├── models/ # openai_model, anthropic_model, base (each ~150-200 LoC)
│ ├── config/ # base.yaml, model_openai.yaml, model_claude.yaml
│ └── utils/
├── assets/
│ └── task_showcase/ # Flask dashboard for repeatable runs
│ ├── app.py
│ ├── templates/ # dashboard.html, task.html
│ └── tasks/<task-id>/ # task.json + report.json per task
├── tests/
└── outputs/ # run artifacts (trajectories, screenshots)
Try It
- Run as a CLI directly. Python 3.10+, Chromium via Playwright, set
OPENAI_API_KEYorANTHROPIC_API_KEY, thenwebwright -t "Search for flights from SEA to JFK on 2026-08-15 to 2026-08-20" --start-url <initial page> --task-id <output subfolder> -o <output dir>. Stack configs with-c. - Install as a Claude Code plugin.
/plugin marketplace add microsoft/Webwrightthen/plugin install webwright@webwright. Restart Claude Code. Either ask in plain English (the skill auto-activates) or use/webwright:run <task>for one-shot or/webwright:craft <task>for a reusable parameterized CLI tool. Each run scaffolds a workspace withplan.md, instrumented Playwright scripts underfinal_runs/run_<id>/, and self-verifies against saved screenshots. - Install as a Codex plugin.
codex plugin marketplace add microsoft/Webwright, restart Codex, invoke with@webwright <task>in a new thread. - Install as an OpenClaw plugin. From a local checkout, with path/archive/npm-spec/git-repo/
clawhub:install paths. Thenopenclaw plugins listshould showwebwright | ... | ✓ ready. Invoke via natural language or/webwright runand friends. - Install as a Hermes Agent skill. Symlink
skills/webwright/into the Hermes user-skills directory — no Hermes-specific manifest needed (onlySKILL.mdis loaded). Starthermesand ask it to drive a web task. The named subcommands (/webwright:run,/webwright:craft) are inert in Hermes; plain English works end-to-end. - Reproduce one of the SOTA benchmarks. The Online-Mind2Web AutoEval suite is the highest-leverage falsification target. If Webwright’s 86.7% with GPT-5.4 reproduces on your model+infra, the code-as-action thesis holds beyond MSR’s lab.
- Inspect the comparison-table claims. Stagehand, agent-browser, browser-use are all sitting beside Webwright in real production today. Pick one task the README implies Webwright wins on, run it through your existing harness, then run it through Webwright — the operator delta is the load-bearing claim worth testing.
Open Questions
- Reproducing 86.7% on Online-Mind2Web requires GPT-5.4. Does the benchmark fall off for Sonnet 4.6 / Haiku 4.5 / Qwen2-Coder / Gemini 2.5 Flash? The README only spotlights GPT-5.4 and Opus 4.7. Model-coverage matters for operators choosing a backend on cost.
- Task2UI’s output schema.
task.json+report.jsonare described as “curated, structured output: sources + result sections like tables, lists, summaries” but the actual JSON schema isn’t in the README. Worth a follow-up read ofassets/task_showcase/README.md. /webwright:craftparameterization heuristics. How does the model decide which task values become CLI flags vs hardcoded? Are there cases where the wrong field gets parameterized (e.g., a destination city becomes a flag but the date doesn’t)? Operator failure-mode worth knowing.- Cross-host parity. Slash commands
/webwright:runand/webwright:craftare a Claude Code / Codex convention; the README says they’re “inert in Hermes.” Are they also inert in OpenClaw, or does OpenClaw’s slash-command system pick them up viaskills/webwright/commands/? README is ambiguous. - Image-QA tool routing on hosts that natively read screenshots. README says hosts that read PNG natively “skip the OpenAI-backed
image_qa/self_reflectiontools.” Does the host’s vision quality match what GPT-4-Vision-styleimage_qawould have surfaced? Or is there a real-world regression? - Browser-detection signature. Webwright drives Playwright with Chromium — does it carry the standard automation-detected fingerprint, or has MSR modified it? Relevant for any cross-comparison with CloakBrowser’s stealth approach.
- Microsoft roadmap. This is MSR. Is Webwright a research prototype, a productization preview, or a long-supported open-source surface? Commit cadence + the 2026-05-04 → 2026-05-11 release tempo suggests active work, but MSR projects sometimes stall after the paper ships.
- License vs Microsoft policy. Repo is MIT. Microsoft has been increasingly carefully sanding the edges around its OSS contributions; check whether the MIT license is genuinely standalone or constrained by a parent CLA / patent-grant pattern.
Related
- Agents & Agentic Systems topic landing
- Browserbase Autobrowse — closest peer in the managed-browser-agent space; different trust model
- TinyFish — full-Chromium managed agent platform
- CloakBrowser — stealth Chromium with fingerprint patches; different problem framing
- Hermes Agent topic landing — Webwright ships a Hermes-compatible skill via shared
skills/webwright/folder - Hermes Skill Bundles — convention reference for the Webwright Hermes integration
- Tool, Skill, or Subagent? (Will, Applied AI) — same SWE-style code-as-source-of-truth philosophy
- Claude Code skills overview — Webwright is a substantive single-skill install
- Managed Agents Self-Hosted Sandboxes — adjacent self-hosted-agent-runtime surface
- Claude Code plugins and marketplaces — install context for
/plugin marketplace add microsoft/Webwright