Microsoft Webwright — Coding-Agent-with-a-Terminal Browser Framework (MSR, MIT, ~1.5k LoC, SOTA on Online-Mind2Web + Odysseys)

Source: raw/gh-star-microsoft-webwright.md (gh-star metadata, fetched 2026-05-26) + raw/x-bookmark-2059026191646945515.md (X bookmark by @mr_r0b0t, 2,084 likes at capture) + ai-research/microsoft-webwright-github-readme-2026-05-27.md (README refresh, fetched 2026-05-27). Repo: github.com/microsoft/Webwright. Homepage: microsoft.github.io/Webwright. MSR blog: Webwright: A Terminal Is All You Need For Web Agents. Authors: Lu, Yadong; Xu, Lingrui; Huang, Chao; Awadallah, Ahmed (Microsoft Research). Stars: 143 (correcting the 1,106 figure captured at first ingest — see refresh note below). Forks: 7. License: MIT. Languages: Python 91.2%, HTML 8.7%, Shell 0.1%. Codebase: ~1.5k LoC total.

Microsoft Research browser-agent framework with a load-bearing inversion: separate the agent from the browser, and treat the browser as something the agent can launch, inspect, and discard while developing a program. The persistent artifact is the code and logs in the local workspace, not the browser session. Ships SOTA results on two real-website benchmarks (Online-Mind2Web 86.7% with GPT-5.4, Odysseys 60.1% +15.6 pts over Opus 4.6), and now installs as a first-class plugin in Claude Code, OpenAI Codex, OpenClaw, and Hermes Agent from one shared skills/webwright/ directory.

Refresh note (2026-05-27)

The first ingest on 2026-05-26 captured stars: 1106 from the gh-stars puller; the current GitHub count is 143 stars + 7 forks. The 1,106 figure was almost certainly a wrong field at the puller side. This refresh corrects the count and adds the SOTA benchmark numbers, architecture details, four-host plugin model, MSR blog citation, and the comparison table — all absent from the first ingest because the gh-star metadata alone was thin.

Key Takeaways

Microsoft Research-published, not just Microsoft-org. Authors are MSR (Lu, Xu, Huang, Awadallah); the MSR blog post explicitly frames Webwright as a research artifact. First-party big-co + research-org maintainer signal stays the load-bearing prior versus community-maintained browser agents like Browserbase Autobrowse, TinyFish, and CloakBrowser.
Code-as-action inversion is the load-bearing claim. Traditional browser agents predict one action per step (click, type, eval) and treat the browser session as state. Webwright instead has the model write free-form Python that drives Playwright, with the workspace as state and the browser as a disposable environment. “Write code → execute → inspect screenshots → repair” replaces “observe → predict → execute → repeat.” This is the same SWE-style decomposition philosophy as Tool, Skill, or Subagent? (Will, Applied AI) applied to web automation.
SOTA results with concrete numbers (100-step budget): Online-Mind2Web 86.7% with GPT-5.4 — highest among open-sourced harnesses in the AutoEval category; Claude Opus 4.7 reaches 84.7% and is stronger on the hard split (80.5% vs 76.6% for GPT-5.4 at N=100). Odysseys 60.1% with GPT-5.4 (avg. 76.1 steps) — +15.6 points over prior SOTA (Opus 4.6 at 44.5%, vision-based + persistent browser) and +26.6 points over base GPT-5.4 (33.5%, xy-coordinate prediction + persistent browser).
Small models + reusable tools punch above their weight. Generated scripts can be packaged as parameterized CLI tools — even Qwen-3.5-9B completes tasks well on Online-Mind2Web sites when 5+ tools are pre-available. This is the leverage flywheel: a strong frontier model writes the script once, then the script becomes a tool any smaller model can invoke.
Lightweight by design. Core agent loop in a single ~450-line file, Playwright environment in ~570 lines, CLI in ~150 lines, model backends (OpenAI / Anthropic / OpenRouter) each ~150-200 lines. Zero hidden frameworks — just httpx, pydantic, playwright, typer. The entire repo is ~1.5k LoC. Engineered for forking and reading, not abstracting.
Workspace-as-state, not browser-as-state. Every run writes trajectories and screenshots to disk for inspection. The agent can write exploratory scripts, spawn fresh browser sessions, and decide when to capture screenshots and inspect failures — the way a human engineer iterating on an RPA script behaves.
Ships as a plugin to four hosts from one skill folder. skills/webwright/ loads across Claude Code (.claude-plugin/plugin.json), OpenAI Codex (.codex-plugin/plugin.json), OpenClaw, and Hermes Agent (agentskills.io compatible — symlink the folder, no manifest needed). Host drives the loop natively; no extra LLM API key or cost beyond the host subscription. Hosts that read PNG screenshots natively skip the OpenAI-backed image_qa / self_reflection tools.
Two slash-command modes in Claude Code / Codex. /webwright:run <task> produces a one-shot final_script.py for the literal task values. /webwright:craft <task> produces a reusable CLI tool — final_script.py becomes a parameterized function with a Google-style Args: docstring and argparse wrapper whose flags default to the concrete task values, so the same script can be rerun later with different arguments (e.g., different dates, different origins). The :craft mode is the script-becomes-tool flywheel in operator-callable form.
Task2UI mode (2026-05-11). Webwright completes the task and renders task results into an HTML-based web app that’s easily viewed and reused. The assets/task_showcase/ Flask app consolidates runs for repeatable odyssey tasks (deals, inventory, listings, job boards, weather) into a single dashboard — each task ships only task.json (metadata) + report.json (curated structured output) and templates render generically.

Architectural comparison (from README)

	Stagehand (Browserbase)	agent-browser (Vercel)	browser-use	Webwright
Paradigm	Hybrid: code + NL primitives (`act` / `extract` / `agent`)	CLI tool that another agent (Claude Code, Codex, etc.) calls	Autonomous LLM agent loop over DOM/AX snapshots	Coding agent with a terminal; browser is just an environment it spawns
Action space	Playwright code, or NL → LLM-translated Playwright	Discrete subcommands (`open`, `click @e2`, `snapshot`, `eval`)	Indexed click/type actions selected by the LLM	Free-form Python (writes Playwright scripts itself)
What is “state”?	The browser session	The browser session (held by daemon across CLI calls)	The browser session	The local workspace — code, screenshots, logs. Browser is disposable.
Loop shape	Imperative; `agent()` does multi-step when needed	One CLI invocation per micro-step	observe → predict next action → execute → repeat	write code → execute → inspect screenshots → repair (code-as-action)

Where this fits

Topic surface	Description	Relationship
Browserbase Autobrowse	Managed, ethics-curated browser-use platform	Different trust model — Browserbase ships managed runtime; Webwright is OSS you self-host
TinyFish	Managed, full-Chromium agent platform	Different trust model — same self-hosted vs managed split
CloakBrowser	Stealth Chromium with fingerprint patches	Different problem framing — CloakBrowser optimizes for detection-evasion; Webwright optimizes for reusable workflows
Hermes Agent topic landing	Nous Research stateful agent	Direct integration — Webwright loads as a Hermes skill from the shared `skills/webwright/` folder
Tool, Skill, or Subagent? (Will, Applied AI)	SWE-style decomposition methodology	Same philosophy — code-as-source-of-truth applied to browser automation
Managed Agents Self-Hosted Sandboxes	Anthropic’s self-hosted agent execution	Adjacent surface — Webwright is browser-specific self-hosted; Managed Agents is the general case
Claude Code skills overview	Skill format Webwright ships in	Plugin host — Webwright is one of the more substantive single-skill installs

Project map (from README)

webwright/
├── pyproject.toml                # package: webwright
├── src/webwright/
│   ├── run/cli.py                # CLI entrypoint (`webwright`)
│   ├── agents/default.py         # core agent loop (~450 LoC)
│   ├── environments/             # Playwright browser workspace (~570 LoC)
│   ├── tools/                    # image_qa, self_reflection
│   ├── models/                   # openai_model, anthropic_model, base (each ~150-200 LoC)
│   ├── config/                   # base.yaml, model_openai.yaml, model_claude.yaml
│   └── utils/
├── assets/
│   └── task_showcase/            # Flask dashboard for repeatable runs
│       ├── app.py
│       ├── templates/            # dashboard.html, task.html
│       └── tasks/<task-id>/      # task.json + report.json per task
├── tests/
└── outputs/                      # run artifacts (trajectories, screenshots)

Try It

Run as a CLI directly. Python 3.10+, Chromium via Playwright, set OPENAI_API_KEY or ANTHROPIC_API_KEY, then webwright -t "Search for flights from SEA to JFK on 2026-08-15 to 2026-08-20" --start-url <initial page> --task-id <output subfolder> -o <output dir>. Stack configs with -c.
Install as a Claude Code plugin. /plugin marketplace add microsoft/Webwright then /plugin install webwright@webwright. Restart Claude Code. Either ask in plain English (the skill auto-activates) or use /webwright:run <task> for one-shot or /webwright:craft <task> for a reusable parameterized CLI tool. Each run scaffolds a workspace with plan.md, instrumented Playwright scripts under final_runs/run_<id>/, and self-verifies against saved screenshots.
Install as a Codex plugin. codex plugin marketplace add microsoft/Webwright, restart Codex, invoke with @webwright <task> in a new thread.
Install as an OpenClaw plugin. From a local checkout, with path/archive/npm-spec/git-repo/clawhub: install paths. Then openclaw plugins list should show webwright | ... | ✓ ready. Invoke via natural language or /webwright run and friends.
Install as a Hermes Agent skill. Symlink skills/webwright/ into the Hermes user-skills directory — no Hermes-specific manifest needed (only SKILL.md is loaded). Start hermes and ask it to drive a web task. The named subcommands (/webwright:run, /webwright:craft) are inert in Hermes; plain English works end-to-end.
Reproduce one of the SOTA benchmarks. The Online-Mind2Web AutoEval suite is the highest-leverage falsification target. If Webwright’s 86.7% with GPT-5.4 reproduces on your model+infra, the code-as-action thesis holds beyond MSR’s lab.
Inspect the comparison-table claims. Stagehand, agent-browser, browser-use are all sitting beside Webwright in real production today. Pick one task the README implies Webwright wins on, run it through your existing harness, then run it through Webwright — the operator delta is the load-bearing claim worth testing.

Open Questions

Reproducing 86.7% on Online-Mind2Web requires GPT-5.4. Does the benchmark fall off for Sonnet 4.6 / Haiku 4.5 / Qwen2-Coder / Gemini 2.5 Flash? The README only spotlights GPT-5.4 and Opus 4.7. Model-coverage matters for operators choosing a backend on cost.
Task2UI’s output schema. task.json + report.json are described as “curated, structured output: sources + result sections like tables, lists, summaries” but the actual JSON schema isn’t in the README. Worth a follow-up read of assets/task_showcase/README.md.
/webwright:craft parameterization heuristics. How does the model decide which task values become CLI flags vs hardcoded? Are there cases where the wrong field gets parameterized (e.g., a destination city becomes a flag but the date doesn’t)? Operator failure-mode worth knowing.
Cross-host parity. Slash commands /webwright:run and /webwright:craft are a Claude Code / Codex convention; the README says they’re “inert in Hermes.” Are they also inert in OpenClaw, or does OpenClaw’s slash-command system pick them up via skills/webwright/commands/? README is ambiguous.
Image-QA tool routing on hosts that natively read screenshots. README says hosts that read PNG natively “skip the OpenAI-backed image_qa / self_reflection tools.” Does the host’s vision quality match what GPT-4-Vision-style image_qa would have surfaced? Or is there a real-world regression?
Browser-detection signature. Webwright drives Playwright with Chromium — does it carry the standard automation-detected fingerprint, or has MSR modified it? Relevant for any cross-comparison with CloakBrowser’s stealth approach.
Microsoft roadmap. This is MSR. Is Webwright a research prototype, a productization preview, or a long-supported open-source surface? Commit cadence + the 2026-05-04 → 2026-05-11 release tempo suggests active work, but MSR projects sometimes stall after the paper ships.
License vs Microsoft policy. Repo is MIT. Microsoft has been increasingly carefully sanding the edges around its OSS contributions; check whether the MIT license is genuinely standalone or constrained by a parent CLA / patent-grant pattern.

Agents & Agentic Systems topic landing
Browserbase Autobrowse — closest peer in the managed-browser-agent space; different trust model
TinyFish — full-Chromium managed agent platform
CloakBrowser — stealth Chromium with fingerprint patches; different problem framing
Hermes Agent topic landing — Webwright ships a Hermes-compatible skill via shared skills/webwright/ folder
Hermes Skill Bundles — convention reference for the Webwright Hermes integration (internal; see the Hermes Agent topic landing above)
Tool, Skill, or Subagent? (Will, Applied AI) — same SWE-style code-as-source-of-truth philosophy
Claude Code skills overview — Webwright is a substantive single-skill install
Managed Agents Self-Hosted Sandboxes — adjacent self-hosted-agent-runtime surface
Claude Code plugins and marketplaces — install context for /plugin marketplace add microsoft/Webwright

Jonathon's AI Wiki

Explorer

Microsoft Webwright — Coding-Agent-with-a-Terminal Browser Framework (MSR, MIT, ~1.5k LoC, SOTA on Online-Mind2Web + Odysseys)

Key Takeaways

Architectural comparison (from README)

Where this fits

Project map (from README)

Try It

Open Questions

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

Microsoft Webwright — Coding-Agent-with-a-Terminal Browser Framework (MSR, MIT, ~1.5k LoC, SOTA on Online-Mind2Web + Odysseys)

Key Takeaways

Architectural comparison (from README)

Where this fits

Project map (from README)

Try It

Open Questions

Related

Graph View

Table of Contents

Backlinks