Tool use, planning, multi-agent patterns, agent frameworks, and practical agent deployments. Covers both Claude-specific agent features and general agentic architecture patterns.

Articles

  • Canva AI 2.0 — Building a Production Agentic System with Claude (Danny Wu) — First-party production case study (Code with Claude Tokyo 2026). Four reusable lessons: define success as steerability + latency not one-shot quality (avg design edited ~110×); the harness is disposable but evals are the durable asset (rewrote the harness 3× in 3 years); cost at scale via token budgets + real tool-cost tracking + a cache-preserving Sonnet-orchestrator → Opus/Haiku routing (~half the cost, high-80s/90s cache hit); and a feedback-to-evals loop that folds user complaints back into the eval set. Companion to the Tokyo digest.

  • Maintain the Harness, Don’t Pile On Tools (Nate B Jones) — The next phase of agent work is maintenance, not construction: the durable, ownable layer is the harness (workbench) around the model — what it reads, remembers, can touch, must prove, and what stops it. Anchor claim: Vercel made its sales-inbox agent better by deleting ~80% of its tools. Introduces the falsifiable framing that agents break in two directions (world-drift and model-improvement — a guardrail that protected you from a clumsy model can trap a better one) and a 5-point maintenance audit (what’s it eating / test its reach / check its job / check the proof / check the value). Productive tension with Canva’s “harness is disposable, evals are durable.” Same author as the Production Class Ladder.

  • Claude Agent Hierarchy — When to Use Which — Comparison of Claude’s three agent tiers (Managed Agents, Agent Teams, Subagents) with decision framework for choosing the right one.

  • Agent Workflow Patterns — Sequential, Parallel, Evaluator-Optimizer — Anthropic’s official taxonomy of the three workflow shapes that keep showing up in production, plus the decision framework. Default to sequential. “Start with the simplest pattern that solves your problem.”

  • AI Agents Unleashed — 2026 Playbook (Mindstream × Futurepedia) — Platform-agnostic implementation guide: chatbot-vs-agent reframe, precision framework, “Is this an agent job?” decision tree, 4-phase roadmap, 7 pitfalls, human-AI relationship timeline, 7 training competencies. Authors: Adam Biddlecombe + Kevin Hutson.

  • Nous Research Hermes Agent — Self-hosted autonomous agent with persistent memory, auto-generated skills, 47+ tools, 6 sandbox backends, 15+ messaging platforms, and MCP integration. Model-agnostic (Nous Portal, OpenRouter, OpenAI, or any OAI-compatible endpoint). MIT, 97K+ stars.

  • Adaline — End-to-End AI Agent Platform — Single platform for the four-stage agent lifecycle: iterate, evaluate, deploy, monitor. Provider-agnostic prompt management, multi-modal + dynamic-variable testing, AI-assisted test-suite generation, multi-environment deployments with smart diffing and instant rollbacks, full traces/spans, human-annotation loop tied directly to monitoring. Recently went GA with $1MM API-credit promotion. Customers: McKinsey (Lilli), Discord, Coframe, Reforge. Stats claimed: 200M+ API calls/day, 5B+ tokens/day, 300+ models, 99.998% uptime. Sits alongside LangSmith / PromptLayer / Helicone / Braintrust / Galileo in the LLMOps space.

  • TinyFish — Web Infrastructure APIs for AI Agents — Four-product platform under one API key: Search, Fetch, Browser, Agent. Search + Fetch went free May 4 2026 across REST / MCP / SDKs / CLI / Skill (free-tier 5 q/min Search, 25 URLs/min Fetch). Custom Chromium fleet with 28 C++-level anti-bot mechanisms; sub-250ms browser cold start, P50 488ms search. Vendor-reported 87% token reduction and 2× completion rate when using CLI + Skill over MCP — concrete data on context-window economics. $47M Series A from ICONIQ; customers Google / DoorDash / Cigna / Volkswagen / Grubhub / NEC; integrates with Hermes / OpenClaw / Cline / Goose / Antigravity / n8n / Dify / LangChain / CrewAI. Direct competitors named in launch coverage: Browserbase (uses Exa for search), Firecrawl (agent reliability issues).

  • ScrapeCreators — Social Media Scraping API for AI Pipelines — Adrian Horning’s (Austin, TX) social-scraping API across 20+ platforms (TikTok 20 endpoints, Instagram 12, YouTube 12, Facebook 9 + Ad Library, X/Twitter 6, LinkedIn 4 + Ad Library, Reddit 5, Pinterest 4, Threads 5; plus Bluesky/Truth Social/Twitch/Spotify/Snapchat/Kick + 4 ad libraries + 5 link-in-bio platforms). 100 free credits no-card; pay-as-you-go from 497/500k credits → enterprise. Single x-api-key header, no rate limits, JSON-only, ~3.1s avg response, claimed 1M+ req/day at 98.2% success. Ships official MCP server (@scrape-creators/mcp) + CLI + first-party Claude Code skill. Sister to TinyFish (web infra) — ScrapeCreators is the social-platform-deep counterpart. Karpathy last30days SessionStart hook calls it out by name as the gap-filler for Reddit comments + TikTok + Instagram (note: hook quotes 10k free credits, landing page shows 100 — flagged for refresh).

  • Crabbox — Remote Testbox for OpenClaw Maintainers and AI Agentsgithub.com/openclaw/crabbox (MIT, Go, 299★ at 10 days old, created 2026-04-30, last push same-day as ingest). Short-lived Linux box for every run on shared cloud capacity: lease, sync, run, release. CLI (Go binary on the laptop) + Broker (Cloudflare Worker + 1 Durable Object) + Runner (Hetzner / AWS Spot / Azure / static-SSH / Blacksmith-testbox). Brokered mode keeps provider creds off laptops; CLI carries only a bearer token. Cost guardrails first-class — TTL caps + monthly spend caps + per-user/org/provider tracking via crabbox usage. Ships as standalone CLI (brew install openclaw/tap/crabbox) AND native OpenClaw plugin exposing 5 agent tools (crabbox_run / _warmup / _status / _list / _stop). The OpenClaw answer to the infrastructure-was-the-wall thesis Anthropic’s Platform team articulates — open + self-hosted + multi-cloud counterpart to Anthropic’s Managed Agents. Notable design choice: crabbox actions hydrate reuses existing GitHub Actions setup steps so local Crabbox runs land in the same hydrated workspace as CI (no duplicate local + CI bootstrap config). Same loop for agents and humans.

  • Paperclip — Multi-Agent Company Orchestration Platform — Paperclip frames AI agent management as running an AI company rather than configuring a single coding assistant. Heartbeat system (9-step protocol per agent: receive task → check budget → load skills → plan → execute → log → checkpoint → return → sleep), goal cascade (org-level → team-level → agent-level), full org-chart UI showing reporting structure and inter-agent message volume, five agent configuration areas (Instructions / Configuration / Skills / Budget / Runs), 16 pre-built example “companies” including Agency Agents and Fullstack Forge. Native Claude Code REST API integration on localhost:3100 — Paperclip can dispatch work to a local Claude Code instance instead of going through the Anthropic API directly. Closes the “I want one agent that runs the whole company while I sleep” loop that single-agent frameworks struggle with. AIS+ resource bundle entry; companion course to Codex 1-Hour and Hermes 1-Hour from the same operator.

  • Autobrowse — Self-Improving Browser-Agent Harness (Browserbase) — Browserbase’s harness that runs a browser agent against a real task on a real site, iterates the strategy via a strategy.md scratchpad until the workflow converges, then graduates the winning approach into a markdown SKILL.md plus deterministic helper scripts. Frames the loop as the Karpathy autoresearch ratchet applied to browser-skill discovery. Concrete benchmarks (Browserbase-reported): Craigslist task 0.12/27s graduated; form-fill 0.24 in 4 iterations; federal grants portal collapsed a 28-page scrape into a single browse fetch after Autobrowse surfaced an undocumented JSON endpoint. Cap iterations low (~3-5), short-circuit aggressively. Honest failure mode: deterministic-parsing tasks (167-row static HTML state catalog cost ~$24 across 4 iters before pivoting to 200 lines of Python with browse fetch + BeautifulSoup) — lesson written into the skill itself: probe with fetch first, escalate to Autobrowse only if the response is empty / dynamic / gated. Output is small readable markdown (frontmatter with recommended_method + alternative_methods + source trace listing iters/convergence date/cross-region prod-validation; body sections for Purpose / When to Use / Workflow / Site-Specific Gotchas) — same format Browserbase’s internal generalist agent bb already loads on demand for feature requests / session investigations / PRs / sales triage. Skills as customer handoff — durable, debuggable, human-auditable, ownable; both engineers and non-engineers (technical PM, VP of tech, grants manager) can read them. Same memory-as-bottleneck thesis as Memory & Dreaming and the Platform team interview, applied to browser agents specifically. Roadmap: smarter stopping (let the agent reason about own convergence by trace structure, not just cost/turns), better priors (push the agent toward fetch/search primitives before browser sessions, and inspect network events / CDP logs to discover internal APIs), recursive Autobrowse (improving the harness itself).

  • Shopify Review Scraper (mikefutia) — Free local Node + Playwright web app for pulling Shopify product-page reviews (up to 250 per request) as CSV or JSON. Provider-aware adapters for Okendo / Junip / Judge.me, plus generic fallbacks for JSON-LD review schema, rendered review markup, and review-shaped network payloads. Browser UI at localhost:3000 + REST API at POST /api/scrape. No API keys, MIT, runs entirely on the operator’s laptop. 9 commits / 2 stars / 0 issues at ingest — narrow-purpose tool from the Scale AI Skool community. Architectural counterpart to ScrapeCreators’s paid social-platform-deep API and Apify’s per-actor marketplace: free, self-hosted, single-platform-deep. Useful template for the “local Playwright app exposing a REST API for one scraping task” pattern. Pairs with Meta Ads CLI for product-launch analysis (creative + customer-sentiment triangulation) and with Hermes as a registered tool. Premise — “Shopify does not provide reviews through a standard product-page API” — remains true; native review surface fragmented across third-party widgets so per-provider adapter pattern is the structural answer.

  • Reflexio — Self-Improvement Harness for AI Agents (ReflexioAI)github.com/ReflexioAI/reflexio (Apache 2.0, Python ≥3.12, 200★ at 5 weeks old). External harness that sits next to an agent, reads completed runs, and extracts user profiles (per-user facts) + agent playbooks (procedural recipes — trigger/instruction/pitfall SOPs) for retrieval next run. Versioning workflow (current → pending → archived) with approval gate. Expert mode compares agent vs. expert responses and writes playbooks from substantive deltas. Drop-in integrations for Claude Code, LangChain, OpenClaw. Headline benchmark claim: −81% planning steps / −72% tokens on Hermes running MiniMax-M2.7 across 4 of 5 GDPVal knowledge-work tasks, on top of the warm baseline (same agent re-running with its own native self-improvement active). Cross-host aggregate is more conservative: −50% / −57%. Caveats worth flagging: N=5 tasks, brand-new 5-week-old solo-author GitHub identity (yilu331 created same day as repo, zero outside contributors), Reflexio sees the cold run of the same task before extracting the recipe (task-specific memoization with retrieval, not transfer learning across tasks). Honest discussion section earns trust — failure case (Police legal reference) documented with two distinct failure modes named. Architecturally a sibling to Browserbase Autobrowse (graduates successful browser strategies into a SKILL.md) — same “harness extracts a reusable recipe from a successful run” pattern, different domain. 57 ms p50 retrieval at ~3,000 indexed rows.

  • Ryan Carson’s Clawd Chief — Solo Founder Executive-Assistant Pattern (OpenClaw + Codex + Devin) — 5x founder Ryan Carson (ex-Treehouse / ex-YC partner) walks through his open-source “Clawd Chief” stack: OpenClaw instance (“R2”) on a MacBook Pro in his closet + VS Code over Tailscale SSH + Codex as the configurator (taking advantage of OpenAI’s subsidized ChatGPT Pro tokens) + Claude Code + Devin in parallel. Load-bearing framing: “Agents are cron jobs and markdown files.” The two load-bearing markdown files in Clawd Chief: priority-map (named projects + people in current rotation) and auto-resolver (decision rules for autonomous vs escalate). R2’s job: schedules meetings via Calendly parsing, sweeps inbox/calendar every 15min and pings Carson in Slack, proactively follows up on outgoing emails, runs daily business-development outreach. Architectural inversion of startup advice: “In startups we used to say just do the bare minimum… that’s literally reverse now” — documentation + cron jobs + skill files ARE the productive work that unlocks the 10x output multiplier. 10+ PRs/day claim anchors the Platform team thesis at solo-founder scale. Sister architecture to Hermes and Claude Code Routines — same long-running + markdown-config + messaging-channel pattern, different tradeoffs.

  • AutoAgent — Autonomous Harness Engineering (kevinrgu)github.com/kevinrgu/autoagent (MIT, Python 100%, 4,500★ / 499 forks / 29 watchers). Meta-agent that hill-climbs on a benchmark of Docker-isolated tasks; tests write a score (0.0-1.0) to /logs/reward.txt and the meta-agent uses it as the loss function for the next iteration. Built on harbor for task execution (uv run harbor run -p tasks/ --agent-import-path agent:AutoAgent); default concurrency 4, README shows 100-wide sweeps. Task format is portable: instruction.md + tests/{test.sh, test.py} + environment/Dockerfile + files/. The stated performance lever isn’t a fine-tune or RL pipeline — it’s equip the agent with Agent Skills for Context Engineering + context7 skills. Skills-as-capability-layer pattern alignment with Tool, Skill, or Subagent?. Sister to Reflexio (retrieval over playbooks) and Browserbase Autobrowse (browser-specific graduation) — different mechanisms, same hill-climbing-on-criterion north star. Author posture not yet vetted; high star count for a sole-developer Python repo warrants caution before deep adoption.

  • RoboNuggets) — Beginner-audience primer covering ~20 OpenClaw concepts in 60-second explanations: agent-as-employee framing, dedicated-machine deploy hygiene, OAuth-vs-API-key cost gotchas (incl. provider posture as of May 2026 — OpenAI explicitly allows OAuth post-creator-acquisition, Anthropic is a gray area, Google has documented Gmail bans), the agentic loop, the Gateway as “always-on engine,” channels as “phone lines plugged into the switchboard,” multi-agent vs sub-agent, the seven-file mental model (identity.md / soul.md / agents.md / user.md / tools.md / memory.md / heartbeat.md + daily memory folder), the cost engine (every message re-injects ALL core MD files as system prompt), model-agnostic via OpenAI/Anthropic/Ollama, skills + clawhub.ai (with vetting caveat), MCP servers, plugins as code-level extensions (every channel is itself a plugin), nodes as paired devices (smart glasses, iPad), and openclaw.json allow/deny lists. Companion to the Alex Krantz UC Berkeley architectural deep-dive (same system, different audience).

  • NVIDIA NemoClaw — Reference Stack for Running OpenClaw Securely in OpenShell — NVIDIA’s first-party open-source hardening layer for OpenClaw (github.com/NVIDIA/NemoClaw, Apache 2.0, TypeScript, 20,575★ at ingest, created 2026-03-15, last push 2026-05-21, alpha software / early preview since March 16 2026). Bundles the NVIDIA OpenShell runtime (part of NVIDIA Agent Toolkit) + a hardened sandbox blueprint (Landlock + seccomp + netns by default) + guided nemoclaw onboard wizard + OpenShell-managed channel messaging + experimental Model Router (NVIDIA LLM Router v3 prefill engine on LiteLLM, sandbox calls https://inference.local/v1 via gateway, never sees raw API keys) behind a single curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash. Default model nvidia/nemotron-3-super-120b-a12b via NVIDIA Endpoints; pool also includes Nemotron-3-Nano-30B-A3B (0.10/M in) with prefill router defaulting to tolerance: 0.20. Hardware: GeForce RTX PCs/laptops, RTX PRO workstations, DGX Station, DGX Spark (with a Spark playbook for end-to-end local Ollama inference). Min 4 vCPU / 8GB RAM / 20GB disk; OOM-killer warning documented for <8GB hosts. The structural answer to Alex Krantz’s observation that baseline OpenClaw security is “not a particularly secure system.” Same-week launch family with Anthropic Managed Agents self-hosted sandboxes (2026-05-19) — same agent-infrastructure cluster, opposite trust model (NVIDIA ships orchestration as OSS; Anthropic ships orchestration as managed plane). 20.5k stars in two months reads as OpenClaw graduating from community project to infrastructure hyperscalers wrap their first-party stack around.

  • CloakBrowser — Stealth Chromium with Source-Level Fingerprint Patches (CloakHQ)github.com/CloakHQ/CloakBrowser (MIT, Python, 19,458★ at ingest, created 2026-02-22). Drop-in Playwright replacement with source-level Chromium fingerprint patches; “30/30 bot detection tests passed” — falsification candidate: reproduce against bot.sannysoft.com / creepjs / fingerprintjs / Cloudflare-protected canaries. Dual-use caveat made explicit in-article — anti-detect / Cloudflare-bypass / captcha-bypass sits on the security gray zone; legitimate-use slots called out (internal bot-detection QA against your own site, accessibility regression, mirror-content scraping you own). Strict-bar verification flags raised: high-star young repo, ToS-vary-by-target ethics layer, dual-use marketing context. Infrastructure-aware comparison (not endorsement) to Browserbase Autobrowse (managed, ethics-curated) and TinyFish (managed, full-Chromium fetch).

  • Microsoft Webwright — Coding-Agent-with-a-Terminal Browser Frameworkgithub.com/microsoft/Webwright (MIT, Python, 143★ as of 2026-05-27 — corrected from 1,106 first-ingest figure; ~1.5k LoC total). Microsoft Research-authored (Lu, Xu, Huang, Awadallah) browser-agent framework whose load-bearing inversion is separate the agent from the browser — the workspace (code, screenshots, logs) is the state, not the browser session. SOTA at 100-step budget: Online-Mind2Web 86.7% with GPT-5.4 / 84.7% with Opus 4.7; Odysseys 60.1% +15.6 points over the prior Opus 4.6 vision-based SOTA. Ships as a first-class plugin to four hosts from one skills/webwright/ directory: Claude Code (/webwright:run one-shot vs /webwright:craft reusable CLI tool), OpenAI Codex (@webwright), OpenClaw, Hermes Agent. Comparison table vs Stagehand/agent-browser/browser-use included in the article. MSR blog: A Terminal Is All You Need For Web Agents. Same SWE-style code-as-source-of-truth philosophy as Tool, Skill, or Subagent? (Will, Applied AI), applied to web automation.

  • WebMCP Directory — Sites Exposing Tools to AI Agents (nekuda.ai)webmcp.cool, maintained by nekuda.ai. Live curated directory of websites that expose typed tools via navigator.modelContext, so AI agents running in the browser can list and invoke them. Built on the W3C webmachinelearning/webmcp proposal. The load-bearing inversion: sites stop being read-only documents and start being callable surfaces with schemas. 18 sites in directory at fetch (2026-05-28). Two distribution paths: (a) Claude Code skill — one-line install npx skills add nekuda-ai/webmcp adds discover-introspect-invoke via Playwright; (b) Ask nekuda Chrome extension for end-users. Read-only JSON API at webmcp.cool/api/v1/{lookup,sites,stats} with OpenAPI 3.1 spec. Closest peer to EmDash CMS (CMS-with-built-in-MCP-server) and Microsoft Webwright (browser-agent framework). Action-side counterpart to Ramp’s content-side agent-readable-web experiment. Repo: github.com/nekuda-ai/webmcp.

  • Microsoft Agent Governance Toolkitgithub.com/microsoft/agent-governance-toolkit (MIT, Python, 2,941★, Public Preview, Microsoft-signed releases). Policy enforcement + zero-trust identity + execution sandboxing + SRE for autonomous AI agents. “One pip install, any framework.” Multi-language SDK distribution (PyPI agent-governance-toolkit + npm @microsoft/agent-governance-sdk + NuGet Microsoft.AgentGovernance). Covers 10/10 OWASP Agentic Top 10 with a published architecture-mapping doc. The README’s three operator questions (is this action allowed / which agent did this / can you prove what happened) are the load-bearing framing — prompt-level safety is “a polite request to a stochastic system.” Sibling to Microsoft Webwright (same Microsoft org, different layer — Webwright is the agent, this is the governance for the agent) and to NVIDIA NemoClaw (sibling third-party big-co security/hardening effort, OpenClaw-specific vs framework-agnostic). Composes with Anthropic — How We Contain Claude — that post covers the runtime-sandbox slice; this toolkit covers the policy + identity + audit slice. OpenSSF Scorecard + OpenSSF Best Practices badged.

  • Principles for Autonomous System Design — OpenClaw Architectural Deep Dive (Alex Krantz, UC Berkeley) — 1-hour talk by a UC Berkeley networking-systems PhD student (advised by Scott Shenker + Sylvia Ratnasamy, also Ion Stoica’s Sky Lab) reverse-engineering OpenClaw after a month of use + several weeks deep in the code. Four-phase LLM evolution (Phase 0 next-token predictors → Phase 1 fine-tuned assistants → Phase 2 LLMs with static orchestration → Phase 3 autonomous agents with dynamic tool discovery). Matryoshka-doll model of “loopiness” = transformer → repeated calls = sentences → wrapped = chat → tools = scoped agents → full env ownership + self-modification = OpenClaw. Three-layer architecture (Connectors / Gateway Controller / Agent Runtime) walked code-level. Sessions-as-processes + agents-as-threads OS mapping. The two-ways-to-interact-with-time pattern as the load-bearing OpenClaw innovation (heartbeat = unpredictable monitoring; cron = predictable scheduled actions; together = agency over the dimension of time = liveliness). Soul.md grounding-of-tone observation. CLI > MCP claim (“MCP was everything 6-8 months ago, agents have gotten really good at CLI”). Skills > MCP for personalization. exc.dev recommendation over Mac mini ($20/mo, 50 persistent VMs, Shelly setup by Tailscale co-founder). Discord-channel-per-project setup pattern (credit Mehdi Qazi). The YouTube channel autonomy demo — Alex authenticated OpenClaw with a Google account, told it “make a YouTube channel,” and 30 minutes of feedback later it had self-discovered Manim, OpenAI TTS API, FFmpeg, and a YouTube upload skill, then autonomously generated 31 educational videos including one explaining his advisor’s CAN paper that the advisor herself approved. “Code quality is dead — design matters more than implementation” meta-observation. Closes with Hofstadter strange-loops framing: agent-as-interface-for-reconfiguring-itself is “a flywheel takeoff moment.”

  • The Production Class Ladder — Governing AI-Built Software (Nate B Jones) — When generating software is nearly free, the bottleneck shifts from “should we build this?” to classifying the software that already exists. Jones’s framework: a 4-rung production class ladder (personal tool → team beta → supported internal product → customer-facing), each rung with explicit requirements; a prototype commons + open-discovery intake posture (“everybody’s job is to prototype”); and promotion and demotion governance (“a ladder that only moves up becomes a junk drawer” — unsupported internal software is the new tech debt). Data spine: Microsoft’s >1M Power Platform assets governed by inventory/telemetry/permission-review; GitGuardian’s 1.2M AI-secret leaks (+81% YoY). The build-side classification layer that sits above Microsoft’s Agent Governance Toolkit (runtime policy/identity/audit).

  • Organizational Singularity — Salim Ismail’s ExO 3.0 REWRITE Methodology — Org-scale counterpart to the Production Class Ladder: restructure a whole company around agentic AI instead of bolting it onto a legacy org chart. Coase’s “nature of the firm” breaks (“building the feature is cheaper than the meeting about it”); the firm survives as a “fiduciary wedge.” Deliverables: a 6-layer intelligence stack (purpose→sensing→interpretation→decision→orchestration→learning, OODA-style, wrapped by a govern-and-assure harness with a human gate per layer); per-agent “passport” governance (what it may/may-not do + policy-controlled APIs + searchable log + rollback; many worker-agents checked by many overseer-agents); and the REWRITE methodology (backcast → 7-dim score → map workflows → cut org drag → digital-twin-at-the-edge → rewire). Load-bearing tactic: never retrofit — copy a workflow into an edge AI-native twin, run parallel, deprecate the old. Claim-heavy futurism on headcount (~10-25% of current) but the methodology + governance primitives are concrete. Cites the 44% Gen Z sabotage “immune system” figure; ships the ExO 3.0 book as a Claude skill.

  • DeepMind’s AI for Science (Demis Hassabis) — Domain-edge (frontier science) but two transferable ideas: the “AI as hypothesis-generation sparring partner” workflow (narrow the question, let it run long — an ~8-hr run produced usable ray-tracing research ideas — and treat output as hypotheses to validate, not answers) and the recursive-self-improvement boundary (self-improving loops compound in code/math where verification is cheap, but stall in physics/chem/bio where the verifier is a physical experiment — the same hill-climbing logic as AutoAgent / Reflexio, with Hassabis naming why it generalizes to software not atoms). Frontier context: AI Co-Scientist (fine-tuned Gemini), Isomorphic Labs’ drug-discovery model suite, automated materials labs + ~200K untested material designs.

  • agentmemory — Persistent Memory for AI Coding Agents (rohitg00) — Off-the-shelf persistent-memory server for coding agents (github.com/rohitg00/agentmemory, Apache-2.0, ~19.8K★). Four memory tiers (working/episodic/semantic/procedural) with decay; hybrid retrieval (BM25 + vector + knowledge-graph, fused via Reciprocal Rank Fusion, injected at SessionStart); Capture→Compress→Index→Retrieve pipeline on an in-house iii primitive (SQLite-only, no external DB). Benchmarked: 95.2% R@5 on LongMemEval-S (ICLR 2025) vs grep’s 86.2%, ~170K vs ~650K tokens/year, 14ms p50; 950+ tests, reproducible harness. npm i -g @agentmemory/agentmemory → connect via MCP to 15+ agents. Clears the strict repo bar (license + tests + peer-reviewed benchmark) where similar high-star projects get deferred. Sibling to Reflexio / AutoAgent (extract reusable artifact from past runs) and Hermes MemoryKit (RRF router + tiers); the buy-not-build counterpart to memory architecture comparison. Also tool #4 in Five OSS Tools.

  • Venice AI — Private LLM Inference with Verifiable TEE Attestation — Walkthrough + live cryptographic proof (Tonbi’s AI Garage, the Hermes Masterclass creator) of Venice AI’s four escalating privacy tiers: Anonymous (metadata-stripping proxy, frontier models) → Private (contractual zero-retention GPUs, default) → TEE (Intel TDX + Nvidia confidential-GPU enclaves, operator cannot read prompts, provable via attestation) → E2EE beta (on-device ECDH encryption to the enclave key). Attestation chain verified on camera with a nonce-fresh Python script against Intel/Nvidia root keys; Phala Network + NEAR AI cloud orchestrate the decentralized TEE fleet (keys/trust on-chain, compute off-chain — creator’s reading, vendor-unconfirmed). Honest trade-offs: TEE/E2EE disable web search + memory; consistent price premium vs OpenRouter (Llama 3.3 0.10/M input) — “you’re buying attestation, not just tokens.” Demo wires Venice into Hermes Agent as a custom OpenAI-compatible endpoint (85 models). The verifiable middle path between “trust our policy” cloud APIs and quality-capped local models.

  • OpenAI Codex Sites — Building Autonomous Self-Updating Apps — Greg Isenberg (Startup Ideas) walkthrough of OpenAI Codex Sites (invoked @sites), the app-builder aimed at autonomous self-updating apps an agent keeps operating after launch. Ships bare (no DB/payments/email/analytics/secrets — prompt them in; Cloudflare D1 or Convex for storage); internal/random-URL-only, no custom domains yet. The reusable discipline is a 4-pattern build — memory → safe actions (a named-mutation boundary so the agent can’t run arbitrary SQL) → skills (a reusable operation manual) → save gates → prove-the-loop in a fresh chat — transferable to any agentic-app workflow. vs Replit/Lovable: autonomy over turnkey simplicity. Single creator-demo source (confidence medium).

  • 12-Factor Agents — HumanLayer’s Framework for Reliable LLM Applications — 12 principles for reliable LLM apps (homaging Heroku’s 12-factor). Load-bearing thesis: “LLMs are stateless functions” — own your control flow, own your context window, own your prompts. Context engineering (Factors 2+3) has superseded prompt engineering as the core discipline. Includes Anthropic’s Five Workflow Patterns (prompt chaining / routing / parallelization / orchestrator-workers / evaluator-optimizer) and convergent principles from Weng/Huyen agent surveys. Apache-2.0 + CC BY-SA 4.0. github.com/humanlayer/12-factor-agents. Source: Agent Wikis compiled from official repo + Anthropic essays.

  • Council — Native macOS App for Multi-Model Blind Deliberationalbertofettucini/Council (MIT, Swift/SwiftUI, 81★). Puts one question to up to nine seats (any of twelve backends — Claude/GPT/Gemini/DeepSeek/Grok/Mistral/Perplexity/OpenRouter/Ollama/Apple Intelligence/two custom OAI-compatible endpoints), they answer in parallel and critique each other blind, then a 0–100 divergence score + optional bounded debate round + a synthesis that preserves the dissent (the outlier spotlighted, “because the majority can be confidently wrong together”). BYOK, keys in macOS Keychain only, no server/telemetry. Distinct personas (Analyst/Practitioner/Skeptic) + a Devil’s Advocate seat; running token/$ tally + pre-run estimate (a 4-stage × 3-model run is ~12+ calls). Ships a council CLI with structured JSON (council.cli.v1) and a CI divergence gate (--fail-above 40). The desktop, multi-model embodiment of the judge-panel / perspective-diverse-verify pattern — concrete tooling for The Verification Frontier; multi-model cousin of the single-model /decide memo in Seven Claude Skills. Honest limit: the divergence score “measures agreement, not correctness.” Flagged in the GitHub-Trending Weekly 36 roundup.

  • Miro Canvas — A Shared, Agent-Readable Context Layer for Teams — Miro’s relaunched Canvas (debuted at “Canvas ‘26”) repositioned from whiteboard to a cloud-hosted, team-shared context layer that AI agents read via MCP. Instead of each teammate keeping their own local markdown, the canvas holds many context types — markdown docs, mermaid diagrams, images, flowcharts, clickable HTML prototypes, code — that every team member’s agent connects to, so context stays identical across a team. Two agents can act on it: Miro’s built-in AI sidekick and your own Cursor or Codex agent through the Miro MCP (@miro in Cursor). Demoed round-trip: drop a PRD onto the canvas → sidekick researches best practice as a new doc → generate an onboarding doc → generate an interactive HTML prototype on the canvas → pull it back into a Cursor code project to build. Comments + sticky notes become part of what the sidekick reads. Selection-scoping quirk: the sidekick only “sees” objects you explicitly select, while an MCP-connected Cursor/Codex agent reads canvas docs without manual selection. New and quirky (sponsored demo, medium confidence) — the team-shared, multi-modal counterpart to The Agent-Readable Web and WebMCP Directory.

  • crawl4ai — Open-Source LLM-Friendly Web Crawler & Scraper — unclecode’s Apache-2.0 Python crawler (~68.7k★, the most-starred web crawler on GitHub) that turns arbitrary pages into clean/Fit Markdown for RAG and agents. No API keys or rate limits; two extraction paths (JsonCssExtractionStrategy CSS-schema/no-LLM vs LLMExtractionStrategy); Playwright browser control (sessions/proxy/anti-bot/screenshots/PDF); ships as a pip library, a Docker server (REST API + dashboard + JWT), and an MCP server for Claude Code. The self-hosted OSS counterpart to TinyFish Fetch / Browserbase / Firecrawl, and the client-side inverse of the agent-readable web (it makes the human web agent-readable without the site cooperating). Commercial Cloud API in closed beta.

Adjacent: long-running agent showcases

  • ClaudePlaysPokemon[Reddit signal — r/ClaudeCode 2026-05-07] Opus 4.7 run currently streaming live at twitch.tv/claudeplayspokemon. Passion project by David Hershey (Anthropic Applied AI team), started June 2024 to learn agent development; went public when Sonnet 3.7 launched February 2025. Anthropic doesn’t own it but promotes it and subsidizes the API costs since Claude is the model. Useful as a publicly-observable benchmark of long-horizon agent capability — what the model does on a single complex environment given multi-day continuous compute. Source: raw/reddit-1t5y55h.md (r/ClaudeCode, 41 upvotes).

33 items under this folder.