Tool use, planning, multi-agent patterns, agent frameworks, and practical agent deployments. Covers both Claude-specific agent features and general agentic architecture patterns.
Articles
-
Canva AI 2.0 — Building a Production Agentic System with Claude (Danny Wu) — First-party production case study (Code with Claude Tokyo 2026). Four reusable lessons: define success as steerability + latency not one-shot quality (avg design edited ~110×); the harness is disposable but evals are the durable asset (rewrote the harness 3× in 3 years); cost at scale via token budgets + real tool-cost tracking + a cache-preserving Sonnet-orchestrator → Opus/Haiku routing (~half the cost, high-80s/90s cache hit); and a feedback-to-evals loop that folds user complaints back into the eval set. Companion to the Tokyo digest.
-
Maintain the Harness, Don’t Pile On Tools (Nate B Jones) — The next phase of agent work is maintenance, not construction: the durable, ownable layer is the harness (workbench) around the model — what it reads, remembers, can touch, must prove, and what stops it. Anchor claim: Vercel made its sales-inbox agent better by deleting ~80% of its tools. Introduces the falsifiable framing that agents break in two directions (world-drift and model-improvement — a guardrail that protected you from a clumsy model can trap a better one) and a 5-point maintenance audit (what’s it eating / test its reach / check its job / check the proof / check the value). Productive tension with Canva’s “harness is disposable, evals are durable.” Same author as the Production Class Ladder.
-
Claude Agent Hierarchy — When to Use Which — Comparison of Claude’s three agent tiers (Managed Agents, Agent Teams, Subagents) with decision framework for choosing the right one.
-
Agent Workflow Patterns — Sequential, Parallel, Evaluator-Optimizer — Anthropic’s official taxonomy of the three workflow shapes that keep showing up in production, plus the decision framework. Default to sequential. “Start with the simplest pattern that solves your problem.”
-
AI Agents Unleashed — 2026 Playbook (Mindstream × Futurepedia) — Platform-agnostic implementation guide: chatbot-vs-agent reframe, precision framework, “Is this an agent job?” decision tree, 4-phase roadmap, 7 pitfalls, human-AI relationship timeline, 7 training competencies. Authors: Adam Biddlecombe + Kevin Hutson.
-
Nous Research Hermes Agent — Self-hosted autonomous agent with persistent memory, auto-generated skills, 47+ tools, 6 sandbox backends, 15+ messaging platforms, and MCP integration. Model-agnostic (Nous Portal, OpenRouter, OpenAI, or any OAI-compatible endpoint). MIT, 97K+ stars.
-
Adaline — End-to-End AI Agent Platform — Single platform for the four-stage agent lifecycle: iterate, evaluate, deploy, monitor. Provider-agnostic prompt management, multi-modal + dynamic-variable testing, AI-assisted test-suite generation, multi-environment deployments with smart diffing and instant rollbacks, full traces/spans, human-annotation loop tied directly to monitoring. Recently went GA with $1MM API-credit promotion. Customers: McKinsey (Lilli), Discord, Coframe, Reforge. Stats claimed: 200M+ API calls/day, 5B+ tokens/day, 300+ models, 99.998% uptime. Sits alongside LangSmith / PromptLayer / Helicone / Braintrust / Galileo in the LLMOps space.
-
TinyFish — Web Infrastructure APIs for AI Agents — Four-product platform under one API key: Search, Fetch, Browser, Agent. Search + Fetch went free May 4 2026 across REST / MCP / SDKs / CLI / Skill (free-tier 5 q/min Search, 25 URLs/min Fetch). Custom Chromium fleet with 28 C++-level anti-bot mechanisms; sub-250ms browser cold start, P50 488ms search. Vendor-reported 87% token reduction and 2× completion rate when using CLI + Skill over MCP — concrete data on context-window economics. $47M Series A from ICONIQ; customers Google / DoorDash / Cigna / Volkswagen / Grubhub / NEC; integrates with Hermes / OpenClaw / Cline / Goose / Antigravity / n8n / Dify / LangChain / CrewAI. Direct competitors named in launch coverage: Browserbase (uses Exa for search), Firecrawl (agent reliability issues).
-
ScrapeCreators — Social Media Scraping API for AI Pipelines — Adrian Horning’s (Austin, TX) social-scraping API across 20+ platforms (TikTok 20 endpoints, Instagram 12, YouTube 12, Facebook 9 + Ad Library, X/Twitter 6, LinkedIn 4 + Ad Library, Reddit 5, Pinterest 4, Threads 5; plus Bluesky/Truth Social/Twitch/Spotify/Snapchat/Kick + 4 ad libraries + 5 link-in-bio platforms). 100 free credits no-card; pay-as-you-go from 497/500k credits → enterprise. Single
x-api-keyheader, no rate limits, JSON-only, ~3.1s avg response, claimed 1M+ req/day at 98.2% success. Ships official MCP server (@scrape-creators/mcp) + CLI + first-party Claude Code skill. Sister to TinyFish (web infra) — ScrapeCreators is the social-platform-deep counterpart. Karpathylast30daysSessionStart hook calls it out by name as the gap-filler for Reddit comments + TikTok + Instagram (note: hook quotes 10k free credits, landing page shows 100 — flagged for refresh). -
Crabbox — Remote Testbox for OpenClaw Maintainers and AI Agents —
github.com/openclaw/crabbox(MIT, Go, 299★ at 10 days old, created 2026-04-30, last push same-day as ingest). Short-lived Linux box for every run on shared cloud capacity: lease, sync, run, release. CLI (Go binary on the laptop) + Broker (Cloudflare Worker + 1 Durable Object) + Runner (Hetzner / AWS Spot / Azure / static-SSH / Blacksmith-testbox). Brokered mode keeps provider creds off laptops; CLI carries only a bearer token. Cost guardrails first-class — TTL caps + monthly spend caps + per-user/org/provider tracking viacrabbox usage. Ships as standalone CLI (brew install openclaw/tap/crabbox) AND native OpenClaw plugin exposing 5 agent tools (crabbox_run/_warmup/_status/_list/_stop). The OpenClaw answer to the infrastructure-was-the-wall thesis Anthropic’s Platform team articulates — open + self-hosted + multi-cloud counterpart to Anthropic’s Managed Agents. Notable design choice:crabbox actions hydratereuses existing GitHub Actions setup steps so local Crabbox runs land in the same hydrated workspace as CI (no duplicate local + CI bootstrap config). Same loop for agents and humans. -
Paperclip — Multi-Agent Company Orchestration Platform — Paperclip frames AI agent management as running an AI company rather than configuring a single coding assistant. Heartbeat system (9-step protocol per agent: receive task → check budget → load skills → plan → execute → log → checkpoint → return → sleep), goal cascade (org-level → team-level → agent-level), full org-chart UI showing reporting structure and inter-agent message volume, five agent configuration areas (Instructions / Configuration / Skills / Budget / Runs), 16 pre-built example “companies” including Agency Agents and Fullstack Forge. Native Claude Code REST API integration on
localhost:3100— Paperclip can dispatch work to a local Claude Code instance instead of going through the Anthropic API directly. Closes the “I want one agent that runs the whole company while I sleep” loop that single-agent frameworks struggle with. AIS+ resource bundle entry; companion course to Codex 1-Hour and Hermes 1-Hour from the same operator. -
Autobrowse — Self-Improving Browser-Agent Harness (Browserbase) — Browserbase’s harness that runs a browser agent against a real task on a real site, iterates the strategy via a
strategy.mdscratchpad until the workflow converges, then graduates the winning approach into a markdown SKILL.md plus deterministic helper scripts. Frames the loop as the Karpathy autoresearch ratchet applied to browser-skill discovery. Concrete benchmarks (Browserbase-reported): Craigslist task 0.12/27s graduated; form-fill 0.24 in 4 iterations; federal grants portal collapsed a 28-page scrape into a singlebrowse fetchafter Autobrowse surfaced an undocumented JSON endpoint. Cap iterations low (~3-5), short-circuit aggressively. Honest failure mode: deterministic-parsing tasks (167-row static HTML state catalog cost ~$24 across 4 iters before pivoting to 200 lines of Python withbrowse fetch+ BeautifulSoup) — lesson written into the skill itself: probe withfetchfirst, escalate to Autobrowse only if the response is empty / dynamic / gated. Output is small readable markdown (frontmatter withrecommended_method+alternative_methods+sourcetrace listing iters/convergence date/cross-region prod-validation; body sections for Purpose / When to Use / Workflow / Site-Specific Gotchas) — same format Browserbase’s internal generalist agentbbalready loads on demand for feature requests / session investigations / PRs / sales triage. Skills as customer handoff — durable, debuggable, human-auditable, ownable; both engineers and non-engineers (technical PM, VP of tech, grants manager) can read them. Same memory-as-bottleneck thesis as Memory & Dreaming and the Platform team interview, applied to browser agents specifically. Roadmap: smarter stopping (let the agent reason about own convergence by trace structure, not just cost/turns), better priors (push the agent towardfetch/searchprimitives before browser sessions, and inspect network events / CDP logs to discover internal APIs), recursive Autobrowse (improving the harness itself). -
Shopify Review Scraper (mikefutia) — Free local Node + Playwright web app for pulling Shopify product-page reviews (up to 250 per request) as CSV or JSON. Provider-aware adapters for Okendo / Junip / Judge.me, plus generic fallbacks for JSON-LD review schema, rendered review markup, and review-shaped network payloads. Browser UI at
localhost:3000+ REST API atPOST /api/scrape. No API keys, MIT, runs entirely on the operator’s laptop. 9 commits / 2 stars / 0 issues at ingest — narrow-purpose tool from the Scale AI Skool community. Architectural counterpart to ScrapeCreators’s paid social-platform-deep API and Apify’s per-actor marketplace: free, self-hosted, single-platform-deep. Useful template for the “local Playwright app exposing a REST API for one scraping task” pattern. Pairs with Meta Ads CLI for product-launch analysis (creative + customer-sentiment triangulation) and with Hermes as a registered tool. Premise — “Shopify does not provide reviews through a standard product-page API” — remains true; native review surface fragmented across third-party widgets so per-provider adapter pattern is the structural answer. -
Reflexio — Self-Improvement Harness for AI Agents (ReflexioAI) —
github.com/ReflexioAI/reflexio(Apache 2.0, Python ≥3.12, 200★ at 5 weeks old). External harness that sits next to an agent, reads completed runs, and extracts user profiles (per-user facts) + agent playbooks (procedural recipes — trigger/instruction/pitfall SOPs) for retrieval next run. Versioning workflow (current → pending → archived) with approval gate. Expert mode compares agent vs. expert responses and writes playbooks from substantive deltas. Drop-in integrations for Claude Code, LangChain, OpenClaw. Headline benchmark claim: −81% planning steps / −72% tokens on Hermes running MiniMax-M2.7 across 4 of 5 GDPVal knowledge-work tasks, on top of the warm baseline (same agent re-running with its own native self-improvement active). Cross-host aggregate is more conservative: −50% / −57%. Caveats worth flagging: N=5 tasks, brand-new 5-week-old solo-author GitHub identity (yilu331created same day as repo, zero outside contributors), Reflexio sees the cold run of the same task before extracting the recipe (task-specific memoization with retrieval, not transfer learning across tasks). Honest discussion section earns trust — failure case (Police legal reference) documented with two distinct failure modes named. Architecturally a sibling to Browserbase Autobrowse (graduates successful browser strategies into aSKILL.md) — same “harness extracts a reusable recipe from a successful run” pattern, different domain. 57 ms p50 retrieval at ~3,000 indexed rows. -
Ryan Carson’s Clawd Chief — Solo Founder Executive-Assistant Pattern (OpenClaw + Codex + Devin) — 5x founder Ryan Carson (ex-Treehouse / ex-YC partner) walks through his open-source “Clawd Chief” stack: OpenClaw instance (“R2”) on a MacBook Pro in his closet + VS Code over Tailscale SSH + Codex as the configurator (taking advantage of OpenAI’s subsidized ChatGPT Pro tokens) + Claude Code + Devin in parallel. Load-bearing framing: “Agents are cron jobs and markdown files.” The two load-bearing markdown files in Clawd Chief:
priority-map(named projects + people in current rotation) andauto-resolver(decision rules for autonomous vs escalate). R2’s job: schedules meetings via Calendly parsing, sweeps inbox/calendar every 15min and pings Carson in Slack, proactively follows up on outgoing emails, runs daily business-development outreach. Architectural inversion of startup advice: “In startups we used to say just do the bare minimum… that’s literally reverse now” — documentation + cron jobs + skill files ARE the productive work that unlocks the 10x output multiplier. 10+ PRs/day claim anchors the Platform team thesis at solo-founder scale. Sister architecture to Hermes and Claude Code Routines — same long-running + markdown-config + messaging-channel pattern, different tradeoffs. -
AutoAgent — Autonomous Harness Engineering (kevinrgu) —
github.com/kevinrgu/autoagent(MIT, Python 100%, 4,500★ / 499 forks / 29 watchers). Meta-agent that hill-climbs on a benchmark of Docker-isolated tasks; tests write a score (0.0-1.0) to/logs/reward.txtand the meta-agent uses it as the loss function for the next iteration. Built onharborfor task execution (uv run harbor run -p tasks/ --agent-import-path agent:AutoAgent); default concurrency 4, README shows 100-wide sweeps. Task format is portable:instruction.md + tests/{test.sh, test.py} + environment/Dockerfile + files/. The stated performance lever isn’t a fine-tune or RL pipeline — it’s equip the agent with Agent Skills for Context Engineering + context7 skills. Skills-as-capability-layer pattern alignment with Tool, Skill, or Subagent?. Sister to Reflexio (retrieval over playbooks) and Browserbase Autobrowse (browser-specific graduation) — different mechanisms, same hill-climbing-on-criterion north star. Author posture not yet vetted; high star count for a sole-developer Python repo warrants caution before deep adoption. -
RoboNuggets) — Beginner-audience primer covering ~20 OpenClaw concepts in 60-second explanations: agent-as-employee framing, dedicated-machine deploy hygiene, OAuth-vs-API-key cost gotchas (incl. provider posture as of May 2026 — OpenAI explicitly allows OAuth post-creator-acquisition, Anthropic is a gray area, Google has documented Gmail bans), the agentic loop, the Gateway as “always-on engine,” channels as “phone lines plugged into the switchboard,” multi-agent vs sub-agent, the seven-file mental model (
identity.md/soul.md/agents.md/user.md/tools.md/memory.md/heartbeat.md+ daily memory folder), the cost engine (every message re-injects ALL core MD files as system prompt), model-agnostic via OpenAI/Anthropic/Ollama, skills +clawhub.ai(with vetting caveat), MCP servers, plugins as code-level extensions (every channel is itself a plugin), nodes as paired devices (smart glasses, iPad), andopenclaw.jsonallow/deny lists. Companion to the Alex Krantz UC Berkeley architectural deep-dive (same system, different audience). -
NVIDIA NemoClaw — Reference Stack for Running OpenClaw Securely in OpenShell — NVIDIA’s first-party open-source hardening layer for OpenClaw (
github.com/NVIDIA/NemoClaw, Apache 2.0, TypeScript, 20,575★ at ingest, created 2026-03-15, last push 2026-05-21, alpha software / early preview since March 16 2026). Bundles the NVIDIA OpenShell runtime (part of NVIDIA Agent Toolkit) + a hardened sandbox blueprint (Landlock + seccomp + netns by default) + guidednemoclaw onboardwizard + OpenShell-managed channel messaging + experimental Model Router (NVIDIA LLM Router v3 prefill engine on LiteLLM, sandbox callshttps://inference.local/v1via gateway, never sees raw API keys) behind a singlecurl -fsSL https://www.nvidia.com/nemoclaw.sh | bash. Default modelnvidia/nemotron-3-super-120b-a12bvia NVIDIA Endpoints; pool also includesNemotron-3-Nano-30B-A3B(0.10/M in) with prefill router defaulting totolerance: 0.20. Hardware: GeForce RTX PCs/laptops, RTX PRO workstations, DGX Station, DGX Spark (with a Spark playbook for end-to-end local Ollama inference). Min 4 vCPU / 8GB RAM / 20GB disk; OOM-killer warning documented for <8GB hosts. The structural answer to Alex Krantz’s observation that baseline OpenClaw security is “not a particularly secure system.” Same-week launch family with Anthropic Managed Agents self-hosted sandboxes (2026-05-19) — same agent-infrastructure cluster, opposite trust model (NVIDIA ships orchestration as OSS; Anthropic ships orchestration as managed plane). 20.5k stars in two months reads as OpenClaw graduating from community project to infrastructure hyperscalers wrap their first-party stack around. -
CloakBrowser — Stealth Chromium with Source-Level Fingerprint Patches (CloakHQ) —
github.com/CloakHQ/CloakBrowser(MIT, Python, 19,458★ at ingest, created 2026-02-22). Drop-in Playwright replacement with source-level Chromium fingerprint patches; “30/30 bot detection tests passed” — falsification candidate: reproduce against bot.sannysoft.com / creepjs / fingerprintjs / Cloudflare-protected canaries. Dual-use caveat made explicit in-article — anti-detect / Cloudflare-bypass / captcha-bypass sits on the security gray zone; legitimate-use slots called out (internal bot-detection QA against your own site, accessibility regression, mirror-content scraping you own). Strict-bar verification flags raised: high-star young repo, ToS-vary-by-target ethics layer, dual-use marketing context. Infrastructure-aware comparison (not endorsement) to Browserbase Autobrowse (managed, ethics-curated) and TinyFish (managed, full-Chromium fetch). -
Microsoft Webwright — Coding-Agent-with-a-Terminal Browser Framework —
github.com/microsoft/Webwright(MIT, Python, 143★ as of 2026-05-27 — corrected from 1,106 first-ingest figure; ~1.5k LoC total). Microsoft Research-authored (Lu, Xu, Huang, Awadallah) browser-agent framework whose load-bearing inversion is separate the agent from the browser — the workspace (code, screenshots, logs) is the state, not the browser session. SOTA at 100-step budget: Online-Mind2Web 86.7% with GPT-5.4 / 84.7% with Opus 4.7; Odysseys 60.1% +15.6 points over the prior Opus 4.6 vision-based SOTA. Ships as a first-class plugin to four hosts from oneskills/webwright/directory: Claude Code (/webwright:runone-shot vs/webwright:craftreusable CLI tool), OpenAI Codex (@webwright), OpenClaw, Hermes Agent. Comparison table vs Stagehand/agent-browser/browser-use included in the article. MSR blog: A Terminal Is All You Need For Web Agents. Same SWE-style code-as-source-of-truth philosophy as Tool, Skill, or Subagent? (Will, Applied AI), applied to web automation. -
WebMCP Directory — Sites Exposing Tools to AI Agents (nekuda.ai) —
webmcp.cool, maintained by nekuda.ai. Live curated directory of websites that expose typed tools vianavigator.modelContext, so AI agents running in the browser can list and invoke them. Built on the W3Cwebmachinelearning/webmcpproposal. The load-bearing inversion: sites stop being read-only documents and start being callable surfaces with schemas. 18 sites in directory at fetch (2026-05-28). Two distribution paths: (a) Claude Code skill — one-line installnpx skills add nekuda-ai/webmcpadds discover-introspect-invoke via Playwright; (b) Ask nekuda Chrome extension for end-users. Read-only JSON API atwebmcp.cool/api/v1/{lookup,sites,stats}with OpenAPI 3.1 spec. Closest peer to EmDash CMS (CMS-with-built-in-MCP-server) and Microsoft Webwright (browser-agent framework). Action-side counterpart to Ramp’s content-side agent-readable-web experiment. Repo: github.com/nekuda-ai/webmcp. -
Microsoft Agent Governance Toolkit —
github.com/microsoft/agent-governance-toolkit(MIT, Python, 2,941★, Public Preview, Microsoft-signed releases). Policy enforcement + zero-trust identity + execution sandboxing + SRE for autonomous AI agents. “Onepip install, any framework.” Multi-language SDK distribution (PyPIagent-governance-toolkit+ npm@microsoft/agent-governance-sdk+ NuGetMicrosoft.AgentGovernance). Covers 10/10 OWASP Agentic Top 10 with a published architecture-mapping doc. The README’s three operator questions (is this action allowed / which agent did this / can you prove what happened) are the load-bearing framing — prompt-level safety is “a polite request to a stochastic system.” Sibling to Microsoft Webwright (same Microsoft org, different layer — Webwright is the agent, this is the governance for the agent) and to NVIDIA NemoClaw (sibling third-party big-co security/hardening effort, OpenClaw-specific vs framework-agnostic). Composes with Anthropic — How We Contain Claude — that post covers the runtime-sandbox slice; this toolkit covers the policy + identity + audit slice. OpenSSF Scorecard + OpenSSF Best Practices badged. -
Principles for Autonomous System Design — OpenClaw Architectural Deep Dive (Alex Krantz, UC Berkeley) — 1-hour talk by a UC Berkeley networking-systems PhD student (advised by Scott Shenker + Sylvia Ratnasamy, also Ion Stoica’s Sky Lab) reverse-engineering OpenClaw after a month of use + several weeks deep in the code. Four-phase LLM evolution (Phase 0 next-token predictors → Phase 1 fine-tuned assistants → Phase 2 LLMs with static orchestration → Phase 3 autonomous agents with dynamic tool discovery). Matryoshka-doll model of “loopiness” = transformer → repeated calls = sentences → wrapped = chat → tools = scoped agents → full env ownership + self-modification = OpenClaw. Three-layer architecture (Connectors / Gateway Controller / Agent Runtime) walked code-level. Sessions-as-processes + agents-as-threads OS mapping. The two-ways-to-interact-with-time pattern as the load-bearing OpenClaw innovation (heartbeat = unpredictable monitoring; cron = predictable scheduled actions; together = agency over the dimension of time = liveliness). Soul.md grounding-of-tone observation. CLI > MCP claim (“MCP was everything 6-8 months ago, agents have gotten really good at CLI”). Skills > MCP for personalization. exc.dev recommendation over Mac mini ($20/mo, 50 persistent VMs, Shelly setup by Tailscale co-founder). Discord-channel-per-project setup pattern (credit Mehdi Qazi). The YouTube channel autonomy demo — Alex authenticated OpenClaw with a Google account, told it “make a YouTube channel,” and 30 minutes of feedback later it had self-discovered Manim, OpenAI TTS API, FFmpeg, and a YouTube upload skill, then autonomously generated 31 educational videos including one explaining his advisor’s CAN paper that the advisor herself approved. “Code quality is dead — design matters more than implementation” meta-observation. Closes with Hofstadter strange-loops framing: agent-as-interface-for-reconfiguring-itself is “a flywheel takeoff moment.”
-
The Production Class Ladder — Governing AI-Built Software (Nate B Jones) — When generating software is nearly free, the bottleneck shifts from “should we build this?” to classifying the software that already exists. Jones’s framework: a 4-rung production class ladder (personal tool → team beta → supported internal product → customer-facing), each rung with explicit requirements; a prototype commons + open-discovery intake posture (“everybody’s job is to prototype”); and promotion and demotion governance (“a ladder that only moves up becomes a junk drawer” — unsupported internal software is the new tech debt). Data spine: Microsoft’s >1M Power Platform assets governed by inventory/telemetry/permission-review; GitGuardian’s 1.2M AI-secret leaks (+81% YoY). The build-side classification layer that sits above Microsoft’s Agent Governance Toolkit (runtime policy/identity/audit).
-
Organizational Singularity — Salim Ismail’s ExO 3.0 REWRITE Methodology — Org-scale counterpart to the Production Class Ladder: restructure a whole company around agentic AI instead of bolting it onto a legacy org chart. Coase’s “nature of the firm” breaks (“building the feature is cheaper than the meeting about it”); the firm survives as a “fiduciary wedge.” Deliverables: a 6-layer intelligence stack (purpose→sensing→interpretation→decision→orchestration→learning, OODA-style, wrapped by a govern-and-assure harness with a human gate per layer); per-agent “passport” governance (what it may/may-not do + policy-controlled APIs + searchable log + rollback; many worker-agents checked by many overseer-agents); and the REWRITE methodology (backcast → 7-dim score → map workflows → cut org drag → digital-twin-at-the-edge → rewire). Load-bearing tactic: never retrofit — copy a workflow into an edge AI-native twin, run parallel, deprecate the old. Claim-heavy futurism on headcount (~10-25% of current) but the methodology + governance primitives are concrete. Cites the 44% Gen Z sabotage “immune system” figure; ships the ExO 3.0 book as a Claude skill.
-
DeepMind’s AI for Science (Demis Hassabis) — Domain-edge (frontier science) but two transferable ideas: the “AI as hypothesis-generation sparring partner” workflow (narrow the question, let it run long — an ~8-hr run produced usable ray-tracing research ideas — and treat output as hypotheses to validate, not answers) and the recursive-self-improvement boundary (self-improving loops compound in code/math where verification is cheap, but stall in physics/chem/bio where the verifier is a physical experiment — the same hill-climbing logic as AutoAgent / Reflexio, with Hassabis naming why it generalizes to software not atoms). Frontier context: AI Co-Scientist (fine-tuned Gemini), Isomorphic Labs’ drug-discovery model suite, automated materials labs + ~200K untested material designs.
-
agentmemory — Persistent Memory for AI Coding Agents (rohitg00) — Off-the-shelf persistent-memory server for coding agents (
github.com/rohitg00/agentmemory, Apache-2.0, ~19.8K★). Four memory tiers (working/episodic/semantic/procedural) with decay; hybrid retrieval (BM25 + vector + knowledge-graph, fused via Reciprocal Rank Fusion, injected at SessionStart); Capture→Compress→Index→Retrieve pipeline on an in-houseiiiprimitive (SQLite-only, no external DB). Benchmarked: 95.2% R@5 on LongMemEval-S (ICLR 2025) vs grep’s 86.2%, ~170K vs ~650K tokens/year, 14ms p50; 950+ tests, reproducible harness.npm i -g @agentmemory/agentmemory→ connect via MCP to 15+ agents. Clears the strict repo bar (license + tests + peer-reviewed benchmark) where similar high-star projects get deferred. Sibling to Reflexio / AutoAgent (extract reusable artifact from past runs) and Hermes MemoryKit (RRF router + tiers); the buy-not-build counterpart to memory architecture comparison. Also tool #4 in Five OSS Tools. -
Venice AI — Private LLM Inference with Verifiable TEE Attestation — Walkthrough + live cryptographic proof (Tonbi’s AI Garage, the Hermes Masterclass creator) of Venice AI’s four escalating privacy tiers: Anonymous (metadata-stripping proxy, frontier models) → Private (contractual zero-retention GPUs, default) → TEE (Intel TDX + Nvidia confidential-GPU enclaves, operator cannot read prompts, provable via attestation) → E2EE beta (on-device ECDH encryption to the enclave key). Attestation chain verified on camera with a nonce-fresh Python script against Intel/Nvidia root keys; Phala Network + NEAR AI cloud orchestrate the decentralized TEE fleet (keys/trust on-chain, compute off-chain — creator’s reading, vendor-unconfirmed). Honest trade-offs: TEE/E2EE disable web search + memory; consistent price premium vs OpenRouter (Llama 3.3 0.10/M input) — “you’re buying attestation, not just tokens.” Demo wires Venice into Hermes Agent as a custom OpenAI-compatible endpoint (85 models). The verifiable middle path between “trust our policy” cloud APIs and quality-capped local models.
-
OpenAI Codex Sites — Building Autonomous Self-Updating Apps — Greg Isenberg (Startup Ideas) walkthrough of OpenAI Codex Sites (invoked
@sites), the app-builder aimed at autonomous self-updating apps an agent keeps operating after launch. Ships bare (no DB/payments/email/analytics/secrets — prompt them in; Cloudflare D1 or Convex for storage); internal/random-URL-only, no custom domains yet. The reusable discipline is a 4-pattern build — memory → safe actions (a named-mutation boundary so the agent can’t run arbitrary SQL) → skills (a reusable operation manual) → save gates → prove-the-loop in a fresh chat — transferable to any agentic-app workflow. vs Replit/Lovable: autonomy over turnkey simplicity. Single creator-demo source (confidence medium). -
12-Factor Agents — HumanLayer’s Framework for Reliable LLM Applications — 12 principles for reliable LLM apps (homaging Heroku’s 12-factor). Load-bearing thesis: “LLMs are stateless functions” — own your control flow, own your context window, own your prompts. Context engineering (Factors 2+3) has superseded prompt engineering as the core discipline. Includes Anthropic’s Five Workflow Patterns (prompt chaining / routing / parallelization / orchestrator-workers / evaluator-optimizer) and convergent principles from Weng/Huyen agent surveys. Apache-2.0 + CC BY-SA 4.0.
github.com/humanlayer/12-factor-agents. Source: Agent Wikis compiled from official repo + Anthropic essays. -
Council — Native macOS App for Multi-Model Blind Deliberation —
albertofettucini/Council(MIT, Swift/SwiftUI, 81★). Puts one question to up to nine seats (any of twelve backends — Claude/GPT/Gemini/DeepSeek/Grok/Mistral/Perplexity/OpenRouter/Ollama/Apple Intelligence/two custom OAI-compatible endpoints), they answer in parallel and critique each other blind, then a 0–100 divergence score + optional bounded debate round + a synthesis that preserves the dissent (the outlier spotlighted, “because the majority can be confidently wrong together”). BYOK, keys in macOS Keychain only, no server/telemetry. Distinct personas (Analyst/Practitioner/Skeptic) + a Devil’s Advocate seat; running token/$ tally + pre-run estimate (a 4-stage × 3-model run is ~12+ calls). Ships acouncilCLI with structured JSON (council.cli.v1) and a CI divergence gate (--fail-above 40). The desktop, multi-model embodiment of the judge-panel / perspective-diverse-verify pattern — concrete tooling for The Verification Frontier; multi-model cousin of the single-model/decidememo in Seven Claude Skills. Honest limit: the divergence score “measures agreement, not correctness.” Flagged in the GitHub-Trending Weekly 36 roundup. -
Miro Canvas — A Shared, Agent-Readable Context Layer for Teams — Miro’s relaunched Canvas (debuted at “Canvas ‘26”) repositioned from whiteboard to a cloud-hosted, team-shared context layer that AI agents read via MCP. Instead of each teammate keeping their own local markdown, the canvas holds many context types — markdown docs, mermaid diagrams, images, flowcharts, clickable HTML prototypes, code — that every team member’s agent connects to, so context stays identical across a team. Two agents can act on it: Miro’s built-in AI sidekick and your own Cursor or Codex agent through the Miro MCP (
@miroin Cursor). Demoed round-trip: drop a PRD onto the canvas → sidekick researches best practice as a new doc → generate an onboarding doc → generate an interactive HTML prototype on the canvas → pull it back into a Cursor code project to build. Comments + sticky notes become part of what the sidekick reads. Selection-scoping quirk: the sidekick only “sees” objects you explicitly select, while an MCP-connected Cursor/Codex agent reads canvas docs without manual selection. New and quirky (sponsored demo, medium confidence) — the team-shared, multi-modal counterpart to The Agent-Readable Web and WebMCP Directory. -
crawl4ai — Open-Source LLM-Friendly Web Crawler & Scraper — unclecode’s Apache-2.0 Python crawler (~68.7k★, the most-starred web crawler on GitHub) that turns arbitrary pages into clean/Fit Markdown for RAG and agents. No API keys or rate limits; two extraction paths (
JsonCssExtractionStrategyCSS-schema/no-LLM vsLLMExtractionStrategy); Playwright browser control (sessions/proxy/anti-bot/screenshots/PDF); ships as a pip library, a Docker server (REST API + dashboard + JWT), and an MCP server for Claude Code. The self-hosted OSS counterpart to TinyFish Fetch / Browserbase / Firecrawl, and the client-side inverse of the agent-readable web (it makes the human web agent-readable without the site cooperating). Commercial Cloud API in closed beta.
Adjacent: long-running agent showcases
- ClaudePlaysPokemon —
[Reddit signal — r/ClaudeCode 2026-05-07]Opus 4.7 run currently streaming live at twitch.tv/claudeplayspokemon. Passion project by David Hershey (Anthropic Applied AI team), started June 2024 to learn agent development; went public when Sonnet 3.7 launched February 2025. Anthropic doesn’t own it but promotes it and subsidizes the API costs since Claude is the model. Useful as a publicly-observable benchmark of long-horizon agent capability — what the model does on a single complex environment given multi-day continuous compute. Source:raw/reddit-1t5y55h.md(r/ClaudeCode, 41 upvotes).