Loop Engineering — Cobus Greyling's Cross-Tool Pattern Reference + CLIs

Source: ai-research/loop-engineering-cobusgreyling.md (github.com/cobusgreyling/loop-engineering — MIT, 162★/25 forks, created 2026-06-09, last push 2026-06-13)

cobusgreyling/loop-engineering is a practical, tool-aware reference for loop engineering — the discipline of replacing yourself as the person who prompts the agent by designing the system that does it instead. It turns the abstract idea (named in essays by Addy Osmani and Cobus Greyling, and in Boris Cherny’s “my job is to write loops” framing) into copyable assets: a six-primitive taxonomy, seven production loop patterns, an L0→L3 readiness ladder, an incident-style failure-mode catalog, side-by-side examples across Grok / Claude Code / Codex / GitHub Actions, and three published npm CLIs (loop-init, loop-audit, loop-cost). The repo dogfoods itself — it runs its own daily-triage and validate-patterns/audit loops on every push.

Key Takeaways

Definition. A loop is a recursive goal: you define a purpose and the agent iterates (with sub-agents, verification, and external state) until done or until it escalates to a human. “Loop engineering is replacing yourself as the prompter.”
The leverage shift. From crafting individual prompts → designing the control system that orchestrates agents over time. Boris Cherny (Head of Claude Code): “I don’t prompt Claude anymore. I have loops running that prompt Claude… My job is to write loops.” Peter Steinberger and Addy Osmani make the same call.
Six primitives (+ Memory). Automations/Scheduling · Worktrees · Skills · Plugins & Connectors (MCP) · Sub-agents (maker/checker) · + Memory/State — the durable spine outside any conversation.
Seven patterns, each with cadence, risk, starter kit, and week-one readiness level: Daily Triage, PR Babysitter, Issue Triage, CI Sweeper, Post-Merge Cleanup, Dependency Sweeper, Changelog Drafter.
Readiness ladder. L0 Draft → L1 Report-only → L2 Assisted (small auto-fixes + verifier) → L3 Unattended (with denylist, budget, metrics, human gates). Never skip L1 for a new pattern on a production repo.
Honest about cost and risk. Token burn, comprehension debt, and “cognitive surrender” are named failure modes, not footnotes. Two people can run the same loop and get opposite results — the loop doesn’t know; you do.
The CLIs are thin but real: loop-audit (0–100 Loop Readiness Score, v1.4.1), loop-init (scaffolder, v1.2.1), loop-cost (token-spend estimator, v1.0.2). All on npm; no clone required.

The Six Primitives (+ Memory)

The building blocks every loop assembles from. A minimal viable loop is scheduling + one triage skill + a state file; you add the rest only once the prior version proves its value (and its failure modes).

Primitive	Job in the loop	Realizations
Automations / Scheduling	The heartbeat — discovery + triage on a cadence	`/loop` (Grok/Claude Code), Claude Code scheduled tasks/cron, GitHub Actions + dispatch, `/goal` (run until a verifiable condition)
Worktrees	Safe parallel execution without merge hell	`isolation: "worktree"` on spawned sub-agents; one worktree per fix, deleted after verify/handoff
Skills	Persistent memory of intent (conventions, build/test commands, “we don’t do it this way”)	`SKILL.md` + scripts; the cure for intent debt
Plugins & Connectors (MCP)	Reach into real tools — Linear/Jira, Slack, GitHub PRs, DBs, deploys	MCP as the common substrate; scope read+comment until trusted
Sub-agents (maker/checker)	The single most important reliability pattern — the implementer never grades its own homework	Explorer→Implementer→Verifier; verifier on a stronger model for unattended loops
+ Memory / State	Durable spine across sessions — “what are we working on, what did we try, what’s waiting on a human”	`STATE.md` / `LOOP-STATE.json` / a Linear section; often the single most important artifact the loop produces

The Seven Patterns

Each pattern ships with cadence, risk rating, a clone-and-run starter (Grok/Claude Code/Codex), week-one readiness level, and tool-specific notes. Machine-readable index lives in patterns/registry.yaml.

Pattern	Cadence	Risk	Week 1	Token cost
Daily Triage	1d–2h	Low	L1 report	Low
PR Babysitter	5–15m	Medium	L1 watch	High
CI Sweeper	5–15m	Medium	L2 cautious	Very high
Dependency Sweeper	6h–1d	Medium	L2 patch-only	Medium
Changelog Drafter	1d / on-tag	Low	L1 draft	Low
Post-Merge Cleanup	1d–6h	Low	L1 off-peak	Low
Issue Triage	2h–1d	Low	L1 propose-only	Low

Best first loops are the low-risk, low-cost, L1 ones (Daily Triage, Changelog Drafter, Post-Merge Cleanup) — high value, report-only, hard to do damage. The repo’s own pattern-picker.md and an interactive web picker route you to one.

Readiness Ladder + Upgrade Path

The spine of the operating discipline (the loop-design-checklist.md rubric scores 10 sections against these):

L0 — Draft: documented intent only.
L1 — Report: triage → state file, no auto-action. Run here 1–2 weeks before anything writes code.
L2 — Assisted: small auto-fixes, each gated by a separate verifier + worktree + max-attempts cap.
L3 — Unattended: runs without you watching — only with denylist, token budget, metrics, and explicit human gates.

loop-audit enforces this: L3 is capped until a project has a verifier, state, cost observability (budget + run-log + LOOP.md budget section), and proven loop activity (real “Last run” timestamps / loop commits / scheduled workflows — not just files on disk).

A Minimum-Viable First Loop for a Non-Engineer Operator (research-agenda drain, 2026-07-03)

The topic’s research agenda asked what a minimum-viable first loop looks like end-to-end for a non-engineer operator (its own example: a WEO marketing surface), and how that maps onto the L0→L3 ladder above. The wiki already has a fully worked answer — it just lived outside this topic, in the WEO Intermediate course’s Automation Primitives module, and hadn’t been cross-linked back here.

The worked example: a weekly Smile Springs competitive-intel sweep, built as a Routine. No code, no terminal, no .mcp.json — set up once in the claude.ai Routines pane. It runs Sunday 11pm on a cron trigger, sweeps three competitor sites (budget-capped at 6 Tavily searches per run), diffs against a saved, date-stamped JSON snapshot from the prior week, and posts a Slack-ready digest plus a “top 3 takeaways” section. A human (the marketing director) reads it Monday morning and decides what to act on. Nothing publishes, merges, or ships without that read.

Where it sits on the ladder — and why the ladder needs a footnote for this case:

It maps to L1 (report-only) on the dimension that matters for risk: the loop takes no consequential external action. It doesn’t post to a client, doesn’t publish content, doesn’t spend money — it produces a digest a human reads and acts on manually.
It’s slightly ahead of a pure-L0/L1 reading on one axis: it autonomously writes state (the snapshot file) every run without a human gate. The Automation Primitives module’s own idempotency rule — write to a date-stamped path so re-runs don’t clobber or duplicate — is exactly the mechanical safeguard that makes an unattended state write compatible with L1 risk tolerance even though a pure reading of “L1 = no auto-action” would flag it. Memory writes that are idempotent and reversible are a different risk class than actions with external consequence.
Graduating to L2 for a non-engineer operator, per the same module’s “Builder basic” variant, means adding exactly one narrow auto-action with its own separate check — a PostRoutineRun hook that pushes the digest into a CRM note. Still not a decision (no judgment call, no publish), just data plumbing with a deterministic trigger. That’s the correct shape of a first L2 step: mechanical, reversible, and gated by something other than the same model that produced the content.
Full L3 (denylist + budget + metrics + proven unattended activity) is explicitly out of reach for a no-code Operator track alone — it requires a Builder-track collaborator to wire the verifier, the cost observability, and the escalation path this article’s own ladder demands before anything runs unattended and consequential.

The readiness-gate caveat this surfaces: this article’s own gate above — “if you can’t already comfortably run 2-3 parallel agent sessions yourself, building a loop is a bad idea” — was written with a Claude-Code / engineer audience in mind, where you are assembling the scheduler, the state file, and the verifier from primitives. A no-code Routine collapses almost all of that engineering into “write a good, bounded prompt,” because Anthropic’s Routines infrastructure already supplies the scheduling, retry, and isolation primitives. The gate still applies to anyone building the loop mechanism — it does not apply to an Operator authoring a well-scoped Routine prompt on top of a mechanism someone else already engineered. That distinction is the actual answer to “what’s the minimum floor for a non-engineer’s first loop”: not engineering competence, but the discipline to write an idempotent, budget-capped, human-reviewed prompt.

Failure-Mode Catalog (the most reusable part)

An incident-style catalog rated S1 (annoying) / S2 (harmful) / S3 (critical) — usable as a debugging checklist for any loop, not just ones built from this repo:

Infinite Fix Loop (S2) — same PR/CI gets 5+ fix attempts, never converges. Mitigate: hard cap (≈3) → escalate; separate/stronger verifier; classify flakes in triage; record attempt count in state.
Verifier Theater (S2) — verifier “approves” but CI fails. Mitigate: verifier must actually run tests/lint and report output; different instructions (“find reasons to reject”); different model+context from the implementer.
State Rot (S1→S2) — STATE.md references merged PRs / closed tickets; loop acts on ghosts. Mitigate: prune every run; validate IDs against live API; one state file per pattern.
Token Burn (S1) — sub-minute cadence runs full sub-agent chains on empty triage. Mitigate: cheap triage-only pass first, spawn sub-agents only when state says actionable; early-exit on empty watchlist (<5k tokens); daily budget → pause.
Notification Fatigue (S1→S2) — pings every run, team mutes the bot, real escalations missed. Mitigate: notify only when a human decision is required; digest mode.
Over-Reach / Wrong Scope (S2→S3) — loop refactors unrelated modules or touches denylisted paths. Mitigate: enforced path denylist; “smallest possible diff”; triage = signal only, no invention.
Comprehension Debt Spiral & Cognitive Surrender (S2, long-term/cultural) — velocity up but nobody can explain recent changes; “the loop handles it.” Mitigate: mandatory human review for non-trivial PRs; weekly loop-digest; success metric = time saved with the quality bar held.
Parallel Collision (S2) & Escalation Failure (S2) — sub-agents edit the same files; or loop retries forever and never pings a human. Mitigate: isolation: worktree for all code-editing sub-agents; connector ping on escalation + alert if an item waits >24h.

Vocabulary It Standardizes

Glossary linking loop engineering to the surrounding agentic-dev ideas (mostly Osmani’s): Agent Harness Engineering (the single-session sandbox; the loop = harness + schedule + state + verification), the Factory Model (the system that builds the software), Intent Debt (cold-start guesses; paid down by skills), Comprehension Debt (gap between what exists and what you understand), Orchestration Tax (human cost of coordinating parallel agents), and Code Agent Orchestra / Adversarial Code Review (different agents for explore/implement/verify).

Examples by Tool

The #examples-by-tool section — the most directly useful comparison — implements the same patterns four ways:

Grok — native /loop, isolation: "worktree", scheduler_create/scheduler_delete; the repo’s primary target (the Grok Build TUI has strong native primitive support).
Claude Code — /loop 1d Run $loop-triage. Read STATE.md...; /goal for one-shot “get main green” with a fresh model checking the stop condition; verifier as a .claude/agents/loop-verifier.md sub-agent; isolation: worktree.
Codex — Automations + .codex/agents/verifier.toml.
GitHub Actions — cron/dispatch workflows (the repo’s own daily-triage.yml is the live example).

An MCP cookbook (examples/mcp/) supplies read-only and safe-write connector configs (GitHub read-only, GitHub propose, Linear, Slack-read) plus a “safe write pattern.”

Implementation

Tool/Service: cobusgreyling/loop-engineering (MIT). Three npm packages + a clone-and-copy reference repo + a GitHub-Pages interactive showcase. Setup (no clone needed):

npx @cobusgreyling/loop-init . --pattern daily-triage --tool grok   # scaffold starter + loop-budget.md + loop-run-log.md
npx @cobusgreyling/loop-cost  --pattern daily-triage --level L1      # estimate daily token spend
npx @cobusgreyling/loop-audit . --suggest                           # 0–100 Loop Readiness Score + next steps

Cost: free/OSS. The point of loop-cost is that the real cost is your loops’ token spend — cadence is a linear multiplier (5m vs 1d = 288× runs/day); a 15m CI-sweeper running full sub-agent chains every time is ~5M tokens/day (“avoid”). loop-audit exits code 2 if score < 40 (CI-gate friendly). Integration notes: loop-audit scores presence of state file, triage skill, verifier, LOOP.md, AGENTS.md/CLAUDE.md, safety docs, workflows, MCP config, worktree evidence, registry.yaml, budget + run-log, and (v1.4) dynamic loop activity. The CLIs are intentionally thin wrappers around the registry metadata; the durable value is the docs, patterns, and checklist.^[inferred — based on reading the tool source vs the docs]

Try It

Read the failure-mode catalog and the design checklist even if you adopt nothing else — they’re the highest-signal, tool-agnostic parts and double as a review rubric for loops you already run.
Score an existing loop: npx @cobusgreyling/loop-audit . --suggest on a repo where you run /loop or scheduled tasks, and see where it lands on L0–L3.
Map it onto Claude Code’s native primitives you already have: /loop, /schedule, /goal, ScheduleWakeup, scheduled cloud agents (Routines), worktree isolation, and sub-agents — this repo is the discipline layer on top of those mechanisms, not a replacement for them.
Start a new loop at L1 report-only (Daily Triage or Changelog Drafter) for a week before letting anything write code.

12-Factor Agents (HumanLayer) — sibling framework; “LLMs are stateless functions, own your control flow / context / prompts” is the same thesis loop engineering operationalizes over time.
Dynamic Workflows (Claude Code) — the Claude-Code-native loop mechanism (/loop, agent-loop genius-or-hype) this reference maps patterns onto.
Loop Engineering — Addy Osmani’s Essay — the canonical primary essay this repo turns into tooling (the repo was created 2026-06-09, the day after the essay); the five building blocks, the worked example, and the named risks originate there.
Loop Engineering: Getting Started with Loops (Anthropic, first-party) — the Claude Code team’s own four-type taxonomy (turn-based/goal-based/time-based/proactive); a narrower, first-party counterpart to this six-primitive catalog.
Verifier-First Loops — the verifier-first pre-flight that prevents this catalog’s Verifier Theater / Infinite Fix Loop failures.
Should You Build a Loop? — the four-condition decision test + 30-day security checklist that complement this repo’s loop-cost economics and failure catalog.
agent-skills (Addy Osmani) — Osmani’s companion essay; his spec-to-ship skill lifecycle is the “skills pay down intent debt” primitive made concrete.
Reflecting on a Year of Claude Code (Boris Cherny & Cat Wu) — primary source for the “my job is to write loops” framing and the /loop 30m /slack-feedback-style usage.
Ryan Carson’s Clawd Chief — “agents are cron jobs and markdown files” — the same loop thesis at solo-founder scale.
Browserbase Autobrowse and Reflexio — harness-self-improvement siblings: graduate a successful run into a reusable skill/playbook (maker/checker + skills-as-memory overlap).
2026 Claude Code AIOS Pattern — the markdown-config + scheduled-task “agent OS” framing loops slot into.
The Loop Is the Unit of Work — the cross-topic synthesis placing this catalog, Claude Code’s /loop/Routines, and the verification frontier as one pattern (prompt → harness → loop).
Agent Loops (topic) — the wiki’s learning hub for loops; start with the Write Loops, Not Prompts explainer (lineage + 3 starter loops), then this catalog as the reference layer.
WEO Intermediate Course — Module 5: Automation Primitives — the fully worked non-engineer minimum-viable loop (the Smile Springs weekly competitive-intel Routine) mapped onto this article’s L0→L3 ladder above.
Claude Automation Primitives — Routines, Managed Agents, Dispatch — the canonical decision tree for which Claude-native primitive a non-engineer reaches for.

Open Questions

The three CLIs are days old (loop-cost v1.0.2) and lightly tested; treat scores/estimates as directional, not authoritative.
No independent production-loop case studies yet beyond the author’s own stories/ (which honestly include why-we-killed-ci-sweeper.md); adopter evidence is thin at 162★.

Jonathon's AI Wiki

Explorer

Loop Engineering — Cobus Greyling's Cross-Tool Pattern Reference + CLIs

Key Takeaways

The Six Primitives (+ Memory)

The Seven Patterns

Readiness Ladder + Upgrade Path

A Minimum-Viable First Loop for a Non-Engineer Operator (research-agenda drain, 2026-07-03)

Failure-Mode Catalog (the most reusable part)

Vocabulary It Standardizes

Examples by Tool

Implementation

Try It

Open Questions

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

Loop Engineering — Cobus Greyling's Cross-Tool Pattern Reference + CLIs

Key Takeaways

The Six Primitives (+ Memory)

The Seven Patterns

Readiness Ladder + Upgrade Path

A Minimum-Viable First Loop for a Non-Engineer Operator (research-agenda drain, 2026-07-03)

Failure-Mode Catalog (the most reusable part)

Vocabulary It Standardizes

Examples by Tool

Implementation

Try It

Related

Open Questions

Graph View

Table of Contents

Backlinks