Source: raw/7_INSANE_loops_you_need_to_try_right_now.md — Matthew Berman, “7 INSANE loops you need to try right now” (YouTube, https://www.youtube.com/watch?v=F4a8aMLb678, 2026-06-19). Cross-posted to X as “LOOPS — everything you need to know” (https://x.com/MatthewBerman/status/2068027152164139196, same 14:37 video; on-screen Loop Library state confirmed from the X video’s frames).

Matthew Berman’s dedicated explainer on agent loops — the cleanest single-source articulation of his mental model and a screen-tour of seven concrete, copy-pasteable loops from his Loop Library. The reproducible payload is the trigger × goal framework and the seven verbatim loop prompts below; treat the macro framing (“loops are the single biggest unlock in AI software right now”) as creator opinion. This is the third Berman source in the topic and is positioned to be the canonical framework + full catalog reference: The Best Way to Vibe Code covers his broader workflow (cloud-vs-local, the flywheel, the “wait until you see X” automation trick) and Loop Library documents the resource itself — this article pulls the loop theory and the exact prompts into one place.

Key Takeaways

  • A loop = a trigger + a goal. That is the whole model. The trigger kicks it off; the goal is the stop condition the agent works toward autonomously, “removing the human” from the prompt-review-reprompt cycle.
  • Three trigger types: manual (you tell the agent to go), scheduled (a time-of-day or recurring run), and action (fires on an event like “pull request opened”). Full autonomy means not triggering manually — but sometimes manual is still required.
  • Two goal types: verifiable (a concrete, deterministic test — “100% test coverage”, “every page loads under 50 ms”) or LLM-as-judge (the model decides when the goal is met — “refactor until satisfied”). Verifiable goals make a loop easy and robust; LLM-as-judge goals make it brittle because taste and judgment are handed to the model.
  • Seven worked loops (full catalog + verbatim prompts below): sub-50ms page-load · overnight docs sweep · architecture satisfaction · logging coverage · production error sweep · SEO/GEO visibility · full product evaluation.
  • You run a loop with /goal. Berman uses the /goal command in Codex (and notes Claude Code also has a /goal feature): paste the prompt, append /goal, and the agent continues until the condition is met — “might run for 10 minutes, might run for 10 hours.” Scheduled loops use the agent’s automations tab.
  • Two hard caveats: (1) not every problem fits a loop yet — designing the goal is the hard part, and Berman has not found a way to build net-new features with loops; (2) loops are expensive — they churn tokens autonomously until the goal is hit, so they favor “token maxers” over anyone on a tight budget.
  • The Loop Library is the live catalog these come from — on-screen it shows “running 13 loops,” last updated June 17 2026, each entry numbered (#01–#13) with a copy button and an explicit “verify / stop” condition. Berman built and deployed the site via here.now (the video’s stated partner) under the Forward Future brand.

What a loop is — the trigger × goal framework

Berman’s definition is deliberately minimal: “A loop is a way to allow your AI coding agent to work autonomously toward a specified goal.” You need exactly two things.

The trigger (how it starts) — three kinds:

  • Manual — you literally tell the agent “go do this loop.”
  • Scheduled — run at a set time or on a repeat (e.g. every night).
  • Action — fire on an event, such as a PR being opened.

The goal (when it stops) — two kinds:

  • Verifiable — a concrete number or deterministic check. Example: 100% test coverage. You know for sure when it’s true, and there’s a clean way to test it. This is the ideal — “a very concrete, well-defined goal really makes building a loop easier.”
  • LLM-as-judge — the model decides when the goal is reached. Example: refactor until satisfied — “you, as the LLM, get to determine when we are satisfactorily refactored enough.” Useful when no deterministic test exists, but more brittle because “we are leaving taste and judgment up to the model.”

This 3 × 2 grid is the article’s load-bearing contribution to the topic: it sharpens the looser trigger/action/goal language used elsewhere in the wiki (see Write Loops, Not Prompts) into a clean classifier you can apply to any candidate loop before building it.

The 7-loop catalog

Every loop below is a real entry in the Loop Library. The table classifies each by goal type and trigger; the verbatim prompts follow (these are the copy-paste payload).

#LoopGoal typeTypical triggerStop condition
03Sub-50ms page-loadVerifiableManual (or PR-open / scheduled)Every page loads under 50 ms
01Overnight docs sweepLLM-as-judgeScheduled (nightly)Docs match current implementation; PR opened
02Architecture satisfactionLLM-as-judgeManual or nightlyArchitecture satisfactory + checks pass
Logging coverageLLM-as-judgeManualEvery important path has useful, tested logs
Production error sweepVerifiable-ishScheduled (nightly)No actionable errors remain; PR + Slack ping
SEO/GEO visibilityVerifiableScheduled (weekly)No critical technical issues remain
13Full product evaluationLLM-as-judgeManualEvery scenario meets the original quality bar

The verbatim prompts (lift these directly):

  • Sub-50ms page-load (Berman’s favorite; verifiable) — “Continue optimizing the code for speed. After each significant change, measure page-load performance across every page under the same repeatable test conditions. Continue until every page loads in under 50 ms.” He ran it as a production goal for ~50 minutes; it walked every page/window/modal and optimized until each was under the threshold.
  • Overnight docs sweep (LLM-as-judge) — “Each night, review the codebase in full and make sure all documentation reflects the latest changes from the previous day. Update the documentation as needed, then open a pull request with those changes.”
  • Architecture satisfaction (LLM-as-judge; credited to Peter Steinberger) — “Refactor until you are happy with the architecture. After each significant step, live-test the system, run autoreview, and commit. Track progress in /tmp/refactor-[projectname].md.” You can sharpen the judge with guidance like “be very strict about simplicity” or “make sure every line of code is DRY.” The progress markdown file means the loop tracks itself as it loops.
  • Logging coverage (LLM-as-judge) — “Review the system’s logging and add missing coverage until every important path produces useful, tested logs.” “Important” is non-deterministic — the LLM decides what matters. This loop is the prerequisite for the next one.
  • Production error sweep (verifiable-ish) — “Every night, review our production logs for errors. If you find an actionable issue, trace it to its root cause, fix it, verify the fix, and open a pull request. Then ping me in Slack with the findings and PR link. If no actionable errors are present, ping me with that result instead.” Goal: no more unaddressed errors in the logs — and it only works if the logging-coverage loop has run first.
  • SEO/GEO visibility (verifiable) — “Run an SEO/GEO audit across crawlability, indexation, page intent, titles, internal links, structured data, source citations, and answer-first content. Rank the gaps. Fix the highest-leverage issues. Rerun the same crawl. Repeat until no critical technical issues remain.” Berman suggests running it weekly.
  • Full product evaluation (LLM-as-judge; “my most hand-wavy loop, but it really works”) — “Create N realistic scenarios covering every major capability. Before testing, define clear success criteria and choose a consistent evaluation method such as pass/fail checks or a scoring rubric. Run every scenario under the same conditions and record evidence for each outcome. Fix the underlying cause of anything that does not meet the criteria, rerun the affected scenarios, and then rerun the complete test. Continue until every scenario meets the original quality bar.” It’s “kind of like a test suite, but non-deterministic” — the model walks every use case and decides if it’s good enough. Can run 12+ hours. Customizable per app (his example: “come up with 100 wide-ranging use cases for asking the LLM questions and judge whether the response is good enough; if not, iterate”).

Loop composition

The logging-coverage and production-error-sweep loops are explicitly designed to chain — “these two loops together, you start to see how loops become so powerful.” First guarantee logs exist everywhere, then run a loop that consumes those logs to fix errors. This is the same flywheel idea documented in The Best Way to Vibe Code.

How you actually run one — /goal and automations

The mechanics Berman demonstrates on screen:

  • Manual / verifiable → /goal. Copy the prompt from the Loop Library, paste it into the agent, and append the /goal command. “As soon as you have this /goal, it’s telling Codex to continue working until the condition is met.” It’s a Codex feature; Berman notes Claude Code also has a /goal feature. (See the wiki’s [[claude-ai/claude-code-goal-command-walkthrough|/goal walkthrough]] for the Claude Code completion-condition loop.) You hit go and watch the token budget — “it might run for 10 minutes, it might run for 10 hours.”
  • Scheduled → automations tab. For the overnight docs sweep, Berman opens Codex’s automations tab, “create via chat,” pastes the prompt, and sets it up as a recurring nightly automation. This is the [[claude-ai/scheduled-tasks|/loop & scheduled tasks]] / Routines equivalent — the scheduling heartbeat that turns a one-shot loop into a standing one.
  • Action-triggered. Wire a loop to an event like “PR opened” (e.g. re-run the sub-50ms loop on every new PR so a change can’t regress page load). The automation-event mechanics and the “wait until you see X” sequencing trick are detailed in The Best Way to Vibe Code.

The two caveats

Berman is unusually candid about where loops break — a useful counterweight to the hype framing.

Caveat 1 — loops are not for every problem (yet). Designing a loop “isn’t always easy; specifically, coming up with the goal is not easy.” Verifiable goals (“every page loads under 50 ms”) are perfect. LLM-as-judge goals are “more brittle because we are leaving taste and judgment up to the model.” This gets much harder for building features:

  • “I’ve not really found a way to build features with loops. You cannot say ‘loop until we build a full permissioning system.‘” The problem is direction — you don’t know what the AI will build, when, or how it’ll decide which features are worthwhile. So loops are not good for day-zero feature building.
  • The one from-scratch example he tried: a goal to clone Excel to feature parity. The agent used computer use to open Excel on his machine and click through verifying parity — and “it was running for days and days and days until I finally stopped it. I do not recommend doing that.”

This directly informs the topic’s open question on runtime ambiguity and where the human stays in the loop (see research agenda and The Verification Frontier): the verifiable/LLM-judge split is the practical answer to “against what standard does the loop check itself” — and feature-building currently has no good standard.

Caveat 2 — loops are expensive. They “churn through tokens autonomously until they hit the goal. Some agents run for 10 minutes, some run for days.” His framing: “For token maxers, loops are fantastic. For those without an unlimited token budget, this might not work for you today.” This is the same economics gate quantified in Should You Build a Loop? — keep a close eye on any running /goal loop if you’re budget-constrained.

Already covered elsewhere (pointers, not repetition)

Three of these loops and most of the operational scaffolding are already documented — read those for depth instead of re-reading here:

  • The three nightly loops in prose (overnight docs sweep, sub-50ms, production error sweep), the test/docs/logging flywheel, cloud-vs-local decision criteria, the “wait until you see X” automation trick, and parallel-merge-still-unsolvedThe Best Way to Vibe Code.
  • The Loop Library resource itself (catalog, submission path, the architecture-satisfaction loop as a verifier-first example) → Loop Library (Forward Future).
  • The verification discipline the LLM-as-judge loops lean on (write the verifier first; keep proof outside the agent) → Verifier-First Loops.

Sponsor disclosure

The YouTube cut carries a paid mid-roll for Digital Ocean (production-inference infrastructure) and Berman promotes his team’s free loop-consulting sessions and the Loop Library. None of that is part of the loop technique; treat it as sponsorship/self-promotion. The reusable content is the framework, the seven prompts, and the /goal mechanics.

  • Agent Loops (topic index) — the learning path this slots into as the canonical framework + catalog reference.
  • The Best Way to Vibe Code (Matthew Berman) — Berman’s broader workflow video; the prose writeups of the nightly loops, the flywheel, and cloud-vs-local live there.
  • Loop Library (Forward Future) — the resource these seven loops are pulled from.
  • Write Loops, Not Prompts — the one-sentence loop definition and three beginner starter loops; the trigger × goal grid here sharpens it.
  • Verifier-First Loops — the verification discipline behind every LLM-as-judge loop above.
  • Agent Loops, Clearly Explained (Nate Herk) — a grounded sibling explainer (Reason-Act-Observe; “most tasks don’t need fleets”) that runs two of this Library’s loops as worked /goal demos.
  • Should You Build a Loop? — the cost/security decision layer behind Caveat 2.
  • [[claude-ai/claude-code-goal-command-walkthrough|/goal Walkthrough]] — the Claude Code completion-condition command Berman references.
  • [[claude-ai/scheduled-tasks|/loop & scheduled tasks]] — the scheduling mechanism behind the nightly automations.

Try It

  1. Classify before you build. Take any task you want to automate and place it on the grid: pick a verifiable goal if you possibly can (a number, a passing check). If the only goal is LLM-as-judge, expect brittleness and write the judge criteria explicitly.
  2. Steal the lowest-risk prompt first. The production-error-sweep or overnight-docs-sweep is the safest start — copy the verbatim prompt above, paste it into Codex or Claude Code, and set it up in the automations tab as a nightly run.
  3. Run a verifiable loop with /goal. Try the sub-50ms loop (or substitute your own number) and append /goal; watch the first run end-to-end and keep an eye on token spend.
  4. Chain logging → error sweep. Run the logging-coverage loop once to guarantee logs exist on every important path, then schedule the production-error-sweep loop that consumes them.
  5. Do not loop on feature-building. Per Berman, keep loops on bounded, checkable goals (perf, docs, logging, errors, audits, refactors, eval) — not “build me feature X” — until the goal-design problem is better solved.

Open Questions

  • Where exactly is the brittleness line for LLM-as-judge goals? Berman flags taste/judgment as the failure point but gives no rule for when an LLM-judge loop is trustworthy vs. when it drifts.^[inferred — the video states the brittleness exists but not how to bound it]
  • Is feature-building-by-loop genuinely impossible or just unsolved? The Excel-clone-via-computer-use anecdote shows it runs but never converges — open whether a vision.md-style constraint (see the research agenda) would give it a stop condition.
  • /goal scope across tools — Berman uses it in Codex and says Claude Code has it too, but the precise differences vs. /loop, Routines, and Dynamic Workflows remain the topic’s standing boundary question.