Source: ai-research/omarsar0-alphabatcher-verifier-first-loops-2026-06-14.md — Authors: Alpha Batcher (@alphabatcher), amplifying Elvis Saravia (@omarsar0, DAIR.ai) · URLs: https://x.com/alphabatcher/status/2066151044581634540 · https://x.com/omarsar0/status/2065491829760651328 · Posted: 2026-06-13/14. The @alphabatcher tweet is extracted verbatim; the @omarsar0 quotes are from his thread posts (his long-form X Article body was login-gated and not extracted — see Open Questions). The model-role split is @omarsar0’s stated personal formula, not a benchmark.

The load-bearing, still-underdeveloped piece of any loop is verificationagainst what standard do you accept the loop’s output? This article collects the sharpest 2026-06 statements of the verifier-first discipline: write the verifier before you launch /goal or /loop, and make the proof live outside the agent’s own explanation. The anchor is Karpathy’s one-liner, quoted by @alphabatcher: “If you can’t evaluate then you can’t auto research it, right?” — no evaluator, no autonomy.

Key Takeaways

  • Write the verifier first. Before you launch a long-running loop, @alphabatcher’s checklist says decide: (1) what counts as done, (2) which checks run every pass, (3) which artifact gets saved, (4) which failure sends it back into the loop. Then let the agent run.
  • Proof sits outside the agent. “The loop can keep going because proof sits outside the agent’s own explanation” — tests, screenshots, benchmark curves, browser runs, changed files. This is how you “get autonomy without babysitting a transcript for 6 hours.” A verifier that only reads the agent’s self-report is “two optimists agreeing” (cf. the Verifier Theater failure mode in the Cobus reference).
  • Split the model roles, not just the agents. @omarsar0’s current formula: Opus 4.8 to plan carefully, GPT-5.5 to execute, and a different model family to evaluate via /goal (Deepseek / Qwen / Kimi / MiniMax). A separate evaluator model is harder for the maker to fool.^[the specific model assignments are @omarsar0’s stated personal setup, not a measured result]
  • Multimodal goals beat text goals. “A multimodal goal is a much stronger goal than a plain text one” — give the agent strong visual cues to compare against (a target screenshot, a reference render), and “use agents to help you set clear goals.”
  • What you’re defending against: models “pause the work early,” make “lots of mistakes,” and take “weird shortcuts (reward hacking).” The cure is extreme goal clarity — explicit dos and don’ts, “eliminate any assumptions you think the model would make.”
  • The task is the unit, not the agent. A community framing from the thread: “the ‘agent’ isn’t the long-running thing — the task is. Each /goal spawns and dies.” The manageable shape is goal → runtime job → transcript → verifier → result → approval or retry → task history.

The Verifier-First Checklist (alphabatcher, verbatim)

Before you launch /goal or /loop, write the verifier:

  • what counts as done
  • which checks run every pass
  • which artifact gets saved
  • which failure sends it back into the loop

Then let the agent run.

The discipline inverts the usual order: most people launch the loop and then wonder how to tell if it worked. Verifier-first makes “done” an objective gate (a saved artifact + a pass/fail check) before any tokens are spent — which is also what /goal enforces structurally, since a fresh model checks the stop condition instead of the model that did the work (see goal walkthrough).

Why Karpathy’s Rule Is the Whole Game

“If you can’t evaluate then you can’t auto research it.” Evaluation is the prerequisite for automation — you can only hand a task to an unattended loop if you can mechanically tell pass from fail. This is the same boundary the wiki tracks as the verification frontier: tasks on the cheap-to-verify side (tests pass, build compiles, screenshot matches) can be looped; tasks where “done is a feel” still need a human in the chair. Verifier-first loop design is, in effect, moving a task across that frontier on purpose by manufacturing an objective gate for it.

Try It

  • Adopt the four-line checklist as a pre-flight. Don’t run /goal or /loop on a repo until you can name the done-condition, the per-pass check, the saved artifact, and the failure→retry path. If you can’t, the task isn’t loop-ready yet — keep it a manual prompt.
  • Give your evaluator a different model + no context from the maker. Pair this with the checker rule (verifier never sees the implementer’s reasoning, never has Write access).
  • Make the goal multimodal where you can: attach a target screenshot or reference image and have the loop diff against it, not just a prose spec.
  • Save the artifact every pass (test output, screenshot, benchmark CSV, changed-files list) so “done” is a thing you can inspect later — this is also your audit trail when a gate rots.
  • Loop Engineering — Addy Osmani’s Essay — the primary thesis; verification is the residual risk it names (“‘done’ is a claim and not a proof”).
  • Loop Engineering — Cobus Greyling’s Reference — the Verifier Theater / Infinite Fix Loop failure modes this discipline prevents.
  • Should You Build a Loop? — the objective-gate argument extended to the Ralph Wiggum / agentic-laziness / goal-drift failure family.
  • The Verification Frontier — the cross-topic synthesis on cheap-vs-expensive verification that this operationalizes.
  • [[claude-ai/claude-code-goal-command-walkthrough|/goal Walkthrough]] — the Claude Code primitive that bakes a separate-checker stop condition into the loop.
  • Write Loops, Not Prompts — the topic entry point; its verification section is the beginner version of this.
  • Andrej Karpathy — origin of the “if you can’t evaluate, you can’t auto-research it” framing the article hangs on.

Open Questions

  • @omarsar0’s underlying long-form X Article (“Autonomous Long-Running Coding Agents”, id 2065876120965111808) is login-gated; its writer-agent summary body was not extractable. This article is built from his thread posts + the @alphabatcher amplification, not the full piece — refresh when the article or his promised follow-up write-up becomes accessible.
  • The exact source/context of the Karpathy quote (which talk/post) isn’t pinned down here — it’s quoted secondhand by @alphabatcher. Verify the original on refresh.^[inferred]
  • The plan/execute/evaluate model split is one practitioner’s 2026-06 setup; no comparative eval backs the specific assignments.