Verifier-First Loops — Proof Outside the Agent (omarsar0 · alphabatcher

Source: ai-research/omarsar0-alphabatcher-verifier-first-loops-2026-06-14.md · raw/x-account-bcherny-2067272742253232211.md · raw/reddit-1uz0hgd.md (human-verifier anecdote, r/ClaudeAI) · raw/reddit-1v28snk.md (CI-as-verifier field report, r/ClaudeAI) — Authors: Alpha Batcher (@alphabatcher), amplifying Elvis Saravia (@omarsar0, DAIR.ai) · URLs: https://x.com/alphabatcher/status/2066151044581634540 · https://x.com/omarsar0/status/2065491829760651328 · Posted: 2026-06-13/14. The @alphabatcher tweet is extracted verbatim; the @omarsar0 quotes are from his thread posts (his long-form X Article body was login-gated and not extracted — see Open Questions). The model-role split is @omarsar0’s stated personal formula, not a benchmark.

The load-bearing, still-underdeveloped piece of any loop is verification — against what standard do you accept the loop’s output? This article collects the sharpest 2026-06 statements of the verifier-first discipline: write the verifier before you launch /goal or /loop, and make the proof live outside the agent’s own explanation. The anchor is Karpathy’s one-liner, quoted by @alphabatcher: “If you can’t evaluate then you can’t auto research it, right?” — no evaluator, no autonomy.

Key Takeaways

Write the verifier first. Before you launch a long-running loop, @alphabatcher’s checklist says decide: (1) what counts as done, (2) which checks run every pass, (3) which artifact gets saved, (4) which failure sends it back into the loop. Then let the agent run.
Proof sits outside the agent. “The loop can keep going because proof sits outside the agent’s own explanation” — tests, screenshots, benchmark curves, browser runs, changed files. This is how you “get autonomy without babysitting a transcript for 6 hours.” A verifier that only reads the agent’s self-report is “two optimists agreeing” (cf. the Verifier Theater failure mode in the Cobus reference).
Split the model roles, not just the agents. @omarsar0’s current formula: Opus 4.8 to plan carefully, GPT-5.5 to execute, and a different model family to evaluate via /goal (Deepseek / Qwen / Kimi / MiniMax). A separate evaluator model is harder for the maker to fool.^[the specific model assignments are @omarsar0’s stated personal setup, not a measured result]
Multimodal goals beat text goals. “A multimodal goal is a much stronger goal than a plain text one” — give the agent strong visual cues to compare against (a target screenshot, a reference render), and “use agents to help you set clear goals.”
What you’re defending against: models “pause the work early,” make “lots of mistakes,” and take “weird shortcuts (reward hacking).” The cure is extreme goal clarity — explicit dos and don’ts, “eliminate any assumptions you think the model would make.”
The task is the unit, not the agent. A community framing from the thread: “the ‘agent’ isn’t the long-running thing — the task is. Each /goal spawns and dies.” The manageable shape is goal → runtime job → transcript → verifier → result → approval or retry → task history.
The tool’s creator frames the loop the same way. [X signal — @bcherny 2026-06] Boris Cherny (Claude Code’s creator, Anthropic) describes the next era of code as running “Claude Code + an advanced model + a verifier in a loop” and feeding it tasks (or data to generate its own tasks), with the human job reduced to setting guardrails and finding/removing bottlenecks. First-party confirmation — from the designer of the tool the practitioners here are looping — that the verifier-in-the-loop is the intended canonical workflow, not just a community technique. ^[Quote is from a reply with no external link; anchored to the permalink https://x.com/bcherny/status/2067272742253232211]

The Verifier-First Checklist (alphabatcher, verbatim)

Before you launch /goal or /loop, write the verifier:

what counts as done

which checks run every pass

which artifact gets saved

which failure sends it back into the loop

Then let the agent run.

The discipline inverts the usual order: most people launch the loop and then wonder how to tell if it worked. Verifier-first makes “done” an objective gate (a saved artifact + a pass/fail check) before any tokens are spent — which is also what /goal enforces structurally, since a fresh model checks the stop condition instead of the model that did the work (see goal walkthrough).

Why Karpathy’s Rule Is the Whole Game

“If you can’t evaluate then you can’t auto research it.” Evaluation is the prerequisite for automation — you can only hand a task to an unattended loop if you can mechanically tell pass from fail. This is the same boundary the wiki tracks as the verification frontier: tasks on the cheap-to-verify side (tests pass, build compiles, screenshot matches) can be looped; tasks where “done is a feel” still need a human in the chair. Verifier-first loop design is, in effect, moving a task across that frontier on purpose by manufacturing an objective gate for it.

The Canonical Standard, Resolved (research-agenda drain, 2026-07-03)

This topic’s oldest and most-cited open question — against what standard does a loop review its own output? — has a convergent answer once three sources in this topic are read together, even though no single one states it end-to-end:

There is no one canonical checklist item — there are two paths, and the choice between them is the whole game. Berman’s trigger × goal framework names them: a verifiable goal (a concrete, deterministic check — test coverage, page-load time under a threshold, a build that passes) is “the ideal,” because you know for certain when it’s true. An LLM-as-judge goal (“refactor until satisfied”) is real but structurally more brittle, because taste and judgment are handed to the model.
Both paths compile to the same four-line checklist. Whether the check is a test assertion or an LLM rubric, @alphabatcher’s pre-flight (what counts as done · which checks run every pass · which artifact gets saved · which failure retries) is identical in shape. The checklist doesn’t pick the standard for you — it forces you to state whichever standard you picked before spending tokens.
When the goal must be LLM-as-judge, the fix is a second, independent evaluator — never the implementer grading itself. Nate Herk’s two pillars (an objective goal + a way to check) converge with @omarsar0’s model-role split here: spin up a dedicated scorer sub-agent, validate its scoring against known-good/known-bad examples before trusting its grades, and keep it on a different model family than the one that did the work.
This isn’t an arbitrary design choice — it’s downstream of the verification frontier. Verification cost, not generation quality, is what gates whether a loop can even be built. A verifiable goal is cheap to verify, so the loop compounds. An LLM-as-judge goal is manufacturing a proxy for something otherwise expensive to verify (taste, “satisfied”) — the proxy (a scored rubric + a separate evaluator) is how you drag an expensive-to-verify task across the frontier on purpose, not a workaround for skipping verification.

The decision rule this resolves to: before starting any loop, ask “can I state a pass/fail check a machine — or a second model — can run?” If yes: that’s your standard, wire it into /goal or the loop’s stop condition, and it will be cheap and robust. If no: don’t skip verification — build the proxy (a rubric + a separate evaluator model, validated against examples first) and accept the brittleness that comes with it. There is no third option where the loop checks itself without an external standard; that configuration is what the failure-mode catalog in Loop Engineering calls Verifier Theater.

The human-verifier corollary — reading the diff out loud (2026-07-17)

[Reddit signal — r/ClaudeAI 2026-07-17] Source: raw/reddit-1uz0hgd.md (133 score, 57 comments, OP u/Mysterious_Value3059, “Workaround” flair). Where the rest of this article puts proof outside the agent, one widely-upvoted anecdote is a reminder that the verifier is sometimes the human — and that skimming isn’t verifying. After a year of skim-and-accept (“the model writes a diff, it looks reasonable, I skim it, I accept”), the OP adopted one rule: read every changed line out loud before accepting it — actually voicing it, not skimming. The reported effect is a metric, not a law: accept rate dropped from “basically everything” to about 40%, with fewer bugs downstream. The value wasn’t the model writing worse code — the other 60% was the human catching “a wrong variable that’s spelled almost right, an edge case handled confidently and incorrectly, a function that works fine and solves a problem I didn’t have,” then asking for a better approach.^[single self-reported anecdote; the ~40% figure is one practitioner’s experience, not a measured study] The OP’s own framing names the failure mode it defends against: “the model got good enough to turn me into a rubber stamp, and the only fix I’ve found is a dumb manual habit that puts me back in the loop on purpose.” This is the manual, human-side complement to the article’s automated gates: when a loop’s stop condition is still a person clicking accept, an objective ritual (read aloud, explain each line to the empty room) is what keeps Verifier Theater from relocating inside the human reviewer.

CI as the verifier — a field report (2026-07-21)

[Reddit signal — r/ClaudeAI 2026-07-21] Source: raw/reddit-1v28snk.md (198 score, 137 comments, OP u/big_like_a_pickle, self-described 30-year software engineer running two Max accounts, “Claude Code” flair). Single practitioner, self-reported, no measured defect numbers — but it is the clearest description in the wiki so far of what the four-line checklist looks like when it stops being a per-loop ritual and becomes standing infrastructure. The OP stopped feature work for three days to build a PR-based GitHub Actions pipeline and reports it as the highest-leverage change he made.

The topology. master deploys to production, dev to a development server. Every Claude session branches off dev into an isolated worktree, does its work, and opens a PR back into dev. That PR triggers a fast GHA job (~1 minute: linting, leaked-secret scanning). Once a batch of work has accumulated on dev, Claude opens a PR into master, which triggers a much deeper workflow — full database migrations plus an automated click-through of the running site in a real headless browser via Playwright. Only after those gates pass does code deploy. Mapped onto @alphabatcher’s checklist: what counts as done = the PR’s gates go green; which checks run every pass = the fast job; which artifact gets saved = the PR and its CI run; which failure sends it back into the loop = the OP reports that when a gate fails, Claude sees the failure, fixes it, and re-merges without human intervention.

Two gates specifically aimed at LLM failure modes. Beyond the usual test suite (the OP cites over 2,000 regression tests and ~10 code-quality checks), he added deterministic checks tuned to how models write code rather than how humans do: a banned-LLM-slop-word check, and a cyclomatic-complexity ceiling on the observation that LLMs “tend to write very long functions, instead of properly organizing code into units.” This is a category the rest of this article doesn’t cover — verifiers whose purpose is not “is it correct?” but “was it written by a model in a way that will rot?” ^[inferred — the OP describes the checks; framing them as a distinct verifier category is this wiki’s synthesis]

GitHub Issues as the cross-session memory. The reported before-state is familiar: copy-pasting context between sessions and “nearly a hundred random markdown docs” littering the repo (the OP archives sessions at context saturation rather than compacting). The after-state moves all cross-session communication into GitHub Issues — Claude writes notes to itself under LLM-ingestion instructions kept in CLAUDE.md, opens and closes bug reports, and writes dense session hand-off docs. The verifier and the memory ended up being the same substrate: the forge already stores state, runs checks, and gates merges, so the loop’s scaffolding is infrastructure the project needed anyway.

Scale and honest limits. The OP runs 3–5 concurrent Claude sessions and says the remaining bottleneck is his own context-switching, not the model. Caveats he states: the product is a locally-hosted open-source app, so “production” is a Docker host on his home LAN — a real SaaS would need substantially more infrastructure in the pipeline. No before/after defect rate is reported, so the claim is “this leveled me up,” not a measured improvement.^[single self-reported field report]

Audit the decisions, not the diff (Victor Taelin, 2026-07-23)

Source: raw/newsletter-theneurondaily-com-300a6b0314.md (The Neuron Daily, 2026-07-23), relaying Victor Taelin.

The human-verifier corollary above says read every diff line. This is the counter-proposal for when that stops scaling, and it resolves the tension by changing what gets reviewed rather than how carefully. The framing: an agent can write thousands of lines in minutes; reviewing every line defeats the point, but merging blind is how you get “digital lasagna.”

The claim underneath it is the useful part: coding agents usually execute a concrete plan well — the risk concentrates where the task was underspecified and the agent quietly picked an architecture, shortcut, or assumption for you. So the diff is the wrong artifact. The decisions are.

Lock the load-bearing decisions before execution — desired behavior, constraints, architecture, and what “fixed” actually means.
After the agent finishes, ask it to enumerate every meaningful choice it made, flagging anything it was unsure about.
Review that short list instead of the full diff, correct weak assumptions, then have the agent revise.
Run a “pride gate” (from the top commenter on Taelin’s post): ask whether the agent is proud of the branch and would stand behind it in production.

The reusable prompt, quoted from the source:

Before I merge this work, audit the decisions you made:

1. List every meaningful decision, assumption, shortcut, or interpretation
   you made while completing the task.
2. For each one, explain why you chose it, what alternatives you considered,
   and how confident you are.
3. Flag any choice that fixes the immediate example but may not solve the
   underlying problem generally.
4. Identify any edge cases, technical debt, or temporary fixes that remain.

Where this sits against the rest of the article. It is self-report, so it inherits the exact weakness the verification frontier and the Verifier-Theater warning describe — an agent grading its own work. Point 3 is the mitigation that matters most (“fixes the immediate example but may not solve the underlying problem” is the reward-hacking shape stated in reviewable terms), but it is still the generator answering. Treat it as a triage layer that tells you where to spend real scrutiny, not as a replacement for the independent verifier or the CI gate below.

Try It

Adopt the four-line checklist as a pre-flight. Don’t run /goal or /loop on a repo until you can name the done-condition, the per-pass check, the saved artifact, and the failure→retry path. If you can’t, the task isn’t loop-ready yet — keep it a manual prompt.
Give your evaluator a different model + no context from the maker. Pair this with the checker rule (verifier never sees the implementer’s reasoning, never has Write access).
Make the goal multimodal where you can: attach a target screenshot or reference image and have the loop diff against it, not just a prose spec.
Save the artifact every pass (test output, screenshot, benchmark CSV, changed-files list) so “done” is a thing you can inspect later — this is also your audit trail when a gate rots.
Promote the verifier into CI once the project outgrows per-loop checks. A PR-gated pipeline makes the four-line checklist standing infrastructure instead of something you re-decide each run — and the forge doubles as the cross-session memory (issues, hand-off docs) that agent sessions otherwise scatter across markdown files.
Add model-specific gates, not just correctness gates. A cyclomatic-complexity ceiling and a banned-phrase check catch the way models degrade a codebase (giant functions, slop wording) — failures a passing test suite will happily wave through.

Harness Design for Long-Running Agents (Anthropic, two-part) — the first-party engineering deep-dive behind this article’s checklist-style principle: why agent self-evaluation fails (models “confidently praise” mediocre work) and how a separate, skeptically-tuned evaluator agent fixes it, with real cost/duration data.
How Anthropic Runs Large-Scale Code Migrations with Claude Code — a concrete, production-scale instance of this discipline: judge-building and stress-testing come before translation, and behavior-matching gates the merge.
Loop Engineering — Addy Osmani’s Essay — the primary thesis; verification is the residual risk it names (“‘done’ is a claim and not a proof”).
Loop Engineering — Cobus Greyling’s Reference — the Verifier Theater / Infinite Fix Loop failure modes this discipline prevents.
Should You Build a Loop? — the objective-gate argument extended to the Ralph Wiggum / agentic-laziness / goal-drift failure family.
The Verification Frontier — the cross-topic synthesis on cheap-vs-expensive verification that this operationalizes.
goal` Walkthrough — the Claude Code primitive that bakes a separate-checker stop condition into the loop.
Write Loops, Not Prompts — the topic entry point; its verification section is the beginner version of this.
Andrej Karpathy — origin of the “if you can’t evaluate, you can’t auto-research it” framing the article hangs on.
LOOPS — Everything You Need to Know (Matthew Berman) — the verifiable/LLM-as-judge grid that sharpens “against what standard” into a binary choice.
Agent Loops, Clearly Explained (Nate Herk) — the two-pillars framing (objective goal + a way to check) this article’s resolved-standard section leans on.
Reward-Hacking and the Verification Frontier — the connection article: GLM-5.2’s anti-hack module is this discipline (proof outside the agent, separate evaluator) productionized at the RL-training layer, with evidence that reward-hacking scales with capability.
A True-Scale Universe Atlas Built with Fable 5 — production-scale field proof of this discipline: 92 AI-authored PRs merged only because CI checked every claim (planet positions vs. JPL ephemeris, rendered frames vs. saved baselines) rather than the model’s self-report.
Agentic Misalignment in Summer 2026 — direct empirical pressure against this article’s “separate model as evaluator” prescription: LLM judges shifted verdicts on identical transcripts once told the training consequence of their label, with Claude models the worst offenders tested — a robust verifier needs more than just being a different model.

Open Questions

@omarsar0’s underlying long-form X Article (“Autonomous Long-Running Coding Agents”, id 2065876120965111808) is login-gated; its writer-agent summary body was not extractable. This article is built from his thread posts + the @alphabatcher amplification, not the full piece — refresh when the article or his promised follow-up write-up becomes accessible.
The exact source/context of the Karpathy quote (which talk/post) isn’t pinned down here — it’s quoted secondhand by @alphabatcher. Verify the original on refresh.
The plan/execute/evaluate model split is one practitioner’s 2026-06 setup; no comparative eval backs the specific assignments.

Jonathon's AI Wiki

Explorer

Verifier-First Loops — Proof Outside the Agent (omarsar0 · alphabatcher · Karpathy)

Key Takeaways

The Verifier-First Checklist (alphabatcher, verbatim)

Why Karpathy’s Rule Is the Whole Game

The Canonical Standard, Resolved (research-agenda drain, 2026-07-03)

The human-verifier corollary — reading the diff out loud (2026-07-17)

CI as the verifier — a field report (2026-07-21)

Audit the decisions, not the diff (Victor Taelin, 2026-07-23)

Try It

Open Questions

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

Verifier-First Loops — Proof Outside the Agent (omarsar0 · alphabatcher · Karpathy)

Key Takeaways

The Verifier-First Checklist (alphabatcher, verbatim)

Why Karpathy’s Rule Is the Whole Game

The Canonical Standard, Resolved (research-agenda drain, 2026-07-03)

The human-verifier corollary — reading the diff out loud (2026-07-17)

CI as the verifier — a field report (2026-07-21)

Audit the decisions, not the diff (Victor Taelin, 2026-07-23)

Try It

Related

Open Questions

Graph View

Table of Contents

Backlinks