AutoResearch — Self-Improving Coding-Agent Loop (Thu Vu walkthrough)

Source: I Used Karpathy’s Autoresearch to Train an LLM (Thu Vu) · YouTube XXR0zZ0_16M · uploaded 2026-04-24 · 15:38

Speaker: Thu Vu (Thu Vu data analytics channel) Subject of the video: Andrej Karpathy’s open-sourced AutoResearch project Sponsor: Mistral AI (Mistral Vibe / Le Chat Pro) Hardware: Apple Silicon M1 MacBook Pro using a community macOS port of AutoResearch

A third-party walkthrough of Karpathy’s AutoResearch — an autonomous self-improving program loop where the AI coding agent runs experiments, evaluates them against an automatic metric, and ratchets the codebase forward without human prompting. Thu Vu uses it to train a small GPT-style language model on a folklore-and-mythology dataset from Hugging Face, with Mistral Vibe as the autonomous coding agent. Filed in the Karpathy topic because the underlying tool and design pattern are Karpathy’s; complements the Sequoia AI Ascent talk by operationalizing the next-step-beyond-agentic-engineering idea Karpathy hints at.

Key Takeaways

AutoResearch is Karpathy’s open-source self-improving research loop. Backstory per the video: Karpathy had a ~630-line GPT training script he’d been manually optimizing for months (hyperparameters, architectures, learning rates — “the usual ML research grind”). At some point he asked himself, “Why am I doing this myself? Why don’t I just let an AI coding agent do this loop for me?” He built AutoResearch and open-sourced it; it picked up tens of thousands of GitHub stars within days^[ambiguous — Thu Vu’s report at fetch time].
Three-tier human/AI evolution framing. Thu Vu names the progression in the cold-open: (1) vibe coding — human prompts, AI writes code, human reviews; (2) agentic engineering — human orchestrates agents in real time as a director; (3) AutoResearch — human doesn’t even orchestrate. Human writes a markdown file describing what good research looks like and walks away. Human role is research advisor. This sits one step beyond the framework Karpathy himself articulates in the Sequoia AI Ascent talk.
The AutoResearch contract is three files. Per Thu Vu’s design walkthrough:
- prepare.py — data prep + validation metric definition (in Karpathy’s original: training data download + val BPB validation bits-per-byte).
- train.py — the agent’s sandbox. ~600 lines containing the GPT training loop. This is the only file the AI agent can edit. Architecture, hyperparameters, optimizer, batch size — all editable.
- program.md — the agent’s instructions. Written, edited, iterated by the human. Defines goals, constraints, boundaries (what the agent may and may not modify), time budget per experiment, output format, logging conventions. Where the human’s research judgment lives.
The “ratchet loop” is the core mechanism. For each experiment:
1. Agent reads program.md for current research priorities + constraints.
2. Agent examines current train.py (the baseline).
3. Agent proposes a hypothesis (architecture change, optimizer swap, etc.).
4. Agent commits to a Git branch.
5. Agent runs training for exactly the time budget (5 min default — adjustable). Equal time budgets keep experiments comparable; the agent can’t cheat by training longer.
6. Agent evaluates against the scoring metric.
7. If metric improved → commit stays. If not → git reset reverts to the previous version.
- Result: “the codebase can only move forward. Each successful experiment adds a commit and each failure gets reverted. Improvements accumulate one at a time and you can never slide backward.”
program.md includes a “never stop” directive. A snippet Thu Vu reads aloud: “Once the experiment loop has begun after the initial setup, do not pause or ask the human if you should continue. Do not ask ‘Should I keep going?’ The human might be asleep or gone from computer and expect you to continue working indefinitely until you manually stopped. The loop runs until the human interrupts you, period.”
The pattern generalizes well beyond LLM training. Three conditions for any AutoResearch-style loop, per Thu Vu: (1) clear automatic metric (ideally one number, machine-measurable); (2) one file the agent edits; (3) time-boxed experiment loop. She names domains where this fits: website-design speed optimization, trading-strategy backtesting, marketing-asset optimization (email subject lines, landing-page copy). The constraint is the metric, not the domain.
“Simpler is better” is baked into Karpathy’s program.md. Per Thu Vu, the original includes guidance like “all else being equal, simpler is better. A small improvement that adds ugly complexity is not worth it.” Aligns with Karpathy’s micro-GPT-simplification frustration in the Sequoia talk (“the models hate this, they can’t do it… you feel like you’re outside of the RL circuits”) and his broader complaint that LLM-generated code is “very bloaty, lots of copy-paste, awkward brittle abstractions.”
Dangerous Mode (auto-approve) is required to run the loop. Mistral Vibe’s auto-approve mode lets the agent execute file edits + bash commands without per-step human approval — “honestly, in no normal circumstances should you enable the auto-approve mode, but here in this project we’re going to use this mode because that’s the whole point.” This is the actual cost of human-as-research-advisor: you’re handing execute permission to an agent loop that won’t stop until you stop it.
Real result on Thu Vu’s run. She runs out of tokens overnight; the agent completes 11 total experiments. Validation metric (val BPB) improves visibly across experiments. Sample from baseline model is grammatical garbage; sample from the final iteration is “still doesn’t make much sense, but a little bit better — sentences are more complete.” Realistic outcome for a tiny model on a small dataset, but the loop mechanic is the demonstration, not the model quality.
The closing reframe — “judgment behind the research agenda is still the human’s.” Quoting the DataCamp AutoResearch guide via Thu Vu: “Writing a good program.md requires having done the research yourself. You need to know which directions are worth trying, what ‘better’ means for your problem, and when incremental gains have run their course. And honestly, that might be the most valuable skill for the next decade.” Same shape as Karpathy’s “you can outsource your thinking but you can’t outsource your understanding” closing in the Sequoia talk — different sentence, same point.

Setup Walkthrough (concise)

For replicating on Apple Silicon, per Thu Vu’s terminal session:

Install uv (the Python package manager AutoResearch uses); verify with uv --version.
Install Mistral Vibe; on first launch it prompts for a Mistral API key from console.mistral.ai.
Clone the community macOS port of AutoResearch into a project folder.
uv sync to install dependencies (PyTorch et al.).
Edit prepare.py to swap in your dataset (Thu Vu had Vibe do this in chat — “I want to use AutoResearch to train an LLM on this dataset; please review the codebase, download the dataset, and configure prepare.py”).
Manually run uv run prepare.py once to verify the data pipeline.
Manually run a single training experiment to verify training works.
Toggle Mistral Vibe to auto-approve mode (Shift-Tab cycles default → plan → accept-edits → auto-approve).
Hand the agent the AutoResearch kickoff prompt: “Have a look at the program.md file, and let’s kick off a new experiment. Let’s do the setup first.” The agent then runs the loop indefinitely until you interrupt it or run out of tokens.

Implementation

Tool/Service: AutoResearch (Andrej Karpathy, open source) + Mistral Vibe (Mistral AI; Le Chat Pro CLI; sponsored)

Setup time: Per the video, all command-line steps take a few minutes; the agent loop itself runs overnight.

Cost: Auto-approve loop will burn through your API tokens steadily — Thu Vu ran out overnight. Budget accordingly. Mistral Vibe is included in Le Chat Pro / Team plans; the OSS Vibe binary works with any model API.

Integration notes:

The macOS port is community-maintained, not Karpathy-maintained. If you’re on Apple Silicon, use that fork; if you have access to dedicated GPUs, run Karpathy’s original.
You can swap the agent. Thu Vu uses Mistral Vibe but the AutoResearch contract (three files, ratchet loop) is agent-agnostic. Claude Code or Codex with auto-approve / Dangerous Mode would work the same way. (Verifying what auto-approve looks like in your specific agent is the prerequisite.)
The 5-minute time budget per experiment is a program.md parameter — bump it up for compute-heavier domains (e.g., trading backtests over years of market data).

Try It

Pick a domain with a clear automatic metric. Three Thu-Vu-suggested non-obvious starting points: site-load-speed optimization, trading-strategy backtesting, marketing-copy CTR optimization. Anything with a single-number objective and a < 5-minute eval loop is in scope.
Write program.md first, before any code. This is where your research judgment lives. Specify: goal metric, time budget per experiment, files the agent may and may not modify, output/logging format, “never stop” directive if you want the loop to run unattended.
Constrain train.py (or your equivalent) to one file. The AutoResearch contract is “one file the agent edits.” Resist the urge to let the agent touch prepare.py or your evaluation harness — the metric must be uncheatable.
Use Git as the ratchet. Each experiment = one commit. Successful → kept. Failed → git reset --hard. This is the load-bearing safety mechanism; don’t skip it.
Cross-reference with the Sequoia talk for the strategic framing of why the human-as-research-advisor tier matters; this video provides the operational mechanics that the talk only gestures at.

From Vibe Coding to Agentic Engineering — Karpathy’s own articulation of the vibe-coding → agentic-engineering progression that AutoResearch sits one step beyond
Karpathy topic — sibling content
Karpathy Pattern — community implementations of his LLM-wiki idea (different pattern, same author)
Agent Skills overview — program.md as a research-advisor skill is conceptually adjacent to the Agent Skills format
Opus 4.7 best practices — for the analogous Claude Code auto-approve / Dangerous Mode considerations
2026 Claude Code AIOS Pattern — convergent-evidence synthesis on self-improving agent architectures

Open Questions

AutoResearch repo URL + license. Thu Vu cites it as Karpathy’s open-source project but the exact GitHub path was not extracted; “tens of thousands of stars within days” not independently verified at fetch time.
Macro port maintainer. Community Apple Silicon fork is referenced but not named — flag for follow-up before recommending it for production use.
DataCamp guide. The closing quote is sourced from a DataCamp AutoResearch guide; URL not extracted.
Karpathy’s own follow-up content. Whether Karpathy has published a paper, blog post, or video explaining AutoResearch directly (vs the project README) is not surfaced in this third-party walkthrough.
Failure modes. What happens when the agent gets stuck in a local optimum, or proposes the same kind of change repeatedly? program.md constraints presumably handle this but the design pattern documentation around exploration vs exploitation is not covered in this video.

Jonathon's AI Wiki

Explorer

AutoResearch — Self-Improving Coding-Agent Loop (Thu Vu walkthrough)

Key Takeaways

Setup Walkthrough (concise)

Implementation

Try It

Open Questions

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

AutoResearch — Self-Improving Coding-Agent Loop (Thu Vu walkthrough)

Key Takeaways

Setup Walkthrough (concise)

Implementation

Try It

Related

Open Questions

Graph View

Table of Contents

Backlinks