Source: ai-research/kevinrgu-autoagent-2026-05-28.md
Repo: https://github.com/kevinrgu/autoagent Stars: 4,500 | Forks: 499 | Watchers: 29 Language: Python 100% | License: MIT Tagline: autonomous harness engineering
A meta-agent harness that builds and iteratively improves agents by running them against a benchmark of Docker-isolated tasks and hill-climbing on a deterministic or LLM-as-judge reward score. Built on harbor for task execution; each task is a self-contained directory with instruction.md + tests/ (test.sh + test.py) + environment/Dockerfile + files/. Tests write a score (0.0-1.0) to /logs/reward.txt; the meta-agent uses that score as the loss function for the next iteration. Performance is improved by equipping the agent with Agent Skills for Context Engineering and context7 skills — the architectural inversion that lets the same harness keep climbing without a custom training pipeline.
Key Takeaways
- Hill-climbs on reward.txt, not on RL gradients. The meta-agent reads task verdicts (deterministic shell + Python verification or LLM-as-judge) and proposes the next iteration of the agent itself. Reward signal is a single float per task per run.
- Docker-isolated tasks are the unit of evaluation. Every task lives in its own container that
FROM autoagent-base; the base image is built once viadocker build -f Dockerfile.base -t autoagent-base .and reused across all task containers. Reference files mounted viafiles/survive into the container at runtime. uv run harbor runis the loop driver. Parallelism via-nflag (default 4, README shows 100 for full-benchmark sweeps).--agent-import-path agent:AutoAgentis how the meta-agent class plugs into Harbor’s task runner. Outputs land injobs/<job-name>/; latest run log inrun.log.- Operator escape hatches for Docker.
docker system prune -a -f(heavy) andkillall Docker && open -a Docker(recovery) are documented in the README itself — Harbor + parallel containers eats Docker daemons, and the author knows it. - Skills as the performance lever. The README’s “Improving performance” section is one paragraph: equip the agent with Agent Skills for Context Engineering and context7 skills. The harness assumes Skills/context7 are the right layer to bolt capability on, not custom training or fine-tuning. Pattern alignment with Tool, Skill, or Subagent? — skills as the cheap composable layer.
- Task format is portable.
instruction.md + tests/ + environment/Dockerfile + files/looks transferrable across harnesses — the same task directory could in principle drive a different agent or framework with no rewrites. - Sister to but distinct from prior wiki entries on self-improving harnesses. Compare Reflexio (extracts playbooks from runs, drop-in for Claude Code / LangChain / OpenClaw) and Browserbase Autobrowse (browser-specific, graduates SKILL.md from convergent strategies). AutoAgent’s slot is benchmark-driven meta-improvement: the agent IS the artifact, the benchmark is the loss function, Docker tasks are the substrate.
How it compares to existing harness articles
| Harness | Slot | What gets improved | Substrate |
|---|---|---|---|
| Reflexio | Cross-domain | Per-user profiles + per-task playbooks (retrieval over recipes) | Real agent runs, retrieved next time |
| Browserbase Autobrowse | Browser-specific | SKILL.md graduated from convergent strategy iterations | Real-site browser sessions, capped at ~3-5 iters |
| Verification-loop skills (Sid Benesaria) | Cross-domain (Claude Code) | Self-improving verification skills that hill-climb on a criterion | Claude Code session loop |
| AutoAgent | Cross-domain (benchmark-shaped) | The agent itself, evaluated against a Docker-isolated task set | Docker task containers, hill-climbing on reward.txt |
The architectural lesson across all four: the harness — not the base model — is where the design choices compound. AutoAgent’s specific take is that a tight task → reward → iterate loop, fully containerized for reproducibility, is the cleanest way to drive that compounding.
Related
- Reflexio — sibling self-improvement harness; different mechanism (retrieval over playbooks) but same north star
- Browserbase Autobrowse — domain-specific sibling (browser strategies → graduated skills)
- Memory Stores + Dreaming — Anthropic’s first-party version of the same idea (multi-session memory + asynchronous batch consolidation)
- Tool, Skill, or Subagent? (Will, Applied AI) — skills as the cheap composable layer for capability addition — same lever AutoAgent recommends
- Stop Babysitting Your Agents (Sid Benesaria) — self-improving verification skills, same hill-climbing-on-criterion shape
- Karpathy autoresearch ratchet — conceptual root of the iterate-against-a-criterion pattern adopted across these harnesses
- 2026 Claude Code AIOS Pattern — broader pattern context: the agent OS where each loop compounds
Try It
# 1. Clone
git clone https://github.com/kevinrgu/autoagent.git
cd autoagent
# 2. Build the base image
docker build -f Dockerfile.base -t autoagent-base .
# 3. Add task directories under tasks/
# Each task = instruction.md + tests/test.sh + tests/test.py +
# environment/Dockerfile (FROM autoagent-base) + files/
# 4. Single task (verify the loop works)
rm -rf jobs && mkdir -p jobs && \
uv run harbor run -p tasks/ \
--task-name "<task-name>" -l 1 -n 1 \
--agent-import-path agent:AutoAgent \
-o jobs --job-name latest > run.log 2>&1
# 5. Parallel sweep (100 concurrent runs)
rm -rf jobs && mkdir -p jobs && \
uv run harbor run -p tasks/ -n 100 \
--agent-import-path agent:AutoAgent \
-o jobs --job-name latest > run.log 2>&1
# Reset Docker if it goes catatonic mid-sweep:
killall Docker && open -a DockerTo improve performance: equip the agent with Agent Skills for Context Engineering and context7 skills (README’s stated lever).
Open Questions
- What does “the agent” actually look like?
--agent-import-path agent:AutoAgentreferences anAutoAgentclass — its constructor signature, tool interface, and model selection aren’t documented in the extracted snippets. A repo read ofagent.pywould close the gap. - How does the meta-agent propose iteration steps? The README mentions hill-climbing on
reward.txtbut the mechanism (gradient-free search, LLM-proposed mutations, RL signal) isn’t surfaced. Worth aagent.py+ Harbor-docs read. - Is there a published benchmark set? The task format is documented but no canonical task suite is referenced in the extracted content. Curated benchmark availability would inform whether AutoAgent ships as a framework + suite or framework-only.
- Relationship to upstream Harbor docs. The README defers to “Harbor docs” for task-writing details — discovering the canonical Harbor reference would let the wiki article cross-link the task-format spec.
- Star pop signal. 4.5k stars on a one-developer Python repo is high; check creation date, contributor count, and recent commit cadence on next visit to calibrate whether this is a stable foundation or a viral-week artifact.