AutoAgent — Autonomous Harness Engineering (kevinrgu)

Source: ai-research/kevinrgu-autoagent-2026-05-28.md

Repo: https://github.com/kevinrgu/autoagent Stars: 4,500 | Forks: 499 | Watchers: 29 Language: Python 100% | License: MIT Tagline: autonomous harness engineering

A meta-agent harness that builds and iteratively improves agents by running them against a benchmark of Docker-isolated tasks and hill-climbing on a deterministic or LLM-as-judge reward score. Built on harbor for task execution; each task is a self-contained directory with instruction.md + tests/ (test.sh + test.py) + environment/Dockerfile + files/. Tests write a score (0.0-1.0) to /logs/reward.txt; the meta-agent uses that score as the loss function for the next iteration. Performance is improved by equipping the agent with Agent Skills for Context Engineering and context7 skills — the architectural inversion that lets the same harness keep climbing without a custom training pipeline.

Key Takeaways

Hill-climbs on reward.txt, not on RL gradients. The meta-agent reads task verdicts (deterministic shell + Python verification or LLM-as-judge) and proposes the next iteration of the agent itself. Reward signal is a single float per task per run.
Docker-isolated tasks are the unit of evaluation. Every task lives in its own container that FROM autoagent-base; the base image is built once via docker build -f Dockerfile.base -t autoagent-base . and reused across all task containers. Reference files mounted via files/ survive into the container at runtime.
uv run harbor run is the loop driver. Parallelism via -n flag (default 4, README shows 100 for full-benchmark sweeps). --agent-import-path agent:AutoAgent is how the meta-agent class plugs into Harbor’s task runner. Outputs land in jobs/<job-name>/; latest run log in run.log.
Operator escape hatches for Docker. docker system prune -a -f (heavy) and killall Docker && open -a Docker (recovery) are documented in the README itself — Harbor + parallel containers eats Docker daemons, and the author knows it.
Skills as the performance lever. The README’s “Improving performance” section is one paragraph: equip the agent with Agent Skills for Context Engineering and context7 skills. The harness assumes Skills/context7 are the right layer to bolt capability on, not custom training or fine-tuning. Pattern alignment with Tool, Skill, or Subagent? — skills as the cheap composable layer.
Task format is portable. instruction.md + tests/ + environment/Dockerfile + files/ looks transferrable across harnesses — the same task directory could in principle drive a different agent or framework with no rewrites.
Sister to but distinct from prior wiki entries on self-improving harnesses. Compare Reflexio (extracts playbooks from runs, drop-in for Claude Code / LangChain / OpenClaw) and Browserbase Autobrowse (browser-specific, graduates SKILL.md from convergent strategies). AutoAgent’s slot is benchmark-driven meta-improvement: the agent IS the artifact, the benchmark is the loss function, Docker tasks are the substrate.

How it compares to existing harness articles

Harness	Slot	What gets improved	Substrate
[[agents-agentic-systems/reflexio	Reflexio]]	Cross-domain	Per-user profiles + per-task playbooks (retrieval over recipes)
[[agents-agentic-systems/browserbase-autobrowse	Browserbase Autobrowse]]	Browser-specific	`SKILL.md` graduated from convergent strategy iterations
[[claude-ai/stop-babysitting-your-agents-talk-sid-benesaria	Verification-loop skills (Sid Benesaria)]]	Cross-domain (Claude Code)	Self-improving verification skills that hill-climb on a criterion
AutoAgent	Cross-domain (benchmark-shaped)	The agent itself, evaluated against a Docker-isolated task set	Docker task containers, hill-climbing on `reward.txt`

The architectural lesson across all four: the harness — not the base model — is where the design choices compound. AutoAgent’s specific take is that a tight task → reward → iterate loop, fully containerized for reproducibility, is the cleanest way to drive that compounding.

Reflexio — sibling self-improvement harness; different mechanism (retrieval over playbooks) but same north star
Browserbase Autobrowse — domain-specific sibling (browser strategies → graduated skills)
Memory Stores + Dreaming — Anthropic’s first-party version of the same idea (multi-session memory + asynchronous batch consolidation)
Tool, Skill, or Subagent? (Will, Applied AI) — skills as the cheap composable layer for capability addition — same lever AutoAgent recommends
Stop Babysitting Your Agents (Sid Benesaria) — self-improving verification skills, same hill-climbing-on-criterion shape
Karpathy autoresearch ratchet — conceptual root of the iterate-against-a-criterion pattern adopted across these harnesses
2026 Claude Code AIOS Pattern — broader pattern context: the agent OS where each loop compounds

Try It

# 1. Clone
git clone https://github.com/kevinrgu/autoagent.git
cd autoagent
 
# 2. Build the base image
docker build -f Dockerfile.base -t autoagent-base .
 
# 3. Add task directories under tasks/
#    Each task = instruction.md + tests/test.sh + tests/test.py +
#    environment/Dockerfile (FROM autoagent-base) + files/
 
# 4. Single task (verify the loop works)
rm -rf jobs && mkdir -p jobs && \
uv run harbor run -p tasks/ \
  --task-name "<task-name>" -l 1 -n 1 \
  --agent-import-path agent:AutoAgent \
  -o jobs --job-name latest > run.log 2>&1
 
# 5. Parallel sweep (100 concurrent runs)
rm -rf jobs && mkdir -p jobs && \
uv run harbor run -p tasks/ -n 100 \
  --agent-import-path agent:AutoAgent \
  -o jobs --job-name latest > run.log 2>&1
 
# Reset Docker if it goes catatonic mid-sweep:
killall Docker && open -a Docker

To improve performance: equip the agent with Agent Skills for Context Engineering and context7 skills (README’s stated lever).

Open Questions

What does “the agent” actually look like? --agent-import-path agent:AutoAgent references an AutoAgent class — its constructor signature, tool interface, and model selection aren’t documented in the extracted snippets. A repo read of agent.py would close the gap.
How does the meta-agent propose iteration steps? The README mentions hill-climbing on reward.txt but the mechanism (gradient-free search, LLM-proposed mutations, RL signal) isn’t surfaced. Worth a agent.py + Harbor-docs read.
Is there a published benchmark set? The task format is documented but no canonical task suite is referenced in the extracted content. Curated benchmark availability would inform whether AutoAgent ships as a framework + suite or framework-only.
Relationship to upstream Harbor docs. The README defers to “Harbor docs” for task-writing details — discovering the canonical Harbor reference would let the wiki article cross-link the task-format spec.
Star pop signal. 4.5k stars on a one-developer Python repo is high; check creation date, contributor count, and recent commit cadence on next visit to calibrate whether this is a stable foundation or a viral-week artifact.

Jonathon's AI Wiki

Explorer

AutoAgent — Autonomous Harness Engineering (kevinrgu)

Key Takeaways

How it compares to existing harness articles

Try It

Open Questions

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

AutoAgent — Autonomous Harness Engineering (kevinrgu)

Key Takeaways

How it compares to existing harness articles

Related

Try It

Open Questions

Graph View

Table of Contents

Backlinks