Source: ai-research/kevinrgu-autoagent-2026-05-28.md

Repo: https://github.com/kevinrgu/autoagent Stars: 4,500 | Forks: 499 | Watchers: 29 Language: Python 100% | License: MIT Tagline: autonomous harness engineering

A meta-agent harness that builds and iteratively improves agents by running them against a benchmark of Docker-isolated tasks and hill-climbing on a deterministic or LLM-as-judge reward score. Built on harbor for task execution; each task is a self-contained directory with instruction.md + tests/ (test.sh + test.py) + environment/Dockerfile + files/. Tests write a score (0.0-1.0) to /logs/reward.txt; the meta-agent uses that score as the loss function for the next iteration. Performance is improved by equipping the agent with Agent Skills for Context Engineering and context7 skills — the architectural inversion that lets the same harness keep climbing without a custom training pipeline.

Key Takeaways

  • Hill-climbs on reward.txt, not on RL gradients. The meta-agent reads task verdicts (deterministic shell + Python verification or LLM-as-judge) and proposes the next iteration of the agent itself. Reward signal is a single float per task per run.
  • Docker-isolated tasks are the unit of evaluation. Every task lives in its own container that FROM autoagent-base; the base image is built once via docker build -f Dockerfile.base -t autoagent-base . and reused across all task containers. Reference files mounted via files/ survive into the container at runtime.
  • uv run harbor run is the loop driver. Parallelism via -n flag (default 4, README shows 100 for full-benchmark sweeps). --agent-import-path agent:AutoAgent is how the meta-agent class plugs into Harbor’s task runner. Outputs land in jobs/<job-name>/; latest run log in run.log.
  • Operator escape hatches for Docker. docker system prune -a -f (heavy) and killall Docker && open -a Docker (recovery) are documented in the README itself — Harbor + parallel containers eats Docker daemons, and the author knows it.
  • Skills as the performance lever. The README’s “Improving performance” section is one paragraph: equip the agent with Agent Skills for Context Engineering and context7 skills. The harness assumes Skills/context7 are the right layer to bolt capability on, not custom training or fine-tuning. Pattern alignment with Tool, Skill, or Subagent? — skills as the cheap composable layer.
  • Task format is portable. instruction.md + tests/ + environment/Dockerfile + files/ looks transferrable across harnesses — the same task directory could in principle drive a different agent or framework with no rewrites.
  • Sister to but distinct from prior wiki entries on self-improving harnesses. Compare Reflexio (extracts playbooks from runs, drop-in for Claude Code / LangChain / OpenClaw) and Browserbase Autobrowse (browser-specific, graduates SKILL.md from convergent strategies). AutoAgent’s slot is benchmark-driven meta-improvement: the agent IS the artifact, the benchmark is the loss function, Docker tasks are the substrate.

How it compares to existing harness articles

HarnessSlotWhat gets improvedSubstrate
ReflexioCross-domainPer-user profiles + per-task playbooks (retrieval over recipes)Real agent runs, retrieved next time
Browserbase AutobrowseBrowser-specificSKILL.md graduated from convergent strategy iterationsReal-site browser sessions, capped at ~3-5 iters
Verification-loop skills (Sid Benesaria)Cross-domain (Claude Code)Self-improving verification skills that hill-climb on a criterionClaude Code session loop
AutoAgentCross-domain (benchmark-shaped)The agent itself, evaluated against a Docker-isolated task setDocker task containers, hill-climbing on reward.txt

The architectural lesson across all four: the harness — not the base model — is where the design choices compound. AutoAgent’s specific take is that a tight task → reward → iterate loop, fully containerized for reproducibility, is the cleanest way to drive that compounding.

Try It

# 1. Clone
git clone https://github.com/kevinrgu/autoagent.git
cd autoagent
 
# 2. Build the base image
docker build -f Dockerfile.base -t autoagent-base .
 
# 3. Add task directories under tasks/
#    Each task = instruction.md + tests/test.sh + tests/test.py +
#    environment/Dockerfile (FROM autoagent-base) + files/
 
# 4. Single task (verify the loop works)
rm -rf jobs && mkdir -p jobs && \
uv run harbor run -p tasks/ \
  --task-name "<task-name>" -l 1 -n 1 \
  --agent-import-path agent:AutoAgent \
  -o jobs --job-name latest > run.log 2>&1
 
# 5. Parallel sweep (100 concurrent runs)
rm -rf jobs && mkdir -p jobs && \
uv run harbor run -p tasks/ -n 100 \
  --agent-import-path agent:AutoAgent \
  -o jobs --job-name latest > run.log 2>&1
 
# Reset Docker if it goes catatonic mid-sweep:
killall Docker && open -a Docker

To improve performance: equip the agent with Agent Skills for Context Engineering and context7 skills (README’s stated lever).

Open Questions

  • What does “the agent” actually look like? --agent-import-path agent:AutoAgent references an AutoAgent class — its constructor signature, tool interface, and model selection aren’t documented in the extracted snippets. A repo read of agent.py would close the gap.
  • How does the meta-agent propose iteration steps? The README mentions hill-climbing on reward.txt but the mechanism (gradient-free search, LLM-proposed mutations, RL signal) isn’t surfaced. Worth a agent.py + Harbor-docs read.
  • Is there a published benchmark set? The task format is documented but no canonical task suite is referenced in the extracted content. Curated benchmark availability would inform whether AutoAgent ships as a framework + suite or framework-only.
  • Relationship to upstream Harbor docs. The README defers to “Harbor docs” for task-writing details — discovering the canonical Harbor reference would let the wiki article cross-link the task-format spec.
  • Star pop signal. 4.5k stars on a one-developer Python repo is high; check creation date, contributor count, and recent commit cadence on next visit to calibrate whether this is a stable foundation or a viral-week artifact.