Source: raw/newsletter-epoch-ai-b3167bb772.md (The Epoch Brief, June 26 2026, Elliot Stewart). Primary: epoch.ai/mirrorcode — the newsletter is the fetched secondary; the full per-model results table and methodology live at the primary and were not directly fetched.

MirrorCode is a new long-horizon coding benchmark co-developed by Epoch AI and METR, built to answer one question: what’s the largest software project AI can complete on its own? It tasks models with rebuilding 25 real-world programs from scratch — no source code, no human in the loop — and crucially gives them an inference budget large enough for a serious attempt (one run cost $2,600 and ran 19 days unattended). At launch, Claude Opus 4.7 leads with a 56% solve rate, which Epoch frames as “significant room for further improvement.” It is a yardstick for autonomous, multi-day coding — directly relevant to anyone betting on agents finishing large software tasks without supervision.

Key Takeaways

  • What it is: a long-horizon coding benchmark from Epoch AI + METR, launched in the June 26 2026 Epoch Brief. Tagline question: “What’s the largest software project AI can complete on its own?”
  • The task: rebuild 25 real-world programs spanning bioinformatics, Unix utilities, cryptography, interpreters, and more — with no access to source code and no human in the loop.
  • Difficulty: the hardest programs would take a human engineer weeks to months to build unaided.
  • Big-budget by design: most SWE benchmarks cap inference at ~10 per task and run for minutes to at most hours. MirrorCode instead funds a real attempt — one of the largest tasks cost $2,600 for a single run, with the model working 19 days without human intervention.
  • Leaderboard headline: Claude Opus 4.7 leads at a 56% solve rate — Epoch reads the sub-60% top score as evidence of “significant room for further improvement.”
  • The practical signal: $2,600 / 19 days is roughly today’s frontier of “just let the agent run.” That is the current ceiling on unattended, long-horizon coding — and a 56% top solve rate means even at that budget, nearly half the hard tasks fail.

What MirrorCode measures

MirrorCode targets autonomous, long-horizon software engineering — the ability of a model to take a large spec and rebuild a working program end-to-end on its own. It deliberately departs from the typical SWE-benchmark setup in two ways:

  • No human in the loop and no source access. The model rebuilds each of the 25 target programs from behavior/spec alone, not by editing an existing repo or with a human steering between steps.
  • Long horizon. The hardest targets are estimated at weeks to months of unaided human-engineer effort, so a successful run reflects sustained multi-day autonomy, not a quick patch.

This makes it a measure of “largest project completable unattended” rather than “can the model fix this bug,” which is what most existing SWE evals reward.

Methodology

  • 25 real-world programs across domains: bioinformatics, Unix utilities, cryptography, interpreters, and more.
  • Inference budget is the key design choice. Where many SWE benchmarks cap spend at ~10/task and runs at minutes-to-hours, MirrorCode provisions enough budget for a genuine attempt. Epoch’s worked example: **one of the largest tasks cost 2,600 / 19-day figure describes one of the largest individual tasks, not every task in the suite — per the newsletter.]
  • Solve rate is the headline metric (programs successfully rebuilt). The full per-model table and methodology are at the primary, epoch.ai/mirrorcode; the newsletter quotes only the top line.

Results

  • Claude Opus 4.7 — 56% solve rate, leading all other models tested. Epoch’s framing: a sub-60% top score means “significant room for further improvement.”
  • Other models’ scores and the full leaderboard are not in the newsletter — see the primary for the complete table.

Also in the June 26 Epoch Brief (context, not part of the benchmark): Epoch projects hyperscaler capex will overtake operating cash flows by end of 2026, and published two Gradient Updates — an analysis of 1,604 Chinese AI-lab job postings and a proposed AI-R&D taxonomy.

Try It

  • Use 56% as your “unattended ceiling” prior. When someone pitches an agent that completes large software projects with no supervision, anchor on the fact that the leading model solves ~56% of MirrorCode’s tasks even at a $2,600 / 19-day budget. Treat “fully autonomous large-build” claims above that as needing evidence.
  • Right-size the budget question. MirrorCode’s contribution is showing that long-horizon coding capability only appears when you stop capping inference at 10. If you’re evaluating whether to “let it run,” budget and wall-clock (days, not minutes) are the real variables — not just model choice.
  • Cross-read against first-party SWE numbers. Pair this with the Opus/Fable SWE-bench Verified/Pro figures already tracked in the wiki (see Related) to separate “fix-a-bug” benchmarks from “build-the-whole-thing” ones — they reward very different things.
  • Pull the full table before citing per-model gaps. The newsletter only gives the leader. For any model-vs-model comparison, read epoch.ai/mirrorcode directly.

Open Questions

  • Which other models, and at what scores? The newsletter gives only the Opus-4.7 leader (56%). The full leaderboard — including frontier models newer than Opus 4.7 — is at the primary and was not fetched here.
  • What exactly counts as “solved”? The pass criterion (functional equivalence? test-suite pass? partial credit?) is not stated in the newsletter; the methodology lives at epoch.ai/mirrorcode.
  • Will the numbers shift? Solve rates may move as labs tune for long-horizon autonomy and as Epoch/METR add or re-score tasks. This article reflects the launch-day top line only.
  • How representative is $2,600 / 19 days? It describes one of the largest single tasks — the budget/time distribution across all 25 programs is not in the secondary source.