Source: raw/newsletter-epoch-ai-b3167bb772.md (The Epoch Brief, June 26 2026, Elliot Stewart). Primary: epoch.ai/mirrorcode — the newsletter is the fetched secondary; the full per-model results table and methodology live at the primary and were not directly fetched.
MirrorCode is a new long-horizon coding benchmark co-developed by Epoch AI and METR, built to answer one question: what’s the largest software project AI can complete on its own? It tasks models with rebuilding 25 real-world programs from scratch — no source code, no human in the loop — and crucially gives them an inference budget large enough for a serious attempt (one run cost $2,600 and ran 19 days unattended). At launch, Claude Opus 4.7 leads with a 56% solve rate, which Epoch frames as “significant room for further improvement.” It is a yardstick for autonomous, multi-day coding — directly relevant to anyone betting on agents finishing large software tasks without supervision.
Key Takeaways
- What it is: a long-horizon coding benchmark from Epoch AI + METR, launched in the June 26 2026 Epoch Brief. Tagline question: “What’s the largest software project AI can complete on its own?”
- The task: rebuild 25 real-world programs spanning bioinformatics, Unix utilities, cryptography, interpreters, and more — with no access to source code and no human in the loop.
- Difficulty: the hardest programs would take a human engineer weeks to months to build unaided.
- Big-budget by design: most SWE benchmarks cap inference at ~10 per task and run for minutes to at most hours. MirrorCode instead funds a real attempt — one of the largest tasks cost $2,600 for a single run, with the model working 19 days without human intervention.
- Leaderboard headline: Claude Opus 4.7 leads at a 56% solve rate — Epoch reads the sub-60% top score as evidence of “significant room for further improvement.”
- The practical signal: $2,600 / 19 days is roughly today’s frontier of “just let the agent run.” That is the current ceiling on unattended, long-horizon coding — and a 56% top solve rate means even at that budget, nearly half the hard tasks fail.
What MirrorCode measures
MirrorCode targets autonomous, long-horizon software engineering — the ability of a model to take a large spec and rebuild a working program end-to-end on its own. It deliberately departs from the typical SWE-benchmark setup in two ways:
- No human in the loop and no source access. The model rebuilds each of the 25 target programs from behavior/spec alone, not by editing an existing repo or with a human steering between steps.
- Long horizon. The hardest targets are estimated at weeks to months of unaided human-engineer effort, so a successful run reflects sustained multi-day autonomy, not a quick patch.
This makes it a measure of “largest project completable unattended” rather than “can the model fix this bug,” which is what most existing SWE evals reward.
Methodology
- 25 real-world programs across domains: bioinformatics, Unix utilities, cryptography, interpreters, and more.
- Inference budget is the key design choice. Where many SWE benchmarks cap spend at ~10/task and runs at minutes-to-hours, MirrorCode provisions enough budget for a genuine attempt. Epoch’s worked example: **one of the largest tasks cost 2,600 / 19-day figure describes one of the largest individual tasks, not every task in the suite — per the newsletter.]
- Solve rate is the headline metric (programs successfully rebuilt). The full per-model table and methodology are at the primary, epoch.ai/mirrorcode; the newsletter quotes only the top line.
Results
- Claude Opus 4.7 — 56% solve rate, leading all other models tested. Epoch’s framing: a sub-60% top score means “significant room for further improvement.”
- Other models’ scores and the full leaderboard are not in the newsletter — see the primary for the complete table.
Also in the June 26 Epoch Brief (context, not part of the benchmark): Epoch projects hyperscaler capex will overtake operating cash flows by end of 2026, and published two Gradient Updates — an analysis of 1,604 Chinese AI-lab job postings and a proposed AI-R&D taxonomy.
Try It
- Use 56% as your “unattended ceiling” prior. When someone pitches an agent that completes large software projects with no supervision, anchor on the fact that the leading model solves ~56% of MirrorCode’s tasks even at a $2,600 / 19-day budget. Treat “fully autonomous large-build” claims above that as needing evidence.
- Right-size the budget question. MirrorCode’s contribution is showing that long-horizon coding capability only appears when you stop capping inference at 10. If you’re evaluating whether to “let it run,” budget and wall-clock (days, not minutes) are the real variables — not just model choice.
- Cross-read against first-party SWE numbers. Pair this with the Opus/Fable SWE-bench Verified/Pro figures already tracked in the wiki (see Related) to separate “fix-a-bug” benchmarks from “build-the-whole-thing” ones — they reward very different things.
- Pull the full table before citing per-model gaps. The newsletter only gives the leader. For any model-vs-model comparison, read epoch.ai/mirrorcode directly.
Related
- Are Mythos’ Cyber Capabilities Overhyped? (Epoch AI) — sibling Epoch AI benchmark analysis; same “control for spend before reading a capability jump” discipline applies to both MirrorCode’s big-budget design and the CVE-spike confound.
- Measuring AI Agent Autonomy in Practice (Anthropic) — the first-party counterpart to MirrorCode’s question; both try to quantify how far an agent gets unattended.
- The Verification Frontier — a 56% solve rate at multi-day budgets is exactly the regime where verifying autonomous output (rather than trusting it) becomes load-bearing.
- AI Industry Research — topic hub; Epoch AI + METR are named independent-benchmark anchors here.
Open Questions
- Which other models, and at what scores? The newsletter gives only the Opus-4.7 leader (56%). The full leaderboard — including frontier models newer than Opus 4.7 — is at the primary and was not fetched here.
- What exactly counts as “solved”? The pass criterion (functional equivalence? test-suite pass? partial credit?) is not stated in the newsletter; the methodology lives at epoch.ai/mirrorcode.
- Will the numbers shift? Solve rates may move as labs tune for long-horizon autonomy and as Epoch/METR add or re-score tasks. This article reflects the launch-day top line only.
- How representative is $2,600 / 19 days? It describes one of the largest single tasks — the budget/time distribution across all 25 programs is not in the secondary source.