MirrorCode (Epoch AI + METR long-horizon coding benchmark)

Source: raw/newsletter-epoch-ai-b3167bb772.md (The Epoch Brief, June 26 2026, Elliot Stewart). Primary: epoch.ai/mirrorcode — the newsletter is the fetched secondary; the full per-model results table and methodology live at the primary and were not directly fetched.

MirrorCode is a new long-horizon coding benchmark co-developed by Epoch AI and METR, built to answer one question: what’s the largest software project AI can complete on its own? It tasks models with rebuilding 25 real-world programs from scratch — no source code, no human in the loop — and crucially gives them an inference budget large enough for a serious attempt (one run cost $2,600 and ran 19 days unattended). At launch, Claude Opus 4.7 leads with a 56% solve rate, which Epoch frames as “significant room for further improvement.” It is a yardstick for autonomous, multi-day coding — directly relevant to anyone betting on agents finishing large software tasks without supervision.

Key Takeaways

What it is: a long-horizon coding benchmark from Epoch AI + METR, launched in the June 26 2026 Epoch Brief. Tagline question: “What’s the largest software project AI can complete on its own?”
The task: rebuild 25 real-world programs spanning bioinformatics, Unix utilities, cryptography, interpreters, and more — with no access to source code and no human in the loop.
Difficulty: the hardest programs would take a human engineer weeks to months to build unaided.
Big-budget by design: most SWE benchmarks cap inference at ~ $1-$ 10 per task and run for minutes to at most hours. MirrorCode instead funds a real attempt — one of the largest tasks cost $2,600 for a single run, with the model working 19 days without human intervention.
Leaderboard headline: Claude Opus 4.7 leads at a 56% solve rate — Epoch reads the sub-60% top score as evidence of “significant room for further improvement.”
The practical signal: $2,600 / 19 days is roughly today’s frontier of “just let the agent run.” That is the current ceiling on unattended, long-horizon coding — and a 56% top solve rate means even at that budget, nearly half the hard tasks fail.

What MirrorCode measures

MirrorCode targets autonomous, long-horizon software engineering — the ability of a model to take a large spec and rebuild a working program end-to-end on its own. It deliberately departs from the typical SWE-benchmark setup in two ways:

No human in the loop and no source access. The model rebuilds each of the 25 target programs from behavior/spec alone, not by editing an existing repo or with a human steering between steps.
Long horizon. The hardest targets are estimated at weeks to months of unaided human-engineer effort, so a successful run reflects sustained multi-day autonomy, not a quick patch.

This makes it a measure of “largest project completable unattended” rather than “can the model fix this bug,” which is what most existing SWE evals reward.

Methodology

25 real-world programs across domains: bioinformatics, Unix utilities, cryptography, interpreters, and more.
Inference budget is the key design choice. Where many SWE benchmarks cap spend at ~ $1-$ 10/task and runs at minutes-to-hours, MirrorCode provisions enough budget for a genuine attempt. Epoch’s worked example: **one of the largest tasks cost $2, 600 f or a s in g l er u nan d in v o l v e d t h e m o d e lw or kin g 19 d a ys w i t h o u t h u manin t er v e n t i o n . * *^{[} T h e$ 2,600 / 19-day figure describes one of the largest individual tasks, not every task in the suite — per the newsletter.]
Solve rate is the headline metric (programs successfully rebuilt). The full per-model table and methodology are at the primary, epoch.ai/mirrorcode; the newsletter quotes only the top line.

Results

Claude Opus 4.7 — 56% solve rate, leading all other models tested. Epoch’s framing: a sub-60% top score means “significant room for further improvement.”
Other models’ scores and the full leaderboard are not in the newsletter — see the primary for the complete table.

Also in the June 26 Epoch Brief (context, not part of the benchmark): Epoch projects hyperscaler capex will overtake operating cash flows by end of 2026, and published two Gradient Updates — an analysis of 1,604 Chinese AI-lab job postings and a proposed AI-R&D taxonomy.

Try It

Use 56% as your “unattended ceiling” prior. When someone pitches an agent that completes large software projects with no supervision, anchor on the fact that the leading model solves ~56% of MirrorCode’s tasks even at a $2,600 / 19-day budget. Treat “fully autonomous large-build” claims above that as needing evidence.
Right-size the budget question. MirrorCode’s contribution is showing that long-horizon coding capability only appears when you stop capping inference at $1-$ 10. If you’re evaluating whether to “let it run,” budget and wall-clock (days, not minutes) are the real variables — not just model choice.
Cross-read against first-party SWE numbers. Pair this with the Opus/Fable SWE-bench Verified/Pro figures already tracked in the wiki (see Related) to separate “fix-a-bug” benchmarks from “build-the-whole-thing” ones — they reward very different things.
Pull the full table before citing per-model gaps. The newsletter only gives the leader. For any model-vs-model comparison, read epoch.ai/mirrorcode directly.

Are Mythos’ Cyber Capabilities Overhyped? (Epoch AI) — sibling Epoch AI benchmark analysis; same “control for spend before reading a capability jump” discipline applies to both MirrorCode’s big-budget design and the CVE-spike confound.
Measuring AI Agent Autonomy in Practice (Anthropic) — the first-party counterpart to MirrorCode’s question; both try to quantify how far an agent gets unattended.
The Verification Frontier — a 56% solve rate at multi-day budgets is exactly the regime where verifying autonomous output (rather than trusting it) becomes load-bearing.
AI Industry Research — topic hub; Epoch AI + METR are named independent-benchmark anchors here.

Open Questions

Which other models, and at what scores? The newsletter gives only the Opus-4.7 leader (56%). The full leaderboard — including frontier models newer than Opus 4.7 — is at the primary and was not fetched here.
What exactly counts as “solved”? The pass criterion (functional equivalence? test-suite pass? partial credit?) is not stated in the newsletter; the methodology lives at epoch.ai/mirrorcode.
Will the numbers shift? Solve rates may move as labs tune for long-horizon autonomy and as Epoch/METR add or re-score tasks. This article reflects the launch-day top line only.
How representative is $2,600 / 19 days? It describes one of the largest single tasks — the budget/time distribution across all 25 programs is not in the secondary source.

Jonathon's AI Wiki

Explorer

MirrorCode (Epoch AI + METR long-horizon coding benchmark)

Key Takeaways

What MirrorCode measures

Methodology

Results

Try It

Open Questions

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

MirrorCode (Epoch AI + METR long-horizon coding benchmark)

Key Takeaways

What MirrorCode measures

Methodology

Results

Try It

Related

Open Questions

Graph View

Table of Contents

Backlinks