Source: ai-research/epoch-ai-mythos-cyber-capabilities-overhyped-2026-06-11.md
Authors: Timothée Chauvin, Alexander Barry, JSD, Anson Ho (Epoch AI)
URL: https://epochai.substack.com/p/are-mythos-cyber-capabilities-overhyped
Published: 2026-06-11 (Epoch AI — Gradient Updates series)
Epoch AI compiled all the public evidence on the cyber capabilities of Anthropic’s Mythos family — Mythos Preview and the newly released Fable 5 — to answer whether Anthropic’s “leap in cyber skills” claim is real or hype. Their verdict: the jump in exploit development is real and large (Mythos Preview sits ~7 months ahead of the trend line and well past GPT-5.5), but the real-world gain in vulnerability discovery is genuinely unclear because Project Glasswing’s CVE spike is confounded by a surge in spending. Net: not “just hype,” but the leap is concentrated in one of the two cyber sub-skills. This is the independent third-party counter-read to the first-party Anthropic system-card numbers the wiki already tracks. (Note: Gradient Updates is Epoch’s explicitly opinionated/informal series — the data aggregation is rigorous and fully sourced in a methodological appendix; the conclusions are the authors’ own, not an Epoch institutional position.)
Key Takeaways
- Two cyber sub-skills must not be conflated. Vulnerability discovery = finding weaknesses in software (e.g. spotting a buffer overflow). Exploit development = turning a known weakness into unauthorized behavior (e.g. crafting inputs that achieve arbitrary code execution). A real attack needs both. Anthropic claimed a leap in both; the public evidence supports the second far more clearly than the first.
- Exploit development: a real, large jump. Epoch aggregated ~15 cyber benchmarks into a domain-specific Cyber-ECI (a cyber cut of their Epoch Capabilities Index). Mythos Preview (April) lands ~7 months ahead of the linear trend since early 2025 (90% CI 3–13 months) — versus GPT-5.5’s “only” ~2–3 months ahead (90% CI 1–5 months). Mythos 5 is “modestly above” Mythos Preview on cyber (per Anthropic’s own card).
- The “on par with GPT-5.5” skeptics were half-right — they benchmarked the wrong checkpoint. There are two Mythos Preview versions: an “Early” internal checkpoint (genuinely close to GPT-5.5) and a much stronger “April” version (released to Project Glasswing on April 7). Microsoft CTI-REALM, UK AISI’s first eval, and METR’s time-horizons all used the Early version → looked like parity. UK AISI’s later eval of the April version flipped the picture.
- Saturated benchmarks hid the gap. Many early head-to-heads were near-saturated, so even a real Mythos>GPT-5.5 exploit-dev gap was hard to see. New unsaturated benchmarks (ExploitBench, ExploitGym) now expose it.
- Vulnerability discovery: the gain is unclear on a fixed budget. CVEs from 21 notable orgs spiked +142% over the 2025 baseline in April and +262% in May, coinciding with Mythos Preview’s release. But Project Glasswing came with up to $100M in API credits (plus OpenAI’s Daybreak) — so the spike may reflect a spending surge, not a capability jump.
- Prior AIs were already very good at finding vulnerabilities. AISLE reports even some small open models recognize several of the vulns Anthropic showcased. On curl (one of the most heavily audited codebases, already scanned by multiple AI tools), Mythos found just one low-severity vuln + four false positives — the maintainer: “I see no evidence that this setup finds issues to any particular higher or more advanced degree than the other tools have done before Mythos.”
- Where Mythos genuinely outshines prior models on discovery: better severity assessment / prioritization and a lower false-positive rate (echoed by curl and Cloudflare) — partly because it is so good at building exploits, which verifies a flagged weakness is real. That alone can have practical impact: far less human time to triage AI-found vulns.
- Bottom line: can’t claim a leap “across the board,” but the Mythos family’s cyber capabilities aren’t just hype. If made widely available, they’d push cybersecurity into a regime where vulnerabilities must be patched much faster to avoid a jump in successful attacks.
The two capabilities: discovery vs exploitation
Anthropic’s Project Glasswing announcement claimed Mythos Preview shows “a striking ability to spot vulnerabilities and work out ways to exploit them.” Epoch’s central move is to split that into two measurable skills and check them separately:
- Vulnerability discovery — inspecting a codebase to find the weakness (the buffer overflow exists, here).
- Exploit development — given the weakness, craft the precise inputs that weaponize it (corrupt memory in just the right way to crash the program or run attacker code).
An attacker needs both. Most existing cyber benchmarks measure the second; almost none cleanly measure the first.
Exploit development: a real, large jump (Cyber-ECI)
- Epoch gathered ~15 cyber benchmarks (mostly exploit-construction) and aggregated them into a Cyber-ECI using a modified Epoch Capabilities Index methodology.
- Plotted over time, Mythos Preview sits far above the early-2025 linear trend — ~7 months ahead (90% CI 3–13 months). GPT-5.5 was ~2–3 months ahead (90% CI 1–5 months).
- Most of the lift comes from big jumps on ExploitGym, ExploitBench, AISI’s Cyber Ranges, and SCONE-Bench; Mythos Preview essentially saturates Cybench and CyberGym.
- Corroborated by Anthropic’s own real-world analysis (red.anthropic.com/2026/n-days/): Mythos Preview is much better at developing arbitrary-code-execution (ACE) exploits than prior models — earlier models rarely achieved ACE; Mythos Preview often does, even with minimal vulnerability information.
- On SCONE-Bench (405 historically-exploited Ethereum smart contracts) Mythos Preview reportedly exploited every vulnerability tested (100%). On UK AISI’s “Cooling Tower” cyber range, Mythos Preview (April) fully completed 3/10 attempts while every other model scored 0/10.
Epoch’s read here is unambiguous: on constructing exploits, Mythos Preview was a genuine step change, and Mythos 5 is modestly better still.
Vulnerability discovery: the spending confound
- There are no unsaturated benchmarks for finding vulnerabilities in source code, so Epoch falls back to real-world CVE counts from Glasswing participants.
- The CVE data shows a gigantic spike at Mythos Preview’s release: High/Critical CVEs from 21 notable orgs +142% (April) / +262% (May) vs the 2025 baseline — and likely to grow, since disclosure lags discovery.
- The catch: Glasswing involved up to $100M in API credits. A spike in spending on vulnerability-hunting can produce a spike in found vulnerabilities without any underlying capability jump.
- Evidence prior models were already strong finders:
- AISLE claims even some small open models recognize several Anthropic-showcased vulns — and discovery parallelizes well across many defenders.
- curl (continuously audited for two decades, already running multiple AI scanners): Mythos surfaced 1 low-severity vuln + 4 false positives; the maintainer saw no evidence it beat prior tools.
- Glasswing partner reports (Mozilla, Palo Alto Networks, Cloudflare, AWS) are positive — “as good as elite security researchers,” “a full year of pentesting in under three weeks,” chaining low-severity bugs into high-severity exploits — but these partners received free credits and are not neutral.
- Mythos’s real discovery edge: fewer false positives + better severity prioritization (a flagged weakness it can also exploit is verified-real). That can matter practically — much less human triage time — even if raw find-rate isn’t a leap.
Benchmark data points worth keeping
CyScenarioBench (Irregular; end-to-end cyber tasks, confirmed comparable across labs, pass@1 fully-complete-run rate):
| Model | Score |
|---|---|
| Mythos 5 | 36.7% |
| Mythos Preview | 29.2% |
| GPT-5.5 | 26% |
| Opus 4.8 | 16.6% |
| GPT-5.4 | 9% |
| GPT-5.2 / 5.3 | 0% |
| Meta Muse Spark | 0% |
| Gemini 3 Pro | 0% (v2 third-party bench) |
OSS-Fuzz (Anthropic’s closest-to-discovery benchmark, from the Mythos 5 card — but it also tests exploitation; crash as base case): Mythos 5 triggered a crash 80% of the time vs Mythos Preview 76.7% and Opus 4.8 61.5%.
The ~15-benchmark Cyber-ECI suite (see source for full methodology): UK AISI CTF Suites (4 tiers) + Cyber Ranges, Microsoft CTI-REALM (the only defense benchmark), CVE-Bench, Cybench, CyberGym, CyScenarioBench, ExploitBench, ExploitGym, InterCode-CTF, NL2Bash, OpenAI CTF + Cyber Ranges, Anthropic SCONE-Bench, XBOW-Web.
Why the “on par with GPT-5.5” take was half-right
The widely-shared skeptic argument (e.g. pointestimate.substack.com) — “GPT-5.5 matches Mythos on cyber benchmarks and didn’t cause a catastrophe” — wasn’t wrong, it was outdated:
- It compared against Mythos Preview (Early), the weak internal checkpoint, not the April version shipped to Glasswing.
- The benchmarks it used were near-saturated, masking the real April-version gap.
Once unsaturated benchmarks (ExploitBench, ExploitGym) and the April checkpoint are used, the gap reappears clearly in the Cyber-ECI.
Try It
- Use the discovery-vs-exploitation split as a literacy filter. When any vendor claims an AI “cyber leap,” ask which sub-skill — finding weaknesses or weaponizing them. Epoch shows the two can move very differently, and the scarier framing (autonomous attacks) needs both.
- Discount single-source CVE/vuln-count spikes. A +262% CVE jump that coincides with a $100M credit program is confounded by spend. Ask for find-rate on a fixed budget before reading a count spike as a capability jump — the same caveat applies to Anthropic’s “10,000+ critical vulns” Glasswing figure (see Mythos Preview).
- For WEO / client security work: the practical near-term value of frontier models is lower false-positive rate + better severity triage, not necessarily finding vulns nothing else could. That’s a “save analyst time” pitch, not a “replace the scanner” pitch.
- Track the unsaturated benchmarks. ExploitBench (V8 ACE), ExploitGym (V8/Linux-kernel/userspace), and Irregular’s CyScenarioBench are now the load-bearing cyber evals — saturated ones (Cybench, CyberGym) no longer separate frontier models.
- Cross-read against the first-party card. Pair this with Mythos 5 article (Firefox-147 working-exploit 88.4% vs Opus 4.8’s 8.8%; ExploitBench ACE 78% vs 40%) to see independent and first-party numbers side by side.
Related
- Claude Mythos Preview — the internal frontier model and the Project Glasswing program whose CVE figures Epoch reinterprets; this is the first-party source of the “10,000+ critical vulns” and curl-scan claims Epoch contextualizes.
- Claude Fable 5 and Claude Mythos 5 — the released model; its card supplies the CyScenarioBench / OSS-Fuzz / Firefox-147 numbers and the “modestly above Mythos Preview on cyber” framing Epoch cites.
- Mapping a Year of AI-Enabled Cyber Threats (MITRE ATT&CK) — the attacker-side mirror: how generally-available models are actually misused today, versus where frontier capability is heading.
- When AI Builds Itself — Recursive Self-Improvement — the same “capability scales faster than we can measure” theme, applied to AI-R&D rather than cyber.
- Stanford HAI AI Index 2026 — sibling independent-benchmark report; its “safety benchmarks lag capability” finding is exactly the dynamic Epoch documents in cyber.
- AI Industry Research — topic hub; Epoch AI is a named source-quality anchor here.
Open Questions
- What is Mythos’s vulnerability-discovery find-rate on a fixed budget? The single most important unknown — until someone controls for spend, the CVE spike can’t be cleanly attributed to capability. A controlled find-rate study would resolve the core debate.
- Will real-world Mythos 5 usage reports match Mythos Preview? Epoch’s evidence is mostly Mythos Preview; they project the conclusions forward to Mythos 5 but flag they’re waiting on usage data.
- Does the lower false-positive rate hold outside hardened codebases? curl and Cloudflare report few false positives, but both are unusually well-audited environments. Open question whether the FP advantage generalizes to messy enterprise code.
- How will an unsaturated discovery benchmark change the picture? Today there’s no clean one. If a good source-code-vuln-finding benchmark ships, it could confirm or refute the “prior models were already very good at discovery” claim.