Are Mythos' Cyber Capabilities Overhyped? (Epoch AI Cyber-ECI Analysis)

Source: ai-research/epoch-ai-mythos-cyber-capabilities-overhyped-2026-06-11.md, raw/reddit-1u7hmvw.md, raw/newsletter-epoch-ai-1add73aa6f.md, raw/newsletter-epoch-ai-436d457208.md (the recovered Epoch newsletter delivery of this same Gradient Update — verified 2026-07-24 to contain no detail beyond the 2026-06-11 web capture already ingested here; added for provenance completeness) Authors: Timothée Chauvin, Alexander Barry, JSD, Anson Ho (Epoch AI) URL: https://epochai.substack.com/p/are-mythos-cyber-capabilities-overhyped Published: 2026-06-11 (Epoch AI — Gradient Updates series)

Epoch AI compiled all the public evidence on the cyber capabilities of Anthropic’s Mythos family — Mythos Preview and the newly released Fable 5 — to answer whether Anthropic’s “leap in cyber skills” claim is real or hype. Their verdict: the jump in exploit development is real and large (Mythos Preview sits ~7 months ahead of the trend line and well past GPT-5.5), but the real-world gain in vulnerability discovery is genuinely unclear because Project Glasswing’s CVE spike is confounded by a surge in spending. Net: not “just hype,” but the leap is concentrated in one of the two cyber sub-skills. This is the independent third-party counter-read to the first-party Anthropic system-card numbers the wiki already tracks. (Note: Gradient Updates is Epoch’s explicitly opinionated/informal series — the data aggregation is rigorous and fully sourced in a methodological appendix; the conclusions are the authors’ own, not an Epoch institutional position.)

Key Takeaways

Two cyber sub-skills must not be conflated. Vulnerability discovery = finding weaknesses in software (e.g. spotting a buffer overflow). Exploit development = turning a known weakness into unauthorized behavior (e.g. crafting inputs that achieve arbitrary code execution). A real attack needs both. Anthropic claimed a leap in both; the public evidence supports the second far more clearly than the first.
Exploit development: a real, large jump. Epoch aggregated ~15 cyber benchmarks into a domain-specific Cyber-ECI (a cyber cut of their Epoch Capabilities Index). Mythos Preview (April) lands ~7 months ahead of the linear trend since early 2025 (90% CI 3–13 months) — versus GPT-5.5’s “only” ~2–3 months ahead (90% CI 1–5 months). Mythos 5 is “modestly above” Mythos Preview on cyber (per Anthropic’s own card).
The “on par with GPT-5.5” skeptics were half-right — they benchmarked the wrong checkpoint. There are two Mythos Preview versions: an “Early” internal checkpoint (genuinely close to GPT-5.5) and a much stronger “April” version (released to Project Glasswing on April 7). Microsoft CTI-REALM, UK AISI’s first eval, and METR’s time-horizons all used the Early version → looked like parity. UK AISI’s later eval of the April version flipped the picture.
Saturated benchmarks hid the gap. Many early head-to-heads were near-saturated, so even a real Mythos>GPT-5.5 exploit-dev gap was hard to see. New unsaturated benchmarks (ExploitBench, ExploitGym) now expose it.
Vulnerability discovery: the gain is unclear on a fixed budget. CVEs from 21 notable orgs spiked +142% over the 2025 baseline in April and +262% in May, coinciding with Mythos Preview’s release. But Project Glasswing came with up to $100M in API credits (plus OpenAI’s Daybreak) — so the spike may reflect a spending surge, not a capability jump.
Prior AIs were already very good at finding vulnerabilities. AISLE reports even some small open models recognize several of the vulns Anthropic showcased. On curl (one of the most heavily audited codebases, already scanned by multiple AI tools), Mythos found just one low-severity vuln + four false positives — the maintainer: “I see no evidence that this setup finds issues to any particular higher or more advanced degree than the other tools have done before Mythos.”
Where Mythos genuinely outshines prior models on discovery: better severity assessment / prioritization and a lower false-positive rate (echoed by curl and Cloudflare) — partly because it is so good at building exploits, which verifies a flagged weakness is real. That alone can have practical impact: far less human time to triage AI-found vulns.
Bottom line: can’t claim a leap “across the board,” but the Mythos family’s cyber capabilities aren’t just hype. If made widely available, they’d push cybersecurity into a regime where vulnerabilities must be patched much faster to avoid a jump in successful attacks.
Real-world data point (June 2026). The US federal shutdown of Fable 5 / Mythos 5 was reportedly triggered by a benign “fix this code” prompt surfacing only previously-known minor vulnerabilities — characterized by a hired reviewer (Katie Moussouris) as “not a jailbreak per se” (The Register / WIRED) — which bears directly on the cyber-capabilities-overhyped thesis.

The two capabilities: discovery vs exploitation

Anthropic’s Project Glasswing announcement claimed Mythos Preview shows “a striking ability to spot vulnerabilities and work out ways to exploit them.” Epoch’s central move is to split that into two measurable skills and check them separately:

Vulnerability discovery — inspecting a codebase to find the weakness (the buffer overflow exists, here).
Exploit development — given the weakness, craft the precise inputs that weaponize it (corrupt memory in just the right way to crash the program or run attacker code).

An attacker needs both. Most existing cyber benchmarks measure the second; almost none cleanly measure the first.

Exploit development: a real, large jump (Cyber-ECI)

Epoch gathered ~15 cyber benchmarks (mostly exploit-construction) and aggregated them into a Cyber-ECI using a modified Epoch Capabilities Index methodology.
Plotted over time, Mythos Preview sits far above the early-2025 linear trend — ~7 months ahead (90% CI 3–13 months). GPT-5.5 was ~2–3 months ahead (90% CI 1–5 months).
Most of the lift comes from big jumps on ExploitGym, ExploitBench, AISI’s Cyber Ranges, and SCONE-Bench; Mythos Preview essentially saturates Cybench and CyberGym.
Corroborated by Anthropic’s own real-world analysis (red.anthropic.com/2026/n-days/): Mythos Preview is much better at developing arbitrary-code-execution (ACE) exploits than prior models — earlier models rarely achieved ACE; Mythos Preview often does, even with minimal vulnerability information.
On SCONE-Bench (405 historically-exploited Ethereum smart contracts) Mythos Preview reportedly exploited every vulnerability tested (100%). On UK AISI’s “Cooling Tower” cyber range, Mythos Preview (April) fully completed 3/10 attempts while every other model scored 0/10.

Epoch’s read here is unambiguous: on constructing exploits, Mythos Preview was a genuine step change, and Mythos 5 is modestly better still.

Vulnerability discovery: the spending confound

There are no unsaturated benchmarks for finding vulnerabilities in source code, so Epoch falls back to real-world CVE counts from Glasswing participants.
The CVE data shows a gigantic spike at Mythos Preview’s release: High/Critical CVEs from 21 notable orgs +142% (April) / +262% (May) vs the 2025 baseline — and likely to grow, since disclosure lags discovery.
June 2026 update (extends the series). A follow-on Epoch Data Insight (Luke Emberson, epoch.ai/data-insights/cve-severity-spike) reports the same 21 notable orgs disclosed ~1,500 high- and critical-severity CVEs in June — more than 3.5× the monthly record before Mythos’ release (“AI appears to be finding software vulnerabilities at scale”), corroborating the “likely to grow” note above as disclosures compound. The $100M Glasswing credit-spend confound still applies — more spend on vulnerability-hunting yields more found vulns without proving a capability jump — so this remains a count spike, not a fixed-budget find-rate result.
The catch: Glasswing involved up to $100M in API credits. A spike in spending on vulnerability-hunting can produce a spike in found vulnerabilities without any underlying capability jump.
Evidence prior models were already strong finders:
- AISLE claims even some small open models recognize several Anthropic-showcased vulns — and discovery parallelizes well across many defenders.
- curl (continuously audited for two decades, already running multiple AI scanners): Mythos surfaced 1 low-severity vuln + 4 false positives; the maintainer saw no evidence it beat prior tools.
Glasswing partner reports (Mozilla, Palo Alto Networks, Cloudflare, AWS) are positive — “as good as elite security researchers,” “a full year of pentesting in under three weeks,” chaining low-severity bugs into high-severity exploits — but these partners received free credits and are not neutral.
Mythos’s real discovery edge: fewer false positives + better severity prioritization (a flagged weakness it can also exploit is verified-real). That can matter practically — much less human triage time — even if raw find-rate isn’t a leap.

Benchmark data points worth keeping

CyScenarioBench (Irregular; end-to-end cyber tasks, confirmed comparable across labs, pass@1 fully-complete-run rate):

Model	Score
Mythos 5	36.7%
Mythos Preview	29.2%
GPT-5.5	26%
Opus 4.8	16.6%
GPT-5.4	9%
GPT-5.2 / 5.3	0%
Meta Muse Spark	0%
Gemini 3 Pro	0% (v2 third-party bench)

OSS-Fuzz (Anthropic’s closest-to-discovery benchmark, from the Mythos 5 card — but it also tests exploitation; crash as base case): Mythos 5 triggered a crash 80% of the time vs Mythos Preview 76.7% and Opus 4.8 61.5%.

The ~15-benchmark Cyber-ECI suite (see source for full methodology): UK AISI CTF Suites (4 tiers) + Cyber Ranges, Microsoft CTI-REALM (the only defense benchmark), CVE-Bench, Cybench, CyberGym, CyScenarioBench, ExploitBench, ExploitGym, InterCode-CTF, NL2Bash, OpenAI CTF + Cyber Ranges, Anthropic SCONE-Bench, XBOW-Web.

Why the “on par with GPT-5.5” take was half-right

The widely-shared skeptic argument (e.g. pointestimate.substack.com) — “GPT-5.5 matches Mythos on cyber benchmarks and didn’t cause a catastrophe” — wasn’t wrong, it was outdated:

It compared against Mythos Preview (Early), the weak internal checkpoint, not the April version shipped to Glasswing.
The benchmarks it used were near-saturated, masking the real April-version gap.

Once unsaturated benchmarks (ExploitBench, ExploitGym) and the April checkpoint are used, the gap reappears clearly in the Cyber-ECI.

Try It

Use the discovery-vs-exploitation split as a literacy filter. When any vendor claims an AI “cyber leap,” ask which sub-skill — finding weaknesses or weaponizing them. Epoch shows the two can move very differently, and the scarier framing (autonomous attacks) needs both.
Discount single-source CVE/vuln-count spikes. A +262% CVE jump that coincides with a $100M credit program is confounded by spend. Ask for find-rate on a fixed budget before reading a count spike as a capability jump — the same caveat applies to Anthropic’s “10,000+ critical vulns” Glasswing figure (see Mythos Preview).
For WEO / client security work: the practical near-term value of frontier models is lower false-positive rate + better severity triage, not necessarily finding vulns nothing else could. That’s a “save analyst time” pitch, not a “replace the scanner” pitch.
Track the unsaturated benchmarks. ExploitBench (V8 ACE), ExploitGym (V8/Linux-kernel/userspace), and Irregular’s CyScenarioBench are now the load-bearing cyber evals — saturated ones (Cybench, CyberGym) no longer separate frontier models.
Cross-read against the first-party card. Pair this with Mythos 5 article (Firefox-147 working-exploit 88.4% vs Opus 4.8’s 8.8%; ExploitBench ACE 78% vs 40%) to see independent and first-party numbers side by side.

Claude Mythos Preview — the internal frontier model and the Project Glasswing program whose CVE figures Epoch reinterprets; this is the first-party source of the “10,000+ critical vulns” and curl-scan claims Epoch contextualizes.
Claude Fable 5 and Claude Mythos 5 — the released model; its card supplies the CyScenarioBench / OSS-Fuzz / Firefox-147 numbers and the “modestly above Mythos Preview on cyber” framing Epoch cites.
Mythos 5 Federal Shutdown (June 2026) — the regulatory episode where this cyber-capability question became a national-security flashpoint.
Mozilla’s Firefox Security Harness (Claude Mythos + Agent SDK) — a first-party operator account that bears on the open question below: Mozilla’s ~500-fixes/month came from a fuzzer-verified agentic harness, and the engineer estimates the unlock as “50/50 model vs harness” (bugs surfaced even with non-frontier models) — i.e. the spike is harness + spend + model, sharpening Epoch’s spending-confound read rather than a clean Mythos-capability leap.
Mapping a Year of AI-Enabled Cyber Threats (MITRE ATT&CK) — the attacker-side mirror: how generally-available models are actually misused today, versus where frontier capability is heading.
When AI Builds Itself — Recursive Self-Improvement — the same “capability scales faster than we can measure” theme, applied to AI-R&D rather than cyber.
Stanford HAI AI Index 2026 — sibling independent-benchmark report; its “safety benchmarks lag capability” finding is exactly the dynamic Epoch documents in cyber.
Hugging Face Sandbox-Escape Incident (July 2026) — the exploit-development capability documented here escaping a test harness in the wild. Epoch’s follow-up Gradient Updates piece argues the incident was predictable from exactly the benchmark set aggregated in this article’s Cyber-ECI.
The Future of AI Benchmarks (Epoch AI) — the Cyber-ECI’s “aggregate many benchmarks, watch for saturation” method is the same discipline this Epoch sibling applies across domains.
The Compute Economics of the AI Buildout (Epoch AI) — the $100M-spend confound at the heart of this article’s vulnerability-discovery caveat is a compute-economics point; sibling from the same recovered Epoch batch.
AI Industry Research — topic hub; Epoch AI is a named source-quality anchor here.

Open Questions

What is Mythos’s vulnerability-discovery find-rate on a fixed budget? The single most important unknown — until someone controls for spend, the CVE spike can’t be cleanly attributed to capability. A controlled find-rate study would resolve the core debate.
Will real-world Mythos 5 usage reports match Mythos Preview? Epoch’s evidence is mostly Mythos Preview; they project the conclusions forward to Mythos 5 but flag they’re waiting on usage data. Partial first-party signal (2026-06-22): Mozilla’s Firefox security harness (see Related) attributes its bug-finding spike ~50/50 to harness vs model — and found bugs even with non-frontier models — complicating any clean Mythos-capability read of real-world deployment results.
Does the lower false-positive rate hold outside hardened codebases? curl and Cloudflare report few false positives, but both are unusually well-audited environments. Open question whether the FP advantage generalizes to messy enterprise code.
How will an unsaturated discovery benchmark change the picture? Today there’s no clean one. If a good source-code-vuln-finding benchmark ships, it could confirm or refute the “prior models were already very good at discovery” claim.

Jonathon's AI Wiki

Explorer

Are Mythos' Cyber Capabilities Overhyped? (Epoch AI Cyber-ECI Analysis)

Key Takeaways

The two capabilities: discovery vs exploitation

Exploit development: a real, large jump (Cyber-ECI)

Vulnerability discovery: the spending confound

Benchmark data points worth keeping

Why the “on par with GPT-5.5” take was half-right

Try It

Open Questions

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

Are Mythos' Cyber Capabilities Overhyped? (Epoch AI Cyber-ECI Analysis)

Key Takeaways

The two capabilities: discovery vs exploitation

Exploit development: a real, large jump (Cyber-ECI)

Vulnerability discovery: the spending confound

Benchmark data points worth keeping

Why the “on par with GPT-5.5” take was half-right

Try It

Related

Open Questions

Graph View

Table of Contents

Backlinks