Source: raw/I_Battle_Tested_Sakana_Fugu_s_Fable_Killer.md (primary, independent hands-on test, YouTube), raw/GPT-5.6_is_Lurking_ChatGPT_Voice_Overhaul_Sonnet_5_Rumor_Fugu_Surprise.md (MattVidPro AI), raw/AI_News_-_The_New_Model_That_s_As_Good_As_Fable.md (Matt Wolfe roundup)

Sakana Fugu (Sakana AI, Japan) is shipped as a single model API that is actually a multi-agent orchestration system underneath: one API call hits a small “manager”/“conductor” model that decomposes the task, delegates sub-tasks to a pool of frontier models (Opus 4.8, GPT-5.5, Gemini, “probably more”), and then has a combiner LLM synthesize the result — model selection, delegation, verification, and synthesis happen automatically and “never reach your code.” Both independent creators who covered it stress the same caveat: it is not its own frontier model, it’s a manager (“a really smart API,” “the same way Claude Code spins up subagents, except across vendors”). It comes in two tiers — base Fugu and Fugu Ultra (marketed as matching Fable 5 / Mythos). The honest verdict from hands-on testing: it works inside Claude Code and produces good output, but in one 38-task head-to-head it was roughly tied with Opus 4.8 on quality while being ~4.5× slower and ~5× more expensive.

Key Takeaways

  • A single-agent API over a multi-agent core. You call one endpoint; behind it a manager model splits the task (“who does each part?”) and routes pieces to specialist frontier models, then a separate combiner LLM merges the responses into one answer. The multi-agent complexity is hidden — it behaves like one model to your code.
  • Explicitly not a smarter model — it’s routing. Both creators de-hype it hard: “really nothing new,” “smart API wrapper,” “not a smarter model, it’s just a manager.” The value (if any) is mixture-of-experts across vendors — using GPT’s coding strengths and Claude’s writing strengths in one call — not a new capability frontier. Named delegate pool: Opus 4.8, GPT-5.5, Gemini, “probably more,” possibly instances of itself ^[inferred — primary creator says “maybe some other frontier models,” exact roster not disclosed by Sakana].
  • Two tiers; profile is “workhorse.” Base Fugu (~GPT-5.5 class on Sakana’s charts) and Fugu Ultra (Fable-5 class on Sakana’s charts). Built for coding / agentic work, not conversation.
  • Runs inside Claude Code, and notably does NOT fill the context window. The primary creator ran “Fugu Ultra 1 million” inside Claude Code — skills, research, and /goal all worked, and the Claude Code context stayed near zero across 20-30 rounds because the work was routed to Sakana’s server. Unlike GLM-5.2 (swap endpoint + key), it is not a simple endpoint swap.
  • Slow and expensive for roughly tied quality (creator-tested). In a 38-task head-to-head vs Opus 4.8: 36 of 38 tied, Opus won 2, Fugu won 0; Fugu was ~4.5× slower (357 min vs 80 min) and 5× more expensive (10). The primary creator’s conclusion: he won’t use it for knowledge work.
  • Where it might be worth it (both creators agree): large teams on a shared codebase who want a “GPT reviewer + Claude planner” orchestrated automatically inside one API instead of wiring it by hand.
  • Contrast with OpenRouter Fusion: Fusion runs ~3 models in parallel and a judge merges; Fugu decomposes the task and delegates pieces to specialists. Different orchestration shape, similar “one API, many models” promise.

What it is

Sakana AI markets Fugu as a model, but architecturally it is an automatic multi-model orchestrator exposed behind one API call. The primary creator describes the loop precisely: “We ask the question to the conductor, and the conductor outsources to specialists” — Claude for writing, GPT for coding/bug-fixes, Gemini for research/facts, plus other models chosen by task complexity. Once the specialists answer, “another LLM combines everything together and presents that answer back to us.” The explicit analogy is Claude Code spinning up Haiku/Sonnet/Opus subagents or dynamic workflows — except the orchestration spans different vendors instead of one model family, which is where the mixture-of-experts upside theoretically comes from.

The announcement tweet “went viral” (~15M views, creator-claimed). Both independent creators frame the pattern as familiar rather than novel: anyone already running Codex + Claude Code on the same codebase is doing this manually; Fugu just automates the delegation.

Access & pricing

  • Sign-up: sakana.ai. Offered both as a subscription (ChatGPT-style; 200/mo tiers, creator-reported) and pay-as-you-go API.
  • Inside Claude Code: setup is a markdown file handed to Claude Code plus an API key (the primary creator gated the exact file behind a free Skool community; not reproduced in the transcript). No codex-fugu-style command appears. It is not a drop-in endpoint swap — work routes to Sakana’s server in a way that keeps the local Claude Code context window near-empty.
  • API pricing (Sakana-listed unless noted):
    • Input ≤272K context: 10 / 1M
    • Output ≤272K context: 45 / 1M
  • Real spend datapoints (creator-tested): primary creator’s 38-task run ≈ 10 for Opus 4.8; Mark Santos’s Crossy-Road clone ≈ **200/mo subscription plan filled a 5-hour window and hit ~34% of the weekly limit in his testing. MattVidPro’s read: “competitive with Fable 5, more than GPT-5.5, not blow-your-mind cheap.”

Benchmarks (Sakana-claimed)

All figures below are read off Sakana’s announcement charts and are NOT independently reproduced — treat as direction, not measurement.

  • Fugu Ultra beats Fable 5 on LiveCodeBench.
  • ~Tie with Fable 5 on GPQA Diamond.
  • SWE-bench Pro: Fable 5 wins; Fugu Ultra close behind and beats Opus 4.8; base Fugu ≈ GPT-5.5.
  • SciCode: Fugu Ultra beats GPT-5.5.
  • Humanity’s Last Exam: Fugu Ultra beats GPT-5.5, just under Fable 5.
  • Terminal-Bench + “Realm”: Sakana says “looking great.”
  • Profile: a “workhorse” tuned for coding / agentic tasks rather than conversational use.

Hands-on findings

Primary creator — 38-task head-to-head, Fugu Ultra vs Opus 4.8 (tasks generated by Codex and graded pass/fail by Codex):

  • 36 of 38 tied; Opus 4.8 won 2; Fugu won 0.
  • ~4.5× slower: 357 minutes total for Fugu vs 80 minutes for Opus across the suite.
  • ~5× more expensive: ~10.
  • Built an impressive one-shot live YouTube dashboard via /goal (~1 hour) — it worked, but was “really, really slow.”
  • Quirk worth noting: it did not consume the Claude Code context window over 20-30 rounds (work routed to Sakana’s server).
  • Conclusion: no quality lift for his knowledge-work tasks; he won’t use it.

Third party — Mark Santos, Crossy-Road clone: Fugu Ultra finished in **22 min / ~40. Fugu’s output looked better (colors, lighting, detail) but was missing sound effects. Note this comparison was vs Opus 4.8, not Fable 5.

Limitations

  • Not a new or smarter model. The most consistent point across all three sources: this is orchestration/routing, not a capability frontier. “Really nothing new”; “smart API wrapper.”
  • Slow and expensive relative to a single strong model for roughly tied quality (creator-tested, see above).
  • No EU availability at time of testing.
  • Testing was partial. The primary creator tested only Fugu Ultra (not base Fugu), on AI-created and AI-graded assessments, with no heavy refactors — he flags his own results “with a grain of salt.” MattVidPro did not test it himself and advised “stay away from subscribing for now.”
  • Mandatory data-retention / privacy posture is undocumented in these sources — relevant if routing client-sensitive work through a third-party orchestrator that fans out to multiple vendors ^[inferred — not stated in sources; flagged as an unknown, not a claim].
  • running-the-agentic-loop — Fugu is the “hosted, vendor-agnostic” corner of the orchestration spectrum: one API call that runs a manager-delegate-combiner loop server-side instead of in your process.
  • council-multi-model-deliberation — the explicit-and-local cousin: Council fans a prompt to multiple vendors and merges with preserved dissent; Fugu hides the same multi-model machinery behind a single opaque API.
  • vendor-direct-tool-calls — Fugu is the opposite trade-off: instead of calling vendors directly, you delegate vendor selection to Sakana’s router.
  • glm-5-series-zai — the contrast case for access: GLM-5.2 is a true endpoint swap (change URL + key); Fugu is not.
  • claude-fable-5-mythos-5 — the model Fugu Ultra’s marketing benchmarks itself against (and the “Fable killer” framing in the primary video’s title).

Try It

  • Only reach for it on a shared-codebase team. If you’re a solo operator, the creators’ verdict is to skip it — a single strong model is faster and cheaper at tied quality. The plausible win is a team that wants automatic “GPT reviewer + Claude planner” routing without hand-wiring it.
  • If you do trial it, run your own head-to-head before committing: pick 5-10 representative tasks, run them through Fugu Ultra and your current default (Opus 4.8 / Fable 5), and compare cost, wall-clock time, and a blind quality grade. The public benchmarks are all Sakana-claimed and unreproduced.
  • Watch the meter. Output is billed at tally, exactly as you would for any multi-call orchestration.
  • Don’t route EU or client-sensitive work through it until availability and data-handling are confirmed (no EU access at test time; retention undocumented in these sources).

Open Questions

  • Exact benchmark deltas are unverified. Every Fugu vs Fable-5 / GPT-5.5 / Opus-4.8 number here is read off Sakana’s own announcement charts; none has been independently reproduced. Confidence the product exists and behaves as described is high; confidence in the specific deltas is medium.
  • Primary creator identity is unconfirmed. Signals point to Nate Herk / an agent-loops creator (channel youtube.com/watch?v=GpSqBjW6hR4), but this is not verified — flag as unconfirmed.
  • Base Fugu (non-Ultra) is untested by any source here — all hands-on results are Fugu Ultra only.
  • EU availability was absent at test time; whether/when it lands is unknown.
  • The exact delegate roster and self-delegation (“possibly instances of itself,” “probably more”) are inferred from the primary creator’s hedged description, not disclosed by Sakana.
  • Data retention / privacy posture for the orchestrator is undocumented in these sources.