Source: raw/I_Field_Tested_Gemini_3.5_Flash_-_Fast_Boi_Smol_Brain..md (YouTube field-test video by Matt Wolfe a.k.a. MattVidPro, youtube.com/watch?v=U1VMhxccPYU, fetched 2026-05-27 via yt-podcast auto-pull from the inbox-refresh mattvidpro feed).
A hands-on field test of Google’s Gemini 3.5 Flash (released ~one week prior at Google I/O alongside Google Omni) running three concrete test prompts side-by-side with GPT 5.5, Claude Opus 4.7, and Google’s two agentic surfaces (Gemini Spark + Antigravity 2.0). Useful as a single operator-perspective data point on the Opus 4.7 vs GPT 5.5 vs Gemini 3.5 Flash decision space as of late May 2026. Matt’s headline read: 3.5 Flash is “3.1 Pro level intelligence at a cheaper, faster price” — intelligence per second up, intelligence per dollar maybe not. Honest broader observation: GPT 5.5 came out looking best across all three tests, and Opus 4.7 felt “sluggish… trading blows and often losing to 5.5 thinking.”
Key Takeaways
- Pricing for 3.5 Flash: 9/M output. Roughly 3× the cost of the previous Flash generation — mid-range, not dirt cheap. The Grok-style cheap-tier slot is no longer Gemini’s. Cost-leadership is intelligence-per-second, not intelligence-per-dollar.
- Speed claim verified. All three tests confirm 3.5 Flash is fast — Voxel village HTML generated in under 30 seconds standard thinking; “if any thinking actually occurred, I don’t see any UI that suggests it.” Same length-feel as instant GPT 5.5.
- Three-test composite scorecard (Matt’s hands-on read, not a benchmark):
| Test | Gemini 3.5 Flash | GPT 5.5 | Opus 4.7 | Spark / Antigravity 2.0 |
|---|---|---|---|---|
| Creative cinematic scene (necklace-artifact concept) | Fast, generic feel, leans into sci-fi cliché when subtly steered away | Same cliché-leaning behavior at default | (not tested for this prompt) | — |
| Voxel village market (fruit NPCs, HTML/JS) | ~30s, basic but works, ~37 fruit NPCs, eyes-only NPC bodies | ~2 min, ~50% more detail (real animations, dialogue, lighting, character interactions, real shops) | “Infinite errors” — never produced a working result | — |
| 3D water-physics globe sim (rotatable, real water, growable lemon trees) | Standard thinking: broken (no water, no trees). Extended thinking: failed entirely (“couldn’t load the chat — doesn’t exist or was deleted”) | Working first try in 41 seconds with water particles, lemon trees, click-to-grow, viscosity controls | Still producing infinite errors after 7 min via Codex (smoke-tested) | Spark: produced a result via dedicated VM that writes HTML to Google Drive, but “seizure-warning”-tier render. Antigravity 2.0: ran 7 min Codex-style; final result had lemons-as-projectiles but missing surface water |
- Gemini Spark (new agentic mode) runs in a dedicated virtual machine, can write outputs to Google Drive automatically (Google Workspace tie-in), and progresses through visible build steps. Matt: “you can actually see all of the progress to build this thing out… if you want to build complex code projects, Google Antigravity 2.0, AI Studio, or this new Spark.” Open: no summarized chain-of-thought visible in the new UI.
- Antigravity 2.0 architectural copy of Codex. “They basically copied the Codex interface — both have projects up here… general consensus on this AI street is that Antigravity 2.0 is a little bit worse than Codex, and Codex is kind of the king of agentic work.” Side panel for custom skills and MCPs (“why not enable all of them”). Temporarily tripled rate limits for free users in Antigravity 2.0 as an adoption push. Agentic IDE code editor is a separate download in 2.0 (“a lot of users aren’t going to need it or want it”).
- Codex’s load-bearing differentiator vs Antigravity 2.0 = smoke tests. Codex “activates browser use to check, notices small errors, little things that need to be tweaked, goes back and edits that main HTML file, conducting a polish pass. You don’t see anything like that going on on Antigravity. In fact, I don’t even see like a code smoke test in here.” Browser-use-on-its-own-output is the verification primitive Antigravity is missing.
- Default-to-image-tool antipattern in Gemini’s new UI. Sending a writing prompt to 3.5 Flash, the model defaulted to generating an image instead of writing. Matt had to explicitly add “don’t make an image” to the prompt. Same antipattern Matt notes on ChatGPT lately. This is a UI/orchestration layer bug, not a model-capability problem — but it’s a real operator gotcha for anyone running Gemini through the consumer surface.
- Bench observations (from Matt’s read of the official Google blog post): MCP Atlas — “pretty big win.” Terminal Bench — still losing. Multimodal — trading blows with 5.5. 1M context advertised but 8-Needle scores aren’t amazing — even at 128k tokens there’s regression compared to 3.1 Pro and GPT 5.5 is “stunningly good in this area.” Arc AGI 2 — “about matching Opus 4.7, still a little bit behind in Humanity’s Last Exam.” Speed graph “definitely missing some models” (no extra-high GPT 5.5).
- Opus 4.7 read. Direct quote: “I haven’t been using Claude that much. Opus 4.7 feels more sluggish compared to both ChatGPT and Gemini. It is very intelligent, but it’s trading blows and often losing to 5.5 thinking.” On the Voxel village + 3D globe tests, Opus produced “infinite errors” and didn’t reach a working output. Pair this with the Margin Lab degradation signal (15% Opus 4.7 pass-rate drop May 22-26) and the wiki has two independent observations of Opus 4.7 underperforming in late May 2026.
- 3.5 Pro roadmap. Coming “down the line.” Matt’s bet: “3.5 Pro is going to land maybe just a touch better than 5.5” with cost reflecting that — possibly priced a smidgen lower than 5.5. Speed is the open variable.
Three reproducible test prompts
Matt’s prompts are explicit enough to re-run against any model. The wiki can use them as a falsification surface for future model claims.
- Cinematic scene (creative writing):
“Original cinematic scene — believable, unique, creative. Fictional + sci-fi + anime. Seedling: a necklace artifact splits into four fragments. Individually, each fragment subtly biases reality toward the holder during conflict. All four fragments together complete causal convergence — reality increasingly resolves in favor of the holder’s survival and victory.”
- Voxel village market (HTML/JS):
“Create an intricate and lively Voxel downtown market village filled with fruit creature NPCs. Accurate relative sizing — a watermelon is going to be way bigger than a strawberry. First-person exploratory experience.”
- 3D water-physics globe (hard):
“Real water physics that actually work when I move and flick the globe around. I want to be able to plant trees that grow and naturally produce lemons on the surface.”
For each test, capture: time-to-result, working-first-try yes/no, qualitative detail level, model-specific failure modes. The third test is the hardest — Matt’s strongest model signal came from there.
Operator takeaway (Matt’s own conclusion)
“If you ask me, I think OpenAI is doing the best job at creating a model that checks all of the boxes. It’s definitely more expensive, but on more difficult tasks, on agentic tasks, I’m throwing it into Codex first. I’m opening up ChatGPT first. If I need a quick Google-style result, quick research, or answering a simpler question, 3.5 Flash is definitely going to be my go-to. But anything that dips into the more advanced and complex territory, I’m instantly switching.”
Routing heuristic the wiki can borrow:
- Quick research / simple Q&A → Gemini 3.5 Flash (speed advantage)
- Difficult / agentic tasks → Codex + GPT 5.5
- Anything visual or creative → GPT 5.5 by default, validate against alternatives
- Production HTML/JS where correctness matters → smoke-test-capable surface (Codex) over surfaces that don’t smoke-test (Antigravity 2.0)
This is a single operator’s hands-on read. Treat as one data point, not a benchmark. The reproducible prompts above are the falsifiable layer.
Open Questions
- Voxel village + 3D globe — Sonnet 4.6 or 4.7 (not Opus) result? Matt only ran Opus 4.7 (and Opus failed). Sonnet 4.6+ for HTML/JS code-generation may have been the right Claude pick for these tests. Worth re-running.
/code-review --fixfrom CC v2.1.152. Could Claude Code’s new auto-fix loop close the “infinite errors” gap Opus showed on the Voxel + globe tests? See Week 22 release digest.- Spark’s chain-of-thought visibility. Matt notes “no way to track and see at least summarized chain of thought” in Spark’s new UI. Does the underlying API expose it? Operator-relevant.
- Antigravity 2.0 smoke-test addition timeline. If Google adds Codex-style browser-use verification to Antigravity, the qualitative gap Matt observed closes. Watch for it.
- 3.5 Flash + MCP Atlas big-win claim. Matt didn’t directly test MCP-tool-use in this video. Worth pairing 3.5 Flash with MCP servers in real workflows to verify the bench claim.
- Code-quality smoke testing as a wiki concept. “Codex runs smoke tests, Antigravity doesn’t” is the load-bearing differentiator Matt names. The wiki doesn’t yet have a dedicated article on AI-code-verification primitives (smoke test, lint, type-check, browser-use-self-verify). Connection candidate.
Related
- AI Podcasts topic landing
- Claude AI topic landing — Claude-specific content the field test references
- Troubleshooting Claude — the Margin Lab Opus 4.7 degradation signal from May 22-26 pairs with Matt’s “Opus 4.7 sluggish, trading blows and often losing to 5.5 thinking” observation
- Claude Code Week 22 Release Digest —
/code-review --fixauto-fix loop landed May 27 and may close some of the “Opus produced infinite errors” gap - Anthropic — How We Contain Claude — adjacent first-party Anthropic context for the same week
- AI Industry Research — for benchmark-derived (vs operator-perspective) model-comparison content