Gemini 3.5 Flash Field Test vs GPT 5.5 / Opus 4.7 / Spark / Antigravity 2.0 / Codex (MattVidPro)

Source: raw/I_Field_Tested_Gemini_3.5_Flash_-_Fast_Boi_Smol_Brain..md (YouTube field-test video by Matt Wolfe a.k.a. MattVidPro, youtube.com/watch?v=U1VMhxccPYU, fetched 2026-05-27 via yt-podcast auto-pull from the inbox-refresh mattvidpro feed).

A hands-on field test of Google’s Gemini 3.5 Flash (released ~one week prior at Google I/O alongside Google Omni) running three concrete test prompts side-by-side with GPT 5.5, Claude Opus 4.7, and Google’s two agentic surfaces (Gemini Spark + Antigravity 2.0). Useful as a single operator-perspective data point on the Opus 4.7 vs GPT 5.5 vs Gemini 3.5 Flash decision space as of late May 2026. Matt’s headline read: 3.5 Flash is “3.1 Pro level intelligence at a cheaper, faster price” — intelligence per second up, intelligence per dollar maybe not. Honest broader observation: GPT 5.5 came out looking best across all three tests, and Opus 4.7 felt “sluggish… trading blows and often losing to 5.5 thinking.”

Key Takeaways

Pricing for 3.5 Flash: $1.50/ M in p u t +$ 9/M output. Roughly 3× the cost of the previous Flash generation — mid-range, not dirt cheap. The Grok-style cheap-tier slot is no longer Gemini’s. Cost-leadership is intelligence-per-second, not intelligence-per-dollar.
Speed claim verified. All three tests confirm 3.5 Flash is fast — Voxel village HTML generated in under 30 seconds standard thinking; “if any thinking actually occurred, I don’t see any UI that suggests it.” Same length-feel as instant GPT 5.5.
Three-test composite scorecard (Matt’s hands-on read, not a benchmark):

Test	Gemini 3.5 Flash	GPT 5.5	Opus 4.7	Spark / Antigravity 2.0
Creative cinematic scene (necklace-artifact concept)	Fast, generic feel, leans into sci-fi cliché when subtly steered away	Same cliché-leaning behavior at default	(not tested for this prompt)	—
Voxel village market (fruit NPCs, HTML/JS)	~30s, basic but works, ~37 fruit NPCs, eyes-only NPC bodies	~2 min, ~50% more detail (real animations, dialogue, lighting, character interactions, real shops)	“Infinite errors” — never produced a working result	—
3D water-physics globe sim (rotatable, real water, growable lemon trees)	Standard thinking: broken (no water, no trees). Extended thinking: failed entirely (“couldn’t load the chat — doesn’t exist or was deleted”)	Working first try in 41 seconds with water particles, lemon trees, click-to-grow, viscosity controls	Still producing infinite errors after 7 min via Codex (smoke-tested)	Spark: produced a result via dedicated VM that writes HTML to Google Drive, but “seizure-warning”-tier render. Antigravity 2.0: ran 7 min Codex-style; final result had lemons-as-projectiles but missing surface water

Gemini Spark (new agentic mode) runs in a dedicated virtual machine, can write outputs to Google Drive automatically (Google Workspace tie-in), and progresses through visible build steps. Matt: “you can actually see all of the progress to build this thing out… if you want to build complex code projects, Google Antigravity 2.0, AI Studio, or this new Spark.” Open: no summarized chain-of-thought visible in the new UI.
Antigravity 2.0 architectural copy of Codex. “They basically copied the Codex interface — both have projects up here… general consensus on this AI street is that Antigravity 2.0 is a little bit worse than Codex, and Codex is kind of the king of agentic work.” Side panel for custom skills and MCPs (“why not enable all of them”). Temporarily tripled rate limits for free users in Antigravity 2.0 as an adoption push. Agentic IDE code editor is a separate download in 2.0 (“a lot of users aren’t going to need it or want it”).
Codex’s load-bearing differentiator vs Antigravity 2.0 = smoke tests. Codex “activates browser use to check, notices small errors, little things that need to be tweaked, goes back and edits that main HTML file, conducting a polish pass. You don’t see anything like that going on on Antigravity. In fact, I don’t even see like a code smoke test in here.” Browser-use-on-its-own-output is the verification primitive Antigravity is missing.
Default-to-image-tool antipattern in Gemini’s new UI. Sending a writing prompt to 3.5 Flash, the model defaulted to generating an image instead of writing. Matt had to explicitly add “don’t make an image” to the prompt. Same antipattern Matt notes on ChatGPT lately. This is a UI/orchestration layer bug, not a model-capability problem — but it’s a real operator gotcha for anyone running Gemini through the consumer surface.
Bench observations (from Matt’s read of the official Google blog post): MCP Atlas — “pretty big win.” Terminal Bench — still losing. Multimodal — trading blows with 5.5. 1M context advertised but 8-Needle scores aren’t amazing — even at 128k tokens there’s regression compared to 3.1 Pro and GPT 5.5 is “stunningly good in this area.” Arc AGI 2 — “about matching Opus 4.7, still a little bit behind in Humanity’s Last Exam.” Speed graph “definitely missing some models” (no extra-high GPT 5.5).
Opus 4.7 read. Direct quote: “I haven’t been using Claude that much. Opus 4.7 feels more sluggish compared to both ChatGPT and Gemini. It is very intelligent, but it’s trading blows and often losing to 5.5 thinking.” On the Voxel village + 3D globe tests, Opus produced “infinite errors” and didn’t reach a working output. Pair this with the Margin Lab degradation signal (15% Opus 4.7 pass-rate drop May 22-26) and the wiki has two independent observations of Opus 4.7 underperforming in late May 2026.
3.5 Pro roadmap. Coming “down the line.” Matt’s bet: “3.5 Pro is going to land maybe just a touch better than 5.5” with cost reflecting that — possibly priced a smidgen lower than 5.5. Speed is the open variable. Update 2026-07-24: still unreleased two months later — Google shipped 3.6 Flash first, and Logan Kilpatrick says 3.5 Pro is “testing with partners.” See Where the Gemini line went next below.

Three reproducible test prompts

Matt’s prompts are explicit enough to re-run against any model. The wiki can use them as a falsification surface for future model claims.

Cinematic scene (creative writing):

“Original cinematic scene — believable, unique, creative. Fictional + sci-fi + anime. Seedling: a necklace artifact splits into four fragments. Individually, each fragment subtly biases reality toward the holder during conflict. All four fragments together complete causal convergence — reality increasingly resolves in favor of the holder’s survival and victory.”
Voxel village market (HTML/JS):

“Create an intricate and lively Voxel downtown market village filled with fruit creature NPCs. Accurate relative sizing — a watermelon is going to be way bigger than a strawberry. First-person exploratory experience.”
3D water-physics globe (hard):

“Real water physics that actually work when I move and flick the globe around. I want to be able to plant trees that grow and naturally produce lemons on the surface.”

For each test, capture: time-to-result, working-first-try yes/no, qualitative detail level, model-specific failure modes. The third test is the hardest — Matt’s strongest model signal came from there.

Operator takeaway (Matt’s own conclusion)

“If you ask me, I think OpenAI is doing the best job at creating a model that checks all of the boxes. It’s definitely more expensive, but on more difficult tasks, on agentic tasks, I’m throwing it into Codex first. I’m opening up ChatGPT first. If I need a quick Google-style result, quick research, or answering a simpler question, 3.5 Flash is definitely going to be my go-to. But anything that dips into the more advanced and complex territory, I’m instantly switching.”

Routing heuristic the wiki can borrow:

Quick research / simple Q&A → Gemini 3.5 Flash (speed advantage)
Difficult / agentic tasks → Codex + GPT 5.5
Anything visual or creative → GPT 5.5 by default, validate against alternatives
Production HTML/JS where correctness matters → smoke-test-capable surface (Codex) over surfaces that don’t smoke-test (Antigravity 2.0)

This is a single operator’s hands-on read. Treat as one data point, not a benchmark. The reproducible prompts above are the falsifiable layer.

Where the Gemini line went next (2026-07-21)

Source: raw/Is_Google_Gemini_In_Trouble.md — the AI For Humans podcast, a conversational news show. Two of the facts below are direct quotes from Google AI Studio’s Logan Kilpatrick on X, which is why this update is worth recording; the rest is the hosts’ framing.

Gemini 3.6 Flash shipped — but 3.5 Pro still hasn’t. Google released 3.6 Flash (plus 3.5 Flash Lite and a cybersecurity-tuned 3.5 Flash Cyber), skipping past the 3.5 Pro that this article’s test was waiting on. 3.6 Flash is faster and roughly two-thirds the price of 3.5 Flash — continuing the intelligence-per-second-not-per-dollar positioning Matt identified, but reversing the price direction.
3.5 Pro is not cancelled. Asked directly on X whether 3.5 Pro was headed “out to the woodshed,” Logan Kilpatrick replied: “3.5 Pro is testing with partners and will hopefully land soon.” This partially resolves this article’s 3.5-Pro-roadmap thread: still coming, ~2 months later than the “down the line” framing above, and now arriving after two Flash generations rather than before.
Gemini 4 pre-training has started. Same day, Kilpatrick: “We have started our most ambitious pre-training run yet for Gemini 4 and are excited for the progress.” The hosts read the combination — Flash releases shipping, Pro slipping, Gemini 4 in pre-training — as Google deliberately stepping back from the frontier-model race for a cycle and optimizing for cheap, fast, and on-device instead.
3.6 Flash lands at #12 on the front-end Code Arena — below Meta’s Muse 1.1 Spark and below Opus 4.6 thinking. Not a frontier play; the hosts’ argument is that this is plausibly the model that ships on an Android phone, generating UI or a throwaway app on demand.
The cyber variant is the interesting one. Run in tandem (multiple instances at once), 3.5 Flash Cyber reportedly punches around Mythos Preview / GPT-5.6 Sol level on its target tasks. ^[the hosts’ read of Google’s blog post, not a benchmark they ran] The practical shape they name: swarms of small cheap agents continuously monitoring for threats or running pen tests — a workload where a 12-on-the-leaderboard model at a third of the price may be the correct choice.

Hard numbers (added 2026-07-24, raw/newsletter-theneurondaily-com-a7d7eca282.md). The podcast segment above described the split qualitatively; a same-week newsletter supplies the figures, and they sharpen the “three specialised models, not one generalist” reading:

Model	Positioning	Stated numbers
Gemini 3.6 Flash	All-purpose workhorse; better coding + document analysis	$1.50/$ 7.50 per MTok; uses 17% fewer output tokens than its predecessor
Gemini 3.5 Flash-Lite	High-volume throughput (document scanning, search)	350 output tokens/sec
Gemini 3.5 Flash Cyber	Locked down — governments and vetted partners only; find and patch security holes	Access-restricted; no public pricing

Two things worth carrying: the 17% output-token reduction is a second, compounding discount on top of the headline price (you are billed on tokens, so fewer tokens per answer is a real cost lever the sticker price hides), and the Cyber variant’s access restriction is the same gated-distribution pattern Anthropic uses for Mythos-class cyber work — see Claude Security Plugin and the Cyber Verification Program.

New silicon in the works: “Frozen.” Google has announced a TPU successor claimed at 6–10× more efficient than current TPUs, reportedly by freezing parts of the model architecture directly into the chip so repetitive/redundant computation isn’t re-performed. ^[podcast-sourced, no primary Google source in the transcript]

Net read for the routing heuristic above: the “quick research / simple Q&A → Gemini Flash” lane is now cheaper and faster than when this test ran, and the “difficult / agentic → Codex + GPT” lane is unchanged — Google shipped nothing in this window that competes for it.

Open Questions

Voxel village + 3D globe — Sonnet 4.6 or 4.7 (not Opus) result? Matt only ran Opus 4.7 (and Opus failed). Sonnet 4.6+ for HTML/JS code-generation may have been the right Claude pick for these tests. Worth re-running.
/code-review --fix from CC v2.1.152. Could Claude Code’s new auto-fix loop close the “infinite errors” gap Opus showed on the Voxel + globe tests? See Week 22 release digest.
Spark’s chain-of-thought visibility. Matt notes “no way to track and see at least summarized chain of thought” in Spark’s new UI. Does the underlying API expose it? Operator-relevant.
Antigravity 2.0 smoke-test addition timeline. If Google adds Codex-style browser-use verification to Antigravity, the qualitative gap Matt observed closes. Watch for it.
3.5 Flash + MCP Atlas big-win claim. Matt didn’t directly test MCP-tool-use in this video. Worth pairing 3.5 Flash with MCP servers in real workflows to verify the bench claim.
Code-quality smoke testing as a wiki concept. “Codex runs smoke tests, Antigravity doesn’t” is the load-bearing differentiator Matt names. The wiki doesn’t yet have a dedicated article on AI-code-verification primitives (smoke test, lint, type-check, browser-use-self-verify). Connection candidate.

AI Podcasts topic landing
Claude AI topic landing — Claude-specific content the field test references
Troubleshooting Claude — the Margin Lab Opus 4.7 degradation signal from May 22-26 pairs with Matt’s “Opus 4.7 sluggish, trading blows and often losing to 5.5 thinking” observation
Claude Code Week 22 Release Digest — /code-review --fix auto-fix loop landed May 27 and may close some of the “Opus produced infinite errors” gap
Anthropic — How We Contain Claude — adjacent first-party Anthropic context for the same week
AI Industry Research — for benchmark-derived (vs operator-perspective) model-comparison content

Jonathon's AI Wiki

Explorer

Gemini 3.5 Flash Field Test vs GPT 5.5 / Opus 4.7 / Spark / Antigravity 2.0 / Codex (MattVidPro)

Key Takeaways

Three reproducible test prompts

Operator takeaway (Matt’s own conclusion)

Where the Gemini line went next (2026-07-21)

Open Questions

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

Gemini 3.5 Flash Field Test vs GPT 5.5 / Opus 4.7 / Spark / Antigravity 2.0 / Codex (MattVidPro)

Key Takeaways

Three reproducible test prompts

Operator takeaway (Matt’s own conclusion)

Where the Gemini line went next (2026-07-21)

Open Questions

Related

Graph View

Table of Contents

Backlinks