Image Models Feed the Video Pipeline — The Still Comes First

Source: wiki synthesis: GPT Image 2, Nano Banana, FREE Seedance 2.0 Claude Skill, Animated Short Film Pipeline

The AI video pipeline does not begin with video — it begins with a still image. Before any image-to-video model animates a shot, a text-to-image model has to generate the character sheet, the storyboard grid, or the first frame the motion model reads as its reference. The two rival “thinking” image models this wiki tracks — OpenAI’s GPT Image 2 and Google’s Nano Banana Pro — are interchangeable front-ends bolted onto the same downstream workflow (Seedance 2 via LTX Studio, Higgsfield, or a Codex/Claude pre-production loop), which is why every serious tutorial A/B’s them in the same run. Picking the image model is therefore an upstream decision that quietly shapes everything downstream: whatever character consistency you lock at the sheet is what the motion model has to work with in every animated frame.

Key Takeaways

Image-to-video is reference-hungry. Seedance 2 “works best fed image references” and reads a character sheet or a 3×3 storyboard grid as its visual + motion anchor — so the still is not a nice-to-have, it is the load-bearing input.
The still comes first, in stages. The canonical order across tutorials: (1) character sheet (GPT Image 2 / Nano Banana Pro) to lock identity → (2) 3×3 cinematic storyboard grid with per-panel choreography → (3) Seedance 2 shot that reads the grid panels as sequential shots inside a 10-15s clip.
The two image models are interchangeable front-ends, A/B’d per job. LTX Studio and Higgsfield MCP both prompt Nano Banana Pro and GPT Image 2 side-by-side in one run (“Nano Banana 2 for one, GPT Image 2 for the other”); the animated-short-film pipeline ran a style bake-off and GPT Image 2 won for detailed hand-drawn 2D. No permanent winner — test both on your actual style.
Character consistency is the property that propagates. GPT Image 2 held tattoos / piercings / hairstyle across six references where Nano Banana Pro drifted; Nano Banana Pro blends up to 14 references and keeps 5 people consistent. Whatever consistency you win at the image stage is what Seedance inherits downstream — and the signature Seedance failure (duplicate / cloned characters from a multi-instance reference sheet) is fixed at the prompt, not the model.
The chain beats either model alone. GPT Image 2’s 50-example field test names the image→video chain explicitly: “GPT Image 2 gets even stronger when combined with Seedance 2.”
Provenance diverges at the source. Nano Banana stamps every output with SynthID + C2PA; GPT Image 2’s provenance stack isn’t documented in this wiki’s coverage — a downstream consideration if the finished video needs an AI-disclosure trail. ^[inferred]

The pipeline, stage by stage

The four articles describe the same assembly line from different seats — a free Claude Skill, a Google image model, an OpenAI image model, and a full one-person production breakdown — but the ordered stages are identical: ^[inferred — the unified stage numbering is this article’s framing; each stage traces to a specific source]

Stage 0 — pre-production writes the prompts. Codex (or Claude Code) brainstorms the story, writes every generation prompt, and calls the image model’s API to produce backgrounds, character sheets, and prop specs before a single video credit is spent.
Stage 1 — character sheet locks identity. GPT Image 2 or Nano Banana Pro generates the character sheet. This is where multi-image consistency earns its keep — GPT Image 2’s shared-state batch (magazines, manga, room-by-room) and Nano Banana’s 14-reference / 5-person blend both exist to make this stage hold.
Stage 2 — storyboard grid. A 3×3 cinematic grid is generated with the character refs attached (the animated-short-film pipeline used GPT Image 2 here). The grid-as-reference trick is load-bearing: Seedance 2 accepts the whole grid as a single image reference and reads the panels as sequential shots.
Stage 3 — Seedance animates. ~30-50 image-to-video clips, 10-15s each, up to 1080p; feed references and expect the model to lean on them heavily.
The failure mode that traces to the still: the duplicate-character bug — Seedance clones the multiple instances it sees on a reference sheet. The fix (“singular / one / closeup of a single main character”) is a change to the image prompt, not the video model — proof that the upstream still governs the downstream shot.

Choosing the front-end

This is not editing and not composition. It is distinct from The Edit Is Text (cutting existing footage as code) and from HTML Is the Canvas (rendering video deterministically from HTML/CSS). Both of those operate on footage or markup you already have; this pipeline generates footage from a still that a reasoning/diffusion image model dreamed up first. ^[inferred]
Decision heuristics from the sources: reach for GPT Image 2 when you need strict identity detail held across many references, multi-image batch output, or detailed hand-drawn 2D (it won the bake-off). Reach for Nano Banana Pro when you need legible in-image text, Search-grounded factual accuracy, SynthID provenance, or you already live inside Google’s stack. The durable move is to A/B both on your own style in one LTX / Higgsfield run and let the bake-off decide, rather than trusting a time-stamped leaderboard.
The upstream choice is sticky. Because Seedance reads the sheet/grid as its anchor, a consistency win or loss at Stage 1 is amplified across 30-50 clips — the cheapest place to fix a character problem is the still, the most expensive is re-generating video. ^[inferred]

Try It

Run the three-stage flow smallest-first with the Seedance Claude Skill: character-sheet prompt → paste into LTX Studio → 3×3 grid → Seedance 2 prompt. Confirm the chain end-to-end before scaling to 30+ clips.
A/B the front-end in one run: generate the same character sheet with GPT Image 2 and Nano Banana Pro, carry both into the storyboard-grid stage, and pick by which identity holds through the animated shot — not by leaderboard ELO.
Prompt for a single character on the reference sheet to dodge Seedance’s clone bug before you spend video credits.
Pick the video platform by queue speed once the storyboard tells you how many clips you need (Polo AI ~3-4 min/gen vs Runway ~8-10 min/gen at ~50 clips).
If the finished video needs an AI-disclosure trail, prefer Nano Banana’s SynthID + C2PA outputs at the image stage — the provenance is set upstream, not in the edit.

GPT Image 2 launch coverage — the OpenAI front-end; multi-image batch + the photorealism unlock.
Nano Banana — the Google front-end; in-image text, 14-reference blend, SynthID provenance.
FREE Seedance 2.0 Claude Skill — the three-prompt-language skill that stages character sheet → grid → shot.
Animated Short Film Pipeline — the full one-person breakdown with the style bake-off and the clone-bug fix.
The Edit Is Text — Agentic Video Editing — the downstream neighbor: editing footage as code, not generating it.
HTML Is the Canvas — the other video-authoring path: deterministic render from HTML, not a generated still.
Higgsfield MCP Tutorial — the side-by-side Nano Banana 2 vs GPT Image 2 brand-book run.
ChatGPT Image (GPT Image 2) — the image-model topic landing page.

Open Questions

Does the front-end choice change once past Stage 1? The sources A/B the image models at the character-sheet stage, but none benchmark whether mixing (e.g. Nano Banana sheet → GPT Image 2 grid) helps or hurts the Seedance read. ^[inferred]
Provenance survival through the video render. SynthID is set on the still; whether the watermark survives Seedance’s image-to-video transform into the finished MP4 is not addressed in the sources.

Jonathon's AI Wiki

Explorer

Image Models Feed the Video Pipeline — The Still Comes First

Key Takeaways

The pipeline, stage by stage

Choosing the front-end

Try It

Open Questions

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

Image Models Feed the Video Pipeline — The Still Comes First

Key Takeaways

The pipeline, stage by stage

Choosing the front-end

Try It

Related

Open Questions

Graph View

Table of Contents

Backlinks