Source: ai-research/claude-vision-mike-futia-readme-2026-05-06.md and ai-research/mikefutia-claude-vision-refresh-2026-05-11.md — README + SKILL.md + GitHub metadata for mikefutia/claude-vision (48 stars / 12 forks at refresh, MIT in README, fetched 2026-05-11; up from 21/8 at first ingest 2026-05-06). Live demo context: raw/Higgsfield_MCP_+Claude_Code=_AI_Ad_Agency_Full_Tutorial.md (Mike Futia / SCALE AI YouTube tutorial 2026-05-01) and raw/Claude_Code_Can_Now_Watch_ANY_Video_Free_Claude_Skill.md (dedicated standalone demo, YouTube SP-8tFnb4g0, 2026-05).
2026-05-11 — Tier-1 refresh
Stars 21 → 48 (+27 in 5 days), forks 8 → 12, SCALE AI community 500+ → 550+. New dedicated demo video (SP-8tFnb4g0) walks through the skill’s
.skillupload path inside claude.ai (customize → skills → + → create skill → upload a skill) — a different install vector from thegit clone + mvREADME path, useful for non-CLI users. Cost claim from the demo: “about 30 cents for every 30 minutes of video” (Gemini API metering). Demo shows two analysis modes: an Ad/UGC mode with a 12-section structured output and a general mode for non-ad content.
A Claude Code skill that gives Claude the ability to “watch” videos by routing the file through Google Gemini’s native video-understanding API, then returning a structured markdown report — top-level summary, MM:SS scene-by-scene breakdown, audio transcript (or an honest “silent” note), visual details, and 3–7 key moments. Sister to [[claude-ai/claude-video|claude-video (Brad Brown’s /watch skill)]] which solves the same problem via a different architecture (yt-dlp + ffmpeg frames + Whisper) — see the comparison table below before picking. Author Mike Futia uses this skill himself in his Higgsfield ad-agency tutorial to grade Seedance video clips against a brand brief — the article that previously listed “Gemini Vision API + Claude integration” as an Open Question is now closed by this ingest.
Key Takeaways
- One job: video → structured markdown report. No URL fetching, no clip extraction, no audio-only path. Hand it a local video file and Gemini does the rest. Output is opinionated and structured — Top-Level Summary, Scene-by-Scene Breakdown (MM:SS), Audio (verbatim or “silent” note), Visual Details, 3–7 Key Moments.
- Backend is Google Gemini’s native video understanding (
gemini-3-flash-previewdefault,gemini-2.5-flashfallback for regional availability). Gemini ingests the video natively — no client-side frame extraction. Inline upload for files ≤18MB; Files API for larger uploads with up-to-300s polling for the ACTIVE state. - Strong anti-hallucination guardrails. The default prompt explicitly forbids inventing narrators, voiceovers, or speaker names. Audio section returns a verbatim transcript OR an honest “no audio / silent / ambient only” note. This is the differentiator versus naive vision-API skills that confabulate plausible-sounding narration.
disable-model-invocation: truein SKILL.md — the skill must be explicitly invoked (/video-analyzer ...or “use the video-analyzer skill on…”). Claude won’t auto-select it from descriptions of other tasks. Reduces accidental triggering on tangential mentions of video.- Two free dependencies, ten supported formats. Free Gemini API key from Google AI Studio;
pip install google-genai(verified at SDK 1.64.0). Formats: mp4, mov, webm, avi, mpeg, mpg, flv, wmv, 3gpp, 3gp. - Optional flags for power use.
--prompt "..."overrides the default structured-report prompt with anything (e.g., “extract every UI element shown” or “grade this against [brand brief]”);--fps Nraises sampling rate for fast-cut content (default 1 fps);--model gemini-2.5-flashswaps to the fallback when the preview model 404s in your region. - Allowed tools:
Bash, Read. Minimal blast radius — runs the analyze_video.py script and reads files. No write/edit, no network beyond what Python does, no destructive surface. - Scoped-skill pattern — install via
git clone+mv claude-vision ~/.claude/skills/video-analyzer. Folder name is load-bearing: must bevideo-analyzer(the SKILL.mdname) for Claude Code to find it. Repo is namedclaude-visionfor marketing; install name isvideo-analyzerfor the runtime. - The “Set my GEMINI_API_KEY” trick — Mike’s install instructions tell users to ask Claude Code itself to write the export to
~/.zshrcrather than editing it themselves. Idiomatic Claude-Code-as-its-own-installer move and a reusable pattern for any skill that needs an env var. Pairs with the skill-design-patterns discussion of bootstrapping ergonomics. - Origin context. Mike Futia ships this as the companion skill to his Higgsfield + Claude Code ad-agency workflow tutorial. In that workflow, Step 5 (Seedance 2.0 hero animation) and Step 7 (Seedance UGC clips) both end with Claude grading the video against the brand brief — and
/video-analyzeris the surface that lets Claude actually see the motion. Without it, Claude can grade still images but not the video output. - MIT license (declared in README; no LICENSE file in the repo root), 48 stars / 12 forks as of 2026-05-11 (up from 21/8 on 2026-05-06 — +27 stars in 5 days; the dedicated demo video SP-8tFnb4g0 likely drove most of it). Solo-maintained (mikefutia, SCALE AI community, now 550+ members per the 2026-05-11 README). Single-purpose skill — small surface area, low maintenance burden.
How it differs from claude-video (/watch)
The wiki has a sibling skill — claude-video by Brad Brown — that also gives Claude “video watching” capability. Both are MIT-licensed, both ship as scoped Claude Code skills, both single-author. They solve the same problem with very different architectures.
| claude-vision (Mike Futia) | claude-video (Brad Brown) | |
|---|---|---|
| Backend | Google Gemini API (native video understanding) | Local frame extraction + transcription |
| Frame strategy | Gemini ingests the video natively | ffmpeg extracts duration-aware frame samples (~30 frames ≤30s, ~80 for 3–10 min, hard cap 100) |
| Audio strategy | Gemini handles natively (or “silent” note) | Native captions when available (free), Whisper via Groq/OpenAI fallback |
| Source types | Local files only (mp4, mov, webm, avi, …) | URLs (300+ sources via yt-dlp) AND local files |
| Slicing | Whole video (no --start/--end) | --start MM:SS --end MM:SS for long content |
| Customization | --prompt, --fps, --model | Per-skill prompt template |
| API keys needed | Gemini (free tier on Google AI Studio) | None for native captions; Groq/OpenAI key for Whisper fallback |
| Output shape | Structured markdown report (5 fixed sections) | Frame samples + transcript handed to Claude for free-form analysis |
| Anti-hallucination | Explicit guardrail in default prompt | Relies on Claude’s general grounding |
| Stars / forks | 48 / 12 (2026-05-11) | 117 / — |
| Repo | mikefutia/claude-vision | bradautomates/claude-video |
Pick claude-vision when: you want a structured report (especially for ad teardowns and SOPs), the video is local, the file is short-to-medium length, and you want the anti-hallucination guardrail on the audio section. Picks the model — you stay out of the prompt details.
Pick claude-video when: the source is a URL (YouTube, TikTok, Vimeo, Instagram, X), you need start/end slicing on long content (podcast snippets, lecture sections), or you want the frames + transcript handed to Claude for an open-ended analytical pass rather than a fixed-shape report.
Pick both when running serious creative work — claude-video for sourcing and pre-processing remote URLs, claude-vision for the structured grading pass on the local output. They don’t conflict; they’re complementary skills under different namespaces (/watch vs /video-analyzer).
Real-world usage — how Mike uses his own skill
In the Higgsfield ad-agency tutorial Mike calls /video-analyzer as the closing critic on every video generation step. The pattern that repeats:
- Higgsfield MCP generates a Seedance 2.0 clip → file lands in the project folder.
- Claude is prompted to “evaluate the clip against the brand brief.”
- Claude invokes
/video-analyzeron the local file → Gemini returns the structured report. - Claude reads the report and writes a one-paragraph fidelity grade (“the hoodie holds shape through the camera push-in; voiceover quality not graded — synthesized audio outside skill scope”).
- Claude recommends ship-or-redo to the operator.
Quote from the transcript (≈05:35): “I have the Gemini Vision API hooked up to my Claude account. That’s a topic for another video, but I basically given Claude the ability to watch and analyze video content. So, you can see Claude is doing its evaluation of the video here in terms of the fidelity to the still… it’s grading it against the brand brief checklist.”
The skill is what turns Claude from a generation operator into a creative director — same observation Sam Witteveen makes in his six-agentic-patterns teardown of Claude Design (pattern 4: self-QA via vision). Both surfaces converge on the same loop: model generates → vision grades → model iterates.
Implementation
Tool/Service: mikefutia/claude-vision Claude Code skill (MIT). Backend: Google Gemini (gemini-3-flash-preview default, gemini-2.5-flash fallback).
Setup:
git clone https://github.com/mikefutia/claude-vision.git
mv claude-vision ~/.claude/skills/video-analyzer # name is load-bearing — keep "video-analyzer"
# Get a free key from https://aistudio.google.com/apikey, then ask Claude Code itself:
# "Set my GEMINI_API_KEY to <your_key> so it's available in every new shell."
pip install google-genai
# If pip complains about externally-managed environment:
# pip install google-genai --break-system-packagesCost:
- Skill itself: Free (MIT, source-available).
- Gemini API: Free tier on Google AI Studio is “generous and fine for personal use” per the README. Paid tier metered separately if exceeded. Per-call cost not specified by the skill — depends on Gemini’s published pricing for the model selected and the video size.
- No Claude tokens beyond the prompt + report read-back. The video is sent to Gemini, not to Claude. Claude only reads the resulting markdown report — small token footprint per analysis even on long videos.
Integration notes:
- Folder name is load-bearing. Install path must be
~/.claude/skills/video-analyzer. The repo is namedclaude-visionfor branding; the SKILL.mdname: video-analyzeris what Claude Code’s skill loader matches. disable-model-invocation: truemeans Claude won’t auto-route to it from a description like “watch this video.” Invoke explicitly:/video-analyzer <path>or “use the video-analyzer skill on<path>.” This keeps the skill scoped — useful if you also have claude-video installed (which has different invocation semantics).- API key visibility. If
GEMINI_API_KEYis set in~/.zshrcbut not exported in your current shell, open a fresh terminal orsource ~/.zshrcbefore retry. The error messages flag this clearly. - File size threshold = 18MB. Below: inline upload (faster). Above: Files API with polling for ACTIVE state (up to 300s). Plan for the polling delay on long screen recordings.
- Allowed tools:
Bash, Read. No write/edit surface. Reports are returned to Claude’s chat context — to persist them, the operator (or Claude in a follow-up turn) writes the report to disk. - Default model is
gemini-3-flash-preview. Fall back to--model gemini-2.5-flashif the preview model isn’t available in your region. Watch for newer Gemini releases to update the default — solo-maintained skill, default may lag. - Custom prompts via
--promptunlock domain-specific grading. Examples that match Mike’s workflow:--prompt "Grade this clip against this brand brief: <paste brief>. Score fidelity, motion quality, brand match. Recommend ship or redo."Or:--prompt "Extract every UI interaction shown — clicks, scrolls, keyboard inputs — with timestamps." - Frame sampling rate via
--fpsmatters for fast-cut UGC. The default 1 fps catches most narrative content; UGC reels with sub-second cuts may need--fps 2or--fps 3to register every cut in the scene-by-scene breakdown.
Try It
- Drop in any local video and run
/video-analyzer ~/path/to/video.mp4— start with a sub-18MB clip so the inline upload path runs end-to-end fast. Read the structured report; check whether the Audio section gets the “silent” note correctly when there’s no narration. - Pair with Mike’s Higgsfield ad-agency workflow — install both
/video-analyzerand the Higgsfield MCP, then run a 7-step DTC campaign end-to-end. Step 5 and Step 7 grading work as Mike demonstrates only when this skill is installed. - Ad teardown in 90 seconds. Save a competitor’s UGC ad locally (screen-record from the Meta Ads Library or download a TikTok/Reel via yt-dlp). Run
/video-analyzerwith--prompt "Beat-by-beat teardown — hook, pain, transformation, proof, CTA. Note timestamp for each beat."Use the output to build your own creative-brief variants. Pairs with Meta Ads CLI for the upload side once you have variants drafted. - Loom recording → SOP. Run
/video-analyzeron a Loom export with--prompt "Convert this tutorial into a numbered step-by-step SOP. Each step: action, click target, expected result. Skip anything not actionable."Store the output assops/<task>.mdand iterate. Strong fit for the Cowork Jarvis pattern of converting tribal knowledge into reusable skills. - Meeting recap from a Zoom recording.
--prompt "Extract decisions and action items only. For each, name the owner and the deadline if stated. Ignore small talk and background discussion."Honest “ambient only” returns when the audio is unintelligible or muted — the anti-hallucination guardrail keeps fake decisions out of the recap. - Sister-skill comparison test. If you also have [[claude-ai/claude-video|
/watch(claude-video)]] installed, run both on the same local file and compare the report shapes: claude-vision returns the fixed five-section structured report; claude-video hands frames + transcript to Claude for an open-ended pass. Pick the surface that fits your usage pattern as the default; keep the other for edge cases (URLs → claude-video; structured-report grading → claude-vision). - For WEO Marketly creative review. Install on the marketing-team workstation. Use as the QC pass on every Higgsfield-generated video before it ships to a client. The structured Audio section catches one specific failure mode the team has hit before — Higgsfield-synthesized voiceover that sounds plausible but isn’t intelligible — because the skill marks unintelligible audio as such rather than confabulating a transcript.
Related
- [[claude-ai/claude-video|claude-video —
/watchSkill (Brad Brown, yt-dlp + ffmpeg + Whisper)]] — sibling skill, same problem solved with a different architecture; comparison table above - Higgsfield + Claude Code Ad-Agency Workflow — Mike’s tutorial that this skill powers (Steps 5 and 7 grading)
- Higgsfield MCP (Entity) — the generation surface in the same workflow
- Skill Design Patterns — the bootstrapping-ergonomics pattern (asking Claude to set its own env vars)
- Claude Code Skills Ecosystem — broader skill-library context
- Six Agentic Patterns (Sam Witteveen on Claude Design) — same self-QA-via-vision loop, different surface
- yt-dlp — pair to fetch URL-based video before passing to claude-vision (since claude-vision is local-files-only)
- AI Video Tools — topic index where the Higgsfield workflow lives
- Banned AI Patterns — the anti-hallucination guardrail aligns with WEO’s banned-pattern discipline
- Meta Ads CLI — natural pairing for the ad-teardown → upload loop
Open Questions
- No URL support. Local files only. URL workflow requires a pre-fetch via yt-dlp or claude-video. Will Mike add URL support, or is it a deliberate scope decision? Repo issues are quiet — no signal yet.
- No
--start/--endslicing. Whole-video only. For long screen recordings or podcasts, this means uploading the full file each time. Open whether--start/--end(like claude-video has) would be a worthwhile addition. - Default model lag.
gemini-3-flash-previewis the README default; Gemini’s release cadence outpaces solo-maintained skills. Watch for new Gemini releases — the operator may want to override--modeluntil the maintainer updates the default. - Per-video Gemini cost. README calls the free tier “generous” without specifying a per-video credit estimate. For agency-scale runs (50+ videos/month) the math matters — open question worth answering with a small benchmark.
- Anti-hallucination prompt is in the script, not exposed. The default prompt with the “no inventing narrators” guardrail lives in
scripts/analyze_video.py. To audit it precisely, read the script directly. (Worth a follow-up to extract the verbatim default prompt for the wiki.) - No multi-video batch mode. Each call processes one video. For batch workflows (e.g., grade all 25 outputs from a 9:16 Higgsfield run), the operator must loop externally. Could be a useful skill enhancement.