video-use — Claude Code video editing skill

Source: ai-research/browser-use-video-use-github-repo.md Repo: https://github.com/browser-use/video-use Stars: 3,151 (captured 2026-04-22) Language: Python Created: 2026-04-12 · Org: browser-use (same team as the browser-use web-automation library) License: None declared in repo metadata; README states “100% open source.”

video-use is a Claude Code skill that edits video by conversation. Drop raw footage in a folder, chat with Claude Code, get final.mp4 back. It handles filler-word trimming, color grading, subtitle burning, overlay animation generation, and a self-evaluating render loop — all without presets or menus. The design insight is that the LLM never watches the video; it reads a packed phrase-level transcript and drills into short PNG timeline composites only at decision points — the same “structured DOM over screenshot” trick browser-use applies to web pages, now applied to video.

Key Takeaways

Skill lives under ~/.claude/skills/video-use/ via symlink from a clone. Outputs always land in <videos_dir>/edit/ — never inside the skill folder (Hard Rule 12).
Two-layer reading model replaces frame-by-frame analysis. Layer 1: ElevenLabs Scribe word-level transcript + diarization + audio events, packed into a ~12KB takes_packed.md. Layer 2: on-demand timeline_view PNGs (filmstrip + waveform + word labels) only at decision points.
Token economics are the whole point. README’s framing: naive frame-dump = 30,000 frames × 1,500 tokens = 45M tokens. video-use = 12KB text + a handful of PNGs.
12 Hard Rules are non-negotiable (correctness, not taste). Examples: subtitles LAST in filter chain, 30ms audio fades at every cut, word-boundary snapping, setpts=PTS-STARTPTS+T/TB for overlays, per-segment extract + -c copy concat (not single-pass filtergraph).
Animation overlays via parallel sub-agents — one sub-agent per animation slot, spawned via Claude Code’s Agent tool. Sequential sub-agents are explicitly an anti-pattern.
Three animation backends: PIL + PNG sequence + ffmpeg (fastest), Manim (formal diagrams — vendored at skills/manim-video/), Remotion (typography, brand layouts).
Two shipped color grades: warm_cinematic (teal/orange split, desaturated — safe for talking heads), neutral_punch (contrast + S-curve, no hue). Plus --filter '<raw ffmpeg>' for arbitrary chains. Mental model is ASC CDL: out = (in * slope + offset) ** power.
Subtitle default (bold-overlay style): 2-word UPPERCASE chunks, Helvetica 18 Bold, white-on-black-outline, MarginV=35. Tuned for fast-paced short-form / tech-launch footage.
Self-eval render loop runs timeline_view on the rendered output at every cut boundary (±1.5s window) plus first 2s / last 2s / 2–3 midpoints — catches visual jumps, audio pops, hidden subtitles, misaligned overlays. Capped at 3 passes; flags unresolved issues instead of looping forever.
Session memory in project.md — appended every session (Strategy / Decisions / Reasoning log / Outstanding). On startup, the skill reads project.md and summarizes the last session before asking whether to continue.
Word-level verbatim ASR only. No Whisper SRT/phrase mode (loses sub-second gap data). No normalized fillers (loses editorial signal — umm/uh/false starts are the editing signal).
Stack: Python + ffmpeg + ElevenLabs Scribe + yt-dlp (optional). Manim and Remotion installed only on first use.

Pipeline

Transcribe ──> Pack ──> LLM Reasons ──> EDL ──> Render ──> Self-Eval
                                                              │
                                                              └─ issue? fix + re-render (max 3)

Eight-step process: Inventory (ffprobe + transcribe_batch.py + pack_transcripts.py) → Pre-scan for verbal slips → Converse (shaped by material, no fixed checklist) → Propose strategy (4–8 sentences, wait for confirmation) → Execute (EDL via editor sub-agent, parallel animation sub-agents, per-segment grade, render.py) → Preview → Self-eval (≤3 passes) → Iterate + persist.

Helpers

transcribe.py <video> — single-file Scribe call, cached.
transcribe_batch.py <videos_dir> — 4-worker parallel transcription.
pack_transcripts.py --edit-dir <dir> — JSON transcripts → takes_packed.md (phrase-level, break on silence ≥ 0.5s).
timeline_view.py <video> <start> <end> — filmstrip + waveform PNG. Decision-point tool, not a scanner.
render.py <edl.json> -o <out> — per-segment extract → concat → PTS-shifted overlays → subtitles LAST.
grade.py <in> -o <out> — ffmpeg filter-chain color grade.

EDL format (the sub-agent’s output contract)

{
  "version": 1,
  "sources": {"C0103": "/abs/path/C0103.MP4"},
  "ranges": [
    {"source": "C0103", "start": 2.42, "end": 6.85,
     "beat": "HOOK", "quote": "...", "reason": "Cleanest delivery, stops before slip."}
  ],
  "grade": "warm_cinematic",
  "overlays": [{"file": "edit/animations/slot_1/render.mp4", "start_in_output": 0.0, "duration": 5.0}],
  "subtitles": "edit/master.srt",
  "total_duration_s": 87.4
}

Implementation

Tool/Service: video-use (Claude Code skill) + ElevenLabs Scribe (hosted ASR) + ffmpeg + optional yt-dlp / Manim / Remotion

Setup:

git clone https://github.com/browser-use/video-use && cd video-use
ln -s "$(pwd)" ~/.claude/skills/video-use
pip install -e .
brew install ffmpeg (required), brew install yt-dlp (optional)
cp .env.example .env then set ELEVENLABS_API_KEY
cd /path/to/your/videos && claude then say “edit these into a [type] video”

Cost:

ElevenLabs Scribe: hosted per-minute pricing (current ElevenLabs ASR rates apply; not stated in the README).
Claude Code: standard subscription or API usage for the orchestrating session + parallel sub-agents.
ffmpeg / yt-dlp / Manim / Remotion / PIL: free.
No proprietary cloud service beyond ElevenLabs.

Integration notes:

Skill outputs live in <videos_dir>/edit/. The skill directory stays clean.
Manim support is vendored (skills/manim-video/). Read its SKILL.md when building Manim animation slots.
Transcripts cached per source file; re-transcription only when the source file hash changes (immutable outputs of immutable inputs).
Each animation = one parallel sub-agent via the Agent tool — sequential sub-agents are a hard anti-pattern.

Design Principles

Text + on-demand visuals. No frame-dumping. Transcript is the surface.
Audio is primary, visuals follow. Cuts from speech boundaries and silence gaps.
Ask → confirm → execute → self-eval → persist. Never cut without strategy approval.
Zero assumptions about content type. Look, ask, then edit.
12 hard rules, artistic freedom elsewhere. Correctness is non-negotiable; taste isn’t.

Anti-patterns (from SKILL.md)

Hierarchical pre-computed tone/shot metadata — over-engineering, derive from transcript at decision time.
Hand-tuned moment-scoring heuristics — the LLM picks better.
Whisper SRT / phrase-level output — loses sub-second gap data.
Running Whisper locally on CPU — slow, normalizes fillers.
Burning subtitles into base before compositing overlays — overlays hide them (Rule 1).
Single-pass filtergraph when overlays exist — double encode.
Linear animation easing — looks robotic; always cubic.
Hard audio cuts — audible pops (Rule 3).
Typing text centered on partial-string width — text slides left during reveal.
Sequential animation sub-agents — always parallel (Rule 10).
Editing before confirming strategy.
Re-transcribing cached sources.

Try It

Smoke test on an existing clip. Clone the repo, symlink into ~/.claude/skills/, set the ElevenLabs key, point claude at a folder with 1–2 raw takes, say “edit these into a 30s demo.” First run forces the full pipeline and produces a takes_packed.md you can review before approving the strategy.
Multi-take recording for a WEO script. Record 5–10 takes of each beat of a [[ai-video-content/_index|OmniPresence]]-style script directly into a raw/ folder. Let video-use pick the best take per beat via the editor sub-agent brief. Compare against the current OmniPresence single-take-then-manual-edit workflow for speed.
Drop-in cleanup of existing OmniPresence / Avatar V output. Feed finished [[ai-video-content/heygen-avatar-v|Avatar V]] MP4s plus any B-roll to video-use and have it produce a montage — tests whether video-use adds polish on top of an avatar pipeline, not just replaces it.
Benchmark against [[ai-video-content/heygen-studio-automation|HeyGen Studio Automation]]. HeyGen Studio Automation generates from scripts; video-use edits existing footage. Running both on the same dental-marketing script (generate with HeyGen, polish with video-use) may produce higher-fidelity output than either alone.
Try a Manim explainer slot. For a dental-marketing explainer, commission one Manim animation slot (e.g., how Google Search prioritizes local dental sites) via video-use’s parallel sub-agent pattern. Gives a reusable asset plus a test of the Manim backend.

HeyGen Studio Automation with Claude Code — complementary: HeyGen Studio generates videos from scripts (ElevenLabs → HeyGen Avatar V → Remotion). video-use edits existing footage. Both orchestrated by Claude Code; both use ElevenLabs. Stacking them = script-to-avatar-to-polished-cut.
Claude Code Video Toolkit (Digital Samba) — 10-skill + 13-command production workspace using open-source models on Modal/RunPod. Overlapping goal (Claude Code owns video production) with different tradeoff: Video Toolkit ships a full OSS model stack for generation; video-use focuses narrowly on editing-by-conversation with hosted ASR as the only paid dependency.
Remotion Motion Graphics — Remotion is one of video-use’s three animation backends (alongside PIL and Manim). Useful when brand-aligned typography matters.
HeyGen Hyperframes — another HTML-based composition framework for Claude Code / Cursor / Codex. Different shape (Hyperframes = deterministic HTML composition; video-use = non-linear editing of recorded footage) but overlapping audience.
HeyGen Avatar V — natural upstream input. Avatar V produces unlimited-duration talking-head footage; video-use is the natural next step for trimming, grading, and subtitling that footage.
Building Skills for Claude — video-use is a canonical example of the Anthropic Skill format: SKILL.md at root with name / description frontmatter, progressive disclosure into helpers, vendored sub-skill (skills/manim-video/).
Skill Design Patterns — video-use illustrates several patterns: progressive-disclosure helpers, sub-skill vendoring, text-first context economy, self-evaluation loop with bounded retries.
Claude Code Subagents — video-use’s parallel-sub-agent pattern for animations is a concrete production case of subagent orchestration.

Open Questions

License. README says “100% open source” but the GitHub repo has no LICENSE file and no license field in metadata. Worth confirming before shipping derivative work.
Scribe cost per hour of source footage. README doesn’t state current ElevenLabs Scribe per-minute rates. For a WEO dental-marketing project with 10+ takes of 30-minute recordings, this is the cost-dominating line item.
Resolution / codec defaults for render.py. README documents “1080p default scale” but full extraction ffmpeg recipe, bitrate, CRF, and pix_fmt would need to be read from the helpers themselves.
How video-use composes with [[ai-video-content/heygen-avatar-v|Avatar V]] in practice. Avatar V output is already edited (no fillers, no retakes). The editing value from video-use on Avatar V output is grading + subtitles + overlays, not cutting. Hands-on trial would confirm whether that’s worth the plumbing.
Behavior on vertical / social aspect ratios. Output spec lists 1080×1920@30 as supported but README-level examples are all landscape. Worth testing for short-form dental Reels / TikToks.
Comparison to traditional NLE-style AI editors (e.g., Descript, Adobe Firefly Video). video-use is conversational + Claude Code-native; most AI-editor competitors are GUI-first. Tradeoff profile hasn’t been benchmarked.

Jonathon's AI Wiki

Explorer

video-use — Claude Code video editing skill

Key Takeaways

Pipeline

Helpers

EDL format (the sub-agent’s output contract)

Implementation

Design Principles

Anti-patterns (from SKILL.md)

Try It

Open Questions

Graph View

Table of Contents

Backlinks

Jonathon's AI Wiki

Explorer

video-use — Claude Code video editing skill

Key Takeaways

Pipeline

Helpers

EDL format (the sub-agent’s output contract)

Implementation

Design Principles

Anti-patterns (from SKILL.md)

Try It

Related

Open Questions

Graph View

Table of Contents

Backlinks