Source: ai-research/browser-use-video-use-github-repo.md Repo: https://github.com/browser-use/video-use Stars: 3,151 (captured 2026-04-22) Language: Python Created: 2026-04-12 · Org: browser-use (same team as the browser-use web-automation library) License: None declared in repo metadata; README states “100% open source.”

video-use is a Claude Code skill that edits video by conversation. Drop raw footage in a folder, chat with Claude Code, get final.mp4 back. It handles filler-word trimming, color grading, subtitle burning, overlay animation generation, and a self-evaluating render loop — all without presets or menus. The design insight is that the LLM never watches the video; it reads a packed phrase-level transcript and drills into short PNG timeline composites only at decision points — the same “structured DOM over screenshot” trick browser-use applies to web pages, now applied to video.

Key Takeaways

  • Skill lives under ~/.claude/skills/video-use/ via symlink from a clone. Outputs always land in <videos_dir>/edit/ — never inside the skill folder (Hard Rule 12).
  • Two-layer reading model replaces frame-by-frame analysis. Layer 1: ElevenLabs Scribe word-level transcript + diarization + audio events, packed into a ~12KB takes_packed.md. Layer 2: on-demand timeline_view PNGs (filmstrip + waveform + word labels) only at decision points.
  • Token economics are the whole point. README’s framing: naive frame-dump = 30,000 frames × 1,500 tokens = 45M tokens. video-use = 12KB text + a handful of PNGs.
  • 12 Hard Rules are non-negotiable (correctness, not taste). Examples: subtitles LAST in filter chain, 30ms audio fades at every cut, word-boundary snapping, setpts=PTS-STARTPTS+T/TB for overlays, per-segment extract + -c copy concat (not single-pass filtergraph).
  • Animation overlays via parallel sub-agents — one sub-agent per animation slot, spawned via Claude Code’s Agent tool. Sequential sub-agents are explicitly an anti-pattern.
  • Three animation backends: PIL + PNG sequence + ffmpeg (fastest), Manim (formal diagrams — vendored at skills/manim-video/), Remotion (typography, brand layouts).
  • Two shipped color grades: warm_cinematic (teal/orange split, desaturated — safe for talking heads), neutral_punch (contrast + S-curve, no hue). Plus --filter '<raw ffmpeg>' for arbitrary chains. Mental model is ASC CDL: out = (in * slope + offset) ** power.
  • Subtitle default (bold-overlay style): 2-word UPPERCASE chunks, Helvetica 18 Bold, white-on-black-outline, MarginV=35. Tuned for fast-paced short-form / tech-launch footage.
  • Self-eval render loop runs timeline_view on the rendered output at every cut boundary (±1.5s window) plus first 2s / last 2s / 2–3 midpoints — catches visual jumps, audio pops, hidden subtitles, misaligned overlays. Capped at 3 passes; flags unresolved issues instead of looping forever.
  • Session memory in project.md — appended every session (Strategy / Decisions / Reasoning log / Outstanding). On startup, the skill reads project.md and summarizes the last session before asking whether to continue.
  • Word-level verbatim ASR only. No Whisper SRT/phrase mode (loses sub-second gap data). No normalized fillers (loses editorial signal — umm/uh/false starts are the editing signal).
  • Stack: Python + ffmpeg + ElevenLabs Scribe + yt-dlp (optional). Manim and Remotion installed only on first use.

Pipeline

Transcribe ──> Pack ──> LLM Reasons ──> EDL ──> Render ──> Self-Eval
                                                              │
                                                              └─ issue? fix + re-render (max 3)

Eight-step process: Inventory (ffprobe + transcribe_batch.py + pack_transcripts.py) → Pre-scan for verbal slips → Converse (shaped by material, no fixed checklist) → Propose strategy (4–8 sentences, wait for confirmation) → Execute (EDL via editor sub-agent, parallel animation sub-agents, per-segment grade, render.py) → Preview → Self-eval (≤3 passes) → Iterate + persist.

Helpers

  • transcribe.py <video> — single-file Scribe call, cached.
  • transcribe_batch.py <videos_dir> — 4-worker parallel transcription.
  • pack_transcripts.py --edit-dir <dir> — JSON transcripts → takes_packed.md (phrase-level, break on silence ≥ 0.5s).
  • timeline_view.py <video> <start> <end> — filmstrip + waveform PNG. Decision-point tool, not a scanner.
  • render.py <edl.json> -o <out> — per-segment extract → concat → PTS-shifted overlays → subtitles LAST.
  • grade.py <in> -o <out> — ffmpeg filter-chain color grade.

EDL format (the sub-agent’s output contract)

{
  "version": 1,
  "sources": {"C0103": "/abs/path/C0103.MP4"},
  "ranges": [
    {"source": "C0103", "start": 2.42, "end": 6.85,
     "beat": "HOOK", "quote": "...", "reason": "Cleanest delivery, stops before slip."}
  ],
  "grade": "warm_cinematic",
  "overlays": [{"file": "edit/animations/slot_1/render.mp4", "start_in_output": 0.0, "duration": 5.0}],
  "subtitles": "edit/master.srt",
  "total_duration_s": 87.4
}

Implementation

Tool/Service: video-use (Claude Code skill) + ElevenLabs Scribe (hosted ASR) + ffmpeg + optional yt-dlp / Manim / Remotion

Setup:

  1. git clone https://github.com/browser-use/video-use && cd video-use
  2. ln -s "$(pwd)" ~/.claude/skills/video-use
  3. pip install -e .
  4. brew install ffmpeg (required), brew install yt-dlp (optional)
  5. cp .env.example .env then set ELEVENLABS_API_KEY
  6. cd /path/to/your/videos && claude then say “edit these into a [type] video”

Cost:

  • ElevenLabs Scribe: hosted per-minute pricing (current ElevenLabs ASR rates apply; not stated in the README).
  • Claude Code: standard subscription or API usage for the orchestrating session + parallel sub-agents.
  • ffmpeg / yt-dlp / Manim / Remotion / PIL: free.
  • No proprietary cloud service beyond ElevenLabs.

Integration notes:

  • Skill outputs live in <videos_dir>/edit/. The skill directory stays clean.
  • Manim support is vendored (skills/manim-video/). Read its SKILL.md when building Manim animation slots.
  • Transcripts cached per source file; re-transcription only when the source file hash changes (immutable outputs of immutable inputs).
  • Each animation = one parallel sub-agent via the Agent tool — sequential sub-agents are a hard anti-pattern.

Design Principles

  1. Text + on-demand visuals. No frame-dumping. Transcript is the surface.
  2. Audio is primary, visuals follow. Cuts from speech boundaries and silence gaps.
  3. Ask → confirm → execute → self-eval → persist. Never cut without strategy approval.
  4. Zero assumptions about content type. Look, ask, then edit.
  5. 12 hard rules, artistic freedom elsewhere. Correctness is non-negotiable; taste isn’t.

Anti-patterns (from SKILL.md)

  • Hierarchical pre-computed tone/shot metadata — over-engineering, derive from transcript at decision time.
  • Hand-tuned moment-scoring heuristics — the LLM picks better.
  • Whisper SRT / phrase-level output — loses sub-second gap data.
  • Running Whisper locally on CPU — slow, normalizes fillers.
  • Burning subtitles into base before compositing overlays — overlays hide them (Rule 1).
  • Single-pass filtergraph when overlays exist — double encode.
  • Linear animation easing — looks robotic; always cubic.
  • Hard audio cuts — audible pops (Rule 3).
  • Typing text centered on partial-string width — text slides left during reveal.
  • Sequential animation sub-agents — always parallel (Rule 10).
  • Editing before confirming strategy.
  • Re-transcribing cached sources.

Try It

  • Smoke test on an existing clip. Clone the repo, symlink into ~/.claude/skills/, set the ElevenLabs key, point claude at a folder with 1–2 raw takes, say “edit these into a 30s demo.” First run forces the full pipeline and produces a takes_packed.md you can review before approving the strategy.
  • Multi-take recording for a WEO script. Record 5–10 takes of each beat of a [[ai-video-content/_index|OmniPresence]]-style script directly into a raw/ folder. Let video-use pick the best take per beat via the editor sub-agent brief. Compare against the current OmniPresence single-take-then-manual-edit workflow for speed.
  • Drop-in cleanup of existing OmniPresence / Avatar V output. Feed finished [[ai-video-content/heygen-avatar-v|Avatar V]] MP4s plus any B-roll to video-use and have it produce a montage — tests whether video-use adds polish on top of an avatar pipeline, not just replaces it.
  • Benchmark against [[ai-video-content/heygen-studio-automation|HeyGen Studio Automation]]. HeyGen Studio Automation generates from scripts; video-use edits existing footage. Running both on the same dental-marketing script (generate with HeyGen, polish with video-use) may produce higher-fidelity output than either alone.
  • Try a Manim explainer slot. For a dental-marketing explainer, commission one Manim animation slot (e.g., how Google Search prioritizes local dental sites) via video-use’s parallel sub-agent pattern. Gives a reusable asset plus a test of the Manim backend.
  • HeyGen Studio Automation with Claude Code — complementary: HeyGen Studio generates videos from scripts (ElevenLabs → HeyGen Avatar V → Remotion). video-use edits existing footage. Both orchestrated by Claude Code; both use ElevenLabs. Stacking them = script-to-avatar-to-polished-cut.
  • Claude Code Video Toolkit (Digital Samba) — 10-skill + 13-command production workspace using open-source models on Modal/RunPod. Overlapping goal (Claude Code owns video production) with different tradeoff: Video Toolkit ships a full OSS model stack for generation; video-use focuses narrowly on editing-by-conversation with hosted ASR as the only paid dependency.
  • Remotion Motion Graphics — Remotion is one of video-use’s three animation backends (alongside PIL and Manim). Useful when brand-aligned typography matters.
  • HeyGen Hyperframes — another HTML-based composition framework for Claude Code / Cursor / Codex. Different shape (Hyperframes = deterministic HTML composition; video-use = non-linear editing of recorded footage) but overlapping audience.
  • HeyGen Avatar V — natural upstream input. Avatar V produces unlimited-duration talking-head footage; video-use is the natural next step for trimming, grading, and subtitling that footage.
  • Building Skills for Claude — video-use is a canonical example of the Anthropic Skill format: SKILL.md at root with name / description frontmatter, progressive disclosure into helpers, vendored sub-skill (skills/manim-video/).
  • Skill Design Patterns — video-use illustrates several patterns: progressive-disclosure helpers, sub-skill vendoring, text-first context economy, self-evaluation loop with bounded retries.
  • Claude Code Subagents — video-use’s parallel-sub-agent pattern for animations is a concrete production case of subagent orchestration.

Open Questions

  • License. README says “100% open source” but the GitHub repo has no LICENSE file and no license field in metadata. Worth confirming before shipping derivative work.
  • Scribe cost per hour of source footage. README doesn’t state current ElevenLabs Scribe per-minute rates. For a WEO dental-marketing project with 10+ takes of 30-minute recordings, this is the cost-dominating line item.
  • Resolution / codec defaults for render.py. README documents “1080p default scale” but full extraction ffmpeg recipe, bitrate, CRF, and pix_fmt would need to be read from the helpers themselves.
  • How video-use composes with [[ai-video-content/heygen-avatar-v|Avatar V]] in practice. Avatar V output is already edited (no fillers, no retakes). The editing value from video-use on Avatar V output is grading + subtitles + overlays, not cutting. Hands-on trial would confirm whether that’s worth the plumbing.
  • Behavior on vertical / social aspect ratios. Output spec lists 1080×1920@30 as supported but README-level examples are all landscape. Worth testing for short-form dental Reels / TikToks.
  • Comparison to traditional NLE-style AI editors (e.g., Descript, Adobe Firefly Video). video-use is conversational + Claude Code-native; most AI-editor competitors are GUI-first. Tradeoff profile hasn’t been benchmarked.