Source: ai-research/browser-use-video-use-github-repo.md Repo: https://github.com/browser-use/video-use Stars: 3,151 (captured 2026-04-22) Language: Python Created: 2026-04-12 · Org: browser-use (same team as the browser-use web-automation library) License: None declared in repo metadata; README states “100% open source.”
video-use is a Claude Code skill that edits video by conversation. Drop raw footage in a folder, chat with Claude Code, get final.mp4 back. It handles filler-word trimming, color grading, subtitle burning, overlay animation generation, and a self-evaluating render loop — all without presets or menus. The design insight is that the LLM never watches the video; it reads a packed phrase-level transcript and drills into short PNG timeline composites only at decision points — the same “structured DOM over screenshot” trick browser-use applies to web pages, now applied to video.
Key Takeaways
- Skill lives under
~/.claude/skills/video-use/via symlink from a clone. Outputs always land in<videos_dir>/edit/— never inside the skill folder (Hard Rule 12). - Two-layer reading model replaces frame-by-frame analysis. Layer 1: ElevenLabs Scribe word-level transcript + diarization + audio events, packed into a ~12KB
takes_packed.md. Layer 2: on-demandtimeline_viewPNGs (filmstrip + waveform + word labels) only at decision points. - Token economics are the whole point. README’s framing: naive frame-dump = 30,000 frames × 1,500 tokens = 45M tokens. video-use = 12KB text + a handful of PNGs.
- 12 Hard Rules are non-negotiable (correctness, not taste). Examples: subtitles LAST in filter chain, 30ms audio fades at every cut, word-boundary snapping,
setpts=PTS-STARTPTS+T/TBfor overlays, per-segment extract +-c copyconcat (not single-pass filtergraph). - Animation overlays via parallel sub-agents — one sub-agent per animation slot, spawned via Claude Code’s
Agenttool. Sequential sub-agents are explicitly an anti-pattern. - Three animation backends: PIL + PNG sequence + ffmpeg (fastest), Manim (formal diagrams — vendored at
skills/manim-video/), Remotion (typography, brand layouts). - Two shipped color grades:
warm_cinematic(teal/orange split, desaturated — safe for talking heads),neutral_punch(contrast + S-curve, no hue). Plus--filter '<raw ffmpeg>'for arbitrary chains. Mental model is ASC CDL:out = (in * slope + offset) ** power. - Subtitle default (
bold-overlaystyle): 2-word UPPERCASE chunks, Helvetica 18 Bold, white-on-black-outline,MarginV=35. Tuned for fast-paced short-form / tech-launch footage. - Self-eval render loop runs
timeline_viewon the rendered output at every cut boundary (±1.5s window) plus first 2s / last 2s / 2–3 midpoints — catches visual jumps, audio pops, hidden subtitles, misaligned overlays. Capped at 3 passes; flags unresolved issues instead of looping forever. - Session memory in
project.md— appended every session (Strategy / Decisions / Reasoning log / Outstanding). On startup, the skill readsproject.mdand summarizes the last session before asking whether to continue. - Word-level verbatim ASR only. No Whisper SRT/phrase mode (loses sub-second gap data). No normalized fillers (loses editorial signal —
umm/uh/false starts are the editing signal). - Stack: Python + ffmpeg + ElevenLabs Scribe + yt-dlp (optional). Manim and Remotion installed only on first use.
Pipeline
Transcribe ──> Pack ──> LLM Reasons ──> EDL ──> Render ──> Self-Eval
│
└─ issue? fix + re-render (max 3)
Eight-step process: Inventory (ffprobe + transcribe_batch.py + pack_transcripts.py) → Pre-scan for verbal slips → Converse (shaped by material, no fixed checklist) → Propose strategy (4–8 sentences, wait for confirmation) → Execute (EDL via editor sub-agent, parallel animation sub-agents, per-segment grade, render.py) → Preview → Self-eval (≤3 passes) → Iterate + persist.
Helpers
transcribe.py <video>— single-file Scribe call, cached.transcribe_batch.py <videos_dir>— 4-worker parallel transcription.pack_transcripts.py --edit-dir <dir>— JSON transcripts →takes_packed.md(phrase-level, break on silence ≥ 0.5s).timeline_view.py <video> <start> <end>— filmstrip + waveform PNG. Decision-point tool, not a scanner.render.py <edl.json> -o <out>— per-segment extract → concat → PTS-shifted overlays → subtitles LAST.grade.py <in> -o <out>— ffmpeg filter-chain color grade.
EDL format (the sub-agent’s output contract)
{
"version": 1,
"sources": {"C0103": "/abs/path/C0103.MP4"},
"ranges": [
{"source": "C0103", "start": 2.42, "end": 6.85,
"beat": "HOOK", "quote": "...", "reason": "Cleanest delivery, stops before slip."}
],
"grade": "warm_cinematic",
"overlays": [{"file": "edit/animations/slot_1/render.mp4", "start_in_output": 0.0, "duration": 5.0}],
"subtitles": "edit/master.srt",
"total_duration_s": 87.4
}Implementation
Tool/Service: video-use (Claude Code skill) + ElevenLabs Scribe (hosted ASR) + ffmpeg + optional yt-dlp / Manim / Remotion
Setup:
git clone https://github.com/browser-use/video-use && cd video-useln -s "$(pwd)" ~/.claude/skills/video-usepip install -e .brew install ffmpeg(required),brew install yt-dlp(optional)cp .env.example .envthen setELEVENLABS_API_KEYcd /path/to/your/videos && claudethen say “edit these into a [type] video”
Cost:
- ElevenLabs Scribe: hosted per-minute pricing (current ElevenLabs ASR rates apply; not stated in the README).
- Claude Code: standard subscription or API usage for the orchestrating session + parallel sub-agents.
- ffmpeg / yt-dlp / Manim / Remotion / PIL: free.
- No proprietary cloud service beyond ElevenLabs.
Integration notes:
- Skill outputs live in
<videos_dir>/edit/. The skill directory stays clean. - Manim support is vendored (
skills/manim-video/). Read itsSKILL.mdwhen building Manim animation slots. - Transcripts cached per source file; re-transcription only when the source file hash changes (immutable outputs of immutable inputs).
- Each animation = one parallel sub-agent via the
Agenttool — sequential sub-agents are a hard anti-pattern.
Design Principles
- Text + on-demand visuals. No frame-dumping. Transcript is the surface.
- Audio is primary, visuals follow. Cuts from speech boundaries and silence gaps.
- Ask → confirm → execute → self-eval → persist. Never cut without strategy approval.
- Zero assumptions about content type. Look, ask, then edit.
- 12 hard rules, artistic freedom elsewhere. Correctness is non-negotiable; taste isn’t.
Anti-patterns (from SKILL.md)
- Hierarchical pre-computed tone/shot metadata — over-engineering, derive from transcript at decision time.
- Hand-tuned moment-scoring heuristics — the LLM picks better.
- Whisper SRT / phrase-level output — loses sub-second gap data.
- Running Whisper locally on CPU — slow, normalizes fillers.
- Burning subtitles into base before compositing overlays — overlays hide them (Rule 1).
- Single-pass filtergraph when overlays exist — double encode.
- Linear animation easing — looks robotic; always cubic.
- Hard audio cuts — audible pops (Rule 3).
- Typing text centered on partial-string width — text slides left during reveal.
- Sequential animation sub-agents — always parallel (Rule 10).
- Editing before confirming strategy.
- Re-transcribing cached sources.
Try It
- Smoke test on an existing clip. Clone the repo, symlink into
~/.claude/skills/, set the ElevenLabs key, pointclaudeat a folder with 1–2 raw takes, say “edit these into a 30s demo.” First run forces the full pipeline and produces atakes_packed.mdyou can review before approving the strategy. - Multi-take recording for a WEO script. Record 5–10 takes of each beat of a
[[ai-video-content/_index|OmniPresence]]-style script directly into araw/folder. Let video-use pick the best take per beat via the editor sub-agent brief. Compare against the current OmniPresence single-take-then-manual-edit workflow for speed. - Drop-in cleanup of existing OmniPresence / Avatar V output. Feed finished
[[ai-video-content/heygen-avatar-v|Avatar V]]MP4s plus any B-roll to video-use and have it produce a montage — tests whether video-use adds polish on top of an avatar pipeline, not just replaces it. - Benchmark against
[[ai-video-content/heygen-studio-automation|HeyGen Studio Automation]]. HeyGen Studio Automation generates from scripts; video-use edits existing footage. Running both on the same dental-marketing script (generate with HeyGen, polish with video-use) may produce higher-fidelity output than either alone. - Try a Manim explainer slot. For a dental-marketing explainer, commission one Manim animation slot (e.g., how Google Search prioritizes local dental sites) via video-use’s parallel sub-agent pattern. Gives a reusable asset plus a test of the Manim backend.
Related
- HeyGen Studio Automation with Claude Code — complementary: HeyGen Studio generates videos from scripts (ElevenLabs → HeyGen Avatar V → Remotion). video-use edits existing footage. Both orchestrated by Claude Code; both use ElevenLabs. Stacking them = script-to-avatar-to-polished-cut.
- Claude Code Video Toolkit (Digital Samba) — 10-skill + 13-command production workspace using open-source models on Modal/RunPod. Overlapping goal (Claude Code owns video production) with different tradeoff: Video Toolkit ships a full OSS model stack for generation; video-use focuses narrowly on editing-by-conversation with hosted ASR as the only paid dependency.
- Remotion Motion Graphics — Remotion is one of video-use’s three animation backends (alongside PIL and Manim). Useful when brand-aligned typography matters.
- HeyGen Hyperframes — another HTML-based composition framework for Claude Code / Cursor / Codex. Different shape (Hyperframes = deterministic HTML composition; video-use = non-linear editing of recorded footage) but overlapping audience.
- HeyGen Avatar V — natural upstream input. Avatar V produces unlimited-duration talking-head footage; video-use is the natural next step for trimming, grading, and subtitling that footage.
- Building Skills for Claude — video-use is a canonical example of the Anthropic Skill format:
SKILL.mdat root withname/descriptionfrontmatter, progressive disclosure into helpers, vendored sub-skill (skills/manim-video/). - Skill Design Patterns — video-use illustrates several patterns: progressive-disclosure helpers, sub-skill vendoring, text-first context economy, self-evaluation loop with bounded retries.
- Claude Code Subagents — video-use’s parallel-sub-agent pattern for animations is a concrete production case of subagent orchestration.
Open Questions
- License. README says “100% open source” but the GitHub repo has no
LICENSEfile and nolicensefield in metadata. Worth confirming before shipping derivative work. - Scribe cost per hour of source footage. README doesn’t state current ElevenLabs Scribe per-minute rates. For a WEO dental-marketing project with 10+ takes of 30-minute recordings, this is the cost-dominating line item.
- Resolution / codec defaults for
render.py. README documents “1080p default scale” but full extraction ffmpeg recipe, bitrate, CRF, and pix_fmt would need to be read from the helpers themselves. - How video-use composes with
[[ai-video-content/heygen-avatar-v|Avatar V]]in practice. Avatar V output is already edited (no fillers, no retakes). The editing value from video-use on Avatar V output is grading + subtitles + overlays, not cutting. Hands-on trial would confirm whether that’s worth the plumbing. - Behavior on vertical / social aspect ratios. Output spec lists 1080×1920@30 as supported but README-level examples are all landscape. Worth testing for short-form dental Reels / TikToks.
- Comparison to traditional NLE-style AI editors (e.g., Descript, Adobe Firefly Video). video-use is conversational + Claude Code-native; most AI-editor competitors are GUI-first. Tradeoff profile hasn’t been benchmarked.