Source: Website to Video — HeyGen HyperFrames docs
The /website-to-video skill is HyperFrames’ URL-to-video path: hand an AI agent a live URL plus a one-line creative direction, and it captures the site, extracts the brand identity, writes a script and storyboard, generates voiceover, builds animated HTML compositions, and delivers a renderable MP4. It is the “warm start” workflow from the HeyGen Hyperframes hub made concrete — the capture step does the scraping, and the standard 7-step HyperFrames pipeline does the rest. This article is the deep-dive on that one path; see the hub for what HyperFrames is overall.
Key Takeaways
- One prompt, full pipeline. A URL + a creative direction triggers capture, design, script, storyboard, voiceover, build, and validate — no manual steps in between.
- The agent self-triggers on a URL + a video request. Once the skill is installed (
npx skills add heygen-com/hyperframes), the agent loads it automatically when it sees a link and a “make a video” intent; no slash command strictly required. - Capture is step one and runs automatically. You don’t call
npx hyperframes captureby hand in the normal flow — but it exists as a standalone command for pre-caching, debugging, or harvesting site data outside video production. - Branding is pulled from the page, not invented. Pixel-sampled colors, downloaded woff2 fonts, semantically-named assets, page sections, and CTAs are extracted into a
capture/dir, then distilled into aDESIGN.mdbrand reference the compositions obey. - Scenes are “beats.” The storyboard breaks the video into per-beat creative direction; each beat becomes one animated HTML composition (
compositions/*.html), so you can rebuild a single beat without re-running the pipeline. - Creative direction outweighs format. “Apple keynote energy” or “dark, developer-focused, show code” shapes every visual decision more than the duration/type label does.
- Optional vision enrichment. A Gemini or OpenRouter API key upgrades asset descriptions from DOM-context guesses to actual image descriptions (~$0.04 per 40-image capture on the paid tier).
How It Works
The skill runs the canonical Hyperframes pipeline — seven steps, each emitting a named artifact that feeds the next:
| Step | Output | What happens |
|---|---|---|
| Capture | capture/ | Screenshots, design tokens, fonts, assets, animations extracted from the live site |
| Design | DESIGN.md | Brand reference — colors, typography, do’s and don’ts |
| Script | SCRIPT.md | Narration text with hook, story, proof, CTA |
| Storyboard | STORYBOARD.md | Per-beat creative direction — mood, assets, animations, transitions |
| VO + Timing | narration.wav + transcript.json | TTS audio with word-level timestamps |
| Build | compositions/*.html | Animated HTML compositions, one per beat |
| Validate | Snapshot PNGs | Visual verification before delivery |
The capture step (the part that makes this a URL workflow). A headless browser loads the page, scrolls through it, and extracts:
- Screenshots — viewport captures at every scroll depth; the count is dynamic based on page height.
- Colors — pixel-sampled dominant colors plus computed styles (including oklch/lab conversion).
- Fonts — CSS font families plus the downloaded woff2 files.
- Assets — images, SVGs with semantic names, Lottie animations, video previews.
- Text — all visible text in DOM order.
- Animations — Web Animations API, scroll-triggered animations, WebGL shaders.
- Sections — page structure with headings, types, and background colors.
- CTAs — buttons and links detected by class names and text patterns.
That raw capture/ is what turns into a composition: the Design step compresses it into DESIGN.md (the palette + type + brand rules the build obeys), the Script and Storyboard steps decide what to say and how to pace it, and the Build step writes one animated HTML file per beat using the captured assets. Nothing about the brand is invented — it is sampled off the page.
Vision enrichment (optional)
By default the capture describes each asset from DOM context alone — alt text, nearby headings, CSS classes. Adding a vision key upgrades those to real descriptions, which lets the agent make better storyboard decisions:
- Without vision:
hero-bg.png — 582KB, section: "Hero", above fold(knows it exists, not what it shows). - With vision:
hero-bg.png — 582KB, A gradient wave in purple and blue sweeps across a dark background, creating an aurora-like effect.
Drop a key in a project-root .env — either GEMINI_API_KEY or OPENROUTER_API_KEY (OpenRouter wins if both are set; default model google/gemini-3.1-flash-lite, overridable via HYPERFRAMES_OPENROUTER_MODEL / HYPERFRAMES_GEMINI_MODEL). Cost is ~0.04**.
Invocation / The Prompt
1. Install the skill once (persists across sessions; works with Claude Code, Cursor, Gemini CLI, and Codex CLI):
npx skills add heygen-com/hyperframes2. Describe the video in any directory — a URL plus a duration and creative direction:
Create a 25-second product launch video from https://example.com.
Bold, cinematic, dark theme energy.
The agent loads the skill on seeing a URL + a video request and runs the whole pipeline. For the most reliable trigger, lead with “Use the /website-to-video skill.”
3. Preview live (opens in the browser; edits auto-reload):
npx hyperframes preview4. Render to a file:
npx hyperframes render --output my-video.mp4
# ✓ Captured 750 frames in 12.4s
# ✓ Encoded to my-video.mp4 (25.0s, 1920×1080, 6.8MB)The capture command (advanced)
The skill captures automatically as step one, so you rarely call this — but it is exposed for pre-caching, debugging a bad capture, or using site data outside video:
npx hyperframes capture https://stripe.com
# ◇ Captured Stripe | Financial Infrastructure → capture
# Screenshots: 12 · Assets: 45 · Sections: 15 · Fonts: sohne-var| Flag | Default | Description |
|---|---|---|
-o, --output | ./capture | Output dir (auto-suffixes ./capture-2/, ./capture-3/… if taken) |
--timeout | 120000 | Page-load timeout (ms) |
--skip-assets | false | Skip downloading images and fonts |
--max-screenshots | 24 | Maximum screenshot count |
--json | false | Output structured JSON for programmatic use |
Iterating without a full re-run
- Edit the storyboard —
STORYBOARD.mdis the creative north star; change a beat’s mood or assets and ask the agent to rebuild just that beat. - Edit a composition directly — open
compositions/beat-3-proof.htmland tweak animations, colors, or layout by hand. - Rebuild one beat — “Rebuild beat 2 with more energy. Use the product screenshot as full-bleed background.”
- Snapshot to verify without a full render —
npx hyperframes snapshot my-project --at 2.9,10.4,18.7emits key-frame PNGs (flags:--framesdefault 5,--attimestamps,--timeoutdefault 5000ms).
What You Get
- A multi-beat video whose scenes are derived from the site’s own content — hook, story, proof, CTA narration arc, one animated HTML composition per beat.
- On-brand visuals pulled from the page — the actual palette, fonts, and assets, governed by
DESIGN.mdso the output matches the source brand rather than a generic template. - Voiceover with word-level timing —
narration.wav+transcript.json, which is what lets captions and asset reveals sync to the narration. - Validation snapshots — PNG key frames generated before delivery so you can eyeball compositions without a full encode.
- A renderable MP4 (example render: 25.0s, 1920×1080, 6.8MB) plus all intermediate artifacts checked into the project for re-editing.
The prompt determines the format — include a duration and a direction:
| Type | Duration | Example prompt |
|---|---|---|
| Social ad | 10–15s | ”15-second Instagram reel. Energetic, fast cuts.” |
| Product launch | 20–30s | ”25-second product launch. Apple keynote energy.” |
| Product tour | 30–60s | ”45-second tour showing the top 3 features.” |
| Brand reel | 15–30s | ”20-second brand video. Celebrate the design.” |
| Feature announcement | 15–25s | ”Feature announcement highlighting the new AI agents.” |
| Teaser | 8–15s | ”10-second teaser. Super minimal. Just the hook.” |
Use Cases
- Product launch / showcase — turn a marketing or product page into a 20–30s keynote-style announcement (the docs’ lead example: a Linear-style launch with “Apple keynote” framing).
- Site / product tour — a 30–60s walkthrough of a site’s top features, scenes built from the captured sections.
- Social clip — a 10–15s reel or an 8–15s teaser cut from the site’s hero and key assets for Instagram/TikTok-style distribution.
- Brand reel — a 15–30s piece that celebrates a site’s design using its own palette and type.
- Capture-only data harvest — run
npx hyperframes capture --jsonpurely to extract a site’s colors, fonts, and assets for use outside video production.
Limitations
- Heavy client-side rendering needs a longer timeout. Sites behind Cloudflare or with heavy CSR can time out; bump
--timeout(e.g.--timeout 180000). The capture handles dynamic sites — it just may need more load time.^[inferred — the docs prescribe a longer timeout rather than declaring such sites unsupported; an empty-shell SPA that renders nothing without interaction is the residual risk] - Lazy-loaded images on very long pages can be missed. Framer-style sites that lazy-load via IntersectionObserver are handled by the capture scrolling the page, but images near the bottom of very long pages may not all load. A vision key improves asset descriptions but does not increase the count.
- Color accuracy depends on sampling. The palette comes from pixel sampling plus DOM computed styles; if colors look wrong, inspect the scroll screenshots in
capture/screenshots/to see what the capture actually saw. - Vision enrichment requires an external API key. Richer asset descriptions need a Gemini or OpenRouter key and incur (small) per-image cost; without one, the agent works from DOM context only.
- Trigger can be unreliable if the skill is not installed or the intent is ambiguous — verify the install and lead with “Use the /website-to-video skill.”
Try It
npx skills add heygen-com/hyperframesin a throwaway directory, then prompt your agent: “Create a 20-second product launch video from [your site]. Apple keynote energy.” Let it run the full pipeline.npx hyperframes previewto watch it in the browser, then iterate with one-line beat edits (“rebuild beat 2 with more energy”).- Inspect the artifacts: open
DESIGN.mdto see the brand it pulled, andSTORYBOARD.mdto see the beat breakdown — thennpx hyperframes render --output launch.mp4. - For a pure data pull, run
npx hyperframes capture https://stripe.com --jsonand look at the captured colors, fonts, and assets.