/install senseaudio-video-gen
SenseAudio Video Gen
Author videos as HTML compositions, preview them in a browser, render them locally through Chrome screenshots plus FFmpeg, plan scripts/storyboards with AudioClaw by default, and generate supporting media through SenseAudio APIs. Treat HTML as the editable source of truth, SenseAudio as the media engine, and AudioClaw as the default LLM route.
Default Setup
On a new machine, configure the media API and LLM API separately:
export SENSEAUDIO_API_KEY="..."
SENSEAUDIO_API_KEY powers only SenseAudio media APIs: TTS, ASR, image, video, and music.
AudioClaw LLM planning uses a separate OpenAI-compatible route. If running inside AudioClaw, no extra LLM env is needed when the local AudioClaw config file exists. Otherwise set the LLM env explicitly:
export AUDIOCLAW_CONFIG_PATH="config/audioclaw.json"
export AUDIOCLAW_LLM_MODEL="doubao-seed-2-0-pro-260215"
export AUDIOCLAW_LLM_BASE_URL="https://platform.senseaudio.cn/v1"
export AUDIOCLAW_LLM_API_KEY="..."
LLM config precedence is CLI flags, then AUDIOCLAW_LLM_*, then AUDIOCLAW_CONFIG_PATH or the local AudioClaw config file. The CLI deliberately does not reuse SENSEAUDIO_API_KEY as an AudioClaw LLM key. Use --llm none for deterministic heuristic planning, or --offline to skip live media calls.
Core Loop
Start from a brief when the user wants a complete project shell:
python3 scripts/senseaudio_video_gen.py compose \
--project my-video \
--brief "Make a premium launch film for a new AI research assistant." \
--duration 12 \
--style-preset executive-film
compose is the general video path for product launch films, feature explainers, report summaries, technical walkthroughs, title cards, social cuts, and branded motion pieces. It defaults to executive-film, a restrained cinematic style with large typography, letterbox framing, low ornament, and non-web chapter IDs. Use --offline when drafting without live API calls, and add --render when the project should be rendered immediately.
compose now defaults to --llm audioclaw. If the default LLM route is unavailable, --llm-fallback keeps the project moving with heuristic planning and records a warning in senseframe.json. Pass --no-llm-fallback when an LLM failure should stop the run.
For the closest HyperFrames-style website workflow, prefer the one-pass site-video pipeline:
python3 scripts/senseaudio_video_gen.py site-video \
--url https://www.anthropic.com/ \
--project anthropic-site-video \
--brief "用中文介绍 Anthropic 官网的 Claude、安全 AI、研究与企业能力。" \
--duration 14 \
--fps 30 \
--llm audioclaw
site-video defaults to editorial-pro, layered beats, cinematic motion, GSAP-compatible timing, real website screenshots, AudioClaw LLM planning, SenseAudio narration/ASR when live media is enabled, audio-reactive data, local rendering, inspect frames, local frame-quality audit, and motion audits. If LLM planning fails, --llm-fallback retries with heuristic planning and records the warning. Use --offline --no-render for a safe draft that writes the same editable project structure and pipeline-report.json.
Add --music --music-poll when the site video should request a SenseAudio music bed, download it, mix it under narration as assets/final-audio.m4a, and render with that mixed track. If SenseAudio accepts the task but does not return audio_url in time, --music-fallback creates a local ambient bed so the video still ships with background music while preserving the task manifest. Use --music-dry-run or --offline to inspect the /music/song/create payload without spending credits. Add --auto-repair when the project should run a second pass after motion/vision audits, tighten real screenshot crops, damp busy overlays, and rerender the repaired composition.
Website capture follows the useful parts of the HyperFrames loop: warm the live page, dismiss common cookie/modals, scroll to trigger lazy assets, record assets/site-capture-quality.json, capture renders/inspect frames, and write renders/inspect/contact-sheet.html for review. Use --vision-audit when a live VL model should judge the rendered frames; the local frame-quality-audit still runs by default when rendering.
For gated, cookie-sensitive, or region-personalized sites, keep browser state explicit:
python3 scripts/senseaudio_video_gen.py site-video \
--url https://example.com/ \
--project example-site-video \
--browser-profile profiles/example-capture \
--cookie-file cookies/example.json
Cookies often make screenshots closer to what a real user sees, but the clean temporary browser remains the default to avoid leaking private account pages into generated videos.
For URL-to-video work, site-ingest classifies real page material into semantic roles such as hero, product, research, safety, developer, enterprise, customer, pricing, and CTA. These roles drive story_evidence, shot choice, composition mode, camera path, and data-material-role markers in the rendered HTML.
Use source-ingest for first-stage non-web inputs. It converts local Markdown/text files or a GitHub repository README into the same site-profile.json shape used by website projects, so compose --site-file \x3Cprofile.json> can reuse storyboard, narration, semantic role, and production-spec logic without a separate document pipeline:
python3 scripts/senseaudio_video_gen.py source-ingest \
--file product-notes.md \
--output product-notes.site.json \
--json
python3 scripts/senseaudio_video_gen.py source-ingest \
--github-url heygen-com/hyperframes \
--output hyperframes-readme.site.json
Use site-vision-plan when screenshot crops need to be planned before rendering. The default heuristic provider derives crop center, zoom, pan, and focus from DOM highlights and semantic roles. --provider openrouter builds an OpenRouter-compatible vision request so a VL model can inspect screenshots first; keep --fallback enabled so rendering degrades to deterministic crops if the model route is unavailable.
Use the music and repair commands directly when tuning an existing project:
python3 scripts/senseaudio_video_gen.py music-create \
--prompt "Instrumental premium website explainer bed, subtle pulse, no vocals" \
--duration 16 \
--poll \
--download my-video/assets/background-music.mp3 \
--project my-video
python3 scripts/senseaudio_video_gen.py mix-audio \
--project my-video \
--voice my-video/assets/narration.mp3 \
--music my-video/assets/background-music.mp3 \
--output my-video/assets/final-audio.m4a \
--duration 16
python3 scripts/senseaudio_video_gen.py repair --project my-video --json
Use the default AudioClaw route for creative plans when the brief needs LLM-written copy and storyboard:
python3 scripts/senseaudio_video_gen.py llm-plan \
--brief "Make a concise webpage intro for SenseAudio's sound library." \
--duration 9 \
--output my-plan.json
python3 scripts/senseaudio_video_gen.py compose \
--project my-video \
--brief "Make a concise webpage intro for SenseAudio's sound library." \
--generate-images \
--generate-broll \
--asset-dry-run \
--offline
llm-plan defaults to --provider audioclaw. The skill strips LiteLLM-style provider prefixes such as volcengine/ for platform.senseaudio.cn and retries without response_format for models that do not support JSON-mode requests.
DeepSeek remains available with --provider deepseek or --llm deepseek; set DEEPSEEK_API_KEY, DEEPSEEK_MODEL, or DEEPSEEK_BASE_URL when using it.
If the AudioClaw configured model is not strong enough for dense product research, switch planning to OpenRouter with --provider openrouter or --llm openrouter, and choose a capable model via --model, --llm-model, OPENROUTER_LLM_MODEL, or OPENROUTER_MODEL.
Build an existing project as a local pipeline:
python3 scripts/senseaudio_video_gen.py build --project my-video --dry-run
python3 scripts/senseaudio_video_gen.py build --project my-video --output my-video/renders/final.mp4
Or scaffold a blank composition:
python3 scripts/senseaudio_video_gen.py init my-video --duration 6 --fps 24
cd my-video
python3 ../scripts/senseaudio_video_gen.py preview .
python3 ../scripts/senseaudio_video_gen.py inspect . --samples 5
python3 ../scripts/senseaudio_video_gen.py render . --output renders/final.mp4
Use SenseAudio assets inside the same project:
python3 scripts/senseaudio_video_gen.py tts \
--text "让声音、字幕和画面在一个视频项目里完成。" \
--voice-id male_0028_a \
--output my-video/assets/narration.mp3
python3 scripts/senseaudio_video_gen.py asr \
--file my-video/assets/narration.mp3 \
--timestamps word \
--output my-video/assets/transcript.json
python3 scripts/senseaudio_video_gen.py captions \
--project my-video \
--transcript my-video/assets/transcript.json \
--output my-video/assets/captions.json
python3 scripts/senseaudio_video_gen.py captions-export \
--captions my-video/assets/captions.json \
--format srt \
--output my-video/renders/final.srt
python3 scripts/senseaudio_video_gen.py render my-video \
--audio my-video/assets/narration.mp3 \
--parallel 4 \
--resume \
--output my-video/renders/final-with-voice.mp4
python3 scripts/senseaudio_video_gen.py lint --project my-video --json
python3 scripts/senseaudio_video_gen.py asset-report --project my-video --json
python3 scripts/senseaudio_video_gen.py generate-assets \
--project my-video \
--image-prompt "clean product UI hero image for a sound library" \
--video-prompt "short b-roll of creators choosing voices" \
--dry-run
python3 scripts/senseaudio_video_gen.py timeline \
--project my-video \
--preset cinematic
Composition Contract
- Root scene uses
data-composition-id,data-width,data-height,data-duration. - Timed clips use
data-start,data-duration, optionaldata-media-start, and optionaldata-scene. - Timeline DSL uses
assets/timeline.json,data-timeline-source, and optionaldata-effectpresets such asfade-up,slide-left,zoom-in,spotlight, andparallax. - Visual styles are selected through the local registry with
stylesandcompose --style-preset; tokens are embedded as CSS variables, written toassets/style-preset.json, and recorded insenseframe.json. - The optional
gsap-compattimeline engine writeslabelsandtracksand uses a localcreateGsapCompatTimelineadapter; never load external GSAP/CDN code for deterministic renders. - Transition presets use
transition_presetplustransitions[]inassets/timeline.json; supported presets includeeditorial,glass,ribbon,iris, andluma. - Storyboard ids must bind to real DOM scenes:
composemaps each storyboard item to matchingdata-sceneanddata-timeline-idelements rather than fixed template beats. - Beat composition uses
assets/beats.json,.beat-layer, anddata-beatmarkers.compose --beat-mode layeredsplits each storyboard scene into hook/proof/detail/cta overlays so a single scene can carry multiple timed visual arguments. - Beat pacing is readability-first: defaults use 3 requested beats per scene, automatically clamp short scenes to keep each beat around one second or longer, and
motion-mapreports flashiness risk when beat/transition rates are too high. - Website explainers rotate dedicated shot layouts (
hero-overview,nav-scan,feature-zoom,trust-message,cta-summary) so adjacent scenes do not reuse the same visual structure. - Composition templates use
data-composition-modevalues such asfull-bleed,split-scan,zoom-callout,evidence-board, andcta-lockup; every website scene should also set adata-camera-pathso the renderer can apply distinct camera motion instead of a repeated left/right card. - Brand extraction uses
brand-extractorcompose --brand-urlto createassets/brand.json; brand name, description, nav labels, colors, logos/icons, social images, typography, keywords, and inferred voice should influence website explainer shots. - Site ingestion uses
site-ingestorcompose --site-urlto createassets/site-profile.json; headings, sections, CTA labels, and evidence snippets should drive storyboard scenes and visual cards before generic brief fallback. - Source ingestion uses
source-ingest --file \x3Cnotes.md|notes.txt>orsource-ingest --github-url \x3Cowner/repo>to create the same site-profile shape from Markdown, plain text, or GitHub README content. - Website screenshots use
site-captureorcompose --site-screenshotsto capture scroll positions intoassets/site-screenshots/; these screenshots should appear in website explainer shots as visual evidence with deterministic pan/zoom and evidence highlight boxes, not decorative stock media. - Video copy defaults to Chinese (
zh-CN) for narration, captions, beat text, and storyboard intent unless the user explicitly requests another language. - Seekable motion uses
window.__timelines["main"]; the runtime seeks registered timelines during frame capture so entrances, breathing motion, exits, chips, waveforms, and focus highlights render deterministically. - Audio-reactive motion uses smoothed
assets/audio-data.jsonfromaudio-data; the runtime loadsdata-audio-sourceand maps RMS/bands to local mesh intensity, card glow, waveform motion, and transition light. Do not drive the global camera directly from raw audio. - Caption containers use
data-caption-source="./assets/captions.json"or inlinewindow.__sfCaptions;captions --include-wordsenables active word highlighting with.sf-word[data-sf-active="true"]. - Word captions use kinetic karaoke states: active words scale, emphasized terms receive
sf-word-emphasis, and styling must remain deterministic with no CSS animation loops. - Final renders are silent unless
render --audio \x3Cfile>is provided orbuildfinds a registered audio asset such asnarration;lintwarns when narration text exists but no audio asset is registered. - Put deterministic frame logic in
window.renderFrame(time). - Run
motion-audit --project \x3Cdir> --strictafter composing to catch storyboard/DOM/timeline mismatches and legacy fixed-template markers. - Run
motion-map --project \x3Cdir> --strictbefore expensive renders to score motion density, scene coverage, low-motion zones, transitions, and audio-reactive binding. - The runtime sets
window.__senseframes.time, CSS variable--sf-time,data-sf-active,data-scene-active, and dispatchessf-seek. - Use CSS for final layout first, then animate from
renderFrame(time); avoid wall-clock animation for rendered output.
Workflow
- Plan — define aspect ratio, duration, audience, scenes, copy, voice, and output target.
- Scaffold — run
composefor brief-to-project orinitfor a blank source; use--beat-mode layeredwhen the video needs dense HyperFrames-style scene internals. - Generate media — use
voices,tts,asr,captions,image-sync, orvideo-createto produce assets. - Register assets — use
asset-addor command manifests soassets/asset-manifest.jsonandsenseframe.jsonstay current. - Lint and audit — run
lint,motion-audit, andmotion-mapto catch missing assets, mismatched scenes, and flat motion. - Inspect frames — run
inspectbefore rendering to catch layout, legibility, and timing issues. - Render — run
renderorbuild; mux narration with--audiowhen needed. - Deliver — return the MP4, render report JSON, subtitles,
senseframe.json, transcripts, prompts, and asset manifests.
Command Map
| Task | Command | Purpose |
|---|---|---|
| One-pass website video | site-video --url \x3Csite> |
Ingest, plan, capture, narrate, bind audio data, render, and audit in one pipeline |
| LLM plan | llm-plan / llm-plan --provider openrouter |
Generate title, narration, visual style, and storyboard JSON; defaults to AudioClaw |
| Brief to project | compose --project \x3Cdir> --brief ... |
Create storyboard, narration script, caption scaffold, HTML, and manifests; defaults to AudioClaw with heuristic fallback |
| Brand extraction | brand-extract --url \x3Csite> |
Extract brand identity, colors, nav, logos/icons, typography, keywords, and voice |
| Site ingestion | site-ingest --url \x3Csite> |
Extract real headings, sections, CTA labels, and evidence snippets for URL-to-video |
| Source ingestion | source-ingest --file \x3Cmd/txt> / --github-url \x3Crepo> |
Convert Markdown, text, or GitHub README content into a reusable site-profile.json |
| Site screenshots | site-capture --url \x3Csite> |
Capture real scroll screenshots with Chrome, warm lazy content, clean overlays, and register visual evidence |
| Frame quality | frame-quality-audit --project \x3Cdir> |
Check inspect/site frames for blank captures and leaked planning copy |
| Visual crop plan | site-vision-plan --project \x3Cdir> |
Plan screenshot crop, zoom, pan, and focus before rendering |
| Beat layers | beats --project \x3Cdir> |
Split storyboard scenes into hook/proof/detail/cta timed overlays |
| Local pipeline | build --project \x3Cdir> |
Run lint, create captions when a transcript exists, and render |
| Generated assets | generate-assets --project \x3Cdir> |
Plan or call SenseAudio image/video generation and register results |
| Project validation | lint --project \x3Cdir> |
Check entry HTML, runtime, caption sources, timing, and asset existence |
| Style registry | styles --json |
List built-in visual presets and recommended motion defaults |
| Motion audit | motion-audit --project \x3Cdir> |
Check storyboard scene binding, beat layers, transition layer, audio-reactive hooks, timeline registry, and legacy markers |
| Motion map | motion-map --project \x3Cdir> |
Score motion density, scene/beat coverage, flashiness risk, transition coverage, dead zones, and audio-reactive binding |
| Audio data | audio-data --audio \x3Cfile> --output assets/audio-data.json |
Extract frame-level RMS/band data and bind it with data-audio-source |
| Scaffold | init \x3Cdir> |
Create index.html, runtime, manifest, assets, renders |
| Preview | preview \x3Cdir> |
Serve project for browser review |
| Inspect | inspect \x3Cdir> |
Capture timestamped sample frames |
| Timeline | timeline --project \x3Cdir> --timeline-engine gsap-compat |
Generate animation tracks, labels, transitions, and bind them to the runtime |
| Render | render \x3Cdir> |
Convert HTML frames to MP4 locally with optional --parallel, --resume, and --frame-dir |
| Voiceover | tts |
Generate narration from SenseAudio TTS |
| Transcript | asr --timestamps word |
Produce transcript timing for captions |
| Captions | captions --transcript ... |
Convert ASR JSON into assets/captions.json |
| Subtitle files | captions-export |
Export captions JSON to .srt or .vtt |
| Asset registry | asset-add |
Register local/generated assets in the project manifest |
| Asset inventory | asset-report |
List registered assets and missing files |
| Still assets | image-sync |
Generate first frames, backdrops, thumbnails |
| Model clips | video-create / video-status |
Generate AI video clips through SenseAudio |
| Voices | voices --voice-type all |
Discover usable voice_id values |
Non-Negotiable Rules
- Never rely on CSS
animation,setInterval, or wall-clock playback for final renders; tie motion totime. - Do not use remote video generation when an HTML composition can express the exact UI/layout; use SenseAudio video generation for generative inserts, references, or b-roll.
- Do not invent voice IDs. Query
voicesor use a user-provided voice. - Keep
senseframe.jsonupdated with generated assets, task IDs, transcripts, and final output paths. - For subtitles, use
captionsto convert ASR words into grouped caption cues before authoring caption elements. - If a local audio/video file must guide SenseAudio model video generation, upload it somewhere first; model video content fields require URLs.
References
references/authoring.md— HTML composition patterns and timing rules.references/renderer.md— local renderer requirements and troubleshooting.references/media-pipeline.md— SenseAudio asset pipeline.references/api.md— endpoint and model parameter summary.examples/starter-html-video— minimal editable composition project.
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install senseaudio-video-gen - 安装完成后,直接呼叫该 Skill 的名称或使用
/senseaudio-video-gen触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
senseaudio-video-gen 是什么?
Use when the user asks to create, inspect, render, or repair an HTML-authored video from a brief, website, Markdown/text file, or GitHub repository; needs ca... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 18 次。
如何安装 senseaudio-video-gen?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install senseaudio-video-gen」即可一键安装,无需额外配置。
senseaudio-video-gen 是免费的吗?
是的,senseaudio-video-gen 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
senseaudio-video-gen 支持哪些平台?
senseaudio-video-gen 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 senseaudio-video-gen?
由 Li Fan(@fridaylifan)开发并维护,当前版本 v1.0.0。