Byted Mediakit Voiceover Editing

Name: Byted Mediakit Voiceover Editing
Author: volc-ai-mediakit

Description

Volcano Engine AI MediaKit talking-head video editing Skill: a one-stop workflow from environment setup through media management, audio processing, talking-h...

README (SKILL.md)

一、模式与凭据

1.1 三种执行模式

模式	说明	所需环境变量	ASR 方式
apig	SkillHub 网关代理，Bearer Token 认证	`ARK_SKILL_API_BASE` + `ARK_SKILL_API_KEY`（容器注入）+ `VOLC_SPACE_NAME` + `ASR_API_KEY` + `ASR_BASE_URL`	豆包语音大模型
cloud	直连火山引擎 OpenAPI，HMAC 签算	`VOLC_ACCESS_KEY_ID` + `VOLC_ACCESS_KEY_SECRET` + `VOLC_SPACE_NAME` + `ASR_API_KEY` + `ASR_BASE_URL`	豆包语音大模型
local	完全本地执行，无需云端服务	无（可选 `EXECUTION_MODE=local`）	Qwen3-ASR 本地推理

优先级：apig > cloud > local。自动检测按此顺序依次检查环境变量，缺参时打印 .env 路径与缺失变量列表并自动降级。

1.2 凭据配置

.env 文件位置：\x3CSKILL_DIR>/.env
脚本先读进程环境变量，再用 .env 补全未设置的项（不覆盖容器注入）
ARK_SKILL_* 通常由部署容器注入，不必手写到 .env
缺参不阻塞：不使用终端 input() 交互，缺参时打印提示信息并自动降级到可用模式
Agent 推荐用户通过编辑 .env 文件或Agent 文件写入工具来配置变量，避免终端粘贴问题
安全：控制台创建仅含所需权限的密钥；测试请用独立点播空间；.env 勿提交仓库

1.3 模式意图识别（Agent 必读）

当用户在对话中表达模式切换意图时，Agent 应识别并执行：

用户表达	识别为	操作
"用本地模式" / "不走云端" / "离线处理"	`EXECUTION_MODE=local`	写入 `.env` 或传 `--mode local`
"用云端" / "用火山引擎" / "走 AK/SK"	`EXECUTION_MODE=cloud`	写入 `.env` 或传 `--mode cloud`
"走网关" / "用 apig" / "用 SkillHub"	`EXECUTION_MODE=apig`	写入 `.env` 或传 `--mode apig`

隔离要求：各任务的模式选择相互独立。一个任务中切换到 local 不应影响另一个任务的模式。Agent 应在每次任务开始时重新检测或遵循用户当次指定。

二、执行约束（STRICT — NO EXCEPTIONS）

仅使用以下脚本，禁止创建新脚本
按顺序执行所有步骤，跳步即任务失败
执行任何脚本前必须先 cd \x3CSKILL_DIR>/scripts
每个 CHECKPOINT 必须验证产物后再继续
任何步骤失败必须立即停止并报告，不得继续
Step 4/5/6 输出禁止简化版、占位符、示例或省略内容
最终输出必须保持原样，不能做任何修改
审核页是否自动打开由 TALKING_VIDEO_AUTO_EDIT_REVIEW_AUTO_OPEN 决定（强制）：Agent 不得额外执行 open/start/xdg-open 等命令
超时设置：音视频处理时长较长，宿主超时时间应设置为 60 分钟

职责分工

组件	职责	不做
脚本（规则引擎）	ASR 优化、候选生成（标记位置+规则置信度+删除建议）	不做最终删除/保留决策
宿主 Agent（你）	语义断句、口癖识别确认、候选复核、最终删除/保留决策	不修改脚本

核心原则：脚本提供候选（含 deleted_parts + cleaned_text），Agent 做最终决策。

三、路径规则

3.1 SKILL_DIR 与 Output-dir

SKILL_DIR：byted-mediakit-voiceover-editing 目录路径
PROJECT_ROOT：由 scripts/project_paths.py 推导：
1. 环境变量 VOICEOVER_EDITING_PROJECT_ROOT 若设置则用之
2. 否则为 \x3CSKILL_DIR> 的 parents[2]（沿父链上移 3 级，不依赖中间目录命名）
Output-dir：\x3CPROJECT_ROOT>/output/\x3C素材名>/
脚本启动时会打印路径推导日志，便于调试确认

3.2 素材名推导

来源	推导规则	示例
URL	取最后一段去扩展名	`https://x.com/video.mp4` → `video`
本地文件	取文件名去扩展名	`/path/Test_Video_720p.mp4` → `Test_Video_720p`
DirectUrl	取 FileName 去扩展名	`test.mp4` → `test`
Vid	取 Vid 值	`v0xxx` → `v0xxx`

3.3 从上下文推导 output-dir

推导优先级（按顺序尝试）：
1. 对话历史/命令参数中已显式传入 --output-dir output/\x3C子目录> → 直接沿用
2. 无法从对话历史获得 → 询问用户指定
Agent 不得扫描仓库来推断 output-dir

3.4 重复处理

写入任何输出文件/目录前，若目标已存在，必须提示用户：

目录已存在：「是否删除原目录？[删除/保留并新建(01)]」
文件已存在：「是否删除/覆盖/保留？」
超时 20 秒默认「保留并新建(01)」

四、脚本清单

执行前必须 cd \x3CSKILL_DIR>/scripts

脚本	用途
`./scripts/setup.sh`	环境检查与依赖安装
`./scripts/pipeline_url_to_asr.py`	Step 3: URL → ASR 流水线（支持 `--mode local/cloud/apig`）
`./scripts/merge_asr_words.py`	Step 4 产出缺 words 时，从 raw 合并
`./scripts/prepare_export_data.py`	Step 6a: 数据预处理（`--width` `--height` `--write-step6`）
`./scripts/serve_review_page.py`	Step 6b: 审核页静态服务 + 数据保存 + 导出代理
`./scripts/export_server.py`	导出服务（独立进程，接收审核页 POST）
`./scripts/vod_direct_export.py`	Step 6c: VOD 导出任务提交与查询

五、必经步骤

各 Step 完整检查单见 references/执行步骤/ 下分步文档。

Step	说明	文档
Step 1	环境检查与依赖安装	1. 环境检查.md
Step 2	语气词/卡顿词确认与规则更新	2. 语气词提示与用户行为更新.md
Step 3	URL → ASR 流水线与候选生成	3. URL到ASR流水线与候选生成.md
Step 4	ASR 语义纠错（Agent 执行）	4. ASR语义纠错.md
Step 5	口播剪辑（Agent 执行）	5. 口播剪辑.md
Step 5.5	审核逻辑确认	5.5 审核逻辑确认.md
Step 6a	数据预处理	6a. 数据预处理.md
Step 6b	审核与导出	6b. 审核与导出.md
Step 6c	VOD 导出任务提交与查询	6c. VOD导出任务提交与查询.md

六、产物对照表

产物文件	生成步骤	说明
`step1_preuploaded.json`	Step 3	素材上传/注册结果（含 `_execution_mode`）
`step3_voice_separation_result.json`	Step 3	人声分离结果
`step5_asr_raw_*.json`	Step 3	ASR 原始转写
`step5_asr_optimized.json`	Step 4	语义纠错后 ASR
`step6_speech_cut.json`	Step 5	口播剪辑决策
`review_import_data.json`	Step 6a	审核页数据（含 `_execution_mode`、`track`、`sentences`）
`export_request.json`	Step 6a / 审核保存	导出请求（审核页"保存"后会同步更新此文件）
`export_submit_*.json`	Step 6b/6c	最终提交的导出数据

七、审核页与数据联动

7.1 模式感知

审核页通过 review_import_data.json 中的 _execution_mode 字段自动识别当前模式，并在界面上：

显示模式徽标（APIG 蓝/云端绿/本地橙）
调整导出按钮文案（本地模式显示"本地导出视频"）
调整导出成功信息（本地模式显示输出文件路径，云端显示 OutputVid + PlayURL）

7.2 本地模式审核页

本地模式完全支持审核页。Source 字段使用 http://127.0.0.1:\x3Cport>/local-media/\x3C绝对路径> 格式，由 serve_review_page.py 的 /local-media/ 路由代理访问本地文件。

7.3 数据联动（审核修改 ↔ 直接导出同步）

审核页提供两个操作按钮：

按钮	功能	数据流
💾 保存审核	将修改持久化到磁盘	POST `/api/save-review` → 更新 `review_import_data.json` + 重新生成 `export_request.json`
导出	直接触发视频导出	POST `/export` → `apply_review_to_export` → `export_submit_*.json` → ffmpeg/VOD

关键：用户在审核页做了修改后，点击"💾 保存审核"即可将修改同步到磁盘。此后即使关闭审核页，Agent 通过 vod_direct_export.py --output-dir \x3C输出目录> submit --wait 直接导出时也会读取更新后的 export_request.json，确保数据一致。

⚠️ 关键约束：调用 vod_direct_export.py 时，--output-dir 必须写在 submit/query 子命令之前。一行式调用格式：
cd SKILL_DIR/scripts && source .venv/bin/activate && python vod_direct_export.py --output-dir \x3C绝对路径> submit --wait

7.4 审核页服务端点

端点	方法	说明
`/`	GET	审核页 HTML
`/api/review-data`	GET	返回 `review_import_data.json`
`/api/mode`	GET	返回当前执行模式
`/api/save-review`	POST	保存审核修改（回写 review_import_data + 重生成 export_request）
`/export`	POST	触发导出（local: ffmpeg；cloud/apig: vod_direct_export）
`/local-media/\x3Cpath>`	GET	本地模式媒体文件代理

八、常见问题

现象	处理
本地文件走了 DirectUrl 模式	本地文件必须作为第一个位置参数传入；`--directurl` 仅用于 VOD 空间内已有 FileName
step5 写入失败	必须写入 `output/\x3C文件名>/step5_asr_optimized.json`，禁止写 output 根目录
concat 规则要删但音频还在播	actionTime 必须从 step5 words 查出仅保留部分的 ms
重复文件未提示	写入前必须检查目标是否存在，按 3.4 规则处理
step6 修正未生效	确保 step6 顶层为 `optimized_segments` 或 `sentences`；运行 `--write-step6` 写回
segment 起止时间不准	Step 6a 会依 step5 words 校正
delete 未在 deleted_parts	每个 `action: delete` 段必须在 deleted_parts 中有对应项
审核页修改关闭后丢失	关闭前点击"💾 保存审核"持久化到磁盘
审核页本地资源 404	确认 Source 字段为 `/local-media/` URL 格式；检查 `serve_review_page.py` 是否正常运行
缺参提示后阻塞	不再使用 `input()`，缺参时自动降级并打印 `.env` 路径提示

九、字幕可见性（Alpha）

字段：textElement.Extra[transform].Alpha（0～1）
含义：0 隐藏（不渲染到画布），1 展示
删除态：Alpha 设为 0；恢复：Alpha 设为 1

Usage Guidance

This skill appears coherent for talking‑head/video editing: it will call ASR endpoints and VOD upload APIs and may install heavy Python packages (torch, demucs, ffmpeg helpers). Before installing or running: 1) do not put high‑privilege credentials in the skill .env — use a dedicated minimal‑permission ASR/VOD account or a test VOD space; 2) run setup/install only in an isolated environment (container/VM) because pip will install large native packages; 3) review requirements.txt and scripts if you must comply with internal policy (they use requests and perform network upload to ASR/VOD endpoints, which is expected); 4) be aware the skill may write EXECUTION_MODE into .env automatically during mode auto‑downgrade; 5) avoid pointing the skill at sensitive local files — it reads/writes files under project_root/output. If you need higher assurance, run the pipeline in local mode (EXECUTION_MODE=local) which avoids sending media to cloud services, but that requires local ASR and separation dependencies to be installed and available.

Capability Analysis

Type: OpenClaw Skill Name: byted-mediakit-voiceover-editing Version: 1.0.9 The skill bundle is a professional-grade tool for automated video editing using Volcano Engine (VOD) APIs and local AI models like Qwen3-ASR and Demucs. It implements a complete pipeline for audio separation, denoising, ASR transcription, and video assembly. While the bundle utilizes high-risk capabilities such as starting local HTTP servers (serve_review_page.py, export_server.py), executing shell commands via FFmpeg (ffmpeg_utils.py), and handling sensitive API credentials, these actions are strictly necessary for and aligned with the stated purpose of media processing. The code is well-structured, and no evidence of malicious intent, data exfiltration, or harmful prompt injection was identified.

Capability Assessment

✓ Purpose & Capability

Name/description (Volcano/Byted MediaKit talking‑head editing) match what the code and SKILL.md implement: ASR submission, candidate generation, review UI, VOD upload/export and local processing. The declared environment variables (VOLC_*, ASR_*, ARK_SKILL_*) and required permissions (network, file read/write, temp storage) are appropriate for those tasks.

ℹ Instruction Scope

Runtime instructions require the agent to run a controlled sequence of scripts (cd into scripts/, run setup.sh, pipeline_xxx, prepare_export_data, etc.), read process env and a skill .env, and write outputs under project output/. These steps are within the editing/export workflow. Note: setup.sh and the scripts may write/update the skill .env (e.g., EXECUTION_MODE), and the SKILL.md encourages the agent to edit .env via a file-write tool — that's expected for configuration but is a point to be conscious of (secrets stored in .env must be handled carefully).

✓ Install Mechanism

There is no external arbitrary binary download; setup.sh creates a Python venv and installs pinned packages from requirements.txt/requirements-local.txt on PyPI. This is normal for Python tooling but means the environment will install substantial packages (torch, demucs, ffmpeg wrapper etc.), so run in an isolated environment and review requirements if you need reproducible/secure installs.

✓ Credentials

Required secrets (ASR_API_KEY, VOLC_SPACE_NAME, optional VOLC_ACCESS_KEY_ID/SECRET or ARK_SKILL_API_*) map directly to ASR and VOD functions the skill performs. There are no unrelated credential requests. The skill reads both process env and .env; it may write EXECUTION_MODE into .env during automatic downgrade — this is plausible but users should avoid placing unrelated high‑privilege secrets in that .env.

✓ Persistence & Privilege

always:false and no requests to modify other skills or system-wide agent settings. The skill writes its own .env and creates a local virtualenv under scripts/.venv and output files under the project output/ — expected persistence for a local tool. Autonomous invocation (model calls) is allowed by default but not in itself a red flag.

Version History

v1.0.9

Version 1.0.9 adds full support for multi-mode (local/cloud/apig) execution of talking-head video editing, with flexible credential handling and path management. - Added new environment variables and logic to support 3 execution modes: apig (SkillHub), cloud (Volcano Engine OpenAPI), and local (offline, no cloud needed). - Introduced scripts for local processing: ASR, denoise, AV separation, media handling, and subtitle processing. - Enhanced mode auto-detection with credential fallback and explicit .env control; "mode switch" commands by user are now supported. - Revised project root and output directory resolution via new script, with clearer user prompts before overwriting outputs. - Step-by-step flow, script list, and product file mapping have been updated in documentation. - Security and separation of credential scope further emphasized.

v1.0.8

byted-mediakit-voiceover-editing 1.0.8 - Updated documentation in SKILL.md to clarify workflow, environment variable configuration, output directory rules, and repeat processing. - Expanded and detailed execution steps, including step-by-step checkpoint requirements and error-handling mandates. - Added strict rules on script usage, review page logic, output overwriting prompts, and timeout defaults. - Clarified host Agent and script responsibilities, especially regarding candidate deletion suggestions and decision-making. - Improved instructions for deducing output directories and handling duplicate outputs to avoid unintended overwrites.

Metadata

Slug byted-mediakit-voiceover-editing

Version 1.0.9

License MIT-0

All-time Installs 2

Active Installs 2

Total Versions 2

Frequently Asked Questions

What is Byted Mediakit Voiceover Editing?

Volcano Engine AI MediaKit talking-head video editing Skill: a one-stop workflow from environment setup through media management, audio processing, talking-h... It is an AI Agent Skill for Claude Code / OpenClaw, with 278 downloads so far.

How do I install Byted Mediakit Voiceover Editing?

Run "/install byted-mediakit-voiceover-editing" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Byted Mediakit Voiceover Editing free?

Yes, Byted Mediakit Voiceover Editing is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Byted Mediakit Voiceover Editing support?

Byted Mediakit Voiceover Editing is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Byted Mediakit Voiceover Editing?

It is built and maintained by Volc-AI-MediaKit (@volc-ai-mediakit); the current version is v1.0.9.

More Skills