功能描述

使用阿里云 DashScope API 与阿里云 LingMou/灵眸生成多种 AI 视频与语音内容。七种能力：① LivePortrait 人像口播（图+音频→说话视频，两步流程）② EMO 人像口播 ③ AA/AnimateAnyone 全身动画（三步流程）④ T2I 文生图（万相2.x，默认 wan2.2-...

使用说明 (SKILL.md)

Human Avatar — 阿里云 AI 视频 & 语音生成

Name: Human Avatar
Author: davideuler

能力总览

能力	脚本	模型/接口	Region	简介
LivePortrait	`live_portrait.py`	`liveportrait`	cn-beijing	人像图 + 音频/视频 → 口播动态视频，两步流程
EMO	`portrait_animate.py`	`emo-v1`	cn-beijing	人像图 + 音频 → 口播，检测+生成两步
AA (AnimateAnyone)	`animate_anyone.py`	`animate-anyone-gen2`	cn-beijing	全身动画，三步：图检测→动作模板→视频生成
T2I 文生图	`text_to_image.py`	`wan2.x-t2i`	多地域	文字描述 → 图片，默认 wan2.2-t2i-flash
I2V 图生视频	`image_to_video.py`	`wan2.x-i2v`	多地域	图片 → 视频，支持 T2I→I2V 一条龙，默认 wan2.6-i2v-flash
Qwen TTS	`qwen_tts.py`	`qwen3-tts-*`	cn-beijing / 新加坡	文字 → 语音，按场景自动选模型和音色，默认 qwen3-tts-vd-realtime-2026-01-15
灵眸数字人	`avatar_video.py`	LingMou SDK	cn-beijing	基于模板的数字人口播视频

快速选择指南

需要人像说话（有现成音频/视频）    → LivePortrait
需要人像说话（无音频，先生成语音）  → Qwen TTS → LivePortrait
需要全身跳舞/动作                 → AA (AnimateAnyone)
需要根据文字生成图片               → T2I (text_to_image)
需要根据图片生成视频               → I2V (image_to_video)
需要从零文字到视频（一条龙）        → T2I → I2V（image_to_video --t2i-prompt）
需要企业数字人/模板播报            → 灵眸 (avatar_video)

环境配置

pip install requests dashscope oss2 scipy numpy
# 灵眸额外:
pip install alibabacloud-lingmou20250527 alibabacloud-tea-openapi

export DASHSCOPE_API_KEY=sk-xxxx               # 北京地域 API Key
export ALIBABA_CLOUD_ACCESS_KEY_ID=xxx         # OSS 上传用
export ALIBABA_CLOUD_ACCESS_KEY_SECRET=xxx
export OSS_BUCKET=your-bucket
export OSS_ENDPOINT=oss-cn-beijing.aliyuncs.com

⚠️ cn-beijing 和新加坡地域的 API Key 不互通，请确认使用正确地域的 Key。 OSS_ENDPOINT 支持带或不带 https:// 前缀，脚本自动规范化。

1. LivePortrait — 人像口播视频

适用场景：有人物照片 + 语音内容，快速生成人物说话视频。

流程：

Step 1: liveportrait-detect (同步)  → pass=true
  ↓
Step 2: liveportrait        (异步)  → video_url

图片要求：单人正面肖像，人脸清晰，无遮挡 音频要求：wav/mp3，\x3C 15MB，1s ~ 3min 视频输入：自动提取音频（ffmpeg）

# 图片 + 音频文件
python scripts/live_portrait.py \
  --image ./portrait.jpg \
  --audio ./speech.mp3 \
  --template normal --download

# 图片 + 视频（自动提取音频）
python scripts/live_portrait.py \
  --image ./portrait.jpg \
  --video ./speech_video.mp4 \
  --template active --download

# 直接用公网 URL
python scripts/live_portrait.py \
  --image-url "https://..." \
  --audio-url "https://..." \
  --mouth-strength 1.2 --download

动作模板：

normal（默认，适中动作）
calm（平静，适合新闻播报/讲故事）
active（活泼，适合演唱/活动主持）

2. Qwen TTS — 文字转语音

适用场景：需要从文字生成语音文件（配合 LivePortrait、EMO 等使用）。

默认模型：qwen3-tts-vd-realtime-2026-01-15

场景自动选模型

场景 `--scene`	推荐模型	推荐音色
`default` / `brand`	`qwen3-tts-vd-realtime-2026-01-15`	Cherry
`news` / `documentary` / `advertising`	`qwen3-tts-instruct-flash-realtime`	Serena / Ethan
`audiobook` / `drama`	`qwen3-tts-instruct-flash-realtime`	Cherry / Dylan
`customer_service` / `chatbot` / `education`	`qwen3-tts-flash-realtime`	Anna / Ethan
`ecommerce` / `short_video`	`qwen3-tts-flash-realtime`	Cherry / Chelsie

可用音色

音色	特点
`Cherry`	活泼甜美女声，广告/有声书/配音
`Serena`	成熟知性女声，新闻/讲解/企业形象
`Ethan`	稳重亲切男声，教育/纪录片/培训
`Dylan`	富有表现力男声，广播剧/游戏配音
`Anna`	温柔亲切女声，客服/助手/日常
`Chelsie`	年轻清新女声，短视频/电商
`Thomas`	低沉磁性男声，品牌宣传/广告
`Luna`	温暖柔和女声，冥想/故事叙述

# 默认生成（qwen3-tts-vd-realtime + Cherry）
python scripts/qwen_tts.py --text "你好，欢迎使用千问语音" --download

# 按场景自动匹配
python scripts/qwen_tts.py --text "今日股市..." --scene news --download
python scripts/qwen_tts.py --text "从前有个..." --scene audiobook --download

# 指令控制语气/风格
python scripts/qwen_tts.py \
  --text "亲爱的同学们..." \
  --model qwen3-tts-instruct-flash-realtime \
  --instructions "语调温和，节奏平稳，适合教学场景" \
  --download

# 查看所有选项
python scripts/qwen_tts.py --list-voices
python scripts/qwen_tts.py --list-models

3. T2I 文生图 — 万相2.x

适用场景：根据文字描述生成高质量图片（可后续接 I2V 生成视频）。

# 默认模型（wan2.2-t2i-flash，快速）
python scripts/text_to_image.py \
  --prompt "一位穿汉服的女性站在桃花林中，电影感，4K，柔和光线" \
  --size 960*1696 --download

# 高质量模型
python scripts/text_to_image.py \
  --prompt "..." --model wan2.2-t2i-plus --size 1280*1280 --download

# 最新模型（万相2.6）
python scripts/text_to_image.py \
  --prompt "..." --model wan2.6-t2i --size 1280*1280 --n 1 --download

模型选型：

wan2.2-t2i-flash（默认，快速，适合测试）
wan2.2-t2i-plus（质量更高）
wan2.6-t2i（最新，支持更宽高比，同步调用）

常用尺寸：1280*1280（1:1）/ 960*1696（9:16 竖版）/ 1696*960（16:9 横版）

4. I2V 图生视频 — 万相2.x

适用场景：将图片生成为动态视频，支持从文字一条龙到视频。

# 本地图片 → 视频
python scripts/image_to_video.py \
  --image ./portrait.jpg \
  --prompt "她缓缓转身微笑，裙摆飘动，花瓣轻轻飞舞" \
  --model wan2.6-i2v-flash \
  --resolution 720P --duration 5 --download

# 🔥 一条龙：文字 → 图 → 视频
python scripts/image_to_video.py \
  --t2i-prompt "一位穿汉服的女性站在桃花林中" \
  --prompt "她缓缓转身，花瓣飘落，唯美意境" \
  --download --output result.mp4

# 带背景音乐
python scripts/image_to_video.py \
  --image ./portrait.jpg \
  --audio-url "https://..." \
  --prompt "..." --download

模型选型：

wan2.6-i2v-flash（默认，含音效，支持5/10s）
wan2.5-i2v-preview（高质量预览版）
wan2.2-i2v-plus（无声，较快）

5. AA AnimateAnyone — 全身动画

适用场景：有人物全身照 + 参考动作视频，生成人物跳舞/动作视频。

要求：

图片：单人全身正面，头到脚完整，宽高比 0.5~2.0
视频：全身入镜，首帧开始即全身可见，mp4/avi/mov，fps≥24，2~60s

三步流程：

Step 1: animate-anyone-detect-gen2   (同步)  → check_pass=true
  ↓
Step 2: animate-anyone-template-gen2 (异步)  → template_id（约3~5分钟）
  ↓
Step 3: animate-anyone-gen2          (异步)  → video_url（约3~5分钟）

# 本地文件（自动转换格式 + 上传 OSS）
python scripts/animate_anyone.py \
  --image ./portrait_fullbody.jpg \
  --video ./dance.mp4 \
  --download --output result.mp4

# 以图片为背景生成
python scripts/animate_anyone.py \
  --image ./portrait.jpg --video ./dance.mp4 \
  --use-ref-img-bg --video-ratio 9:16 --download

# 跳过 Step2（已有 template_id）
python scripts/animate_anyone.py \
  --image ./portrait.jpg \
  --template-id "AACT.xxx.xxx" --download

格式自动转换：视频 webm/mkv/flv → mp4；图片 webp/heic → jpg；fps\x3C24 → 24fps

6. EMO — 人像口播（旧版）

注意：推荐优先使用 LivePortrait，EMO 适合对口型精度要求高的场景。

python scripts/portrait_animate.py \
  --image ./portrait.jpg \
  --audio ./speech.mp3 \
  --download

7. 灵眸数字人 — 企业级模板视频

适用场景：企业数字人播报、模板化新闻视频、上传人物图片并结合口播脚本生成模板播报视频。

新工作流（优先无 `template_id`）

若用户给了 template_id：直接使用该模板生成
若用户没给 template_id：
1. 先列出账号下已有播报模板
2. 如果有模板，随机选择一个模板来创作
3. 如果没有模板，再尝试获取公共模板并复制最多 3 个公共模板到当前账号
4. 从复制结果里随机选择一个继续生成
但要注意：公共模板复制成功后，复制出的模板不一定立刻就是“可直接生成视频”的成熟模板；有些复制结果仍是草稿，可能缺少有效片段、素材或变量绑定，需要在灵眸侧补完
若用户只给了图片和“做个口播视频”的要求，但没有明确脚本：先向用户确认口播文案，再继续生成

当前脚本能力

scripts/avatar_video.py 现在支持：

--list-templates：列出账号下已有模板
--list-public-templates：列出公共模板（SDK 1.7.0+）
--copy-public-templates：复制最多 3 个公共模板（SDK 1.7.0+）
不传 --template-id：随机选择一个已有模板
当本地模板为空时：自动尝试复制公共模板作为兜底
--show-template-detail：查看模板详情与可替换变量
自动把输入文案填入模板里的 text 变量（优先 text_content / test_text）
当公共模板复制后直接生成失败时，明确报错提示用户该模板仍需完善，而不是静默失败

# 列出现有模板
python scripts/avatar_video.py --list-templates

# 列出公共模板（SDK 1.7.0+）
python scripts/avatar_video.py --list-public-templates

# 手动复制最多 3 个公共模板（SDK 1.7.0+）
python scripts/avatar_video.py --copy-public-templates

# 不指定 template_id，自动随机选一个已有模板来播报
python scripts/avatar_video.py \
  --text "大家好，欢迎收看今天的科技新闻。" \
  --download

# 指定 template_id
python scripts/avatar_video.py \
  --template-id "BS1b2WNnRMu4ouRzT4clY9Jhg" \
  --text "大家好，欢迎收看今天的科技新闻。" \
  --download

# 查看随机选中的模板详情
python scripts/avatar_video.py \
  --show-template-detail \
  --text "这是一段测试播报文案"

对话式使用约定

当用户说：

“用这张图做一个口播视频”
“帮我做个数字人口播”
“上传图片，做个播报视频”

按下面流程执行：

判断用户是否已经给出可直接播报的文案/脚本
如果没有，就先追问一句：“口播的具体文案是什么？你也可以只给我要点，我来帮你整理成适合播报的脚本。”
拿到脚本后，调用灵眸流程：优先随机已有模板；无本地模板时再尝试公共模板复制
如果用户上传了人物图片，但当前模板式灵眸接口并不需要该图片，明确告诉用户：这一路径主要依赖模板；若要强制使用用户图片做人像口播，应改走 LivePortrait / EMO

API 参考文档

LivePortrait: https://help.aliyun.com/zh/model-studio/liveportrait-api
EMO (emo-detect + emo-v1): references/emo-api.md
AA (Animate Anyone): references/aa-api.md
T2I (文生图V2): https://help.aliyun.com/zh/model-studio/text-to-image-v2-api-reference
I2V (图生视频): https://help.aliyun.com/zh/model-studio/image-to-video-api-reference/
Qwen TTS: https://help.aliyun.com/zh/model-studio/qwen-tts-realtime
灵眸 (LingMou): references/lingmou-api.md
OSS 上传: references/oss-upload.md

安全使用建议

This skill appears internally coherent and does what it claims: it converts media (ffmpeg), uploads user media to your Alibaba OSS bucket, and calls DashScope / LingMou APIs using the credentials you provide. Before installing or running it: 1) Only provide DASHSCOPE_API_KEY and AK/SK that you trust — prefer a dedicated/test Alibaba account and a dedicated OSS bucket with lifecycle rules and limited permissions; 2) Be aware that uploaded files are sent to Alibaba endpoints (DashScope/LingMou) and their generated signed URLs are used for processing; 3) Rotate keys after testing and avoid using high-privilege or production credentials; 4) Review and run the scripts in an isolated environment first (non-production account) to confirm behavior and billing implications; 5) If you need stronger assurance, run a line-by-line review or sandboxed execution — the code uses subprocess for ffmpeg, base64 decoding for audio streams, and standard Alibaba SDK calls, all of which are expected for this skill.

功能分析

Type: OpenClaw Skill Name: human-avatar Version: 1.6.0 The skill bundle is a legitimate toolset for generating AI video and audio content using Alibaba Cloud's DashScope and LingMou APIs. It utilizes subprocess calls to 'ffmpeg' and 'ffprobe' for media format conversion and audio extraction, which are handled safely without shell execution. The scripts correctly manage Alibaba Cloud credentials (AK/SK) and API keys via environment variables for OSS uploads and API authentication, and the bundle includes a 'SECURITY.md' file that transparently explains these technical choices. No evidence of data exfiltration, malicious prompt injection, or unauthorized execution was found across the scripts (e.g., live_portrait.py, qwen_tts.py, avatar_video.py).

能力评估

✓ Purpose & Capability

Name/description (Human Avatar using DashScope/LingMou) match what the files and SKILL.md do: call DashScope APIs, use Qwen TTS, call LingMou SDK, and upload media to the user's OSS bucket. Required binaries (ffmpeg/ffprobe) and env vars (DASHSCOPE_API_KEY, ALIBABA_CLOUD AK/SK, OSS_BUCKET/OSS_ENDPOINT) are appropriate and expected for these operations.

✓ Instruction Scope

SKILL.md and the scripts explicitly instruct uploading local media to the user's OSS, converting media with ffmpeg, and calling DashScope/LingMou endpoints. The instructions do not request unrelated files, system secrets, or external endpoints beyond Alibaba Cloud and user OSS. The scripts perform polling, uploading, and signed-URL generation only — behavior stays within the stated scope.

✓ Install Mechanism

There is no automated install that downloads arbitrary code; SKILL.md recommends pip packages from standard registries. Code execution is local via provided scripts. No suspicious external download URLs or archive extraction are present in the manifest.

ℹ Credentials

The skill requires multiple sensitive credentials (DashScope API key, Alibaba AK/SK and OSS info). These are proportionate to the advertised features (DashScope TTS/vision APIs require DASHSCOPE_API_KEY; LingMou and OSS uploads require AK/SK and bucket). However, granting AK/SK + OSS bucket gives the skill the ability to upload files to your bucket, generate signed URLs, and use your account resources — this is expected but high-sensitivity. Use least-privilege credentials and a dedicated/test bucket when possible.

✓ Persistence & Privilege

The skill is not marked always:true and does not request persistent registry-level privileges. It does not modify other skills or system-wide agent settings. Some scripts reference creating a local virtualenv for SDK testing (optional), but there is no installation-time persistence or self-enabling present.

版本历史

v1.6.0

Clean release: add explicit registry metadata for required env vars and ffmpeg/ffprobe, align security scan expectations, restore clean description, and include the latest LingMou template workflow improvements.

v1.1.2

Publish test for summary freshness.

v1.1.1

Upgrade LingMou avatar_video: support random existing template selection, public template listing/copy fallback, template variable auto-detection, verified copied public-template video generation, and dialog flow for confirming script before generation.

v1.5.4

Big update: Adds multiple new AI video, image, and speech generation features. Skill is now a comprehensive toolkit for Aliyun-based multimedia content creation. - Added 5 new scripts: image-to-video, live-portrait, qwen TTS (text-to-speech), text-to-image, and a security policy. - Now supports seven capabilities including: LivePortrait talking head, EMO, AnimateAnyone full-body animation, Text2Image, Image2Video, Qwen TTS, and LingMou template videos. - Extended documentation with quick selection guide, detailed usage examples, and updated API/model options. - Improved environment setup instructions and added new supported models, templates, and parameter options. - The skill now enables end-to-end workflows from text to synthesized video or speech.

v1.5.3

Include scripts/ and references/ in package: live_portrait.py, qwen_tts.py, text_to_image.py, image_to_video.py, animate_anyone.py, portrait_animate.py, avatar_video.py + aa-api.md, emo-api.md, oss-upload.md, lingmou-api.md

v1.5.2

Security: add SECURITY.md explaining all scanner-flagged patterns (subprocess=ffmpeg only, base64=audio decode only, OSS creds=user's own bucket only); remove __pycache__; add .clawhubignore

v1.5.1

Update skill description: 7 capabilities — LivePortrait人像口播, Qwen TTS文字转语音, T2I文生图, I2V图生视频, AA全身动画, EMO口播, 灵眸数字人

v1.5.0

Complete SKILL.md rewrite: full docs for all 7 capabilities (LivePortrait, Qwen TTS, T2I, I2V, AA AnimateAnyone, EMO, 灵眸数字人); quick-select guide; model comparison tables; voice guide; API references

v1.4.0

Add Qwen TTS: text-to-speech with auto scene→model→voice selection (default: qwen3-tts-vd-realtime-2026-01-15); supports 8 scenes, 8 voices, instructions control, PCM→WAV conversion

v1.3.0

Add LivePortrait: portrait image + audio/video → animated portrait video; auto audio extraction from video; ffmpeg format conversion; 2-step pipeline (detect→generate)

v1.2.0

Add T2I (wan2.x text-to-image, default: wan2.2-t2i-flash) + I2V (wan2.x image-to-video, default: wan2.6-i2v-flash); T2I→I2V one-shot pipeline; update SKILL.md

v1.1.0

Fix AA to 3-step Gen2 pipeline (detect→template→generate); auto ffmpeg format conversion; signed OSS URLs; updated docs

v0.1.0

Initial release: EMO/AA/LingMou APIs + demo pipeline

元数据

Slug human-avatar

版本 1.6.0

许可证 MIT-0

累计安装 1

当前安装数 1

历史版本数 13

常见问题

Human Avatar 是什么？

使用阿里云 DashScope API 与阿里云 LingMou/灵眸生成多种 AI 视频与语音内容。七种能力：① LivePortrait 人像口播（图+音频→说话视频，两步流程）② EMO 人像口播 ③ AA/AnimateAnyone 全身动画（三步流程）④ T2I 文生图（万相2.x，默认 wan2.2-... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 382 次。

如何安装 Human Avatar？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install human-avatar」即可一键安装，无需额外配置。

Human Avatar 是免费的吗？

是的，Human Avatar 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Human Avatar 支持哪些平台？

Human Avatar 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Human Avatar？

由 david l euler（@davideuler）开发并维护，当前版本 v1.6.0。

Human Avatar