功能描述

Bring any still photo to life with ai-talking-photo, the skill that syncs facial animation to audio and makes portraits speak, sing, or narrate. Upload a fac...

使用说明 (SKILL.md)

Getting Started

Welcome! With AI Talking Photo, you can turn any portrait into a speaking, animated video in moments. Upload your photo and audio clip (or tell me what you'd like the subject to say) and let's bring it to life!

Try saying:

"Animate this portrait with my audio"
"Make my headshot say this script"
"Create talking photo for social reel"

Quick Start Setup

This skill connects to a cloud processing backend. On first use, set up the connection automatically and let the user know ("Connecting...").

Token check: Look for NEMO_TOKEN in the environment. If found, skip to session creation. Otherwise:

Generate a UUID as client identifier
POST https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token with X-Client-Id header
Extract data.token from the response — this is your NEMO_TOKEN (100 free credits, 7-day expiry)

Session: POST https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemo_agent with Bearer auth and body {"task_name":"project"}. Keep the returned session_id for all operations.

Let the user know with a brief "Ready!" when setup is complete. Don't expose tokens or raw API output.

Make Any Photo Speak With One Upload

Name: Ai Talking Photo
Author: whitejohnk-26

Static photos hold stories that never get told. AI Talking Photo changes that by turning a single still image into an animated, speaking portrait — no camera, no studio, no video shoot required. Whether it's a historical figure, a product mascot, a family member, or your own headshot, this skill breathes voice and movement into the image in seconds.

The process is straightforward: provide a clear face photo and either an audio file or a text script you want spoken. The skill analyzes the facial geometry, maps lip movements to the audio waveform, and generates a short video where the subject appears to genuinely speak. Subtle head motion, eye blinks, and micro-expressions are layered in to avoid the uncanny stiffness of early deepfake tools.

Creators use this for memorial tribute videos, branded spokesperson content, educational history lessons, social media reels, and interactive storytelling. If you can photograph a face, you can give it a voice — that's the core promise of AI Talking Photo.

Routing Animate Portrait Requests

When a user submits a still image with a voice or script input, the skill parses the facial detection parameters and animation style preferences before dispatching the job to the appropriate talking photo pipeline endpoint.

User says...	Action	Skip SSE?
"export" / "导出" / "download" / "send me the video"	→ §3.5 Export	✅
"credits" / "积分" / "balance" / "余额"	→ §3.3 Credits	✅
"status" / "状态" / "show tracks"	→ §3.4 State	✅
"upload" / "上传" / user sends file	→ §3.2 Upload	✅
Everything else (generate, edit, add BGM…)	→ §3.1 SSE	❌

Talking Photo API Reference

The cloud processing backend handles facial landmark mapping, lip-sync synthesis, and expression blending on remote GPU clusters, meaning heavy rendering never touches the local device. Completed animated portrait outputs are returned as video streams or downloadable clips once the synthesis job finalizes.

Skill attribution — read from this file's YAML frontmatter at runtime:

X-Skill-Source: ai-talking-photo
X-Skill-Version: from frontmatter version
X-Skill-Platform: detect from install path (~/.clawhub/ → clawhub, ~/.cursor/skills/ → cursor, else unknown)

All requests must include: Authorization: Bearer \x3CNEMO_TOKEN>, X-Skill-Source, X-Skill-Version, X-Skill-Platform. Missing attribution headers will cause export to fail with 402.

API base: https://mega-api-prod.nemovideo.ai

Create session: POST /api/tasks/me/with-session/nemo_agent — body {"task_name":"project","language":"\x3Clang>"} — returns task_id, session_id.

Send message (SSE): POST /run_sse — body {"app_name":"nemo_agent","user_id":"me","session_id":"\x3Csid>","new_message":{"parts":[{"text":"\x3Cmsg>"}]}} with Accept: text/event-stream. Max timeout: 15 minutes.

Upload: POST /api/upload-video/nemo_agent/me/\x3Csid> — file: multipart -F "files=@/path", or URL: {"urls":["\x3Curl>"],"source_type":"url"}

Credits: GET /api/credits/balance/simple — returns available, frozen, total

Session state: GET /api/state/nemo_agent/me/\x3Csid>/latest — key fields: data.state.draft, data.state.video_infos, data.state.generated_media

Export (free, no credits): POST /api/render/proxy/lambda — body {"id":"render_\x3Cts>","sessionId":"\x3Csid>","draft":\x3Cjson>,"output":{"format":"mp4","quality":"high"}}. Poll GET /api/render/proxy/lambda/\x3Cid> every 30s until status = completed. Download URL at output.url.

Supported formats: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.

SSE Event Handling

Event	Action
Text response	Apply GUI translation (§4), present to user
Tool call/result	Process internally, don't forward
`heartbeat` / empty `data:`	Keep waiting. Every 2 min: "⏳ Still working..."
Stream closes	Process final response

~30% of editing operations return no text in the SSE stream. When this happens: poll session state to verify the edit was applied, then summarize changes to the user.

Backend Response Translation

The backend assumes a GUI exists. Translate these into API actions:

Backend says	You do
"click [button]" / "点击"	Execute via API
"open [panel]" / "打开"	Query session state
"drag/drop" / "拖拽"	Send edit via SSE
"preview in timeline"	Show track summary
"Export button" / "导出"	Execute export workflow

Draft field mapping: t=tracks, tt=track type (0=video, 1=audio, 7=text), sg=segments, d=duration(ms), m=metadata.

Timeline (3 tracks): 1. Video: city timelapse (0-10s) 2. BGM: Lo-fi (0-10s, 35%) 3. Title: "Urban Dreams" (0-3s)

Error Handling

Code	Meaning	Action
0	Success	Continue
1001	Bad/expired token	Re-auth via anonymous-token (tokens expire after 7 days)
1002	Session not found	New session §3.0
2001	No credits	Anonymous: show registration URL with `?bind=\x3Cid>` (get `\x3Cid>` from create-session or state response when needed). Registered: "Top up credits in your account"
4001	Unsupported file	Show supported formats
4002	File too large	Suggest compress/trim
400	Missing X-Client-Id	Generate Client-Id and retry (see §1)
402	Free plan export blocked	Subscription tier issue, NOT credits. "Register or upgrade your plan to unlock export."
429	Rate limit (1 token/client/7 days)	Retry in 30s once

Best Practices

The quality of your output is almost entirely determined by the quality of your input photo. Avoid images with heavy filters, strong side-lighting, or partial face occlusion — these confuse the facial landmark detection and produce jittery or misaligned lip movements. A neutral expression in the source photo gives the animation engine the most flexibility to map a wide range of speech sounds accurately.

Keep audio clips clean and free of background music during the lip-sync generation phase. If your final video needs music, add it as a separate layer after the talking photo is rendered. This prevents the model from misreading musical frequencies as speech phonemes.

For emotional impact — especially in memorial or tribute videos — choose audio that is paced naturally and not too fast. Rapid speech compresses lip movements and reduces the realism of the animation. A speaking rate of 120–150 words per minute tends to yield the most convincing results. Finally, always review the generated video before publishing; small manual trims at the start and end of the clip can remove any initialization frames where the face hasn't yet settled into the animation.

Integration Guide

Getting started with AI Talking Photo requires just two inputs: a face image and an audio source. For best results, use a front-facing photo where the subject's mouth and eyes are clearly visible, unobstructed by hands, masks, or extreme angles. JPEG and PNG formats work well; aim for at least 512×512 pixels to preserve animation quality.

For audio, you can supply an MP3, WAV, or M4A file up to 60 seconds, or simply paste a text script and select a voice style — the skill will synthesize the speech internally before animating the photo. If you're embedding the output in a website or presentation, request the export in MP4 format with a transparent-background option for overlay use.

When building workflows — such as auto-generating spokesperson videos from a CMS or producing personalized video messages at scale — pass the image URL and script text as variables. The skill returns a video URL or file you can route directly into your delivery pipeline, email platform, or social scheduler.

安全使用建议

This skill appears to do what it says — it uploads photos and audio to a nemovideo.ai backend to produce animated videos. Before installing, consider: 1) Privacy: any media you upload will be sent to an external service; do not upload private or sensitive images without checking the service's terms. 2) Token behavior: the skill accepts NEMO_TOKEN but can also obtain an anonymous token automatically; ask whether you prefer using your own token or an anonymous session. 3) Metadata mismatch: the skill's frontmatter lists a config path (~/.config/nemovideo/) while registry metadata did not — confirm whether the agent will read that directory and what it may contain. 4) Attribution headers: the skill sets X-Skill-* headers based on install path (the agent may inspect certain home dirs); if that concerns you, ask for clarification from the skill author. If you rely on sensitive data or need stricter privacy, avoid installing or require review of where tokens and media are stored and transmitted.

功能分析

Type: OpenClaw Skill Name: ai-talking-photo Version: 1.0.0 The ai-talking-photo skill is a legitimate integration for the NemoVideo AI service, allowing users to animate still photos into videos. It provides detailed instructions for the agent to manage API authentication (including automated anonymous token generation), file uploads, and video rendering via the mega-api-prod.nemovideo.ai backend. The skill includes standard telemetry for attribution and robust error handling, with no evidence of malicious data exfiltration, unauthorized execution, or harmful prompt injection.

能力评估

✓ Purpose & Capability

The skill claims to animate still photos via a remote GPU-backed API (nemovideo) and only asks for a single API token (NEMO_TOKEN) and optional config path for nemovideo—both of which are reasonable for that purpose.

ℹ Instruction Scope

SKILL.md instructs the agent to create a session, upload images/audio, call SSE endpoints, and poll render status on the nemovideo API — all expected for this feature. It also instructs the agent to detect install path and read the skill's YAML frontmatter for attribution headers; this requires some filesystem inspection but is narrowly scoped. The runtime will upload user media to an external service, so user data leaves the device (expected, but privacy-relevant).

✓ Install Mechanism

There is no install spec and no code files — instruction-only skills carry lower installation risk because nothing is downloaded or written by an installer.

ℹ Credentials

The skill declares a single credential (NEMO_TOKEN), which fits the cloud API usage. However, SKILL.md also describes an anonymous-token flow if NEMO_TOKEN is absent (POST to /api/auth/anonymous-token) and the frontmatter includes a configPath (~/.config/nemovideo/) while registry metadata reported none — this inconsistency should be clarified. The anonymous-token flow means the skill can operate without a pre-provided secret, and it will request or create a token at runtime.

✓ Persistence & Privilege

The skill is not always-enabled, does not request elevated platform privileges, and does not claim to modify other skills or system-wide settings.

版本历史

v1.0.0

ai-talking-photo 1.0.0 — Animate still images into lifelike, speaking portraits. - Launches the skill to turn any portrait photo and audio/script into a realistic, animated talking video. - Handles all authentication, session setup, and cloud backend interaction automatically—no manual config needed. - Supports uploads, export, balance checking, and editing via simple commands. - Provides clear user feedback on job status and error conditions. - Compatible with common media formats (photos, audio, video exports). - Ideal for memorial videos, social posts, education, marketing, and storytelling.

元数据

Slug ai-talking-photo

版本 1.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题

Ai Talking Photo 是什么？

Bring any still photo to life with ai-talking-photo, the skill that syncs facial animation to audio and makes portraits speak, sing, or narrate. Upload a fac... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 112 次。

如何安装 Ai Talking Photo？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install ai-talking-photo」即可一键安装，无需额外配置。

Ai Talking Photo 是免费的吗？

是的，Ai Talking Photo 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Ai Talking Photo 支持哪些平台？

Ai Talking Photo 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Ai Talking Photo？

由 whitejohnk-26（@whitejohnk-26）开发并维护，当前版本 v1.0.0。

Ai Talking Photo