← 返回 Skills 市场

midasheng-audio-generate

Name: midasheng-audio-generate
Author: jimbozhang

作者 Junbo Zhang · GitHub ↗ · v1.1.5 · MIT-0

cross-platform ✓ 安全检测通过

221

总下载

当前安装

版本数

在 OpenClaw 中安装

/install midasheng-audio-generate

功能描述

Generate immersive audio scenes—complete with speech, sound effects, music, and ambient sounds by text descriptions.

使用说明 (SKILL.md)

midasheng-audio-generate

Audio scene generation from text descriptions. Generates WAV audio with speech, sound effects, music, and environmental sounds.

1. Trigger

Use this skill when the user requests audio, sound effects, or music generation based on a text description.

2. Execution Steps

Step 1: Design the Audio Scene (Prompt Refinement)

Before calling the API, you must act as an expert Audio Scene Architect and Foley Designer. Deeply understand the user's natural language input (which may be in any language) and translate it into a highly structured tagged string based on real-world acoustic logic and scene realism.

Prompt Tag Definition:

\x3C|caption|>: The overall, comprehensive description of the audio scene.
\x3C|speech|>: Speaker identity (e.g., middle-aged man, energetic girl) and speaking style.
\x3C|asr|>: The actual transcript / spoken dialogue.
\x3C|sfx|>: Specific sound effects present in the audio (e.g., footsteps, doorbell, dog barking).
\x3C|music|>: Description of background music (e.g., soft jazz, tense orchestral).
\x3C|env|>: Environmental or ambient background noise (e.g., city bustle, forest wind and crickets).

Crucial Generation Rules:

Scene Enrichment: Do not merely copy the user's input! Act as a sound designer and logically enrich the scene.
Speech & Dialogue Generation: If the user explicitly mentions speech or implies a speaking scenario, creatively generate a reasonable and vivid transcript for the \x3C|speech|> and \x3C|asr|> fields.
Strict ASR Formatting: For the \x3C|asr|> tag, output only the raw spoken text. Do not include any speaker labels or narration, such as “man:”, “speaker1:”, or “a man says”.
Omit Missing Elements: If any element is not relevant, directly omit its corresponding tag.
Language & Case Constraint: The entire generated prompt string MUST be in lowercase English, including \x3C|asr|> content.
Strict Output: Output ONLY the formatted tagged string internally for the next step.

Step 2: Execute Command

curl -X POST "https://llmplus.ai.xiaomi.com/dasheng/audio/gen" \
  -H "Content-Type: application/json" \
  -d "{\"text\": \"\x3CFORMATTED_PROMPT_STRING>\"}" \
  -o \x3CFILENAME.wav>

3. Queue Status

Query Command

curl -X POST "https://llmplus.ai.xiaomi.com/metrics?path=/dasheng/audio/gen"

Returned Fields

active: Number of currently active requests
avg_latency_ms: Average processing latency (milliseconds)
Estimated wait time = active × avg_latency_ms

When to Call

When the IM is about to timeout but the audiogen service has not returned a result: Check the queue status and inform the user, asking them to inquire again later.
When the user asks about task progress later but the service still hasn't returned: Check the latest queue status and report it back to the user.

Status Levels

🟢 active=0 or estimated wait \x3C5s → Service idle
🟡 Estimated wait 5-30s → Slight queue
🔴 Estimated wait >30s → Queue is long, recommend trying again later

安全使用建议

This skill behaves like a thin adapter that sends your text to an external Xiaomi-hosted audio-generation service. Before installing: (1) Do not send PII or confidential text — SKILL.md explicitly warns data may be retained and is sent offsite; (2) verify you are comfortable with the endpoint host (llmplus.ai.xiaomi.com) and its privacy/retention policy — the skill does not document retention or authentication; (3) note the small metadata mismatch: registry metadata omitted 'curl' while SKILL.md requires it; (4) test with non-sensitive prompts first to confirm the service accepts unauthenticated requests and to observe latency/quality; (5) if you require stronger privacy, consider using a local or self-hosted alternative (the SKILL.md links include demos and a GitHub repo you can review). If you want higher assurance, ask the publisher for explicit data-retention and authentication details or prefer a skill with documented privacy guarantees.

功能分析

Type: OpenClaw Skill Name: midasheng-audio-generate Version: 1.1.5 The skill is a legitimate interface for the Dasheng AudioGen service developed by Xiaomi and Shanghai Jiao Tong University. It uses standard curl commands to interact with official-looking endpoints (llmplus.ai.xiaomi.com) for audio generation and queue status monitoring. The documentation in SKILL.md includes proactive privacy warnings regarding PII and correctly aligns with the stated purpose of transforming text into audio scenes without any evidence of malicious intent, data exfiltration, or obfuscation.

能力评估

✓ Purpose & Capability

The skill claims to convert text into immersive audio and its runtime instructions perform exactly that: craft a structured prompt and POST it to a remote audio-generation API. The required functionality (prompt engineering + curl call to the service) matches the described purpose. Minor inconsistency: the registry metadata lists no required binaries while SKILL.md lists 'requirements: curl' (the curl command is used in the instructions).

ℹ Instruction Scope

Instructions are narrowly focused: they direct the agent to build a structured, lowercased prompt and send it to the specified API endpoint, and to optionally check a queue-status endpoint. The instructions do not ask the agent to read local files, other env vars, or system state. Important privacy note in SKILL.md: user-provided prompts are transmitted to an external endpoint and data retention is unknown — the skill explicitly warns not to include PII or sensitive content.

✓ Install Mechanism

This is an instruction-only skill with no install spec and no code to write to disk, which is the lowest-risk install model. The only runtime requirement is using curl (per SKILL.md), which is a normal CLI tool for making HTTP requests.

✓ Credentials

The skill does not request any environment variables, credentials, or config paths. That is proportional to its described purpose. Note: the SKILL.md declares the remote endpoint accepts unauthenticated requests (authentication: none); if the endpoint actually requires credentials, the skill might fail or prompt for extra setup not declared here.

✓ Persistence & Privilege

The skill is not marked always:true and does not request persistent privileges or attempt to modify other skills or system-wide config. It can be invoked by the agent normally (default).

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install midasheng-audio-generate
安装完成后，直接呼叫该 Skill 的名称或使用 /midasheng-audio-generate 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.1.5

- Minor update to the skill description: now lists Xiaomi and Shanghai Jiao Tong University as developers at the beginning, improving attribution. - No functional or technical changes; behavior and API usage remain unchanged.

v1.1.4

- Clarified and strengthened the privacy policy, explicitly prohibiting inclusion or invention of any personally identifiable or sensitive information in prompts. - Updated the privacy recommendations to highlight limitations of data protection and that all agent-generated enhancements are sent externally. - Adjusted ASR formatting instructions to ban any speaker labels or narration. - Improved language around queue status instructions for greater clarity.

v1.1.3

Initial release. - Generate high-quality audio scenes from text input, including speech, sound effects, music, and ambiance - Expert-designed prompt structuring with specific tags for scene realism - Returns only relevant tags; elements not needed are omitted - Output enforces lowercase English and strict tag formatting - Queue status querying and guidance based on wait times for user experience

v1.1.2

- Added service provider details: developed by Xiaomi and Shanghai Jiao Tong University. - Listed explicit API endpoints for audio generation and queue status. - Updated privacy section with data handling and safety recommendations. - Minor update to description for clarity.

v1.1.1

- Updated the description to be more concise and include a direct demo link. - No file or functional changes detected; documentation remains consistent.

v1.1.0

- Updated skill description to emphasize immersive audio scene generation. - Added links for Demo Page, Hugging Face Demo, and Code Repository. - Clarified language and case constraint: all prompt strings—including <|asr|> content—must be lowercase English. - No changes to API usage or overall workflow.

v1.0.0

- Initial release of midasheng-audio-generate skill. - Generate WAV audio from text prompts, including speech, sound effects, music, and ambient environments using MiLM Plus API. - Transforms user text into a structured, tagged prompt string based on scene, speech, SFX, music, and environment. - Includes detailed prompt formatting and language requirements to ensure accurate and enriched audio generation. - Provides a queue status check to inform users of processing times and system load.

元数据

Slug midasheng-audio-generate

版本 1.1.5

许可证 MIT-0

累计安装 1

当前安装数 1

历史版本数 7

常见问题