← 返回 Skills 市场
jojowillwater

Smart Audio Analyzer

作者 JoJowillwater · GitHub ↗ · v1.2.1 · MIT-0
cross-platform ⚠ suspicious
405
总下载
0
收藏
1
当前安装
4
版本数
在 OpenClaw 中安装
/install audio-analyzer
功能描述
All-in-one audio analysis: transcribe, identify speakers by voiceprint, auto-detect scene (meeting/interview/training/talk), generate structured notes. The O...
使用说明 (SKILL.md)

Smart Audio Analyzer

The only audio skill with persistent voice profiles. Beyond transcription — it knows WHO is speaking, detects the scene, and generates structured notes.

唯一带声纹档案的录音分析 skill。转写只是第一步——它还能认出谁在说话,自动判断场景,按模板出纪要。

What Makes This Different

Feature This Skill Others
Transcription ✅ AssemblyAI (default) + Whisper + Gemini ✅ Usually one engine
Speaker ID by voiceprint ✅ Persistent profiles across sessions ❌ None
Scene auto-detection ✅ 5 built-in scenes + extensible ❌ One-size-fits-all
Structured output ✅ Scene-specific templates ⚠️ Generic summary
Multi-language ✅ Chinese + English Varies

Quick Start

# 1. Install
cd skills/audio-analyzer/scripts && npm install

# 2. Configure (pick ONE — AssemblyAI recommended)
cp .env.example .env
# Edit .env: set ASSEMBLYAI_API_KEY

# 3. Run
node analyze.js /path/to/recording.m4a

Zero-config alternative: If no API key is set, it will attempt local Whisper or Gemini fallback.

安装

# 1. 放到 workspace/skills/ 下
cp -r audio-analyzer /path/to/.openclaw/workspace/skills/

# 2. 安装依赖
cd skills/audio-analyzer/scripts && npm install

# 3. 配置 ASR 引擎(选一个即可,推荐 AssemblyAI)
cp .env.example .env
# 编辑 .env,填入 ASSEMBLYAI_API_KEY

# 4. 多 agent 环境:每个 agent 的 workspace 都需要一份

Bootstrap 片段

将以下内容添加到你的 agent bootstrap.md:

## 音频文件处理
当收到音频文件(.m4a/.mp3/.wav/.ogg/.flac)时,**必须**按以下流程处理:
1. 运行 `cd \x3Cworkspace>/skills/audio-analyzer/scripts && node analyze.js \x3C音频文件绝对路径>` 进行转写+说话人分离
2. 读取转写结果,根据内容自动判断场景(或按用户指定)
3. 读取 skills/audio-analyzer/references/scenes/\x3C场景>.md 加载模板
4. 读取 skills/audio-analyzer/references/voice-profiles.md 对照音色档案
5. 按模板生成结构化纪要
6. 与用户确认说话人身份,更新音色档案

**不要**尝试用 summarize、pdf、image 等工具处理音频文件。

Core Pipeline

Audio File → Transcribe + Speaker Separation → Voice Profile Matching
→ Scene Detection → Load Template → Generate Notes → Update Profiles

Step 1: Transcribe

cd scripts && node analyze.js \x3C文件路径>

ASR Engine Priority:

  1. AssemblyAI (default, best quality) — needs ASSEMBLYAI_API_KEY
  2. Gemini — needs GEMINI_API_KEY or OpenRouter key
  3. Whisper (local) — needs whisper installed locally

Output:

  • \x3Cfilename>_transcript.txt — timestamped dialogue with speaker labels
  • \x3Cfilename>_raw.json — raw JSON with speaker metadata

Step 2: Speaker Identification

Cross-references references/voice-profiles.md:

  1. Read all known voice profiles (speech patterns, content patterns)
  2. Analyze each speaker against profiles
  3. Match rules:
    • High confidence → auto-label with name
    • Partial match → label as "possibly XXX" with evidence
    • No match → label as "Unknown Speaker"
  4. Ask user to confirm
  5. Update profiles after confirmation

Step 3: Scene Detection

Auto-detects based on transcript content:

Scene Typical Keywords Template
🚣 Rowing Training stroke rate, pace, catch, drive scenes/rowing.md
💼 Work Meeting project, deadline, requirements, bug scenes/meeting.md
🎤 Interview user pain points, use case, feedback scenes/interview.md
🎓 Talk/Lecture welcome, today's topic, Q&A scenes/talk.md
📝 General (fallback) scenes/general.md

Override manually: node analyze.js file.m4a meeting

Step 4-5: Generate Structured Notes

Loads scene-specific template → generates structured output with key points, action items, and insights.

Step 6: Update Voice Profiles

After user confirms speaker identities, updates references/voice-profiles.md:

  • New person → add entry (role, speech patterns, content patterns)
  • Known person → refine description
  • Shared across all scenes and future recordings

Extending Scenes

Add a new .md file in references/scenes/:

references/scenes/
├── rowing.md      # 🚣 Rowing Training
├── meeting.md     # 💼 Work Meeting
├── interview.md   # 🎤 Interview
├── talk.md        # 🎓 Talk/Lecture
└── general.md     # 📝 General (fallback)

Requirements

  • Node.js 18+
  • At least ONE of: AssemblyAI key, Gemini key, or local Whisper
  • cd scripts && npm install

Error Handling

Situation Response
API quota exceeded "Transcription service unavailable, check API quota"
File > 100MB Warn user: estimated 5-10 min processing
Empty transcript "No speech detected in audio"
Network error "Connection error, please retry"
No ASR engine available List setup instructions for each engine

Advanced: Voiceprint Extraction (Optional)

The skill includes an optional voiceprint.py tool for embedding-based speaker identification using ONNX neural models. This is separate from the text-based voice profile matching in the core pipeline.

What it does

  • Extracts speaker audio segments using ffmpeg
  • Computes 256-dim speaker embeddings via WeSpeaker ONNX model
  • Stores embeddings locally in references/voice-db.json
  • Matches new speakers against stored embeddings (cosine similarity)

Setup (optional — core skill works without this)

# 1. Install Python dependencies
pip install numpy librosa onnxruntime

# 2. Install ffmpeg
apt install ffmpeg  # or: brew install ffmpeg

# 3. Download WeSpeaker model
mkdir -p ~/.openclaw/models/wespeaker
# Download cnceleb_resnet34_LM.onnx from:
# https://github.com/wenet-e2e/wespeaker/releases
# Set: export WESPEAKER_MODEL=~/.openclaw/models/wespeaker/cnceleb_resnet34_LM.onnx

Usage

# Extract voiceprints from a transcribed recording
python3 voiceprint.py extract recording.m4a recording_raw.json

# Enroll a known speaker
python3 voiceprint.py enroll "JoJo" jojo_sample.m4a

# Identify speaker in new audio
python3 voiceprint.py identify unknown.m4a

Privacy Notice

  • All voice embeddings are stored locally in references/voice-db.json
  • Voice embeddings are never sent externally
  • Audio files ARE uploaded to cloud ASR (AssemblyAI/Gemini) for transcription. For fully offline operation, use local Whisper
  • Speaker identity updates require explicit user confirmation
  • To delete all voiceprint data: rm references/voice-db.json

Voice Profiles (Text-Based)

See references/voice-profiles.md. Shared across all scenes — same person is recognized regardless of context. This is the lightweight alternative that works without the ONNX model.

安全使用建议
This skill appears to do what it claims: transcribe audio, match speaker voiceprints locally, detect scene, and produce structured notes. Before installing, note two things: (1) the registry metadata provided to the platform did not list the API keys the SKILL.md and code actually require (ASSEMBLYAI_API_KEY / GEMINI_API_KEY / OPENAI_API_KEY). Confirm the skill's metadata and required env vars with the publisher so you don't accidentally omit or misconfigure credentials. (2) By default audio will be uploaded to third‑party ASR/summarization services — if that is sensitive data, set ASR_ENGINE=whisper and install local Whisper/ffmpeg or avoid providing cloud API keys. Verify where voice embeddings are stored (references/voice-db.json) and ensure the permissions/backup policy meets your privacy requirements. Additional precautions: review the included scripts before running npm install, run the skill in a sandboxed agent or separate environment first, limit API keys to accounts with minimal privileges/billing limits, and only enroll/confirm speaker identities after you understand how updates to voice-profiles are performed.
功能分析
Type: OpenClaw Skill Name: audio-analyzer Version: 1.2.1 The skill contains a critical shell injection vulnerability in 'scripts/analyze.js' within the 'transcribeWhisper' function, where the 'audioPath' variable is directly interpolated into a shell command string via 'execSync'. While the skill's stated purpose of audio analysis and speaker identification is legitimate, and the requested permissions (shell_exec, filesystem, network) are consistent with using tools like FFmpeg and Whisper, the lack of input sanitization allows for potential RCE if a maliciously named file is processed. No evidence of intentional malice or data exfiltration was identified.
能力评估
Purpose & Capability
The skill's name and description (transcribe, speaker identification, scene detection, structured notes) align with included files (scripts/analyze.js and scripts/voiceprint.py) and the declared npm/python dependencies. However the registry metadata at the top of the report lists no required env vars/credentials while SKILL.md clearly requires ASSEMBLYAI_API_KEY, GEMINI_API_KEY or OPENAI_API_KEY (env_any_of). This metadata mismatch is an incoherence that should be resolved before trusting automated installs.
Instruction Scope
SKILL.md instructs the agent to run node analyze.js on incoming audio, read/write references/voice-profiles.md and references/voice-db.json, load scene templates, and update profiles after user confirmation. The code indeed reads/writes those files. It also uploads audio to third‑party ASR/summarization services (AssemblyAI/Gemini/OpenAI/OpenRouter) unless a local Whisper fallback is used. These behaviors are all within the stated purpose, but they have privacy implications (audio is sent off‑host). The bootstrap snippet also instructs agents to automatically invoke the script on audio files — that is expected for a processing skill but worth noting.
Install Mechanism
There is no exotic install: SKILL.md instructs running 'cd scripts && npm install' which matches the provided package.json and package-lock.json (assemblyai, dotenv, openai). No downloads from personal servers are required by default. An optional ONNX model download is referenced (github.com/wenet-e2e/wespeaker/releases) — that is a well-known release host but it is optional and would write a model file to disk if used.
Credentials
SKILL.md requires one of ASSEMBLYAI_API_KEY, GEMINI_API_KEY, or OPENAI_API_KEY and optionally WESPEAKER_MODEL/ASR_ENGINE; the code reads ASSEMBLYAI_API_KEY, GEMINI_API_KEY, OPENAI_API_KEY, OPENAI_BASE_URL/OPENAI_API_KEY for summarization. The registry metadata shown earlier reported no required env vars or primary credential — that's inconsistent and concerning because the skill will not function (and will upload audio) without those keys. Also audio upload to cloud ASR is intrinsic to the design; users should consider whether they are comfortable sending recordings to external services. The voice embeddings are stored locally by voiceprint.py (references/voice-db.json), which matches the privacy claim in SKILL.md.
Persistence & Privilege
The skill does not request 'always: true' and will not modify other skills. It writes and updates files in its own workspace (references/voice-profiles.md and references/voice-db.json) to persist speaker profiles, which is expected for this functionality. The agent-autonomous invocation setting is default (allowed) but not elevated here.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install audio-analyzer
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /audio-analyzer 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.2.1
v1.2.1: Fix privacy claim (audio IS sent to cloud ASR, embeddings are not), fix hardcoded model path to relative path
v1.2.0
v1.2.0: Include voiceprint.py with full dependency/permission/privacy declarations. Multi-engine ASR. All deps explicitly declared.
v1.1.1
v1.1.1: Fix security scan - declare all deps, permissions, and credentials
v1.1.0
v1.1.0: Multi-engine ASR
元数据
Slug audio-analyzer
版本 1.2.1
许可证 MIT-0
累计安装 1
当前安装数 1
历史版本数 4
常见问题

Smart Audio Analyzer 是什么?

All-in-one audio analysis: transcribe, identify speakers by voiceprint, auto-detect scene (meeting/interview/training/talk), generate structured notes. The O... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 405 次。

如何安装 Smart Audio Analyzer?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install audio-analyzer」即可一键安装,无需额外配置。

Smart Audio Analyzer 是免费的吗?

是的,Smart Audio Analyzer 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Smart Audio Analyzer 支持哪些平台?

Smart Audio Analyzer 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Smart Audio Analyzer?

由 JoJowillwater(@jojowillwater)开发并维护,当前版本 v1.2.1。

💬 留言讨论