Description

利用 Faster-Whisper 高精度语音识别与飞书内置 TTS，实现语音消息识别和双向语音交流回复。

README (SKILL.md)

飞书 Whisper + TTS 语音交互技能

Name: feishu-whisper-voice
Author: 15071664

何时触发此技能

当以下情况时使用此 Skill:

用户发送语音/音频消息需要识别和回复/语音聊天
需要高精度的语音转文字（Whisper 准确率 >98%）
需要将 AI 回复转换为自然语音进行交互
用户提到"语音交互"、"说话"、"Faster-Whisper"、"TTS"等关键词

Faster-Whisper + TTS 架构

用户语音 → 下载音频 → Faster-Whisper 识别 → AI 处理 → TTS 转换 → 语音回复

核心优势

Faster-Whisper: 开源的语音识别模型，支持多语言，准确率极高
TTS: 飞书内置文本转语音工具，自然流畅
双向交互: 既能听懂用户说话，也能用声音回复

工具集成

1. 下载语音文件

优先使用机器人身份（无需授权）:

feishu_im_bot_image(
    message_id="om_xxx",
    file_key="file_xxx",
    type="audio"
)

用户身份（需要 OAuth 授权）:

feishu_im_user_fetch_resource(
    message_id="om_xxx",
    file_key="file_xxx",
    type="audio"
)

2. Whisper 语音识别

使用 faster-whisper 库进行高精度的语音转文字：

from faster_whisper import WhisperModel

# 初始化模型（自动下载 base 模型）
model = WhisperModel("base", device="cpu")

# 转录音频文件
segments, info = model.transcribe(audio_file)

print(f"识别语言：{info.language}, 置信度：{info.language_probability:.4f}")
for segment in segments:
    print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.text}")

模型选项:

base: 142MB，CPU友好，推荐新手使用
small: 466MB，平衡性能和准确率
medium: 769MB，GPU 推荐（有 NVIDIA GPU 时使用）
large: 1.5GB，最高精度

3. TTS 文本转语音

使用飞书内置 tts() 工具:

await tts(text="你好，我是你的 AI 助手")

返回格式:

成功：音频文件路径（Base64）或 audio_url
失败：错误信息

4. 完整语音交互流程

async def handle_voice_message(message_id: str) -> None:
    # Step 1: 下载音频文件
    audio_path = await feishu_im_bot_image(
        message_id=message_id,
        file_key=audio_file_key,
        type="audio"
    )
    
    # Step 2: Whisper 识别
    model = WhisperModel("base", device="cpu")
    segments, info = model.transcribe(audio_path)
    transcript = " ".join([seg.text for seg in segments])
    
    print(f"用户说：{transcript}")
    
    # Step 3: AI 处理（根据识别结果生成回复）
    reply_text = generate_reply(transcript)
    
    # Step 4: TTS 转换并发送语音消息
    audio_result = await tts(text=reply_text)
    
    print(f"AI 回复：{reply_text}")

依赖要求

Python 库

faster-whisper >= 1.0.0 - Whisper 语音识别引擎
openai-whisper (可选) - OpenAI Whisper API

FFmpeg (推荐安装)

用于音频格式转换和质量优化：

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt-get update && sudo apt-get install -y ffmpeg

使用示例

场景 1: 语音消息识别

用户发送语音消息，AI 识别后回复文字：

message_id = "om_xxx"
file_key = "file_xxx"

# 下载音频
audio_path = await feishu_im_bot_image(
    message_id=message_id,
    file_key=file_key,
    type="audio"
)

# 识别语音
model = WhisperModel("base", device="cpu")
segments, info = model.transcribe(audio_path)
transcript = " ".join([seg.text for seg in segments])

# 生成回复
reply = f"我听到了：{transcript}"

# 发送文字消息
await message.send(
    to=current_channel,
    message=reply
)

场景 2: 双向语音对话

用户说中文，AI 用语音回复：

async def voice_dialogue(message_id: str):
    # 下载并识别
    audio_path = await download_audio(message_id)
    transcript = transcribe(audio_path)
    
    # AI 处理
    reply_text = generate_response(transcript)
    
    # TTS 转换
    audio_result = await tts(text=reply_text)
    
    # 发送语音消息
    await send_voice_message(
        to=current_channel,
        audio_url=audio_result["audio_url"]
    )

性能优化

CPU vs GPU

CPU 模式（推荐新手）:

model = WhisperModel("base", device="cpu")
# 预期速度：2-4x faster than real-time (Apple Silicon)

GPU 模式（NVIDIA）:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

model = WhisperModel("medium", device="cuda")
# 预期速度：5-10x faster than real-time

Apple Silicon (M1/M2/M3):

model = WhisperModel("base", device="mps")
# Metal 加速，性能接近 GPU

模型缓存

Whisper 模型首次使用时自动下载：

位置: ~/.cache/huggingface/hub/
大小: base 模型约 142MB
管理: 删除后会自动重新下载

故障排除

Whisper 模型下载失败

症状: ConnectError: [Errno 65] No route to host

解决: 设置 HuggingFace 镜像站环境变量：

export HF_ENDPOINT=https://hf-mirror.com

或在 Python 代码中设置：

import os
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"

GPU 未检测到

症状: RuntimeError: CUDA not available

解决:

检查 NVIDIA 驱动安装
使用 CPU 模式回退：device="cpu"
Apple Silicon 使用 MPS：device="mps"

最佳实践

优先使用 base 模型 - 在 CPU 上性能足够好，启动快
缓存模型文件 - 避免每次启动都下载
批量处理语音消息 - 减少重复加载模型的开销
设置合理的超时 - Whisper 识别可能需要几秒到几十秒

扩展阅读

创建时间: 2026-03-16
维护者: zhou (码农zhou)
版本: v1.0

Usage Guidance

What to consider before installing/running this skill: - The skill's functionality (Whisper transcription + Feishu TTS) matches the files, but the package does not declare required Feishu credentials even though Feishu OAuth/app credentials will be needed to fetch user resources or send TTS messages. Expect to need Feishu app credentials; confirm where/how to provide them safely. - The included install.sh will create a virtualenv, pip-install packages (faster-whisper, torch), optionally clone/compile whisper.cpp, and append OPENAI_API_KEY lines to ~/.bashrc and ~/.zshrc. Do NOT run the installer unreviewed; inspect and run it in a sandbox or isolated environment if you want to test. - The scripts perform pip installs at runtime in some cases; prefer installing dependencies manually into an isolated venv you control rather than letting scripts run pip automatically. - The SKILL.md recommends using an HF mirror (HF_ENDPOINT). Using non-official mirrors can be risky — validate the mirror's trustworthiness before setting it system-wide. - Check for missing files the register script expects (DEPENDENCIES.md, CONFIG.md, check_dependencies.py) and ask the maintainer for them; their absence is an inconsistency. - If you must install: (1) review install.sh and all scripts line-by-line, (2) run in a disposable VM/container, (3) avoid inserting real API keys into shell rc — instead use a secrets manager or export variables in-session, (4) limit network access if you don't want automatic model downloads, and (5) verify the origin/maintainer and prefer packages from official sources. - If you want help auditing specific lines (install.sh modifications, or any script), provide which file and I can point out exact commands to review or sanitize.

Capability Analysis

Type: OpenClaw Skill Name: feishu-whisper-voice Version: 1.0.3 The skill bundle provides a Feishu (Lark) integration for voice interaction using Faster-Whisper and TTS. It includes an interactive installation script (install.sh) that manages Python dependencies and optionally configures environment variables for API keys. While the bundle is cluttered with multiple redundant transcription scripts (e.g., transcribe_latest.py, transcribe_newest.py) containing hardcoded temporary paths and one developer-specific local path (/Users/jurry/), these appear to be non-malicious artifacts of development and testing. The core logic is transparent and aligns with the stated purpose of audio processing and skill registration within the OpenClaw environment.

Capability Assessment

⚠ Purpose & Capability

The description and SKILL.md consistently describe a Feishu (Lark) voice-transcribe + TTS workflow using faster-whisper; the included scripts implement transcription using faster-whisper and reference Feishu helper calls in examples. However the skill does not declare or require Feishu/OAuth credentials even though SKILL.md explicitly differentiates bot vs user download flows (bot identity vs OAuth-required user identity). The registry metadata shows no required env vars or primary credential despite the real need to interact with Feishu APIs and potentially TTS services. Also register_skill.py expects files (DEPENDENCIES.md, CONFIG.md, check_dependencies.py) that are not present — an internal inconsistency.

ℹ Instruction Scope

SKILL.md instructs the agent to download audio from Feishu, run faster-whisper transcriptions, and call the platform tts() to produce audio — these are in-scope for the stated purpose. The instructions also suggest setting HF_ENDPOINT to a mirror and optionally using external TTS providers (Azure, ElevenLabs) and OpenAI API keys. Several shipped scripts perform pip installs at runtime or call subprocess to install faster-whisper, and some write transcription output to /tmp or /tmp/voice_result.txt. The instructions do not request unrelated files, but they do allow installing packages and writing to home and shell rc files (via the included install.sh), which broadens the scope beyond pure in-memory processing.

⚠ Install Mechanism

Registry lists no formal install spec, but the bundle includes an install.sh that: creates a virtualenv, pip-installs faster-whisper and torch, optionally clones/compiles whisper.cpp from GitHub, optionally installs Azure/ElevenLabs SDKs, and appends OPENAI_API_KEY lines to ~/.bashrc and ~/.zshrc. The install targets (PyPI and GitHub) are common/traceable, but the script makes persistent modifications to user shell rc files without strong justification and prompts for optional installs. Several runtime scripts also call pip install dynamically. These behaviors increase attack surface and persistence risk if run unreviewed.

⚠ Credentials

The published skill declares no required environment variables, yet SKILL.md and install.sh reference multiple credentials/services: FEISHU OAuth flows are discussed (but not declared), HF_ENDPOINT (a Hugging Face mirror) is suggested, and the installer prompts to add OPENAI_API_KEY and mentions AZURE_SUBSCRIPTION_KEY and ElevenLabs. Requesting OpenAI/Azure keys or using third-party HF mirrors is plausible for optional features, but the absence of explicit required env var declarations for Feishu authentication (which is essential for downloading user resources or calling Feishu TTS) is a notable mismatch and increases the chance of unexpected credential requests or misconfiguration.

⚠ Persistence & Privilege

always:false and normal autonomous invocation are fine. However the included install.sh will modify ~/.bashrc and ~/.zshrc to add an OPENAI_API_KEY line (even if placeholder), create a .venv_whisper directory, and may install system packages (apt-get, brew) or clone/compile code. The register script references ~/.openclaw extension paths and may create or rely on files in the user's home under ~/.openclaw. These on-disk and shell-rc changes create persistence beyond ephemeral skill execution and should be reviewed before running.

Version History

v1.0.3

feishu-whisper-voice v1.0.2 - Added a short metadata block (name/description) to the top of SKILL.md. - No major feature or logic changes; documentation structure is unchanged. - This update improves skill metadata for platform compatibility.

v1.0.1

feishu-whisper-voice v1.0.1 - Refactored script layout: all script files moved from scripts/ to scripts-/ directory. - Removed duplicate or legacy files from previous script structure. - No user-facing behavior or feature changes.

v1.0.0

feishu-whisper-voice 1.0.0 - Initial release with full Feishu voice interaction capabilities. - Integrates Faster-Whisper for high-accuracy speech-to-text (STT) in multiple languages. - Supports TTS (text-to-speech) for AI voice replies, enabling two-way voice conversations. - Includes example workflows for receiving, transcribing, and responding to audio messages. - Provides troubleshooting tips, performance guidance, and best practices for CPU/GPU/Apple Silicon environments.

Metadata

Slug feishu-whisper-voice

Version 1.0.3

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 3

Frequently Asked Questions

What is feishu-whisper-voice?

利用 Faster-Whisper 高精度语音识别与飞书内置 TTS，实现语音消息识别和双向语音交流回复。 It is an AI Agent Skill for Claude Code / OpenClaw, with 291 downloads so far.

How do I install feishu-whisper-voice?

Run "/install feishu-whisper-voice" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is feishu-whisper-voice free?

Yes, feishu-whisper-voice is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does feishu-whisper-voice support?

feishu-whisper-voice is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created feishu-whisper-voice?

It is built and maintained by 15071664 (@15071664); the current version is v1.0.3.

More Skills

feishu-whisper-voice