功能描述

Offline voice toolkit for speech-to-text, text-to-speech, and language detection supporting 25 languages with no API keys or cloud usage.

使用说明 (SKILL.md)

kesha-voice-kit

Name: Kesha Voice Kit
Author: drakulavich

Local voice toolkit: transcribe voice messages to text, synthesize speech, detect language of audio or text. Fully offline after kesha install. No API keys, no per-minute billing.

Trigger keywords for when to use this skill: voice message, voice memo, .ogg, .wav, .mp3, audio file, transcribe, transcription, speech-to-text, STT, text-to-speech, TTS, synthesize speech, say, multilingual voice, multilingual ASR, language detection, offline voice, privacy, Apple Silicon, CoreML.

When to use

Voice memo arrived (Telegram, WhatsApp, Slack, Signal .ogg/.opus/.m4a): transcribe with kesha --json \x3Cpath> and branch on the detected language.
Need to reply with audio: synthesize with kesha say "\x3Ctext>" > reply.wav. Auto-routes by detected language (Kokoro-82M for English, Piper for Russian). For other languages and ~180 more voices use --voice macos-* on macOS (zero model download).
Need to detect what language a file is in before choosing a pipeline: kesha --json audio.ogg returns both audio-based and text-based language detection with confidence scores.

STT: transcribe audio

# JSON output with language detection (recommended for automation)
kesha --json voice.ogg

[{
  "file": "voice.ogg",
  "text": "Привет, как дела?",
  "lang": "ru",
  "audioLanguage": { "code": "ru", "confidence": 0.98 },
  "textLanguage": { "code": "ru", "confidence": 0.99 }
}]

Use lang (or the more detailed audioLanguage/textLanguage) to decide how to respond.

Formats: .ogg, .opus, .mp3, .m4a, .wav, .flac, .webm — decoded via symphonia, no ffmpeg required.

Other output modes:

kesha audio.ogg — plain transcript on stdout
kesha --format transcript audio.ogg — transcript + [lang: ru, confidence: 0.99] footer
kesha --verbose audio.ogg — human-readable with language info
kesha --lang en audio.ogg — warn if detected language differs (useful sanity check)

TTS: synthesize speech

kesha say "Hello, world" > hello.wav               # auto-routes en → Kokoro-82M
kesha say "Привет, мир" > privet.wav              # auto-routes ru → Piper
kesha say --voice macos-de-DE "Guten Tag" > de.wav # any macOS system voice — German, French, Italian, ...
kesha say --list-voices                            # Kokoro + Piper + ~180 macos-* voices

Output: WAV mono float32. --out \x3Cpath> writes to a file instead of stdout.

Language detection standalone

kesha --json audio.ogg includes both audio-based (audioLanguage) and text-based (textLanguage) detection. Use audio detection to identify the language before running language-specific logic.

Install

bun add --global @drakulavich/kesha-voice-kit    # or: npm i -g @drakulavich/kesha-voice-kit
kesha install                                    # downloads engine (~350 MB)
kesha install --tts                              # adds Kokoro + Piper RU + ONNX G2P (~490 MB more, for TTS)

No system deps — G2P runs as ONNX alongside Kokoro/Piper. macos-* voices need no install either — they use voices already on the Mac.

Supported languages

Speech-to-text (25): Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, Ukrainian.

Text-to-speech: English (Kokoro-82M, ~70 voices), Russian (Piper ru-denis), plus any macOS system voice via --voice macos-*.

Performance

ASR: ~19× faster than OpenAI Whisper on Apple Silicon (CoreML via FluidAudio), ~2.5× on CPU (ONNX via ort).
TTS: sub-second latency for short utterances on Apple Silicon.

Why local

No API keys to manage. No per-minute billing. Voice data never leaves the machine — important for regulated industries, personal messaging, and anything that shouldn't be in a third-party log.

Links

安全使用建议

This skill appears coherent with its stated purpose (local STT/TTS/language detection). Before installing: 1) Inspect the npm package and GitHub repo (verify owner, recent commits, issues) and prefer installing from the official npm/GitHub links shown in the README. 2) Do not run curl | bash blindly — if you need Bun, install it via your platform package manager or inspect the install script first. 3) Audit what `kesha install` downloads (check URLs, hashes if provided) or run the install in a sandbox/VM/container if you want to avoid any unexpected network activity. 4) Be prepared for large downloads and disk usage (~350–500MB+). 5) If you need to be air-gapped, follow the project’s model-mirroring docs before using. Finally, because the registry metadata omitted an explicit install spec, confirm the exact install steps and binary names (kesha) on the upstream repository before proceeding.

功能分析

Type: OpenClaw Skill Name: kesha-voice-kit Version: 1.4.4 The kesha-voice-kit is a local voice processing toolkit for STT and TTS, utilizing models like NVIDIA Parakeet and Kokoro. The skill bundle (SKILL.md, README.md) describes standard installation via npm/bun and a model download step (kesha install). No indicators of data exfiltration, malicious execution, or prompt injection were identified; the behavior is consistent with the stated purpose of offline audio processing.

能力评估

✓ Purpose & Capability

The name/description (STT/TTS/language detection, offline) lines up with the runtime instructions: install the @drakulavich/kesha-voice-kit npm package and run the provided `kesha` CLI which downloads an engine and models. The README/SKILL.md reference legitimate hosts (GitHub, npm, HuggingFace). Minor inconsistency: registry metadata listed no install spec while SKILL.md includes an install block and a `requires: bins: [kesha]` frontmatter; this is plausible (instruction-only skill) but worth noting.

ℹ Instruction Scope

The SKILL.md instructs the agent to run only kesha commands on audio files and to use `kesha install` for offline setup — actions are within the voice-processing scope. There are no instructions to read unrelated system files, env vars, or to exfiltrate data. However, the README includes a convenience line to run `curl -fsSL https://bun.sh/install | bash` which instructs executing a remote install script — this is a high-risk, out-of-band action and should be avoided or audited before running. The instructions also require network access to download engines/models during install (consistent with 'offline after install').

ℹ Install Mechanism

There is no platform-level install spec in the registry (the skill is instruction-only), but SKILL.md/README tell users to install the package globally from npm (bun/npm) and then run `kesha install`, which downloads engine and models (~350–500MB). Download sources referenced are standard (npm, GitHub, HuggingFace) which is expected for an ML tool; this is moderate-risk (third-party code & large model downloads). The README's recommended curl|bash bun installer is a higher-risk convenience instruction and should be treated cautiously.

✓ Credentials

The skill declares no required environment variables or credentials and the instructions do not ask for secrets. The only external needs are network access during install and substantial disk space for models. No unrelated credentials or config paths are requested.

✓ Persistence & Privilege

The skill does not request always:true and does not modify other skills or system-wide agent configuration. Installing the npm package and running `kesha install` will write binaries and large model files to disk (expected for a local ML toolkit) — consider this normal but note the storage footprint and that the CLI runs locally thereafter.

版本历史

v1.4.4

v1.4.4: include LICENSE in bundle (1.4.3 dropped it accidentally).

v1.4.3

v1.4.3: ONNX G2P replaces espeak-ng (no system deps); README trimmed; SKILL.md reframed as multilingual.

v1.4.1

v1.4.1: ONNX G2P replaces espeak-ng — no system deps on any platform. VAD auto-trigger for audio ≥120s. SSML <phoneme alphabet='ipa' ph='...'> override. AVSpeechSynthesizer for ~180 macOS system voices. Fixes displayName (was 'Parakeet Cli Ssml').

v1.3.2

- Updated description and documentation to clarify multilingual support for both STT and TTS features. - Improved wording and examples for text-to-speech, highlighting usage of additional macOS system voices in multiple languages. - Revised trigger keywords and usage sections to emphasize multilingual voice and language detection capabilities. - No code changes — documentation update only.

v1.3.1

- Removed 94 files, including documentation, plans, specs, benchmarks, and various assets. - No changes to functionality, install process, or user-facing features. - Skill remains focused on local speech-to-text, text-to-speech, and language detection.

v1.3.0

- Added detailed documentation clarifying usage, supported features, and installation steps. - Explained offline operation, wide audio format support, and privacy benefits. - Listed supported languages for both STT (speech-to-text) and TTS (text-to-speech). - Described performance improvements over OpenAI Whisper and introduced automatic language detection and voice routing. - Provided clear platform-specific installation and voice usage instructions.

元数据

Slug kesha-voice-kit

版本 1.4.4

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 6

常见问题

Kesha Voice Kit 是什么？

Offline voice toolkit for speech-to-text, text-to-speech, and language detection supporting 25 languages with no API keys or cloud usage. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 192 次。

如何安装 Kesha Voice Kit？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install kesha-voice-kit」即可一键安装，无需额外配置。

Kesha Voice Kit 是免费的吗？

是的，Kesha Voice Kit 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Kesha Voice Kit 支持哪些平台？

Kesha Voice Kit 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Kesha Voice Kit？

由 Anton Yakutovich（@drakulavich）开发并维护，当前版本 v1.4.4。

Kesha Voice Kit