← Back to Skills Marketplace

Audio Recognition

Name: Audio Recognition
Author: zzhimin

by zzhimin · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ✓ Security Clean

127

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install audio-recognition

Description

音频语音识别服务(Speech-to-Text)。当用户上传音频文件，需要将语音内容转换为文字，或需要识别音频中的特定信息（如关键词、歌曲名）时触发。适用于：(1) 会议录音转写 (2) 音频内容提取 (3) 语音指令识别 (4) 音视频字幕生成

README (SKILL.md)

音频语音识别 (Audio Recognition)

将音频中的语音准确转写为文字，并能区分不同说话人。

核心能力

语音转文字 (Speech-to-Text)
说话人分离 (Diarization)
标点与断句优化
支持多语言识别

工作流程

1. 音频预处理

降噪处理
格式转换 (统一为16kHz/16bit PCM或MP3)
音量标准化

2. 声学特征提取

提取MFCC、FBANK等声学特征
用于后续ASR模型输入

3. ASR语音识别

使用ASR模型进行语音识别
生成文字初稿
支持模型：Whisper、WeNet、Paraformer等

4. 后处理

文字纠错
断句与标点添加
说话人分离标注

5. 输出结果

最终识别文本
说话人标签（如需要）
时间戳（如需要）

质量目标

准确率：95%以上（标准普通话录音）
实时性：支持实时和离线两种模式

限制说明

噪音限制：背景噪音过大时识别效果下降
口音限制：重度方言/口音可能影响准确率
隐私保护：用户上传的音频仅用于本次识别，不得用于模型训练或其他用途
语义理解：仅负责语音转文字，不理解文本语义内容

参考服务

讯飞听见
Google Speech-to-Text
阿里云语音识别
腾讯云语音识别

适用场景

会议纪要自动生成
音频/视频字幕制作
语音内容检索
录音文件整理

Usage Guidance

This skill is a high-level spec for an audio speech-to-text pipeline and appears coherent, but it does not implement anything by itself. Before installing or enabling: (1) confirm which runtime or service the agent will actually use (local model vs cloud provider); (2) if it uses third-party cloud APIs, expect to need API keys and verify how audio is uploaded and stored — the SKILL.md's privacy promise is descriptive but not enforceable; (3) verify accuracy claims (95% for Mandarin) against your expected audio conditions; and (4) ensure you have legal/consent coverage for processing any sensitive audio.

Capability Analysis

Type: OpenClaw Skill Name: audio-recognition Version: 1.0.0 The skill bundle contains only metadata and documentation describing a standard audio-to-text (Speech-to-Text) workflow. There is no executable code, and the instructions in SKILL.md are purely descriptive of a legitimate service process without any signs of prompt injection, data exfiltration, or malicious intent.

Capability Assessment

✓ Purpose & Capability

The name/description (speech-to-text, diarization, punctuation, multi-language) aligns with the SKILL.md content. The skill does not ask for unrelated credentials, binaries, or config paths.

ℹ Instruction Scope

The SKILL.md outlines preprocessing, feature extraction, ASR models (Whisper/WeNet/Paraformer), and postprocessing at a high level. It does not tell the agent to read unrelated files or exfiltrate data, but it is high-level and leaves implementation choices unspecified (e.g., which model/service to call), so runtime behavior depends on the agent environment and any integrations the agent has.

✓ Install Mechanism

No install specification or code files are present — the skill is instruction-only, so nothing will be written or executed by default.

ℹ Credentials

The skill requires no environment variables or credentials as declared, which is coherent for a descriptive spec. However, real implementations often require API keys or local model binaries; the SKILL.md does not request them or describe secure handling, so users should verify how the agent will implement model calls.

✓ Persistence & Privilege

always:false and default model invocation settings are used. The skill does not request persistent presence or system-level configuration changes.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install audio-recognition
After installation, invoke the skill by name or use /audio-recognition
Provide required inputs per the skill's parameter spec and get structured output

Version History

v1.0.0

- Initial release of the audio-recognition skill for speech-to-text tasks. - Supports accurate transcription of audio to text with speaker diarization. - Includes noise reduction, audio normalization, and multi-language recognition. - Features robust post-processing: punctuation, error correction, and optional timestamps. - Suitable for meeting transcriptions, subtitle generation, voice command recognition, and content extraction.

Metadata

Slug audio-recognition

Version 1.0.0

License MIT-0

All-time Installs 1

Active Installs 1

Total Versions 1

Frequently Asked Questions

What is Audio Recognition?

音频语音识别服务(Speech-to-Text)。当用户上传音频文件，需要将语音内容转换为文字，或需要识别音频中的特定信息（如关键词、歌曲名）时触发。适用于：(1) 会议录音转写 (2) 音频内容提取 (3) 语音指令识别 (4) 音视频字幕生成. It is an AI Agent Skill for Claude Code / OpenClaw, with 127 downloads so far.

How do I install Audio Recognition?

Run "/install audio-recognition" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Audio Recognition free?

Yes, Audio Recognition is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Audio Recognition support?

Audio Recognition is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Audio Recognition?

It is built and maintained by zzhimin (@zzhimin); the current version is v1.0.0.

More Skills