← 返回 Skills 市场
ai-ggroup

Gipformer ASR

作者 AI-GGroup · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ 安全检测通过
190
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install gipformer
功能描述
Vietnamese speech-to-text using Gipformer ASR (65M params, Zipformer-RNNT). Accepts audio of any length — the server handles VAD chunking, batching, and retu...
使用说明 (SKILL.md)

Gipformer ASR

Vietnamese speech recognition — send audio of any length, get transcript.

Huggingface Model: g-group-ai-lab/gipformer-65M-rnnt (65M params, int8/fp32 ONNX)

Architecture

flowchart TD
    A[Audio file] -->|base64 encode| B[POST /transcribe]
    B --> C[Decode & resample to 16kHz]
    C --> D[VAD chunking ≤ 20s]
    D --> E[Batch inference — sherpa-onnx]
    E --> F[Merge chunk texts]
    F --> G["{ transcript, chunks }"]

The client sends base64-encoded audio (any length, any format). The server decodes, chunks with VAD, infers in batches, and returns the full transcript.

Quick Start

1. Install dependencies

pip install -r {baseDir}/requirements.txt

System dependency: ffmpeg (required for M4A support).

2. Start the server

python {baseDir}/scripts/serve.py
# or with options:
python {baseDir}/scripts/serve.py --port 8910 --quantize int8 --max-batch-size 32

The server downloads the ASR model + VAD model on first run and listens on http://127.0.0.1:8910.

3. Transcribe audio

# Single file (any format)
python {baseDir}/scripts/transcribe.py audio.wav
python {baseDir}/scripts/transcribe.py recording.mp3

# Multiple files
python {baseDir}/scripts/transcribe.py *.wav

# JSON output with chunk details
python {baseDir}/scripts/transcribe.py audio.wav --json

# Save results
python {baseDir}/scripts/transcribe.py audio.wav -o results.json

4. Direct API call (curl)

# Transcribe (any length, any format)
curl -X POST http://127.0.0.1:8910/transcribe \
  -H "Content-Type: application/json" \
  -d "{\"audio_b64\": \"$(base64 -i audio.wav)\"}"

# Response:
# { "transcript": "full text...", "duration_s": 120.5, "process_time_s": 5.2,
#   "chunks": [{"text": "...", "start_s": 0.0, "end_s": 8.7}, ...] }

# Health check
curl http://127.0.0.1:8910/health

Audio Format

Format Extension Support
WAV .wav Native (soundfile)
FLAC .flac Native (soundfile)
OGG .ogg Native (soundfile)
MP3 .mp3 Native (soundfile)
M4A/AAC .m4a Via ffmpeg

All formats are converted to WAV 16-bit PCM mono 16kHz internally.

Server Tuning

Flag Default Effect
--quantize int8 fp32 for accuracy, int8 for speed/size
--max-batch-size 16 Higher = more throughput, more latency
--max-wait-ms 100 How long to wait before flushing a partial batch
--num-threads 4 ONNX runtime threads
--decoding-method modified_beam_search greedy_search for faster speed

API Reference

See references/api.md for full endpoint documentation.

安全使用建议
This skill appears coherent for running a local Vietnamese ASR server, but review and be prepared for the following before installing: 1) It will download model files from Hugging Face at first run — verify the REPO_ID (g-group-ai-lab/gipformer-65M-rnnt) is trusted. 2) You must install Python packages (sherpa-onnx, onnxruntime, silero-vad, fastapi, etc.) and system dependencies like ffmpeg and possibly libsndfile — these can be large and may require system package installs. 3) The server executes ffmpeg via subprocess and writes temporary files while decoding uploaded audio; run in a sandbox/virtualenv or container if you want isolation. 4) No secrets are requested by the skill, but huggingface_hub may use your HUGGINGFACE_HUB_TOKEN automatically if present (only needed for private models). 5) If you plan to expose the server beyond localhost, review network/security settings (authentication is not implemented). If uncertain, run the code in a controlled environment and inspect the repository on Hugging Face before use.
功能分析
Type: OpenClaw Skill Name: gipformer Version: 1.0.0 The gipformer skill provides Vietnamese speech-to-text functionality using the Gipformer ASR model. The bundle includes a FastAPI server (serve.py) that handles model inference via sherpa-onnx, an audio chunking utility (chunk_audio.py) using Silero VAD, and a client script (transcribe.py) for interacting with the API. The code follows standard practices for machine learning services, such as downloading models from Hugging Face and using subprocess safely for audio conversion with ffmpeg. No indicators of malicious intent, data exfiltration, or harmful prompt injection were found.
能力评估
Purpose & Capability
Name/description (Vietnamese ASR) align with the included code and requirements: scripts implement VAD chunking, ONNX-based inference (sherpa-onnx), a FastAPI server, and a client. Required packages in requirements.txt are consistent with the functionality.
Instruction Scope
SKILL.md instructs installing dependencies, running a local server, and sending base64 audio to /transcribe. The runtime instructions and code operate on provided audio files and do not read unrelated system files or env vars. The server decodes audio, chunks it, runs inference, and returns transcripts as described.
Install Mechanism
There is no automated install spec in the registry; SKILL.md expects the user to pip install -r requirements.txt. Model files are downloaded at first run from Hugging Face (hf_hub_download). Network downloads and heavy native/system deps (ffmpeg, libsndfile) are required — expected for this use-case but worth noting before install.
Credentials
The skill does not request environment variables, credentials, or configuration paths. It uses huggingface_hub to download public model files; if a private repo were used the huggingface token (HUGGINGFACE_HUB_TOKEN) would be used by the library but is not required by this package.
Persistence & Privilege
Skill is not always-enabled and does not modify other skills or system-wide agent settings. It runs a local server when started; no privileged or persistent platform-level presence is requested by the skill metadata.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install gipformer
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /gipformer 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
Initial release of Vietnamese speech-to-text using Gipformer ASR. - Supports speech recognition for Vietnamese audio using a 65M parameter Zipformer-RNNT model. - Accepts audio in WAV, FLAC, OGG, MP3, and M4A formats; any duration. - Handles VAD chunking, batching, and provides full transcript with chunk metadata. - Server and CLI tools provided for both API and script-based transcription. - Configurable for quantization, batch size, decoding method, and format support (ffmpeg required for M4A). - Includes health check and comprehensive API documentation.
元数据
Slug gipformer
版本 1.0.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

Gipformer ASR 是什么?

Vietnamese speech-to-text using Gipformer ASR (65M params, Zipformer-RNNT). Accepts audio of any length — the server handles VAD chunking, batching, and retu... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 190 次。

如何安装 Gipformer ASR?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install gipformer」即可一键安装,无需额外配置。

Gipformer ASR 是免费的吗?

是的,Gipformer ASR 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Gipformer ASR 支持哪些平台?

Gipformer ASR 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Gipformer ASR?

由 AI-GGroup(@ai-ggroup)开发并维护,当前版本 v1.0.0。

💬 留言讨论