← Back to Skills Marketplace
ai-ggroup

Gipformer ASR

by AI-GGroup · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ Security Clean
190
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install gipformer
Description
Vietnamese speech-to-text using Gipformer ASR (65M params, Zipformer-RNNT). Accepts audio of any length — the server handles VAD chunking, batching, and retu...
README (SKILL.md)

Gipformer ASR

Vietnamese speech recognition — send audio of any length, get transcript.

Huggingface Model: g-group-ai-lab/gipformer-65M-rnnt (65M params, int8/fp32 ONNX)

Architecture

flowchart TD
    A[Audio file] -->|base64 encode| B[POST /transcribe]
    B --> C[Decode & resample to 16kHz]
    C --> D[VAD chunking ≤ 20s]
    D --> E[Batch inference — sherpa-onnx]
    E --> F[Merge chunk texts]
    F --> G["{ transcript, chunks }"]

The client sends base64-encoded audio (any length, any format). The server decodes, chunks with VAD, infers in batches, and returns the full transcript.

Quick Start

1. Install dependencies

pip install -r {baseDir}/requirements.txt

System dependency: ffmpeg (required for M4A support).

2. Start the server

python {baseDir}/scripts/serve.py
# or with options:
python {baseDir}/scripts/serve.py --port 8910 --quantize int8 --max-batch-size 32

The server downloads the ASR model + VAD model on first run and listens on http://127.0.0.1:8910.

3. Transcribe audio

# Single file (any format)
python {baseDir}/scripts/transcribe.py audio.wav
python {baseDir}/scripts/transcribe.py recording.mp3

# Multiple files
python {baseDir}/scripts/transcribe.py *.wav

# JSON output with chunk details
python {baseDir}/scripts/transcribe.py audio.wav --json

# Save results
python {baseDir}/scripts/transcribe.py audio.wav -o results.json

4. Direct API call (curl)

# Transcribe (any length, any format)
curl -X POST http://127.0.0.1:8910/transcribe \
  -H "Content-Type: application/json" \
  -d "{\"audio_b64\": \"$(base64 -i audio.wav)\"}"

# Response:
# { "transcript": "full text...", "duration_s": 120.5, "process_time_s": 5.2,
#   "chunks": [{"text": "...", "start_s": 0.0, "end_s": 8.7}, ...] }

# Health check
curl http://127.0.0.1:8910/health

Audio Format

Format Extension Support
WAV .wav Native (soundfile)
FLAC .flac Native (soundfile)
OGG .ogg Native (soundfile)
MP3 .mp3 Native (soundfile)
M4A/AAC .m4a Via ffmpeg

All formats are converted to WAV 16-bit PCM mono 16kHz internally.

Server Tuning

Flag Default Effect
--quantize int8 fp32 for accuracy, int8 for speed/size
--max-batch-size 16 Higher = more throughput, more latency
--max-wait-ms 100 How long to wait before flushing a partial batch
--num-threads 4 ONNX runtime threads
--decoding-method modified_beam_search greedy_search for faster speed

API Reference

See references/api.md for full endpoint documentation.

Usage Guidance
This skill appears coherent for running a local Vietnamese ASR server, but review and be prepared for the following before installing: 1) It will download model files from Hugging Face at first run — verify the REPO_ID (g-group-ai-lab/gipformer-65M-rnnt) is trusted. 2) You must install Python packages (sherpa-onnx, onnxruntime, silero-vad, fastapi, etc.) and system dependencies like ffmpeg and possibly libsndfile — these can be large and may require system package installs. 3) The server executes ffmpeg via subprocess and writes temporary files while decoding uploaded audio; run in a sandbox/virtualenv or container if you want isolation. 4) No secrets are requested by the skill, but huggingface_hub may use your HUGGINGFACE_HUB_TOKEN automatically if present (only needed for private models). 5) If you plan to expose the server beyond localhost, review network/security settings (authentication is not implemented). If uncertain, run the code in a controlled environment and inspect the repository on Hugging Face before use.
Capability Analysis
Type: OpenClaw Skill Name: gipformer Version: 1.0.0 The gipformer skill provides Vietnamese speech-to-text functionality using the Gipformer ASR model. The bundle includes a FastAPI server (serve.py) that handles model inference via sherpa-onnx, an audio chunking utility (chunk_audio.py) using Silero VAD, and a client script (transcribe.py) for interacting with the API. The code follows standard practices for machine learning services, such as downloading models from Hugging Face and using subprocess safely for audio conversion with ffmpeg. No indicators of malicious intent, data exfiltration, or harmful prompt injection were found.
Capability Assessment
Purpose & Capability
Name/description (Vietnamese ASR) align with the included code and requirements: scripts implement VAD chunking, ONNX-based inference (sherpa-onnx), a FastAPI server, and a client. Required packages in requirements.txt are consistent with the functionality.
Instruction Scope
SKILL.md instructs installing dependencies, running a local server, and sending base64 audio to /transcribe. The runtime instructions and code operate on provided audio files and do not read unrelated system files or env vars. The server decodes audio, chunks it, runs inference, and returns transcripts as described.
Install Mechanism
There is no automated install spec in the registry; SKILL.md expects the user to pip install -r requirements.txt. Model files are downloaded at first run from Hugging Face (hf_hub_download). Network downloads and heavy native/system deps (ffmpeg, libsndfile) are required — expected for this use-case but worth noting before install.
Credentials
The skill does not request environment variables, credentials, or configuration paths. It uses huggingface_hub to download public model files; if a private repo were used the huggingface token (HUGGINGFACE_HUB_TOKEN) would be used by the library but is not required by this package.
Persistence & Privilege
Skill is not always-enabled and does not modify other skills or system-wide agent settings. It runs a local server when started; no privileged or persistent platform-level presence is requested by the skill metadata.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install gipformer
  3. After installation, invoke the skill by name or use /gipformer
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
Initial release of Vietnamese speech-to-text using Gipformer ASR. - Supports speech recognition for Vietnamese audio using a 65M parameter Zipformer-RNNT model. - Accepts audio in WAV, FLAC, OGG, MP3, and M4A formats; any duration. - Handles VAD chunking, batching, and provides full transcript with chunk metadata. - Server and CLI tools provided for both API and script-based transcription. - Configurable for quantization, batch size, decoding method, and format support (ffmpeg required for M4A). - Includes health check and comprehensive API documentation.
Metadata
Slug gipformer
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is Gipformer ASR?

Vietnamese speech-to-text using Gipformer ASR (65M params, Zipformer-RNNT). Accepts audio of any length — the server handles VAD chunking, batching, and retu... It is an AI Agent Skill for Claude Code / OpenClaw, with 190 downloads so far.

How do I install Gipformer ASR?

Run "/install gipformer" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Gipformer ASR free?

Yes, Gipformer ASR is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Gipformer ASR support?

Gipformer ASR is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Gipformer ASR?

It is built and maintained by AI-GGroup (@ai-ggroup); the current version is v1.0.0.

💬 Comments