← Back to Skills Marketplace
mozi1924

Local Stt Workflow

by Mozi Arasaka · GitHub ↗ · v1.0.2 · MIT-0
cross-platform ✓ Security Clean
138
Downloads
0
Stars
1
Active Installs
3
Versions
Install in OpenClaw
/install local-stt-workflow
Description
Local speech-to-text workflow for an OpenAI-compatible STT server, typically on http://127.0.0.1:8000/v1. Use when configuring, testing, debugging, or valida...
README (SKILL.md)

Local STT Workflow

Use this skill to debug the full transcription path, not just the model.

Default assumption: the local STT server lives at http://127.0.0.1:8000/v1.

Current local model-path fallback worth remembering: if the server did not pull a model by name, it may be loading directly from a local path such as ./models/Qwen3-ASR-0.6B-bf16.

When exact route shape matters, the local OpenAPI document is available at:

  • http://localhost:8000/openapi.json

Use this OpenAPI doc as a schema/reference source to compare this local mlx-audio server against OpenAI’s API. Do not treat it as a health check.

Workflow

1. Verify the server before blaming OpenClaw

Check the basics first:

curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models

Confirm that the intended STT model exists, usually qwen3-asr.

If the model does not appear by pulled registry name, do not assume STT is broken — this server may be running a local-path model such as ./models/Qwen3-ASR-0.6B-bf16.

If the server is task-gated, ensure STT is enabled:

MLX_AUDIO_SERVER_TASKS=stt uv run python server.py

If the model is missing, register it before testing clients — but first check whether the server is intentionally loading from a local path and verify the exact accepted model IDs through /v1/models or http://localhost:8000/openapi.json.

2. Prove the raw STT endpoint works

Always isolate the server from the client stack.

Minimal direct transcription test:

curl -X POST http://127.0.0.1:8000/v1/audio/transcriptions \
  -F [email protected] \
  -F model=qwen3-asr \
  -F response_format=json

Useful richer test:

curl -X POST http://127.0.0.1:8000/v1/audio/transcriptions \
  -F [email protected] \
  -F model=qwen3-asr \
  -F response_format=verbose_json \
  -F 'timestamp_granularities[]=segment' \
  -F 'timestamp_granularities[]=word'

If direct curl works but OpenClaw does not, the bug is probably in the message ingestion or routing layer, not the STT backend.

3. Distinguish server failure from routing failure

Use this rule hard:

  • Direct curl fails → fix the local STT server first
  • Direct curl works, but OpenClaw shows no transcript → inspect OpenClaw audio pipeline / attachment routing
  • OpenClaw sends requests, but fields are wrong → inspect request shape compatibility

This distinction saves a shitload of time.

4. Check the request shape

This server is designed around OpenAI-style multipart form upload.

Expected core fields for /v1/audio/transcriptions from the current local OpenAPI schema:

  • required: file, model
  • optional: language, verbose, max_tokens, chunk_duration, frame_threshold, stream, context, prefill_step_size, text

This means the local server is not exposing the same form shape as OpenAI Whisper-style docs. Do not blindly assume response_format, prompt, or timestamp_granularities[] exist just because OpenAI supports them.

If a client is suspected of sending the wrong shape, inspect traffic with a temporary dump proxy or server logs.

5. Use the reference doc when exact fields matter

Read references/stt-api.md when you need exact behavior for:

  • response_format=json|text|verbose_json|srt|vtt
  • stream=true SSE events
  • timestamp_granularities[]
  • include[]
  • translation endpoint semantics
  • error envelope shape
  • current compatibility limits

Do not guess field support from generic OpenAI docs when this local server may intentionally differ.

Current notable mismatch: the local schema exposes context and text, plus chunking/prefill controls like chunk_duration, frame_threshold, and prefill_step_size, which are not the usual OpenAI STT field set.

6. OpenClaw-specific debugging pattern

When OpenClaw STT appears broken:

  1. Confirm tools.media.audio is configured, not messages.stt
  2. Confirm base URL points at http://127.0.0.1:8000/v1
  3. Confirm the chosen model exists in /v1/models
  4. Send the exact inbound audio file directly to /v1/audio/transcriptions
  5. Inspect gateway logs for any sign of transcription dispatch
  6. If there is no /audio/transcriptions request at all, the problem is upstream of STT

If OpenClaw never hits the server, stop tweaking model params. That would be cargo-cult debugging.

7. Preferred test ladder

Use this order:

  1. GET /health
  2. GET /v1/models
  3. direct curl transcription with the same audio file
  4. compare request fields against http://localhost:8000/openapi.json
  5. OpenAI client compatibility test
  6. OpenClaw integration test
  7. dump-proxy / log inspection only if still ambiguous

8. Common conclusions

Niche input container bug

Typical signs:

  • direct upload of a less-common container like .m4a returns 500
  • server logs mention unsupported format handling during temp write or normalization
  • converting the same source audio to mp3 or wav makes transcription succeed immediately

Conclusion: treat this as an input-container compatibility bug, not an ASR-quality failure. For now, transcode niche formats to mp3 or wav before testing recognition quality.

Server good, client bad

Typical signs:

  • manual curl returns { "text": ... }
  • OpenClaw logs show no transcription request
  • changing model/language does nothing

Conclusion: fix routing, not inference.

Multipart mismatch

Typical signs:

  • server is up
  • model exists
  • client gets 400 errors
  • direct curl works but app client does not

Conclusion: compare multipart field names and values.

Feature mismatch

Typical signs:

  • client expects diarization, logprobs, or richer streaming fields
  • local server only implements a smaller compatible subset

Conclusion: align expectations with references/stt-api.md.

Resources

references/

  • references/stt-api.md — exact local API behavior, schema, response formats, SSE events, limits, and compatibility notes
Usage Guidance
This skill is a local troubleshooting guide — it tells you to curl localhost, read the server's openapi.json, and inspect logs. Those are reasonable when debugging a local STT server. Before running commands, ensure you understand any curl/file commands you paste into a shell and avoid sending private audio to unknown remote endpoints. If your environment restricts access to system logs or localhost ports, run these steps on a trusted machine where the STT server is intentionally hosted.
Capability Analysis
Type: OpenClaw Skill Name: local-stt-workflow Version: 1.0.2 The skill bundle provides a structured workflow and documentation for debugging a local Speech-to-Text (STT) server (typically at 127.0.0.1:8000). It uses standard diagnostic tools like curl to verify server health and API compatibility (SKILL.md, references/stt-api.md). No indicators of data exfiltration, malicious execution, or harmful prompt injection were found; the instructions are focused entirely on troubleshooting local audio processing pipelines.
Capability Assessment
Purpose & Capability
The name/description (local STT debug workflow) matches the content: step-by-step curl tests, OpenAPI checks, and OpenClaw integration guidance. There are no unrelated requirements (no cloud creds, no extraneous binaries).
Instruction Scope
SKILL.md contains only diagnostic steps: curl against localhost endpoints, read local openapi.json, check server logs, and use a dump proxy for traffic inspection. These actions are appropriate for local STT debugging; nothing instructs reading unrelated system secrets or exfiltrating data to remote endpoints.
Install Mechanism
No install spec or code is included. This is instruction-only, so nothing is written to disk or pulled from external URLs.
Credentials
No required environment variables, credentials, or config paths are declared. The document mentions MLX_AUDIO_SERVER_TASKS as an example runtime flag (contextual, not required). No disproportionate secret access is requested.
Persistence & Privilege
The skill is not always-enabled and is user-invocable. It does not request permanent presence or modification of other skills or system-wide settings.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install local-stt-workflow
  3. After installation, invoke the skill by name or use /local-stt-workflow
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.2
local-stt-workflow 1.0.2 - Added guidance for handling transcription failures when using less-common audio containers like `.m4a` - Clarified that container incompatibility should be treated as an input compatibility, not ASR quality, issue - Updated "Common conclusions" section with troubleshooting advice for input-container bugs
v1.0.1
- Clarified that the local STT server may load models from a local path if registry-pulled models are missing. - Added information about referencing the local OpenAPI schema at `http://localhost:8000/openapi.json` to verify exact request/response shape. - Updated documentation to highlight differences between this server’s API and OpenAI Whisper’s API, especially regarding supported request fields. - Emphasized not to assume OpenAI field compatibility; notable mention of unique local fields (`context`, `text`, chunking/prefill controls). - Adjusted debugging/test workflow steps to prioritize verifying actual supported schema and added OpenAPI comparison as a diagnostic step.
v1.0.0
Initial release of local-stt-workflow. - Introduces a detailed workflow for debugging and validating speech-to-text servers compatible with OpenAI endpoints. - Guides users through verifying server setup, isolating server vs. client routing failures, and checking request payload compatibility. - Provides troubleshooting advice specific to OpenClaw audio pipelines and multipart/form-data issues. - Emphasizes use of direct `curl` commands for diagnosis and references documentation for exact API behaviors. - Outlines common error patterns and troubleshooting steps to streamline local audio transcription debugging.
Metadata
Slug local-stt-workflow
Version 1.0.2
License MIT-0
All-time Installs 1
Active Installs 1
Total Versions 3
Frequently Asked Questions

What is Local Stt Workflow?

Local speech-to-text workflow for an OpenAI-compatible STT server, typically on http://127.0.0.1:8000/v1. Use when configuring, testing, debugging, or valida... It is an AI Agent Skill for Claude Code / OpenClaw, with 138 downloads so far.

How do I install Local Stt Workflow?

Run "/install local-stt-workflow" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Local Stt Workflow free?

Yes, Local Stt Workflow is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Local Stt Workflow support?

Local Stt Workflow is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Local Stt Workflow?

It is built and maintained by Mozi Arasaka (@mozi1924); the current version is v1.0.2.

💬 Comments