功能描述

Local speech-to-text workflow for an OpenAI-compatible STT server, typically on http://127.0.0.1:8000/v1. Use when configuring, testing, debugging, or valida...

使用说明 (SKILL.md)

Local STT Workflow

Name: Local Stt Workflow
Author: mozi1924

Use this skill to debug the full transcription path, not just the model.

Default assumption: the local STT server lives at http://127.0.0.1:8000/v1.

Current local model-path fallback worth remembering: if the server did not pull a model by name, it may be loading directly from a local path such as ./models/Qwen3-ASR-0.6B-bf16.

When exact route shape matters, the local OpenAPI document is available at:

http://localhost:8000/openapi.json

Use this OpenAPI doc as a schema/reference source to compare this local mlx-audio server against OpenAI’s API. Do not treat it as a health check.

Workflow

1. Verify the server before blaming OpenClaw

Check the basics first:

curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models

Confirm that the intended STT model exists, usually qwen3-asr.

If the model does not appear by pulled registry name, do not assume STT is broken — this server may be running a local-path model such as ./models/Qwen3-ASR-0.6B-bf16.

If the server is task-gated, ensure STT is enabled:

MLX_AUDIO_SERVER_TASKS=stt uv run python server.py

If the model is missing, register it before testing clients — but first check whether the server is intentionally loading from a local path and verify the exact accepted model IDs through /v1/models or http://localhost:8000/openapi.json.

2. Prove the raw STT endpoint works

Always isolate the server from the client stack.

Minimal direct transcription test:

curl -X POST http://127.0.0.1:8000/v1/audio/transcriptions \
  -F [email protected] \
  -F model=qwen3-asr \
  -F response_format=json

Useful richer test:

curl -X POST http://127.0.0.1:8000/v1/audio/transcriptions \
  -F [email protected] \
  -F model=qwen3-asr \
  -F response_format=verbose_json \
  -F 'timestamp_granularities[]=segment' \
  -F 'timestamp_granularities[]=word'

If direct curl works but OpenClaw does not, the bug is probably in the message ingestion or routing layer, not the STT backend.

3. Distinguish server failure from routing failure

Use this rule hard:

Direct curl fails → fix the local STT server first
Direct curl works, but OpenClaw shows no transcript → inspect OpenClaw audio pipeline / attachment routing
OpenClaw sends requests, but fields are wrong → inspect request shape compatibility

This distinction saves a shitload of time.

4. Check the request shape

This server is designed around OpenAI-style multipart form upload.

Expected core fields for /v1/audio/transcriptions from the current local OpenAPI schema:

required: file, model
optional: language, verbose, max_tokens, chunk_duration, frame_threshold, stream, context, prefill_step_size, text

This means the local server is not exposing the same form shape as OpenAI Whisper-style docs. Do not blindly assume response_format, prompt, or timestamp_granularities[] exist just because OpenAI supports them.

If a client is suspected of sending the wrong shape, inspect traffic with a temporary dump proxy or server logs.

5. Use the reference doc when exact fields matter

Read references/stt-api.md when you need exact behavior for:

response_format=json|text|verbose_json|srt|vtt
stream=true SSE events
timestamp_granularities[]
include[]
translation endpoint semantics
error envelope shape
current compatibility limits

Do not guess field support from generic OpenAI docs when this local server may intentionally differ.

Current notable mismatch: the local schema exposes context and text, plus chunking/prefill controls like chunk_duration, frame_threshold, and prefill_step_size, which are not the usual OpenAI STT field set.

6. OpenClaw-specific debugging pattern

When OpenClaw STT appears broken:

Confirm tools.media.audio is configured, not messages.stt
Confirm base URL points at http://127.0.0.1:8000/v1
Confirm the chosen model exists in /v1/models
Send the exact inbound audio file directly to /v1/audio/transcriptions
Inspect gateway logs for any sign of transcription dispatch
If there is no /audio/transcriptions request at all, the problem is upstream of STT

If OpenClaw never hits the server, stop tweaking model params. That would be cargo-cult debugging.

7. Preferred test ladder

Use this order:

GET /health
GET /v1/models
direct curl transcription with the same audio file
compare request fields against http://localhost:8000/openapi.json
OpenAI client compatibility test
OpenClaw integration test
dump-proxy / log inspection only if still ambiguous

8. Common conclusions

Niche input container bug

Typical signs:

direct upload of a less-common container like .m4a returns 500
server logs mention unsupported format handling during temp write or normalization
converting the same source audio to mp3 or wav makes transcription succeed immediately

Conclusion: treat this as an input-container compatibility bug, not an ASR-quality failure. For now, transcode niche formats to mp3 or wav before testing recognition quality.

Server good, client bad

Typical signs:

manual curl returns { "text": ... }
OpenClaw logs show no transcription request
changing model/language does nothing

Conclusion: fix routing, not inference.

Multipart mismatch

Typical signs:

server is up
model exists
client gets 400 errors
direct curl works but app client does not

Conclusion: compare multipart field names and values.

Feature mismatch

Typical signs:

client expects diarization, logprobs, or richer streaming fields
local server only implements a smaller compatible subset

Conclusion: align expectations with references/stt-api.md.

Resources

references/

references/stt-api.md — exact local API behavior, schema, response formats, SSE events, limits, and compatibility notes

安全使用建议

This skill is a local troubleshooting guide — it tells you to curl localhost, read the server's openapi.json, and inspect logs. Those are reasonable when debugging a local STT server. Before running commands, ensure you understand any curl/file commands you paste into a shell and avoid sending private audio to unknown remote endpoints. If your environment restricts access to system logs or localhost ports, run these steps on a trusted machine where the STT server is intentionally hosted.

功能分析

Type: OpenClaw Skill Name: local-stt-workflow Version: 1.0.2 The skill bundle provides a structured workflow and documentation for debugging a local Speech-to-Text (STT) server (typically at 127.0.0.1:8000). It uses standard diagnostic tools like curl to verify server health and API compatibility (SKILL.md, references/stt-api.md). No indicators of data exfiltration, malicious execution, or harmful prompt injection were found; the instructions are focused entirely on troubleshooting local audio processing pipelines.

能力评估

✓ Purpose & Capability

The name/description (local STT debug workflow) matches the content: step-by-step curl tests, OpenAPI checks, and OpenClaw integration guidance. There are no unrelated requirements (no cloud creds, no extraneous binaries).

✓ Instruction Scope

SKILL.md contains only diagnostic steps: curl against localhost endpoints, read local openapi.json, check server logs, and use a dump proxy for traffic inspection. These actions are appropriate for local STT debugging; nothing instructs reading unrelated system secrets or exfiltrating data to remote endpoints.

✓ Install Mechanism

No install spec or code is included. This is instruction-only, so nothing is written to disk or pulled from external URLs.

✓ Credentials

No required environment variables, credentials, or config paths are declared. The document mentions MLX_AUDIO_SERVER_TASKS as an example runtime flag (contextual, not required). No disproportionate secret access is requested.

✓ Persistence & Privilege

The skill is not always-enabled and is user-invocable. It does not request permanent presence or modification of other skills or system-wide settings.

版本历史

v1.0.2

local-stt-workflow 1.0.2 - Added guidance for handling transcription failures when using less-common audio containers like `.m4a` - Clarified that container incompatibility should be treated as an input compatibility, not ASR quality, issue - Updated "Common conclusions" section with troubleshooting advice for input-container bugs

v1.0.1

- Clarified that the local STT server may load models from a local path if registry-pulled models are missing. - Added information about referencing the local OpenAPI schema at `http://localhost:8000/openapi.json` to verify exact request/response shape. - Updated documentation to highlight differences between this server’s API and OpenAI Whisper’s API, especially regarding supported request fields. - Emphasized not to assume OpenAI field compatibility; notable mention of unique local fields (`context`, `text`, chunking/prefill controls). - Adjusted debugging/test workflow steps to prioritize verifying actual supported schema and added OpenAPI comparison as a diagnostic step.

v1.0.0

Initial release of local-stt-workflow. - Introduces a detailed workflow for debugging and validating speech-to-text servers compatible with OpenAI endpoints. - Guides users through verifying server setup, isolating server vs. client routing failures, and checking request payload compatibility. - Provides troubleshooting advice specific to OpenClaw audio pipelines and multipart/form-data issues. - Emphasizes use of direct `curl` commands for diagnosis and references documentation for exact API behaviors. - Outlines common error patterns and troubleshooting steps to streamline local audio transcription debugging.

元数据

Slug local-stt-workflow

版本 1.0.2

许可证 MIT-0

累计安装 1

当前安装数 1

历史版本数 3

常见问题