← Back to Skills Marketplace
psyb0t

talkies

by Ciprian Mandache · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ Security Clean
45
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install talkies
Description
Self-hosted OpenAI-compatible /v1/audio/transcriptions fronting seven open ASR models (Whisper, Parakeet, Canary). Same wire format as OpenAI — change the ba...
README (SKILL.md)

talkies

Self-hosted speech-to-text. OpenAI-compatible /v1/audio/transcriptions wire shape — point an OpenAI client at it, change the model slug, done.

Seven backends behind the same endpoint: whisper-large-v3, whisper-large-v3-turbo, distil-whisper-large-v3, parakeet-tdt-0.6b-v3, canary-180m-flash, canary-1b-flash, canary-qwen-2.5b. Stereo diarization, URL file_path fetching, server-side file staging, MCP endpoint with 6 tools, optional bearer-token auth.

For installation, configuration, and container setup, see references/setup.md.

When To Use

  • Transcribe audio files (any format ffmpeg decodes — WAV, MP3, M4A, FLAC, OGG, WebM, Opus, MP4 audio).
  • Generate SRT/VTT subtitles for video.
  • Transcribe podcasts, lectures, interviews, voicemails, calls.
  • Stereo two-mic recordings → per-speaker diarized output (L: / R: channel tagging).
  • German/French/Spanish ↔ English speech-to-text translation via Canary-1B-Flash.
  • Drop-in replacement for api.openai.com/v1/audio/transcriptions in existing client code.

When NOT To Use

  • Real-time / streaming transcription (this is request/response only — buffer and POST).
  • TTS (text-to-speech) — talkies is ASR-only. Use speaches if you need TTS.
  • Speaker identification from voice (only stereo-channel diarization is supported, not voice clustering).
  • Per-request prompt or temperature injection (fields accepted for compat, ignored).
  • arm64 hosts — linux/amd64 only.

Setup

The container should already be running. Set the base URL:

export TALKIES_URL=http://localhost:8000

If the server has TALKIES_AUTH_TOKEN set, export it too:

export TALKIES_AUTH_TOKEN=\x3Cyour-token>
# every request below needs: -H "Authorization: Bearer $TALKIES_AUTH_TOKEN"

Verify: curl $TALKIES_URL/healthz returns {"ok": true, "device": "...", "models": [...]}.

For install / configuration / env vars / CPU vs CUDA images / custom model registry, see references/setup.md.

Quick Start

# Discover what's available.
curl -s $TALKIES_URL/v1/models | jq

# Simplest transcribe — file upload, JSON response.
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-large-v3-turbo" | jq

# Same call, but the audio lives at a URL — talkies downloads + caches it.
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file_path=https://example.com/podcasts/ep-042.mp3" \
  -F "model=whisper-large-v3-turbo" | jq

# Full Whisper-shape JSON with per-segment + per-word timestamps.
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-large-v3-turbo" \
  -F "response_format=verbose_json" | jq

# SRT subtitles.
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-large-v3" \
  -F "response_format=srt" > lecture.srt

Supported Models

Slug Family CPU CUDA Languages Strength
whisper-large-v3 faster-whisper yes yes 99 auto-detect best accuracy, slowest
whisper-large-v3-turbo faster-whisper yes yes 99 auto-detect sweet spot — fast, accurate
distil-whisper-large-v3 faster-whisper yes yes English only fastest Whisper variant
parakeet-tdt-0.6b-v3 NeMo TDT no yes English only very fast on GPU
canary-180m-flash NeMo Canary yes yes English only (small) smallest, runs anywhere
canary-1b-flash NeMo Canary no yes en/de/fr/es + translation multilingual, translation
canary-qwen-2.5b NeMo SALM no yes English only best English accuracy (no timestamps)

Pick by use case:

  • General-purpose: whisper-large-v3-turbo.
  • English-only, max speed on CPU: distil-whisper-large-v3.
  • English-only, max accuracy on GPU: canary-qwen-2.5b (but no per-segment timestamps).
  • Translation EN↔DE/FR/ES: canary-1b-flash (requires custom model registry — see Translation).

canary-qwen-2.5b produces no segment/word timestamps — verbose_json.segments and .words come back empty, srt/vtt collapse to a single full-duration cue. Transcription itself is whole-file. Use a Whisper or Canary multitask slug if you need timing.

API — POST /v1/audio/transcriptions

Multipart form. Same field names as OpenAI's transcription endpoint where they overlap.

Request Fields

Field Required Default Notes
file one of file/file_path Audio file. Capped at TALKIES_MAX_UPLOAD_BYTES (default 100 MB).
file_path one of file/file_path Either a path under the staging area (/v1/files) or an http(s):// URL (downloaded + cached server-side). Not subject to the 100 MB upload cap; URL downloads capped by TALKIES_MAX_DOWNLOAD_BYTES (default 1 GiB).
model yes One of the configured slugs (see GET /v1/models). Unknown → 404.
language no model default ISO-639-1 code. Whisper auto-detects when omitted; Canary uses its default_source_lang.
response_format no json json / text / verbose_json / srt / vtt.
timestamp_granularities[] no Accepted for OpenAI compat; ignored — verbose_json always emits both segment + word.
prompt no Accepted, ignored.
temperature no Accepted, ignored.
diarization no false Stereo-channel diarization. Requires 2-channel input — mono returns 400.

Exactly one of file or file_path must be set — passing both or neither returns 400.

Response Formats

response_format Content-Type Shape
json (default) application/json {"text": "..."} — just the transcript.
text text/plain The transcript as plain text.
verbose_json application/json Full Whisper shape — task, language, duration, text, segments[], words[].
srt application/x-subrip SubRip subtitle file, one cue per VAD-segmented chunk.
vtt text/vtt WebVTT subtitle file, one cue per VAD-segmented chunk.

json shape:

{ "text": " full transcript as a single string" }

verbose_json shape — segments and words are always present (empty arrays for backends with no alignment output):

{
  "task": "transcribe",
  "language": "en",
  "duration": 6.42,
  "text": " full transcript",
  "segments": [{ "id": 0, "start": 0.0, "end": 2.31, "text": " ...", "tokens": [], "temperature": 0.0, "avg_logprob": null, "compression_ratio": null, "no_speech_prob": null }],
  "words": [{ "word": " the", "start": 0.0, "end": 0.12 }]
}

Whisper-only confidence fields (avg_logprob, compression_ratio, no_speech_prob) are emitted as null regardless of backend so clients reading them don't crash. tokens is always [].

Stereo Diarization

Pass diarization=true and upload a 2-channel file. Left channel = speaker L, right channel = speaker R. Each channel is transcribed independently, the two timelines are merged chronologically by segment start time.

curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-large-v3-turbo" \
  -F "diarization=true" \
  -F "response_format=verbose_json" | jq

What changes:

  • verbose_json — every segment/word gets "channel": "L" or "R". Segments re-numbered after merge.
  • text / response_format=text — rebuilt as alternating turn lines: L: ...\ R: ...\ .... Consecutive same-channel segments collapsed into one line per turn.
  • srt / vtt — each cue prefixed with L: / R:.

Caveats:

  • Exactly 2 channels required. Mono → 400. >2 channels → 400.
  • Latency ~2× the mono case (model runs sequentially on each channel).
  • The technique is exact for true two-mic setups (interview rigs, podcast splits). It does NOT magically separate speakers from a single-mic recording that's been rendered to stereo.

Translation

Canary multitask models can translate speech → text in a non-source language. canary-1b-flash covers en↔de, en↔fr, en↔es. The task is baked into the model slug, not passed per-request — you add a translation-specific slug via custom models.json (see Customizing the model registry):

{
  "models": {
    "canary-1b-flash-de2en": {
      "repo": "nvidia/canary-1b-flash",
      "executor": "canary_multitask",
      "default_source_lang": "de",
      "default_target_lang": "en",
      "default_task": "s2t_translation",
      "languages": ["de"]
    }
  }
}

Then call it normally — text carries the English translation:

curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=canary-1b-flash-de2en" | jq

canary-180m-flash is English-ASR-only — don't point a translation slug at it. canary-qwen-2.5b is English ASR only too.

Long Files + VAD Chunking

Audio longer than 30 s (TALKIES_VAD_CHUNK_THRESHOLD) gets sliced through Silero VAD into ≤28 s speech regions before being handed to the backend. Timestamps are re-assembled by offsetting each chunk's segment/word timings — you get one continuous segments list spanning the whole file.

No client-side change. Long files just work. Verify by checking duration in verbose_json.

Error Contract

Status Shape When
200 per response_format success
400 {"detail": "..."} bad audio, mono+diarization, >2 ch+diarization, both/neither of file/file_path, invalid file_path, URL download failure (DNS, HTTP error, size exceeded, SSRF blocked)
401 {"detail": "..."} only when TALKIES_AUTH_TOKEN is set: missing/wrong bearer. Includes WWW-Authenticate: Bearer.
404 {"detail": "..."} unknown model slug, file_path references missing file, DELETE /api/ps/{slug} on unloaded model, /v1/files/{path} GET/DELETE on missing
413 {"detail": "..."} upload exceeded TALKIES_MAX_UPLOAD_BYTES (multipart file and PUT /v1/files/{path} only — not file_path URL)
422 {"detail": [...]} Pydantic validation (missing fields, wrong types)
500 {"detail": "..."} unhandled backend failure

Resource-Management Endpoints (Ollama-Style)

talkies mirrors a subset of speaches / Ollama, so a LiteLLM proxy can drive both.

Endpoint Behavior
GET /healthz Unauthenticated liveness. Returns {ok, device, models}.
GET /v1/models OpenAI-style list of configured slugs.
GET /api/ps Currently-loaded models with per-model idle_seconds.
DELETE /api/ps/{model_id} Evict one model. Slug can be URL-encoded (/%2F). 404 if not loaded.
POST /unload Evict every loaded model. Returns the list actually unloaded.

Behind these: an idle sweeper runs every TALKIES_SWEEPER_INTERVAL s (default 60) and unloads anything not used in TALKIES_MODEL_TTL s (default 600). Set TALKIES_MODEL_TTL=0 to disable.

There's also sibling eviction at transcription time — every request evicts other loaded models so VRAM doesn't get split. One model resident at a time, per container. If you need two models simultaneously, run two containers.

# Which models are loaded right now.
curl -s $TALKIES_URL/api/ps | jq

# Free VRAM after a job — evict one model.
curl -s -X DELETE "$TALKIES_URL/api/ps/whisper-large-v3-turbo"

# Or evict everything.
curl -s -X POST $TALKIES_URL/unload | jq

Server-Side File Staging (/v1/files)

For repeated transcribes of the same file (different response_format, different model, iterating on params), stage the file once and reference it by path. Files land under ${TALKIES_DATA_DIR}/files/\x3Cpath>.

Endpoint Behavior
GET /v1/files List every staged file. Returns {"files": [{"path", "size", "modified"}]}.
PUT /v1/files/{path} Upload raw bytes (--data-binary @local-file). Capped at TALKIES_MAX_UPLOAD_BYTES. Atomic write (.part → rename).
GET /v1/files/{path} Streams file back. Content-Type guessed by extension. 404 if missing.
DELETE /v1/files/{path} Removes file and prunes empty parent dirs. 404 if missing.
# Stage once.
curl -X PUT --data-binary @lecture.mp3 \
  -H "Content-Type: audio/mpeg" \
  $TALKIES_URL/v1/files/lectures/2026-03-15/lecture.mp3

# Reuse across multiple transcribe calls.
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file_path=lectures/2026-03-15/lecture.mp3" \
  -F "model=whisper-large-v3-turbo" \
  -F "response_format=verbose_json" | jq

# Cleanup.
curl -X DELETE $TALKIES_URL/v1/files/lectures/2026-03-15/lecture.mp3

Path safety: null bytes, backslashes, . / .. segments and double slashes are rejected (400). Symlinks pointing outside the root are refused. Leading / is stripped — /foo/bar.mp3 and foo/bar.mp3 resolve identically.

URL file_path (Download + Cache)

file_path also accepts http:// / https:// URLs. First request downloads to ${TALKIES_DATA_DIR}/files/downloads/\x3Csha256(url)[:16]>-\x3Cbasename>, subsequent requests with the same URL hit the cache.

# First call: downloads, transcribes off the cached copy.
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file_path=https://example.com/podcasts/ep-042.mp3" \
  -F "model=whisper-large-v3-turbo" | jq

# Second call: same URL → cache hit, no re-download.
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file_path=https://example.com/podcasts/ep-042.mp3" \
  -F "model=canary-1b-flash" \
  -F "response_format=srt" > ep-042.srt

Downloads appear in GET /v1/files listings under downloads/. Invalidate a single cached URL with DELETE /v1/files/downloads/\x3Ckey>.

Constraints applied during download:

  • Size capped by TALKIES_MAX_DOWNLOAD_BYTES (default 1 GiB).
  • 5 redirect hops max; SSRF guard re-applied at every hop.
  • 10 s connect, 300 s per-chunk read timeout.
  • SSRF off by default. Set TALKIES_BLOCK_PRIVATE_DOWNLOADS=true to reject URLs whose hostname resolves to private/loopback/link-local/multicast/reserved IPs.

MCP Endpoint (/v1/mcp)

talkies exposes a Model Context Protocol server over Streamable HTTP at /v1/mcp. Same FastAPI process, same BACKENDS / REGISTRY, same auth middleware — a model loaded by the MCP transcribe tool is the same instance the HTTP endpoint sees.

Tool What it does
list_models Discover ASR slugs. Returns [{slug, executor, default_source_lang, default_target_lang, default_task, loaded}].
transcribe Run ASR on a file_path (URL or staged path). Args: model, language?, response_format? (json/verbose_json/text/srt/vtt), diarization?. JSON formats return a JSON-encoded string; text/srt/vtt return raw.
list_files Same payload as GET /v1/files.
put_file Upload to staging. Body is base64 (content_base64). Decoded size capped at TALKIES_MAX_UPLOAD_BYTES. For big files, prefer PUT /v1/files/{path} over HTTP — JSON-RPC + base64 chews token budget.
get_file Read a staged file as base64. Same size cap. Same advice — for big bytes, hit GET /v1/files/{path} over HTTP.
delete_file Remove a staged file, prune empty parents.

The transport requires Accept: application/json, text/event-stream. Wire it into Claude Code:

claude mcp add --transport http talkies $TALKIES_URL/v1/mcp

With auth:

claude mcp add --transport http talkies $TALKIES_URL/v1/mcp \
  --header "Authorization: Bearer $TALKIES_AUTH_TOKEN"

Note: the canonical mount path is /v1/mcp/ (trailing slash). Bare /v1/mcp is rewritten internally to /v1/mcp/ so clients that don't follow Starlette's 307 redirect work too.

Raw JSON-RPC

For debugging or non-MCP-aware callers, hit it as JSON-RPC over HTTP POST:

# tools/list
curl -s $TALKIES_URL/v1/mcp/ \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -d '{"jsonrpc": "2.0", "id": 1, "method": "tools/list"}'

# tools/call
curl -s $TALKIES_URL/v1/mcp/ \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -d '{
    "jsonrpc": "2.0", "id": 2, "method": "tools/call",
    "params": {
      "name": "transcribe",
      "arguments": {
        "file_path": "https://example.com/clip.mp3",
        "model": "whisper-large-v3-turbo",
        "response_format": "json"
      }
    }
  }'

Bearer-Token Auth

If TALKIES_AUTH_TOKEN is set on the server, every route except /healthz and CORS preflight (OPTIONS) requires Authorization: Bearer \x3Ctoken>. Wrong/missing token returns 401 with WWW-Authenticate: Bearer. Compared with hmac.compare_digest (constant-time).

curl -H "Authorization: Bearer $TALKIES_AUTH_TOKEN" $TALKIES_URL/v1/models

Empty / unset token = wide open. For untrusted networks, combine the token with a reverse proxy doing TLS + rate limiting.

Typical Workflows

Quick one-off transcribe

curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-large-v3-turbo" | jq -r .text

Generate subtitles for a video

ffmpeg -i video.mp4 -vn -acodec libmp3lame audio.mp3
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-large-v3" \
  -F "response_format=srt" > video.srt
# burn in:  ffmpeg -i video.mp4 -vf subtitles=video.srt -c:a copy video-subbed.mp4

Iterate on the same file with different settings

# Stage once.
curl -X PUT --data-binary @lecture.mp3 \
  -H "Content-Type: audio/mpeg" \
  $TALKIES_URL/v1/files/work/lecture.mp3

# Try different models / formats without re-uploading.
for fmt in json verbose_json srt; do
  curl -s $TALKIES_URL/v1/audio/transcriptions \
    -F "file_path=work/lecture.mp3" \
    -F "model=whisper-large-v3-turbo" \
    -F "response_format=$fmt" > "lecture.$fmt"
done

# Cleanup.
curl -X DELETE $TALKIES_URL/v1/files/work/lecture.mp3

Diarized interview transcript

curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-large-v3-turbo" \
  -F "diarization=true" \
  -F "response_format=text"
# stdout:
#   L: hi how's it going
#   R: not bad you
#   L: cool man

Free VRAM after a job

curl -s -X POST $TALKIES_URL/unload | jq

Bulk transcribe from URLs

for url in $(cat urls.txt); do
  curl -s $TALKIES_URL/v1/audio/transcriptions \
    -F "file_path=$url" \
    -F "model=whisper-large-v3-turbo" \
    -F "response_format=text"
  echo "---"
done

The first hit on each URL downloads + caches; re-running the loop is free.

For a fuller bulk-transcribe driver (mix of local paths + URLs, per-input output files, error reporting, optional diarization) see scripts/bulk_transcribe.sh:

TALKIES_URL=http://localhost:8000 \
TALKIES_MODEL=whisper-large-v3-turbo \
TALKIES_FORMAT=srt \
TALKIES_OUTDIR=./subs \
  bash scripts/bulk_transcribe.sh inputs.txt

Tips

  1. Use whisper-large-v3-turbo as your default — it's the speed/quality sweet spot for general-purpose ASR. Switch to whisper-large-v3 only when you need the last few % of accuracy on hard audio.
  2. URL file_path over multipart upload — if the audio is already at a URL, send the URL. Saves bandwidth (the file isn't going up and then back down), gets cached server-side, no upload size cap.
  3. Stage repeated files via PUT /v1/files/{path} and call with file_path= to avoid re-uploading on every retry/iteration.
  4. response_format=text for the "just give me the string" case — no jq -r .text needed, content-type is text/plain.
  5. One model at a time — every transcribe request evicts other loaded models. Don't try to fan out two calls against two different models on the same container; the second one evicts the first and reloads. Use two containers if you actually need concurrency on different models.
  6. POST /unload after a job — explicit eviction frees VRAM/RAM faster than waiting for the 10-min idle sweeper. Useful in CI / batch scripts.
  7. canary-qwen-2.5b has no timestampsverbose_json.segments / .words come back empty, srt/vtt collapse to one cue. Use a Whisper or Canary multitask slug if you need timing data.
  8. Diarization requires true stereo — if your "stereo" file is the same mono signal copied to both channels, diarization won't separate speakers. The technique is exact for two-mic setups, useless otherwise.
  9. Long files just work — VAD chunking happens transparently. Don't pre-split. Send the whole file.
  10. prompt and temperature are ignored even though the form accepts them. Don't expect them to do anything.
  11. Watch /api/ps to see what's resident. A request that hangs at "loading model" is doing the first cold load — subsequent calls are fast.
  12. Customizing the model registry for translation slugs or to restrict the served set — see references/setup.md.
Usage Guidance
Install only if you intend to run or connect to a talkies server. Treat uploaded audio, staged files, transcripts, and cached remote downloads as potentially sensitive server-side data. Keep the service bound to localhost or a trusted network when possible; if exposed, set TALKIES_AUTH_TOKEN, use TLS/rate limiting, and enable TALKIES_BLOCK_PRIVATE_DOWNLOADS for untrusted clients. Clean up staged files and cached downloads when they are no longer needed.
Capability Tags
requires-oauth-tokenrequires-sensitive-credentials
Capability Assessment
Purpose & Capability
The documented capabilities match the stated ASR purpose: transcribing uploaded or URL-based audio, producing subtitles/JSON/text, managing loaded models, and exposing an MCP interface for the same operations.
Instruction Scope
The skill clearly documents file upload, URL file_path downloads, server-side staging, file listing/retrieval/deletion, auth behavior, and deployment hardening; these are broad within the talkies data directory but purpose-aligned and disclosed.
Install Mechanism
Installation is Docker-based using the publisher's talkies images, with a helper shell script that sends user-specified files or URLs to TALKIES_URL; no hidden post-install hook, obfuscation, or unrelated command execution was found.
Credentials
Docker, curl, network access, model downloads, GPU/CPU resources, bearer-token configuration, and persisted model/data directories are proportionate for a self-hosted transcription server. Public deployments require explicit hardening.
Persistence & Privilege
The service persists model weights plus staged uploads and URL download caches under TALKIES_DATA_DIR and exposes cleanup endpoints for those files. This is expected for the feature set, but access should be controlled on shared or public servers.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install talkies
  3. After installation, invoke the skill by name or use /talkies
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
Initial release of talkies — self-hosted, OpenAI-compatible speech-to-text API. - Compatible with OpenAI’s `/v1/audio/transcriptions` wire format; supports seven open ASR models (Whisper, Parakeet, Canary). - Features stereo diarization, server-side URL file fetching, multi-model backend, and optional bearer authentication. - Supports various audio formats and can generate plain transcripts, subtitles (SRT/VTT), or verbose JSON outputs with timestamps. - Drop-in replacement for OpenAI’s API; just change the endpoint URL and model slug. - See documentation for setup, supported models, and usage examples.
Metadata
Slug talkies
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is talkies?

Self-hosted OpenAI-compatible /v1/audio/transcriptions fronting seven open ASR models (Whisper, Parakeet, Canary). Same wire format as OpenAI — change the ba... It is an AI Agent Skill for Claude Code / OpenClaw, with 45 downloads so far.

How do I install talkies?

Run "/install talkies" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is talkies free?

Yes, talkies is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does talkies support?

talkies is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created talkies?

It is built and maintained by Ciprian Mandache (@psyb0t); the current version is v1.0.0.

💬 Comments