talkies
/install talkies
talkies
Self-hosted speech-to-text. OpenAI-compatible /v1/audio/transcriptions wire shape — point an OpenAI client at it, change the model slug, done.
Seven backends behind the same endpoint: whisper-large-v3, whisper-large-v3-turbo, distil-whisper-large-v3, parakeet-tdt-0.6b-v3, canary-180m-flash, canary-1b-flash, canary-qwen-2.5b. Stereo diarization, URL file_path fetching, server-side file staging, MCP endpoint with 6 tools, optional bearer-token auth.
For installation, configuration, and container setup, see references/setup.md.
When To Use
- Transcribe audio files (any format ffmpeg decodes — WAV, MP3, M4A, FLAC, OGG, WebM, Opus, MP4 audio).
- Generate SRT/VTT subtitles for video.
- Transcribe podcasts, lectures, interviews, voicemails, calls.
- Stereo two-mic recordings → per-speaker diarized output (
L:/R:channel tagging). - German/French/Spanish ↔ English speech-to-text translation via Canary-1B-Flash.
- Drop-in replacement for
api.openai.com/v1/audio/transcriptionsin existing client code.
When NOT To Use
- Real-time / streaming transcription (this is request/response only — buffer and POST).
- TTS (text-to-speech) — talkies is ASR-only. Use speaches if you need TTS.
- Speaker identification from voice (only stereo-channel diarization is supported, not voice clustering).
- Per-request
promptortemperatureinjection (fields accepted for compat, ignored). - arm64 hosts —
linux/amd64only.
Setup
The container should already be running. Set the base URL:
export TALKIES_URL=http://localhost:8000
If the server has TALKIES_AUTH_TOKEN set, export it too:
export TALKIES_AUTH_TOKEN=\x3Cyour-token>
# every request below needs: -H "Authorization: Bearer $TALKIES_AUTH_TOKEN"
Verify: curl $TALKIES_URL/healthz returns {"ok": true, "device": "...", "models": [...]}.
For install / configuration / env vars / CPU vs CUDA images / custom model registry, see references/setup.md.
Quick Start
# Discover what's available.
curl -s $TALKIES_URL/v1/models | jq
# Simplest transcribe — file upload, JSON response.
curl -s $TALKIES_URL/v1/audio/transcriptions \
-F "[email protected]" \
-F "model=whisper-large-v3-turbo" | jq
# Same call, but the audio lives at a URL — talkies downloads + caches it.
curl -s $TALKIES_URL/v1/audio/transcriptions \
-F "file_path=https://example.com/podcasts/ep-042.mp3" \
-F "model=whisper-large-v3-turbo" | jq
# Full Whisper-shape JSON with per-segment + per-word timestamps.
curl -s $TALKIES_URL/v1/audio/transcriptions \
-F "[email protected]" \
-F "model=whisper-large-v3-turbo" \
-F "response_format=verbose_json" | jq
# SRT subtitles.
curl -s $TALKIES_URL/v1/audio/transcriptions \
-F "[email protected]" \
-F "model=whisper-large-v3" \
-F "response_format=srt" > lecture.srt
Supported Models
| Slug | Family | CPU | CUDA | Languages | Strength |
|---|---|---|---|---|---|
whisper-large-v3 |
faster-whisper | yes | yes | 99 auto-detect | best accuracy, slowest |
whisper-large-v3-turbo |
faster-whisper | yes | yes | 99 auto-detect | sweet spot — fast, accurate |
distil-whisper-large-v3 |
faster-whisper | yes | yes | English only | fastest Whisper variant |
parakeet-tdt-0.6b-v3 |
NeMo TDT | no | yes | English only | very fast on GPU |
canary-180m-flash |
NeMo Canary | yes | yes | English only (small) | smallest, runs anywhere |
canary-1b-flash |
NeMo Canary | no | yes | en/de/fr/es + translation | multilingual, translation |
canary-qwen-2.5b |
NeMo SALM | no | yes | English only | best English accuracy (no timestamps) |
Pick by use case:
- General-purpose:
whisper-large-v3-turbo. - English-only, max speed on CPU:
distil-whisper-large-v3. - English-only, max accuracy on GPU:
canary-qwen-2.5b(but no per-segment timestamps). - Translation EN↔DE/FR/ES:
canary-1b-flash(requires custom model registry — see Translation).
canary-qwen-2.5b produces no segment/word timestamps — verbose_json.segments and .words come back empty, srt/vtt collapse to a single full-duration cue. Transcription itself is whole-file. Use a Whisper or Canary multitask slug if you need timing.
API — POST /v1/audio/transcriptions
Multipart form. Same field names as OpenAI's transcription endpoint where they overlap.
Request Fields
| Field | Required | Default | Notes |
|---|---|---|---|
file |
one of file/file_path |
— | Audio file. Capped at TALKIES_MAX_UPLOAD_BYTES (default 100 MB). |
file_path |
one of file/file_path |
— | Either a path under the staging area (/v1/files) or an http(s):// URL (downloaded + cached server-side). Not subject to the 100 MB upload cap; URL downloads capped by TALKIES_MAX_DOWNLOAD_BYTES (default 1 GiB). |
model |
yes | — | One of the configured slugs (see GET /v1/models). Unknown → 404. |
language |
no | model default | ISO-639-1 code. Whisper auto-detects when omitted; Canary uses its default_source_lang. |
response_format |
no | json |
json / text / verbose_json / srt / vtt. |
timestamp_granularities[] |
no | — | Accepted for OpenAI compat; ignored — verbose_json always emits both segment + word. |
prompt |
no | — | Accepted, ignored. |
temperature |
no | — | Accepted, ignored. |
diarization |
no | false |
Stereo-channel diarization. Requires 2-channel input — mono returns 400. |
Exactly one of file or file_path must be set — passing both or neither returns 400.
Response Formats
response_format |
Content-Type | Shape |
|---|---|---|
json (default) |
application/json |
{"text": "..."} — just the transcript. |
text |
text/plain |
The transcript as plain text. |
verbose_json |
application/json |
Full Whisper shape — task, language, duration, text, segments[], words[]. |
srt |
application/x-subrip |
SubRip subtitle file, one cue per VAD-segmented chunk. |
vtt |
text/vtt |
WebVTT subtitle file, one cue per VAD-segmented chunk. |
json shape:
{ "text": " full transcript as a single string" }
verbose_json shape — segments and words are always present (empty arrays for backends with no alignment output):
{
"task": "transcribe",
"language": "en",
"duration": 6.42,
"text": " full transcript",
"segments": [{ "id": 0, "start": 0.0, "end": 2.31, "text": " ...", "tokens": [], "temperature": 0.0, "avg_logprob": null, "compression_ratio": null, "no_speech_prob": null }],
"words": [{ "word": " the", "start": 0.0, "end": 0.12 }]
}
Whisper-only confidence fields (avg_logprob, compression_ratio, no_speech_prob) are emitted as null regardless of backend so clients reading them don't crash. tokens is always [].
Stereo Diarization
Pass diarization=true and upload a 2-channel file. Left channel = speaker L, right channel = speaker R. Each channel is transcribed independently, the two timelines are merged chronologically by segment start time.
curl -s $TALKIES_URL/v1/audio/transcriptions \
-F "[email protected]" \
-F "model=whisper-large-v3-turbo" \
-F "diarization=true" \
-F "response_format=verbose_json" | jq
What changes:
verbose_json— every segment/word gets"channel": "L"or"R". Segments re-numbered after merge.text/response_format=text— rebuilt as alternating turn lines:L: ...\ R: ...\ .... Consecutive same-channel segments collapsed into one line per turn.srt/vtt— each cue prefixed withL:/R:.
Caveats:
- Exactly 2 channels required. Mono → 400. >2 channels → 400.
- Latency ~2× the mono case (model runs sequentially on each channel).
- The technique is exact for true two-mic setups (interview rigs, podcast splits). It does NOT magically separate speakers from a single-mic recording that's been rendered to stereo.
Translation
Canary multitask models can translate speech → text in a non-source language. canary-1b-flash covers en↔de, en↔fr, en↔es. The task is baked into the model slug, not passed per-request — you add a translation-specific slug via custom models.json (see Customizing the model registry):
{
"models": {
"canary-1b-flash-de2en": {
"repo": "nvidia/canary-1b-flash",
"executor": "canary_multitask",
"default_source_lang": "de",
"default_target_lang": "en",
"default_task": "s2t_translation",
"languages": ["de"]
}
}
}
Then call it normally — text carries the English translation:
curl -s $TALKIES_URL/v1/audio/transcriptions \
-F "[email protected]" \
-F "model=canary-1b-flash-de2en" | jq
canary-180m-flash is English-ASR-only — don't point a translation slug at it. canary-qwen-2.5b is English ASR only too.
Long Files + VAD Chunking
Audio longer than 30 s (TALKIES_VAD_CHUNK_THRESHOLD) gets sliced through Silero VAD into ≤28 s speech regions before being handed to the backend. Timestamps are re-assembled by offsetting each chunk's segment/word timings — you get one continuous segments list spanning the whole file.
No client-side change. Long files just work. Verify by checking duration in verbose_json.
Error Contract
| Status | Shape | When |
|---|---|---|
| 200 | per response_format |
success |
| 400 | {"detail": "..."} |
bad audio, mono+diarization, >2 ch+diarization, both/neither of file/file_path, invalid file_path, URL download failure (DNS, HTTP error, size exceeded, SSRF blocked) |
| 401 | {"detail": "..."} |
only when TALKIES_AUTH_TOKEN is set: missing/wrong bearer. Includes WWW-Authenticate: Bearer. |
| 404 | {"detail": "..."} |
unknown model slug, file_path references missing file, DELETE /api/ps/{slug} on unloaded model, /v1/files/{path} GET/DELETE on missing |
| 413 | {"detail": "..."} |
upload exceeded TALKIES_MAX_UPLOAD_BYTES (multipart file and PUT /v1/files/{path} only — not file_path URL) |
| 422 | {"detail": [...]} |
Pydantic validation (missing fields, wrong types) |
| 500 | {"detail": "..."} |
unhandled backend failure |
Resource-Management Endpoints (Ollama-Style)
talkies mirrors a subset of speaches / Ollama, so a LiteLLM proxy can drive both.
| Endpoint | Behavior |
|---|---|
GET /healthz |
Unauthenticated liveness. Returns {ok, device, models}. |
GET /v1/models |
OpenAI-style list of configured slugs. |
GET /api/ps |
Currently-loaded models with per-model idle_seconds. |
DELETE /api/ps/{model_id} |
Evict one model. Slug can be URL-encoded (/ → %2F). 404 if not loaded. |
POST /unload |
Evict every loaded model. Returns the list actually unloaded. |
Behind these: an idle sweeper runs every TALKIES_SWEEPER_INTERVAL s (default 60) and unloads anything not used in TALKIES_MODEL_TTL s (default 600). Set TALKIES_MODEL_TTL=0 to disable.
There's also sibling eviction at transcription time — every request evicts other loaded models so VRAM doesn't get split. One model resident at a time, per container. If you need two models simultaneously, run two containers.
# Which models are loaded right now.
curl -s $TALKIES_URL/api/ps | jq
# Free VRAM after a job — evict one model.
curl -s -X DELETE "$TALKIES_URL/api/ps/whisper-large-v3-turbo"
# Or evict everything.
curl -s -X POST $TALKIES_URL/unload | jq
Server-Side File Staging (/v1/files)
For repeated transcribes of the same file (different response_format, different model, iterating on params), stage the file once and reference it by path. Files land under ${TALKIES_DATA_DIR}/files/\x3Cpath>.
| Endpoint | Behavior |
|---|---|
GET /v1/files |
List every staged file. Returns {"files": [{"path", "size", "modified"}]}. |
PUT /v1/files/{path} |
Upload raw bytes (--data-binary @local-file). Capped at TALKIES_MAX_UPLOAD_BYTES. Atomic write (.part → rename). |
GET /v1/files/{path} |
Streams file back. Content-Type guessed by extension. 404 if missing. |
DELETE /v1/files/{path} |
Removes file and prunes empty parent dirs. 404 if missing. |
# Stage once.
curl -X PUT --data-binary @lecture.mp3 \
-H "Content-Type: audio/mpeg" \
$TALKIES_URL/v1/files/lectures/2026-03-15/lecture.mp3
# Reuse across multiple transcribe calls.
curl -s $TALKIES_URL/v1/audio/transcriptions \
-F "file_path=lectures/2026-03-15/lecture.mp3" \
-F "model=whisper-large-v3-turbo" \
-F "response_format=verbose_json" | jq
# Cleanup.
curl -X DELETE $TALKIES_URL/v1/files/lectures/2026-03-15/lecture.mp3
Path safety: null bytes, backslashes, . / .. segments and double slashes are rejected (400). Symlinks pointing outside the root are refused. Leading / is stripped — /foo/bar.mp3 and foo/bar.mp3 resolve identically.
URL file_path (Download + Cache)
file_path also accepts http:// / https:// URLs. First request downloads to ${TALKIES_DATA_DIR}/files/downloads/\x3Csha256(url)[:16]>-\x3Cbasename>, subsequent requests with the same URL hit the cache.
# First call: downloads, transcribes off the cached copy.
curl -s $TALKIES_URL/v1/audio/transcriptions \
-F "file_path=https://example.com/podcasts/ep-042.mp3" \
-F "model=whisper-large-v3-turbo" | jq
# Second call: same URL → cache hit, no re-download.
curl -s $TALKIES_URL/v1/audio/transcriptions \
-F "file_path=https://example.com/podcasts/ep-042.mp3" \
-F "model=canary-1b-flash" \
-F "response_format=srt" > ep-042.srt
Downloads appear in GET /v1/files listings under downloads/. Invalidate a single cached URL with DELETE /v1/files/downloads/\x3Ckey>.
Constraints applied during download:
- Size capped by
TALKIES_MAX_DOWNLOAD_BYTES(default 1 GiB). - 5 redirect hops max; SSRF guard re-applied at every hop.
- 10 s connect, 300 s per-chunk read timeout.
- SSRF off by default. Set
TALKIES_BLOCK_PRIVATE_DOWNLOADS=trueto reject URLs whose hostname resolves to private/loopback/link-local/multicast/reserved IPs.
MCP Endpoint (/v1/mcp)
talkies exposes a Model Context Protocol server over Streamable HTTP at /v1/mcp. Same FastAPI process, same BACKENDS / REGISTRY, same auth middleware — a model loaded by the MCP transcribe tool is the same instance the HTTP endpoint sees.
| Tool | What it does |
|---|---|
list_models |
Discover ASR slugs. Returns [{slug, executor, default_source_lang, default_target_lang, default_task, loaded}]. |
transcribe |
Run ASR on a file_path (URL or staged path). Args: model, language?, response_format? (json/verbose_json/text/srt/vtt), diarization?. JSON formats return a JSON-encoded string; text/srt/vtt return raw. |
list_files |
Same payload as GET /v1/files. |
put_file |
Upload to staging. Body is base64 (content_base64). Decoded size capped at TALKIES_MAX_UPLOAD_BYTES. For big files, prefer PUT /v1/files/{path} over HTTP — JSON-RPC + base64 chews token budget. |
get_file |
Read a staged file as base64. Same size cap. Same advice — for big bytes, hit GET /v1/files/{path} over HTTP. |
delete_file |
Remove a staged file, prune empty parents. |
The transport requires Accept: application/json, text/event-stream. Wire it into Claude Code:
claude mcp add --transport http talkies $TALKIES_URL/v1/mcp
With auth:
claude mcp add --transport http talkies $TALKIES_URL/v1/mcp \
--header "Authorization: Bearer $TALKIES_AUTH_TOKEN"
Note: the canonical mount path is /v1/mcp/ (trailing slash). Bare /v1/mcp is rewritten internally to /v1/mcp/ so clients that don't follow Starlette's 307 redirect work too.
Raw JSON-RPC
For debugging or non-MCP-aware callers, hit it as JSON-RPC over HTTP POST:
# tools/list
curl -s $TALKIES_URL/v1/mcp/ \
-H "Content-Type: application/json" \
-H "Accept: application/json, text/event-stream" \
-d '{"jsonrpc": "2.0", "id": 1, "method": "tools/list"}'
# tools/call
curl -s $TALKIES_URL/v1/mcp/ \
-H "Content-Type: application/json" \
-H "Accept: application/json, text/event-stream" \
-d '{
"jsonrpc": "2.0", "id": 2, "method": "tools/call",
"params": {
"name": "transcribe",
"arguments": {
"file_path": "https://example.com/clip.mp3",
"model": "whisper-large-v3-turbo",
"response_format": "json"
}
}
}'
Bearer-Token Auth
If TALKIES_AUTH_TOKEN is set on the server, every route except /healthz and CORS preflight (OPTIONS) requires Authorization: Bearer \x3Ctoken>. Wrong/missing token returns 401 with WWW-Authenticate: Bearer. Compared with hmac.compare_digest (constant-time).
curl -H "Authorization: Bearer $TALKIES_AUTH_TOKEN" $TALKIES_URL/v1/models
Empty / unset token = wide open. For untrusted networks, combine the token with a reverse proxy doing TLS + rate limiting.
Typical Workflows
Quick one-off transcribe
curl -s $TALKIES_URL/v1/audio/transcriptions \
-F "[email protected]" \
-F "model=whisper-large-v3-turbo" | jq -r .text
Generate subtitles for a video
ffmpeg -i video.mp4 -vn -acodec libmp3lame audio.mp3
curl -s $TALKIES_URL/v1/audio/transcriptions \
-F "[email protected]" \
-F "model=whisper-large-v3" \
-F "response_format=srt" > video.srt
# burn in: ffmpeg -i video.mp4 -vf subtitles=video.srt -c:a copy video-subbed.mp4
Iterate on the same file with different settings
# Stage once.
curl -X PUT --data-binary @lecture.mp3 \
-H "Content-Type: audio/mpeg" \
$TALKIES_URL/v1/files/work/lecture.mp3
# Try different models / formats without re-uploading.
for fmt in json verbose_json srt; do
curl -s $TALKIES_URL/v1/audio/transcriptions \
-F "file_path=work/lecture.mp3" \
-F "model=whisper-large-v3-turbo" \
-F "response_format=$fmt" > "lecture.$fmt"
done
# Cleanup.
curl -X DELETE $TALKIES_URL/v1/files/work/lecture.mp3
Diarized interview transcript
curl -s $TALKIES_URL/v1/audio/transcriptions \
-F "[email protected]" \
-F "model=whisper-large-v3-turbo" \
-F "diarization=true" \
-F "response_format=text"
# stdout:
# L: hi how's it going
# R: not bad you
# L: cool man
Free VRAM after a job
curl -s -X POST $TALKIES_URL/unload | jq
Bulk transcribe from URLs
for url in $(cat urls.txt); do
curl -s $TALKIES_URL/v1/audio/transcriptions \
-F "file_path=$url" \
-F "model=whisper-large-v3-turbo" \
-F "response_format=text"
echo "---"
done
The first hit on each URL downloads + caches; re-running the loop is free.
For a fuller bulk-transcribe driver (mix of local paths + URLs, per-input output files, error reporting, optional diarization) see scripts/bulk_transcribe.sh:
TALKIES_URL=http://localhost:8000 \
TALKIES_MODEL=whisper-large-v3-turbo \
TALKIES_FORMAT=srt \
TALKIES_OUTDIR=./subs \
bash scripts/bulk_transcribe.sh inputs.txt
Tips
- Use
whisper-large-v3-turboas your default — it's the speed/quality sweet spot for general-purpose ASR. Switch towhisper-large-v3only when you need the last few % of accuracy on hard audio. - URL
file_pathover multipart upload — if the audio is already at a URL, send the URL. Saves bandwidth (the file isn't going up and then back down), gets cached server-side, no upload size cap. - Stage repeated files via
PUT /v1/files/{path}and call withfile_path=to avoid re-uploading on every retry/iteration. response_format=textfor the "just give me the string" case — nojq -r .textneeded, content-type istext/plain.- One model at a time — every transcribe request evicts other loaded models. Don't try to fan out two calls against two different models on the same container; the second one evicts the first and reloads. Use two containers if you actually need concurrency on different models.
POST /unloadafter a job — explicit eviction frees VRAM/RAM faster than waiting for the 10-min idle sweeper. Useful in CI / batch scripts.canary-qwen-2.5bhas no timestamps —verbose_json.segments/.wordscome back empty,srt/vttcollapse to one cue. Use a Whisper or Canary multitask slug if you need timing data.- Diarization requires true stereo — if your "stereo" file is the same mono signal copied to both channels, diarization won't separate speakers. The technique is exact for two-mic setups, useless otherwise.
- Long files just work — VAD chunking happens transparently. Don't pre-split. Send the whole file.
promptandtemperatureare ignored even though the form accepts them. Don't expect them to do anything.- Watch
/api/psto see what's resident. A request that hangs at "loading model" is doing the first cold load — subsequent calls are fast. - Customizing the model registry for translation slugs or to restrict the served set — see references/setup.md.
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install talkies - After installation, invoke the skill by name or use
/talkies - Provide required inputs per the skill's parameter spec and get structured output
What is talkies?
Self-hosted OpenAI-compatible /v1/audio/transcriptions fronting seven open ASR models (Whisper, Parakeet, Canary). Same wire format as OpenAI — change the ba... It is an AI Agent Skill for Claude Code / OpenClaw, with 45 downloads so far.
How do I install talkies?
Run "/install talkies" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is talkies free?
Yes, talkies is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does talkies support?
talkies is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created talkies?
It is built and maintained by Ciprian Mandache (@psyb0t); the current version is v1.0.0.