← Back to Skills Marketplace

talkies

Name: talkies
Author: psyb0t

by Ciprian Mandache · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ✓ Security Clean

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install talkies

Description

Self-hosted OpenAI-compatible /v1/audio/transcriptions fronting seven open ASR models (Whisper, Parakeet, Canary). Same wire format as OpenAI — change the ba...

README (SKILL.md)

talkies

Self-hosted speech-to-text. OpenAI-compatible /v1/audio/transcriptions wire shape — point an OpenAI client at it, change the model slug, done.

Seven backends behind the same endpoint: whisper-large-v3, whisper-large-v3-turbo, distil-whisper-large-v3, parakeet-tdt-0.6b-v3, canary-180m-flash, canary-1b-flash, canary-qwen-2.5b. Stereo diarization, URL file_path fetching, server-side file staging, MCP endpoint with 6 tools, optional bearer-token auth.

For installation, configuration, and container setup, see references/setup.md.

When To Use

Transcribe audio files (any format ffmpeg decodes — WAV, MP3, M4A, FLAC, OGG, WebM, Opus, MP4 audio).
Generate SRT/VTT subtitles for video.
Transcribe podcasts, lectures, interviews, voicemails, calls.
Stereo two-mic recordings → per-speaker diarized output (L: / R: channel tagging).
German/French/Spanish ↔ English speech-to-text translation via Canary-1B-Flash.
Drop-in replacement for api.openai.com/v1/audio/transcriptions in existing client code.

When NOT To Use

Real-time / streaming transcription (this is request/response only — buffer and POST).
TTS (text-to-speech) — talkies is ASR-only. Use speaches if you need TTS.
Speaker identification from voice (only stereo-channel diarization is supported, not voice clustering).
Per-request prompt or temperature injection (fields accepted for compat, ignored).
arm64 hosts — linux/amd64 only.

Setup

The container should already be running. Set the base URL:

export TALKIES_URL=http://localhost:8000

If the server has TALKIES_AUTH_TOKEN set, export it too:

export TALKIES_AUTH_TOKEN=\x3Cyour-token>
# every request below needs: -H "Authorization: Bearer $TALKIES_AUTH_TOKEN"

Verify: curl $TALKIES_URL/healthz returns {"ok": true, "device": "...", "models": [...]}.

For install / configuration / env vars / CPU vs CUDA images / custom model registry, see references/setup.md.

Quick Start

# Discover what's available.
curl -s $TALKIES_URL/v1/models | jq

# Simplest transcribe — file upload, JSON response.
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-large-v3-turbo" | jq

# Same call, but the audio lives at a URL — talkies downloads + caches it.
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file_path=https://example.com/podcasts/ep-042.mp3" \
  -F "model=whisper-large-v3-turbo" | jq

# Full Whisper-shape JSON with per-segment + per-word timestamps.
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-large-v3-turbo" \
  -F "response_format=verbose_json" | jq

# SRT subtitles.
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-large-v3" \
  -F "response_format=srt" > lecture.srt

Supported Models

Slug	Family	CPU	CUDA	Languages	Strength
`whisper-large-v3`	faster-whisper	yes	yes	99 auto-detect	best accuracy, slowest
`whisper-large-v3-turbo`	faster-whisper	yes	yes	99 auto-detect	sweet spot — fast, accurate
`distil-whisper-large-v3`	faster-whisper	yes	yes	English only	fastest Whisper variant
`parakeet-tdt-0.6b-v3`	NeMo TDT	no	yes	English only	very fast on GPU
`canary-180m-flash`	NeMo Canary	yes	yes	English only (small)	smallest, runs anywhere
`canary-1b-flash`	NeMo Canary	no	yes	en/de/fr/es + translation	multilingual, translation
`canary-qwen-2.5b`	NeMo SALM	no	yes	English only	best English accuracy (no timestamps)

Pick by use case:

General-purpose: whisper-large-v3-turbo.
English-only, max speed on CPU: distil-whisper-large-v3.
English-only, max accuracy on GPU: canary-qwen-2.5b (but no per-segment timestamps).
Translation EN↔DE/FR/ES: canary-1b-flash (requires custom model registry — see Translation).

canary-qwen-2.5b produces no segment/word timestamps — verbose_json.segments and .words come back empty, srt/vtt collapse to a single full-duration cue. Transcription itself is whole-file. Use a Whisper or Canary multitask slug if you need timing.

API — `POST /v1/audio/transcriptions`

Multipart form. Same field names as OpenAI's transcription endpoint where they overlap.

Request Fields

Field	Required	Default	Notes
`file`	one of `file`/`file_path`	—	Audio file. Capped at `TALKIES_MAX_UPLOAD_BYTES` (default 100 MB).
`file_path`	one of `file`/`file_path`	—	Either a path under the staging area (`/v1/files`) or an `http(s)://` URL (downloaded + cached server-side). Not subject to the 100 MB upload cap; URL downloads capped by `TALKIES_MAX_DOWNLOAD_BYTES` (default 1 GiB).
`model`	yes	—	One of the configured slugs (see `GET /v1/models`). Unknown → 404.
`language`	no	model default	ISO-639-1 code. Whisper auto-detects when omitted; Canary uses its `default_source_lang`.
`response_format`	no	`json`	`json` / `text` / `verbose_json` / `srt` / `vtt`.
`timestamp_granularities[]`	no	—	Accepted for OpenAI compat; ignored — `verbose_json` always emits both segment + word.
`prompt`	no	—	Accepted, ignored.
`temperature`	no	—	Accepted, ignored.
`diarization`	no	`false`	Stereo-channel diarization. Requires 2-channel input — mono returns 400.

Exactly one of file or file_path must be set — passing both or neither returns 400.

Response Formats

`response_format`	Content-Type	Shape
`json` (default)	`application/json`	`{"text": "..."}` — just the transcript.
`text`	`text/plain`	The transcript as plain text.
`verbose_json`	`application/json`	Full Whisper shape — `task`, `language`, `duration`, `text`, `segments[]`, `words[]`.
`srt`	`application/x-subrip`	SubRip subtitle file, one cue per VAD-segmented chunk.
`vtt`	`text/vtt`	WebVTT subtitle file, one cue per VAD-segmented chunk.

json shape:

{ "text": " full transcript as a single string" }

verbose_json shape — segments and words are always present (empty arrays for backends with no alignment output):

{
  "task": "transcribe",
  "language": "en",
  "duration": 6.42,
  "text": " full transcript",
  "segments": [{ "id": 0, "start": 0.0, "end": 2.31, "text": " ...", "tokens": [], "temperature": 0.0, "avg_logprob": null, "compression_ratio": null, "no_speech_prob": null }],
  "words": [{ "word": " the", "start": 0.0, "end": 0.12 }]
}

Whisper-only confidence fields (avg_logprob, compression_ratio, no_speech_prob) are emitted as null regardless of backend so clients reading them don't crash. tokens is always [].

Stereo Diarization

Pass diarization=true and upload a 2-channel file. Left channel = speaker L, right channel = speaker R. Each channel is transcribed independently, the two timelines are merged chronologically by segment start time.

curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-large-v3-turbo" \
  -F "diarization=true" \
  -F "response_format=verbose_json" | jq

What changes:

verbose_json — every segment/word gets "channel": "L" or "R". Segments re-numbered after merge.
text / response_format=text — rebuilt as alternating turn lines: L: ...\ R: ...\ .... Consecutive same-channel segments collapsed into one line per turn.
srt / vtt — each cue prefixed with L: / R:.

Caveats:

Exactly 2 channels required. Mono → 400. >2 channels → 400.
Latency ~2× the mono case (model runs sequentially on each channel).
The technique is exact for true two-mic setups (interview rigs, podcast splits). It does NOT magically separate speakers from a single-mic recording that's been rendered to stereo.

Translation

Canary multitask models can translate speech → text in a non-source language. canary-1b-flash covers en↔de, en↔fr, en↔es. The task is baked into the model slug, not passed per-request — you add a translation-specific slug via custom models.json (see Customizing the model registry):

{
  "models": {
    "canary-1b-flash-de2en": {
      "repo": "nvidia/canary-1b-flash",
      "executor": "canary_multitask",
      "default_source_lang": "de",
      "default_target_lang": "en",
      "default_task": "s2t_translation",
      "languages": ["de"]
    }
  }
}

Then call it normally — text carries the English translation:

curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=canary-1b-flash-de2en" | jq

canary-180m-flash is English-ASR-only — don't point a translation slug at it. canary-qwen-2.5b is English ASR only too.

Long Files + VAD Chunking

Audio longer than 30 s (TALKIES_VAD_CHUNK_THRESHOLD) gets sliced through Silero VAD into ≤28 s speech regions before being handed to the backend. Timestamps are re-assembled by offsetting each chunk's segment/word timings — you get one continuous segments list spanning the whole file.

No client-side change. Long files just work. Verify by checking duration in verbose_json.

Error Contract

Status	Shape	When
200	per `response_format`	success
400	`{"detail": "..."}`	bad audio, mono+diarization, >2 ch+diarization, both/neither of `file`/`file_path`, invalid file_path, URL download failure (DNS, HTTP error, size exceeded, SSRF blocked)
401	`{"detail": "..."}`	only when `TALKIES_AUTH_TOKEN` is set: missing/wrong bearer. Includes `WWW-Authenticate: Bearer`.
404	`{"detail": "..."}`	unknown model slug, `file_path` references missing file, `DELETE /api/ps/{slug}` on unloaded model, `/v1/files/{path}` GET/DELETE on missing
413	`{"detail": "..."}`	upload exceeded `TALKIES_MAX_UPLOAD_BYTES` (multipart `file` and `PUT /v1/files/{path}` only — not `file_path` URL)
422	`{"detail": [...]}`	Pydantic validation (missing fields, wrong types)
500	`{"detail": "..."}`	unhandled backend failure

Resource-Management Endpoints (Ollama-Style)

talkies mirrors a subset of speaches / Ollama, so a LiteLLM proxy can drive both.

Endpoint	Behavior
`GET /healthz`	Unauthenticated liveness. Returns `{ok, device, models}`.
`GET /v1/models`	OpenAI-style list of configured slugs.
`GET /api/ps`	Currently-loaded models with per-model `idle_seconds`.
`DELETE /api/ps/{model_id}`	Evict one model. Slug can be URL-encoded (`/` → `%2F`). 404 if not loaded.
`POST /unload`	Evict every loaded model. Returns the list actually unloaded.

Behind these: an idle sweeper runs every TALKIES_SWEEPER_INTERVAL s (default 60) and unloads anything not used in TALKIES_MODEL_TTL s (default 600). Set TALKIES_MODEL_TTL=0 to disable.

There's also sibling eviction at transcription time — every request evicts other loaded models so VRAM doesn't get split. One model resident at a time, per container. If you need two models simultaneously, run two containers.

# Which models are loaded right now.
curl -s $TALKIES_URL/api/ps | jq

# Free VRAM after a job — evict one model.
curl -s -X DELETE "$TALKIES_URL/api/ps/whisper-large-v3-turbo"

# Or evict everything.
curl -s -X POST $TALKIES_URL/unload | jq

Server-Side File Staging (`/v1/files`)

For repeated transcribes of the same file (different response_format, different model, iterating on params), stage the file once and reference it by path. Files land under ${TALKIES_DATA_DIR}/files/\x3Cpath>.

Endpoint	Behavior
`GET /v1/files`	List every staged file. Returns `{"files": [{"path", "size", "modified"}]}`.
`PUT /v1/files/{path}`	Upload raw bytes (`--data-binary @local-file`). Capped at `TALKIES_MAX_UPLOAD_BYTES`. Atomic write (`.part` → rename).
`GET /v1/files/{path}`	Streams file back. Content-Type guessed by extension. 404 if missing.
`DELETE /v1/files/{path}`	Removes file and prunes empty parent dirs. 404 if missing.

# Stage once.
curl -X PUT --data-binary @lecture.mp3 \
  -H "Content-Type: audio/mpeg" \
  $TALKIES_URL/v1/files/lectures/2026-03-15/lecture.mp3

# Reuse across multiple transcribe calls.
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file_path=lectures/2026-03-15/lecture.mp3" \
  -F "model=whisper-large-v3-turbo" \
  -F "response_format=verbose_json" | jq

# Cleanup.
curl -X DELETE $TALKIES_URL/v1/files/lectures/2026-03-15/lecture.mp3

Path safety: null bytes, backslashes, . / .. segments and double slashes are rejected (400). Symlinks pointing outside the root are refused. Leading / is stripped — /foo/bar.mp3 and foo/bar.mp3 resolve identically.

URL `file_path` (Download + Cache)

file_path also accepts http:// / https:// URLs. First request downloads to ${TALKIES_DATA_DIR}/files/downloads/\x3Csha256(url)[:16]>-\x3Cbasename>, subsequent requests with the same URL hit the cache.

# First call: downloads, transcribes off the cached copy.
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file_path=https://example.com/podcasts/ep-042.mp3" \
  -F "model=whisper-large-v3-turbo" | jq

# Second call: same URL → cache hit, no re-download.
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "file_path=https://example.com/podcasts/ep-042.mp3" \
  -F "model=canary-1b-flash" \
  -F "response_format=srt" > ep-042.srt

Downloads appear in GET /v1/files listings under downloads/. Invalidate a single cached URL with DELETE /v1/files/downloads/\x3Ckey>.

Constraints applied during download:

Size capped by TALKIES_MAX_DOWNLOAD_BYTES (default 1 GiB).
5 redirect hops max; SSRF guard re-applied at every hop.
10 s connect, 300 s per-chunk read timeout.
SSRF off by default. Set TALKIES_BLOCK_PRIVATE_DOWNLOADS=true to reject URLs whose hostname resolves to private/loopback/link-local/multicast/reserved IPs.

MCP Endpoint (`/v1/mcp`)

talkies exposes a Model Context Protocol server over Streamable HTTP at /v1/mcp. Same FastAPI process, same BACKENDS / REGISTRY, same auth middleware — a model loaded by the MCP transcribe tool is the same instance the HTTP endpoint sees.

Tool	What it does
`list_models`	Discover ASR slugs. Returns `[{slug, executor, default_source_lang, default_target_lang, default_task, loaded}]`.
`transcribe`	Run ASR on a `file_path` (URL or staged path). Args: `model`, `language?`, `response_format?` (`json`/`verbose_json`/`text`/`srt`/`vtt`), `diarization?`. JSON formats return a JSON-encoded string; text/srt/vtt return raw.
`list_files`	Same payload as `GET /v1/files`.
`put_file`	Upload to staging. Body is base64 (`content_base64`). Decoded size capped at `TALKIES_MAX_UPLOAD_BYTES`. For big files, prefer `PUT /v1/files/{path}` over HTTP — JSON-RPC + base64 chews token budget.
`get_file`	Read a staged file as base64. Same size cap. Same advice — for big bytes, hit `GET /v1/files/{path}` over HTTP.
`delete_file`	Remove a staged file, prune empty parents.

The transport requires Accept: application/json, text/event-stream. Wire it into Claude Code:

claude mcp add --transport http talkies $TALKIES_URL/v1/mcp

With auth:

claude mcp add --transport http talkies $TALKIES_URL/v1/mcp \
  --header "Authorization: Bearer $TALKIES_AUTH_TOKEN"

Note: the canonical mount path is /v1/mcp/ (trailing slash). Bare /v1/mcp is rewritten internally to /v1/mcp/ so clients that don't follow Starlette's 307 redirect work too.

Raw JSON-RPC

For debugging or non-MCP-aware callers, hit it as JSON-RPC over HTTP POST:

# tools/list
curl -s $TALKIES_URL/v1/mcp/ \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -d '{"jsonrpc": "2.0", "id": 1, "method": "tools/list"}'

# tools/call
curl -s $TALKIES_URL/v1/mcp/ \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -d '{
    "jsonrpc": "2.0", "id": 2, "method": "tools/call",
    "params": {
      "name": "transcribe",
      "arguments": {
        "file_path": "https://example.com/clip.mp3",
        "model": "whisper-large-v3-turbo",
        "response_format": "json"
      }
    }
  }'

Bearer-Token Auth

If TALKIES_AUTH_TOKEN is set on the server, every route except /healthz and CORS preflight (OPTIONS) requires Authorization: Bearer \x3Ctoken>. Wrong/missing token returns 401 with WWW-Authenticate: Bearer. Compared with hmac.compare_digest (constant-time).

curl -H "Authorization: Bearer $TALKIES_AUTH_TOKEN" $TALKIES_URL/v1/models

Empty / unset token = wide open. For untrusted networks, combine the token with a reverse proxy doing TLS + rate limiting.

Typical Workflows

Quick one-off transcribe

curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-large-v3-turbo" | jq -r .text

Generate subtitles for a video

ffmpeg -i video.mp4 -vn -acodec libmp3lame audio.mp3
curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-large-v3" \
  -F "response_format=srt" > video.srt
# burn in:  ffmpeg -i video.mp4 -vf subtitles=video.srt -c:a copy video-subbed.mp4

Iterate on the same file with different settings

# Stage once.
curl -X PUT --data-binary @lecture.mp3 \
  -H "Content-Type: audio/mpeg" \
  $TALKIES_URL/v1/files/work/lecture.mp3

# Try different models / formats without re-uploading.
for fmt in json verbose_json srt; do
  curl -s $TALKIES_URL/v1/audio/transcriptions \
    -F "file_path=work/lecture.mp3" \
    -F "model=whisper-large-v3-turbo" \
    -F "response_format=$fmt" > "lecture.$fmt"
done

# Cleanup.
curl -X DELETE $TALKIES_URL/v1/files/work/lecture.mp3

Diarized interview transcript

curl -s $TALKIES_URL/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-large-v3-turbo" \
  -F "diarization=true" \
  -F "response_format=text"
# stdout:
#   L: hi how's it going
#   R: not bad you
#   L: cool man

Free VRAM after a job

curl -s -X POST $TALKIES_URL/unload | jq

Bulk transcribe from URLs

for url in $(cat urls.txt); do
  curl -s $TALKIES_URL/v1/audio/transcriptions \
    -F "file_path=$url" \
    -F "model=whisper-large-v3-turbo" \
    -F "response_format=text"
  echo "---"
done

The first hit on each URL downloads + caches; re-running the loop is free.

For a fuller bulk-transcribe driver (mix of local paths + URLs, per-input output files, error reporting, optional diarization) see scripts/bulk_transcribe.sh:

TALKIES_URL=http://localhost:8000 \
TALKIES_MODEL=whisper-large-v3-turbo \
TALKIES_FORMAT=srt \
TALKIES_OUTDIR=./subs \
  bash scripts/bulk_transcribe.sh inputs.txt

Tips

Use whisper-large-v3-turbo as your default — it's the speed/quality sweet spot for general-purpose ASR. Switch to whisper-large-v3 only when you need the last few % of accuracy on hard audio.
URL file_path over multipart upload — if the audio is already at a URL, send the URL. Saves bandwidth (the file isn't going up and then back down), gets cached server-side, no upload size cap.
Stage repeated files via PUT /v1/files/{path} and call with file_path= to avoid re-uploading on every retry/iteration.
response_format=text for the "just give me the string" case — no jq -r .text needed, content-type is text/plain.
One model at a time — every transcribe request evicts other loaded models. Don't try to fan out two calls against two different models on the same container; the second one evicts the first and reloads. Use two containers if you actually need concurrency on different models.
POST /unload after a job — explicit eviction frees VRAM/RAM faster than waiting for the 10-min idle sweeper. Useful in CI / batch scripts.
canary-qwen-2.5b has no timestamps — verbose_json.segments / .words come back empty, srt/vtt collapse to one cue. Use a Whisper or Canary multitask slug if you need timing data.
Diarization requires true stereo — if your "stereo" file is the same mono signal copied to both channels, diarization won't separate speakers. The technique is exact for two-mic setups, useless otherwise.
Long files just work — VAD chunking happens transparently. Don't pre-split. Send the whole file.
prompt and temperature are ignored even though the form accepts them. Don't expect them to do anything.
Watch /api/ps to see what's resident. A request that hangs at "loading model" is doing the first cold load — subsequent calls are fast.
Customizing the model registry for translation slugs or to restrict the served set — see references/setup.md.

Usage Guidance

Install only if you intend to run or connect to a talkies server. Treat uploaded audio, staged files, transcripts, and cached remote downloads as potentially sensitive server-side data. Keep the service bound to localhost or a trusted network when possible; if exposed, set TALKIES_AUTH_TOKEN, use TLS/rate limiting, and enable TALKIES_BLOCK_PRIVATE_DOWNLOADS for untrusted clients. Clean up staged files and cached downloads when they are no longer needed.

Capability Tags

requires-oauth-tokenrequires-sensitive-credentials

Capability Assessment

✓ Purpose & Capability

The documented capabilities match the stated ASR purpose: transcribing uploaded or URL-based audio, producing subtitles/JSON/text, managing loaded models, and exposing an MCP interface for the same operations.

ℹ Instruction Scope

The skill clearly documents file upload, URL file_path downloads, server-side staging, file listing/retrieval/deletion, auth behavior, and deployment hardening; these are broad within the talkies data directory but purpose-aligned and disclosed.

ℹ Install Mechanism

Installation is Docker-based using the publisher's talkies images, with a helper shell script that sends user-specified files or URLs to TALKIES_URL; no hidden post-install hook, obfuscation, or unrelated command execution was found.

ℹ Credentials

Docker, curl, network access, model downloads, GPU/CPU resources, bearer-token configuration, and persisted model/data directories are proportionate for a self-hosted transcription server. Public deployments require explicit hardening.

ℹ Persistence & Privilege

The service persists model weights plus staged uploads and URL download caches under TALKIES_DATA_DIR and exposes cleanup endpoints for those files. This is expected for the feature set, but access should be controlled on shared or public servers.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install talkies
After installation, invoke the skill by name or use /talkies
Provide required inputs per the skill's parameter spec and get structured output

Version History

v1.0.0

Initial release of talkies — self-hosted, OpenAI-compatible speech-to-text API. - Compatible with OpenAI’s `/v1/audio/transcriptions` wire format; supports seven open ASR models (Whisper, Parakeet, Canary). - Features stereo diarization, server-side URL file fetching, multi-model backend, and optional bearer authentication. - Supports various audio formats and can generate plain transcripts, subtitles (SRT/VTT), or verbose JSON outputs with timestamps. - Drop-in replacement for OpenAI’s API; just change the endpoint URL and model slug. - See documentation for setup, supported models, and usage examples.

Metadata

Slug talkies

Version 1.0.0

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is talkies?

Self-hosted OpenAI-compatible /v1/audio/transcriptions fronting seven open ASR models (Whisper, Parakeet, Canary). Same wire format as OpenAI — change the ba... It is an AI Agent Skill for Claude Code / OpenClaw, with 45 downloads so far.

How do I install talkies?

Run "/install talkies" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is talkies free?

Yes, talkies is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does talkies support?

talkies is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created talkies?

It is built and maintained by Ciprian Mandache (@psyb0t); the current version is v1.0.0.

More Skills

talkies

talkies

When To Use

When NOT To Use

Setup

Quick Start

Supported Models

API — POST /v1/audio/transcriptions

Request Fields

Response Formats

Stereo Diarization

Translation

Long Files + VAD Chunking

Error Contract

Resource-Management Endpoints (Ollama-Style)

Server-Side File Staging (/v1/files)

URL file_path (Download + Cache)

MCP Endpoint (/v1/mcp)

Raw JSON-RPC

Bearer-Token Auth

Typical Workflows

Quick one-off transcribe

Generate subtitles for a video

Iterate on the same file with different settings

Diarized interview transcript

Free VRAM after a job

Bulk transcribe from URLs

Tips

What is talkies?

How do I install talkies?

Is talkies free?

Which platforms does talkies support?

Who created talkies?

💬 Comments

API — `POST /v1/audio/transcriptions`

Server-Side File Staging (`/v1/files`)

URL `file_path` (Download + Cache)

MCP Endpoint (`/v1/mcp`)