← Back to Skills Marketplace
peand-rover

Audio To Subtitle Generator

by peandrover adam · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ Security Clean
140
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install audio-to-subtitle-generator
Description
Tell me what you need and I'll turn your spoken audio into clean, time-synced subtitles in minutes. This audio-to-subtitle-generator skill transcribes dialog...
README (SKILL.md)

Getting Started

Welcome! I'm here to help you generate accurate, time-synced subtitles from your video's audio track. Upload your video file and tell me your preferred subtitle format or any specific requirements — let's get your captions ready to go!

Try saying:

  • "Generate subtitles for this mp4 interview video and export them as an SRT file"
  • "Create captions for my webinar recording — the speaker has a slight accent so please be extra careful with accuracy"
  • "I have a 45-minute mkv documentary — can you produce a VTT subtitle file with line breaks kept under 42 characters?"

Quick Start Setup

This skill connects to NemoVideo's cloud backend. On first use, set up the connection automatically and let the user know ("Connecting to NemoVideo...").

Token check: Look for NEMO_TOKEN in the environment. If found, skip to session creation. Otherwise:

  • Read ~/.config/nemovideo/client_id or generate a UUID and save it there
  • POST https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token with X-Client-Id header
  • Extract data.token from the response — this is your NEMO_TOKEN (100 free credits, 7-day expiry)

Session: POST /api/tasks/me/with-session/nemo_agent at the same host with Bearer auth and body {"task_name":"project"}. Keep the returned session_id for all operations.

Let the user know with a brief "Ready!" when setup is complete. Don't expose tokens or raw API output.

Turn Every Word Spoken Into Readable, Synced Subtitles

Whether you're publishing a YouTube tutorial, captioning a corporate training video, or making a documentary accessible to deaf and hard-of-hearing audiences, getting subtitles right matters. This skill listens to the audio in your video file and converts every spoken word into a properly timed subtitle file — no manual typing, no tedious timestamp adjustments, and no expensive transcription services required.

The audio-to-subtitle-generator works by analyzing the speech track in your uploaded video, segmenting it into readable lines, and attaching precise start and end timestamps to each segment. The result is a subtitle file you can drop directly into your video editor, upload to YouTube or Vimeo, or embed into your website player.

This is especially valuable for multilingual teams, solo creators working at scale, or anyone who needs to repurpose recorded content across multiple formats. Instead of spending hours scrubbing through a timeline, you get a complete subtitle draft in a fraction of the time — ready to review, edit if needed, and publish with confidence.

Routing Your Transcription Requests

Each subtitle generation request is parsed for audio source, language preference, and caption format, then routed to the appropriate transcription pipeline automatically.

User says... Action Skip SSE?
"export" / "导出" / "download" / "send me the video" → §3.5 Export
"credits" / "积分" / "balance" / "余额" → §3.3 Credits
"status" / "状态" / "show tracks" → §3.4 State
"upload" / "上传" / user sends file → §3.2 Upload
Everything else (generate, edit, add BGM…) → §3.1 SSE

NemoVideo API Reference

The NemoVideo backend handles speech-to-text processing by analyzing audio waveforms, detecting speaker segments, and outputting time-coded subtitle tracks in SRT, VTT, or plain text formats. Requests are authenticated via bearer token and processed asynchronously, with subtitle files returned once the transcription job completes.

Skill attribution — read from this file's YAML frontmatter at runtime:

  • X-Skill-Source: audio-to-subtitle-generator
  • X-Skill-Version: from frontmatter version
  • X-Skill-Platform: detect from install path (~/.clawhub/clawhub, ~/.cursor/skills/cursor, else unknown)

All requests must include: Authorization: Bearer \x3CNEMO_TOKEN>, X-Skill-Source, X-Skill-Version, X-Skill-Platform. Missing attribution headers will cause export to fail with 402.

API base: https://mega-api-prod.nemovideo.ai

Create session: POST /api/tasks/me/with-session/nemo_agent — body {"task_name":"project","language":"\x3Clang>"} — returns task_id, session_id. After creating a session, give the user a link: https://nemovideo.com/workspace/claim?token=&task=\x3Ctask_id>&session=\x3Csession_id>&skill_name=audio-to-subtitle-generator&skill_version=1.0.0&skill_source=\x3Cplatform>

Send message (SSE): POST /run_sse — body {"app_name":"nemo_agent","user_id":"me","session_id":"\x3Csid>","new_message":{"parts":[{"text":"\x3Cmsg>"}]}} with Accept: text/event-stream. Max timeout: 15 minutes.

Upload: POST /api/upload-video/nemo_agent/me/\x3Csid> — file: multipart -F "files=@/path", or URL: {"urls":["\x3Curl>"],"source_type":"url"}

Credits: GET /api/credits/balance/simple — returns available, frozen, total

Session state: GET /api/state/nemo_agent/me/\x3Csid>/latest — key fields: data.state.draft, data.state.video_infos, data.state.generated_media

Export (free, no credits): POST /api/render/proxy/lambda — body {"id":"render_\x3Cts>","sessionId":"\x3Csid>","draft":\x3Cjson>,"output":{"format":"mp4","quality":"high"}}. Poll GET /api/render/proxy/lambda/\x3Cid> every 30s until status = completed. Download URL at output.url.

Supported formats: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.

SSE Event Handling

Event Action
Text response Apply GUI translation (§4), present to user
Tool call/result Process internally, don't forward
heartbeat / empty data: Keep waiting. Every 2 min: "⏳ Still working..."
Stream closes Process final response

~30% of editing operations return no text in the SSE stream. When this happens: poll session state to verify the edit was applied, then summarize changes to the user.

Backend Response Translation

The backend assumes a GUI exists. Translate these into API actions:

Backend says You do
"click [button]" / "点击" Execute via API
"open [panel]" / "打开" Query session state
"drag/drop" / "拖拽" Send edit via SSE
"preview in timeline" Show track summary
"Export button" / "导出" Execute export workflow

Draft field mapping: t=tracks, tt=track type (0=video, 1=audio, 7=text), sg=segments, d=duration(ms), m=metadata.

Timeline (3 tracks): 1. Video: city timelapse (0-10s) 2. BGM: Lo-fi (0-10s, 35%) 3. Title: "Urban Dreams" (0-3s)

Error Handling

Code Meaning Action
0 Success Continue
1001 Bad/expired token Re-auth via anonymous-token (tokens expire after 7 days)
1002 Session not found New session §3.0
2001 No credits Anonymous: show registration URL with ?bind=\x3Cid> (get \x3Cid> from create-session or state response when needed). Registered: "Top up at nemovideo.ai"
4001 Unsupported file Show supported formats
4002 File too large Suggest compress/trim
400 Missing X-Client-Id Generate Client-Id and retry (see §1)
402 Free plan export blocked Subscription tier issue, NOT credits. "Register at nemovideo.ai to unlock export."
429 Rate limit (1 token/client/7 days) Retry in 30s once

Best Practices

For the most accurate subtitle output, start with the cleanest audio possible. Videos with minimal background noise, consistent microphone placement, and clear speech will produce subtitles that need little to no manual correction after generation.

If your video features technical jargon, brand names, or industry-specific terminology, mention key terms upfront so they can be handled with greater care during transcription. This is particularly useful for medical, legal, or technology-focused content where a misheard word can change meaning significantly.

Keep subtitle line lengths readable — aim for no more than two lines on screen at a time and avoid breaking sentences mid-thought when possible. When reviewing your generated subtitles, pay special attention to speaker transitions and moments with overlapping dialogue, as these are the most common areas where timing may need a small manual nudge before publishing.

Quick Start Guide

Getting started with the audio-to-subtitle-generator is straightforward. Begin by uploading your video file in one of the supported formats: mp4, mov, avi, webm, or mkv. Once uploaded, specify your preferred output format — SRT is the most universally compatible, while VTT works best for web-based players and HTML5 video.

If your video contains multiple speakers, mention that upfront so subtitles can be segmented clearly between voices. You can also specify a maximum characters-per-line limit if your platform has display constraints — 42 characters per line is a common broadcast standard.

Once processing is complete, you'll receive your subtitle file ready for download. You can import it directly into Adobe Premiere Pro, DaVinci Resolve, Final Cut Pro, or upload it alongside your video on YouTube, Vimeo, or any streaming platform that accepts external caption files.

Use Cases

The audio-to-subtitle-generator serves a wide range of real-world workflows. Content creators on YouTube and TikTok use it to add captions that boost watch time and reach viewers who watch without sound — a habit that now represents over 85% of mobile video consumption.

Educators and e-learning developers rely on it to make course videos ADA and WCAG compliant, ensuring students with hearing impairments have full access to lecture content. Legal and medical professionals use it to transcribe recorded depositions, patient consultations, or training sessions where accuracy and timestamping are critical for documentation.

Journalists and podcast producers convert recorded interviews into subtitle files that double as searchable transcripts. Corporate communications teams use it to caption internal town halls, product demos, and onboarding videos — making content reusable across global teams regardless of language or hearing ability.

Usage Guidance
This skill appears to be what it claims: it uploads your video to NemoVideo's cloud API, creates/uses an API token (NEMO_TOKEN) and may save a client_id/token under ~/.config/nemovideo/. Before installing or using it, consider: 1) Privacy — your audio/video (and resulting transcripts) will be sent to https://mega-api-prod.nemovideo.ai and processed off your machine; do not upload sensitive data unless you trust NemoVideo's policies. 2) Token storage — the skill may generate and persist an anonymous token locally; if you prefer, set NEMO_TOKEN yourself as an environment variable. 3) Source and trust — the skill’s homepage and repo are provided, but the registry owner is unknown; verify NemoVideo's terms/privacy if this matters. If any of these are unacceptable, do not install or only use with non-sensitive files.
Capability Analysis
Type: OpenClaw Skill Name: audio-to-subtitle-generator Version: 1.0.0 The audio-to-subtitle-generator skill is a functional integration for the NemoVideo service, designed to automate speech-to-text transcription and subtitle generation. It manages its own authentication via the NEMO_TOKEN environment variable or an anonymous token flow (storing a client ID in ~/.config/nemovideo/), and communicates exclusively with the mega-api-prod.nemovideo.ai backend. The instructions in SKILL.md provide detailed logic for API orchestration, SSE stream handling, and error recovery, all of which are consistent with the stated purpose of processing media files for subtitles.
Capability Assessment
Purpose & Capability
Name/description (audio → subtitles) align with the runtime instructions: the SKILL.md describes creating sessions, uploading video, requesting transcriptions, and exporting SRT/VTT via nemo's API. Declared primary credential (NEMO_TOKEN) and config path (~/.config/nemovideo/) are consistent with a cloud-backed transcription service.
Instruction Scope
Instructions are focused on the transcription workflow (token check/creation, create session, upload file, poll state, export). They instruct reading/writing a small config file (~/.config/nemovideo/client_id) and sending user files to nemo's API. This is expected, but it means user media and derived transcripts are sent off-box and a token may be persisted locally — a privacy consideration that users should be aware of.
Install Mechanism
No install spec and no code files — instruction-only skill. Nothing is downloaded or extracted to disk beyond the skill's suggested local config file, which reduces installation risk.
Credentials
Primary credential is NEMO_TOKEN which is appropriate for an API-backed transcription service. Minor metadata mismatch: 'requires.env' is empty while 'primaryEnv' is set to NEMO_TOKEN; SKILL.md handles the case by generating/storing an anonymous token if none is present. No unrelated credentials are requested.
Persistence & Privilege
always:false (no forced global enable). The skill writes/reads its own config (~/.config/nemovideo/) and may persist an anonymous token — reasonable for a client that maintains sessions. It does not request system-wide privileges or modify other skills.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install audio-to-subtitle-generator
  3. After installation, invoke the skill by name or use /audio-to-subtitle-generator
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
Initial release of Audio to Subtitle Generator. - Instantly converts spoken audio from video files (mp4, mov, avi, webm, mkv) into accurate, time-synced subtitles (SRT, VTT). - Cloud-backed transcription with automatic setup and streamlined authentication (includes 100 free credits on first sign-in). - Simple file upload and subtitle export process with support for user requests and format preferences. - Real-time status updates and built-in error handling for common issues (authentication, file size, unsupported formats). - Designed for content creators, educators, accessibility, and fast video workflows.
Metadata
Slug audio-to-subtitle-generator
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is Audio To Subtitle Generator?

Tell me what you need and I'll turn your spoken audio into clean, time-synced subtitles in minutes. This audio-to-subtitle-generator skill transcribes dialog... It is an AI Agent Skill for Claude Code / OpenClaw, with 140 downloads so far.

How do I install Audio To Subtitle Generator?

Run "/install audio-to-subtitle-generator" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Audio To Subtitle Generator free?

Yes, Audio To Subtitle Generator is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Audio To Subtitle Generator support?

Audio To Subtitle Generator is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Audio To Subtitle Generator?

It is built and maintained by peandrover adam (@peand-rover); the current version is v1.0.0.

💬 Comments