← Back to Skills Marketplace

🔌

Speech-to-text, 3x faster than Whisper, remote FREE GPU

Name: Speech-to-text, 3x faster than Whisper, remote FREE GPU
Author: speech2srt

by speech2srt · GitHub ↗ · v1.3.1 · MIT-0

cross-platform ⚠ suspicious

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install speech-transcribe

Description

3x Faster than Whisper, Speech-to-text transcription with sentence-level timestamps on remote (FREE) L4 GPU. Trigger when user says: transcribe, speech to te...

Usage Guidance

What to consider before installing/running this skill: - The skill uses Modal (your Modal CLI/token) and will create and use Modal volumes to upload audio and store models/results. Uploaded audio and generated transcripts live in those volumes under your Modal account — treat that as remote storage. - The code contains a step that removes ~/.cache and symlinks it to the models volume when the model loads. If you accidentally run transcribe.py locally rather than via 'modal run', this could delete your local ~/.cache (which may contain other cached credentials or valuable caches). Do NOT run the Python file directly on your machine unless you inspect and modify that behavior first. - Dependencies are installed from PyPI (faster-whisper, stable-ts). That is expected for model inference, but you must trust those packages. The image also apt-installs ffmpeg. - The README mentions HF_TOKEN as optional for higher rate limits; you should only set that if you trust the skill to access Hugging Face on your behalf. - There is at least one apparent bug/typo in transcribe.py (truncated 'jso' usage) — expect the code may need fixing before reliable use. Recommendations: 1) Inspect transcribe.py and remove/modify the code that deletes ~/.cache (or ensure it runs only in an isolated container). 2) Run the skill in an isolated Modal account or project where persistent volumes and billing are acceptable. 3) Backup your local ~/.cache before trying local experimentation. 4) If you need stronger assurance, run the container build in an isolated environment and review all third-party dependencies (PyPI packages) before executing on real data.

Capability Analysis

Type: OpenClaw Skill Name: speech-transcribe Version: 1.3.1 The skill provides a legitimate speech-to-text transcription pipeline using the Modal platform and Whisper models. It includes well-structured logic in `transcribe.py` for parallelizing ffmpeg conversion and model loading, and the `SKILL.md` instructions correctly guide the AI agent through the workflow of uploading files, running the remote GPU task, and retrieving results. No indicators of data exfiltration, malicious execution, or prompt injection were found.

Capability Assessment

ℹ Purpose & Capability

The name/description (remote L4 GPU speech-to-text) aligns with the code and SKILL.md: it uses Modal, a CUDA image, faster-whisper/stable-whisper, and Modal volumes. Minor mismatches: SKILL.md advertises a “FREE L4 GPU” which is a marketing claim (Modal provisioning may be free or billable depending on account), and config.PYTHON_VERSION = '3.11' whereas the image requests add_python='3.12' (inconsequential but inconsistent). Overall the requested resources (Modal, GPU, volumes) are reasonable for the stated purpose.

⚠ Instruction Scope

SKILL.md instructs users to use the Modal CLI and modal run, and to upload files to Modal volumes — that is coherent. However, the runtime code (_load_model) forcibly replaces ~/.cache with a symlink to the models volume: if run outside the intended Modal container (e.g., if someone runs transcribe.py locally), this could delete the user's ~/.cache directory (shutil.rmtree) and then symlink it. The instructions do not explicitly warn about this destructive behavior or require running only inside the Modal container. There is also a partial/truncated bug in the provided transcribe.py (a bare 'jso' token in the truncated portion) indicating the code may not be robust as-is.

ℹ Install Mechanism

There is no external install spec; the code builds a Modal image that apt_installs ffmpeg and pip_installs 'faster-whisper' and 'stable-ts' — typical for this use case and traceable to PyPI. No arbitrary URL downloads or shorteners are used. Building a custom container image is expected for GPU inference, but pip-installed dependencies mean you will execute third-party packages from PyPI inside the image (normal but requires trust in those packages).

✓ Credentials

The skill declares no required environment variables and does not request unrelated credentials. The error-handling doc notes HF_TOKEN as optional for higher Hugging Face rate limits; that is reasonable and optional. The skill will, however, operate against the user's Modal account (Modal token) and create volumes under that account — expected for remote GPU runs.

ℹ Persistence & Privilege

always is false (good). The skill creates Modal volumes (create_if_missing=True) and mounts them into the job image — this is expected for caching models and storing outputs but does grant persistent remote storage of uploaded audio and downloaded models in the user's Modal account. The code's symlink attempts to make ~/.cache point to the persistent models volume inside the container; the destructive cache replacement behavior is the main persistence/privilege risk if the code is run outside the container context.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install speech-transcribe
After installation, invoke the skill by name or use /speech-transcribe
Provide required inputs per the skill's parameter spec and get structured output

Version History

v1.3.1

- Updated to version 1.0.1 with documentation changes only. - Consolidated and shortened the Model Options section for easier reference. - No code or functionality changes; workflow and setup remain the same.

v1.0.0

- Initial release of speech-transcribe skill. - Provides fast speech-to-text transcription with sentence-level timestamps using a GPU-powered Whisper pipeline. - Supports multiple audio/video formats and outputs plain text (.txt) and subtitle (.srt) files. - Allows choice of model size, with large-v3 as default. - Offers streamlined file handling, real-time streaming, and easy Modal integration. - Includes clear setup instructions and user-friendly result reporting.

Metadata

Slug speech-transcribe

Version 1.3.1

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 2

Frequently Asked Questions

What is Speech-to-text, 3x faster than Whisper, remote FREE GPU?

3x Faster than Whisper, Speech-to-text transcription with sentence-level timestamps on remote (FREE) L4 GPU. Trigger when user says: transcribe, speech to te... It is an AI Agent Skill for Claude Code / OpenClaw, with 98 downloads so far.

How do I install Speech-to-text, 3x faster than Whisper, remote FREE GPU?

Run "/install speech-transcribe" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Speech-to-text, 3x faster than Whisper, remote FREE GPU free?

Yes, Speech-to-text, 3x faster than Whisper, remote FREE GPU is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Speech-to-text, 3x faster than Whisper, remote FREE GPU support?

Speech-to-text, 3x faster than Whisper, remote FREE GPU is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Speech-to-text, 3x faster than Whisper, remote FREE GPU?

It is built and maintained by speech2srt (@speech2srt); the current version is v1.3.1.

More Skills