← Back to Skills Marketplace
🔌
Speech-to-text, 3x faster than Whisper, remote FREE GPU
by
speech2srt
· GitHub ↗
· v1.3.1
· MIT-0
98
Downloads
1
Stars
0
Active Installs
2
Versions
Install in OpenClaw
/install speech-transcribe
Description
3x Faster than Whisper, Speech-to-text transcription with sentence-level timestamps on remote (FREE) L4 GPU. Trigger when user says: transcribe, speech to te...
Usage Guidance
What to consider before installing/running this skill:
- The skill uses Modal (your Modal CLI/token) and will create and use Modal volumes to upload audio and store models/results. Uploaded audio and generated transcripts live in those volumes under your Modal account — treat that as remote storage.
- The code contains a step that removes ~/.cache and symlinks it to the models volume when the model loads. If you accidentally run transcribe.py locally rather than via 'modal run', this could delete your local ~/.cache (which may contain other cached credentials or valuable caches). Do NOT run the Python file directly on your machine unless you inspect and modify that behavior first.
- Dependencies are installed from PyPI (faster-whisper, stable-ts). That is expected for model inference, but you must trust those packages. The image also apt-installs ffmpeg.
- The README mentions HF_TOKEN as optional for higher rate limits; you should only set that if you trust the skill to access Hugging Face on your behalf.
- There is at least one apparent bug/typo in transcribe.py (truncated 'jso' usage) — expect the code may need fixing before reliable use.
Recommendations:
1) Inspect transcribe.py and remove/modify the code that deletes ~/.cache (or ensure it runs only in an isolated container). 2) Run the skill in an isolated Modal account or project where persistent volumes and billing are acceptable. 3) Backup your local ~/.cache before trying local experimentation. 4) If you need stronger assurance, run the container build in an isolated environment and review all third-party dependencies (PyPI packages) before executing on real data.
Capability Analysis
Type: OpenClaw Skill
Name: speech-transcribe
Version: 1.3.1
The skill provides a legitimate speech-to-text transcription pipeline using the Modal platform and Whisper models. It includes well-structured logic in `transcribe.py` for parallelizing ffmpeg conversion and model loading, and the `SKILL.md` instructions correctly guide the AI agent through the workflow of uploading files, running the remote GPU task, and retrieving results. No indicators of data exfiltration, malicious execution, or prompt injection were found.
Capability Assessment
Purpose & Capability
The name/description (remote L4 GPU speech-to-text) aligns with the code and SKILL.md: it uses Modal, a CUDA image, faster-whisper/stable-whisper, and Modal volumes. Minor mismatches: SKILL.md advertises a “FREE L4 GPU” which is a marketing claim (Modal provisioning may be free or billable depending on account), and config.PYTHON_VERSION = '3.11' whereas the image requests add_python='3.12' (inconsequential but inconsistent). Overall the requested resources (Modal, GPU, volumes) are reasonable for the stated purpose.
Instruction Scope
SKILL.md instructs users to use the Modal CLI and modal run, and to upload files to Modal volumes — that is coherent. However, the runtime code (_load_model) forcibly replaces ~/.cache with a symlink to the models volume: if run outside the intended Modal container (e.g., if someone runs transcribe.py locally), this could delete the user's ~/.cache directory (shutil.rmtree) and then symlink it. The instructions do not explicitly warn about this destructive behavior or require running only inside the Modal container. There is also a partial/truncated bug in the provided transcribe.py (a bare 'jso' token in the truncated portion) indicating the code may not be robust as-is.
Install Mechanism
There is no external install spec; the code builds a Modal image that apt_installs ffmpeg and pip_installs 'faster-whisper' and 'stable-ts' — typical for this use case and traceable to PyPI. No arbitrary URL downloads or shorteners are used. Building a custom container image is expected for GPU inference, but pip-installed dependencies mean you will execute third-party packages from PyPI inside the image (normal but requires trust in those packages).
Credentials
The skill declares no required environment variables and does not request unrelated credentials. The error-handling doc notes HF_TOKEN as optional for higher Hugging Face rate limits; that is reasonable and optional. The skill will, however, operate against the user's Modal account (Modal token) and create volumes under that account — expected for remote GPU runs.
Persistence & Privilege
always is false (good). The skill creates Modal volumes (create_if_missing=True) and mounts them into the job image — this is expected for caching models and storing outputs but does grant persistent remote storage of uploaded audio and downloaded models in the user's Modal account. The code's symlink attempts to make ~/.cache point to the persistent models volume inside the container; the destructive cache replacement behavior is the main persistence/privilege risk if the code is run outside the container context.
How to Use
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install speech-transcribe - After installation, invoke the skill by name or use
/speech-transcribe - Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.3.1
- Updated to version 1.0.1 with documentation changes only.
- Consolidated and shortened the Model Options section for easier reference.
- No code or functionality changes; workflow and setup remain the same.
v1.0.0
- Initial release of speech-transcribe skill.
- Provides fast speech-to-text transcription with sentence-level timestamps using a GPU-powered Whisper pipeline.
- Supports multiple audio/video formats and outputs plain text (.txt) and subtitle (.srt) files.
- Allows choice of model size, with large-v3 as default.
- Offers streamlined file handling, real-time streaming, and easy Modal integration.
- Includes clear setup instructions and user-friendly result reporting.
Metadata
Frequently Asked Questions
What is Speech-to-text, 3x faster than Whisper, remote FREE GPU?
3x Faster than Whisper, Speech-to-text transcription with sentence-level timestamps on remote (FREE) L4 GPU. Trigger when user says: transcribe, speech to te... It is an AI Agent Skill for Claude Code / OpenClaw, with 98 downloads so far.
How do I install Speech-to-text, 3x faster than Whisper, remote FREE GPU?
Run "/install speech-transcribe" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Speech-to-text, 3x faster than Whisper, remote FREE GPU free?
Yes, Speech-to-text, 3x faster than Whisper, remote FREE GPU is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does Speech-to-text, 3x faster than Whisper, remote FREE GPU support?
Speech-to-text, 3x faster than Whisper, remote FREE GPU is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Speech-to-text, 3x faster than Whisper, remote FREE GPU?
It is built and maintained by speech2srt (@speech2srt); the current version is v1.3.1.
More Skills