← 返回 Skills 市场

🔌

Speech-to-text, 3x faster than Whisper, remote FREE GPU

Name: Speech-to-text, 3x faster than Whisper, remote FREE GPU
Author: speech2srt

作者 speech2srt · GitHub ↗ · v1.3.1 · MIT-0

cross-platform ⚠ suspicious

总下载

当前安装

版本数

在 OpenClaw 中安装

/install speech-transcribe

功能描述

3x Faster than Whisper, Speech-to-text transcription with sentence-level timestamps on remote (FREE) L4 GPU. Trigger when user says: transcribe, speech to te...

安全使用建议

What to consider before installing/running this skill: - The skill uses Modal (your Modal CLI/token) and will create and use Modal volumes to upload audio and store models/results. Uploaded audio and generated transcripts live in those volumes under your Modal account — treat that as remote storage. - The code contains a step that removes ~/.cache and symlinks it to the models volume when the model loads. If you accidentally run transcribe.py locally rather than via 'modal run', this could delete your local ~/.cache (which may contain other cached credentials or valuable caches). Do NOT run the Python file directly on your machine unless you inspect and modify that behavior first. - Dependencies are installed from PyPI (faster-whisper, stable-ts). That is expected for model inference, but you must trust those packages. The image also apt-installs ffmpeg. - The README mentions HF_TOKEN as optional for higher rate limits; you should only set that if you trust the skill to access Hugging Face on your behalf. - There is at least one apparent bug/typo in transcribe.py (truncated 'jso' usage) — expect the code may need fixing before reliable use. Recommendations: 1) Inspect transcribe.py and remove/modify the code that deletes ~/.cache (or ensure it runs only in an isolated container). 2) Run the skill in an isolated Modal account or project where persistent volumes and billing are acceptable. 3) Backup your local ~/.cache before trying local experimentation. 4) If you need stronger assurance, run the container build in an isolated environment and review all third-party dependencies (PyPI packages) before executing on real data.

功能分析

Type: OpenClaw Skill Name: speech-transcribe Version: 1.3.1 The skill provides a legitimate speech-to-text transcription pipeline using the Modal platform and Whisper models. It includes well-structured logic in `transcribe.py` for parallelizing ffmpeg conversion and model loading, and the `SKILL.md` instructions correctly guide the AI agent through the workflow of uploading files, running the remote GPU task, and retrieving results. No indicators of data exfiltration, malicious execution, or prompt injection were found.

能力评估

ℹ Purpose & Capability

The name/description (remote L4 GPU speech-to-text) aligns with the code and SKILL.md: it uses Modal, a CUDA image, faster-whisper/stable-whisper, and Modal volumes. Minor mismatches: SKILL.md advertises a “FREE L4 GPU” which is a marketing claim (Modal provisioning may be free or billable depending on account), and config.PYTHON_VERSION = '3.11' whereas the image requests add_python='3.12' (inconsequential but inconsistent). Overall the requested resources (Modal, GPU, volumes) are reasonable for the stated purpose.

⚠ Instruction Scope

SKILL.md instructs users to use the Modal CLI and modal run, and to upload files to Modal volumes — that is coherent. However, the runtime code (_load_model) forcibly replaces ~/.cache with a symlink to the models volume: if run outside the intended Modal container (e.g., if someone runs transcribe.py locally), this could delete the user's ~/.cache directory (shutil.rmtree) and then symlink it. The instructions do not explicitly warn about this destructive behavior or require running only inside the Modal container. There is also a partial/truncated bug in the provided transcribe.py (a bare 'jso' token in the truncated portion) indicating the code may not be robust as-is.

ℹ Install Mechanism

There is no external install spec; the code builds a Modal image that apt_installs ffmpeg and pip_installs 'faster-whisper' and 'stable-ts' — typical for this use case and traceable to PyPI. No arbitrary URL downloads or shorteners are used. Building a custom container image is expected for GPU inference, but pip-installed dependencies mean you will execute third-party packages from PyPI inside the image (normal but requires trust in those packages).

✓ Credentials

The skill declares no required environment variables and does not request unrelated credentials. The error-handling doc notes HF_TOKEN as optional for higher Hugging Face rate limits; that is reasonable and optional. The skill will, however, operate against the user's Modal account (Modal token) and create volumes under that account — expected for remote GPU runs.

ℹ Persistence & Privilege

always is false (good). The skill creates Modal volumes (create_if_missing=True) and mounts them into the job image — this is expected for caching models and storing outputs but does grant persistent remote storage of uploaded audio and downloaded models in the user's Modal account. The code's symlink attempts to make ~/.cache point to the persistent models volume inside the container; the destructive cache replacement behavior is the main persistence/privilege risk if the code is run outside the container context.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install speech-transcribe
安装完成后，直接呼叫该 Skill 的名称或使用 /speech-transcribe 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.3.1

- Updated to version 1.0.1 with documentation changes only. - Consolidated and shortened the Model Options section for easier reference. - No code or functionality changes; workflow and setup remain the same.

v1.0.0

- Initial release of speech-transcribe skill. - Provides fast speech-to-text transcription with sentence-level timestamps using a GPU-powered Whisper pipeline. - Supports multiple audio/video formats and outputs plain text (.txt) and subtitle (.srt) files. - Allows choice of model size, with large-v3 as default. - Offers streamlined file handling, real-time streaming, and easy Modal integration. - Includes clear setup instructions and user-friendly result reporting.

元数据

Slug speech-transcribe

版本 1.3.1

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 2

常见问题

Speech-to-text, 3x faster than Whisper, remote FREE GPU 是什么？

3x Faster than Whisper, Speech-to-text transcription with sentence-level timestamps on remote (FREE) L4 GPU. Trigger when user says: transcribe, speech to te... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 98 次。

如何安装 Speech-to-text, 3x faster than Whisper, remote FREE GPU？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install speech-transcribe」即可一键安装，无需额外配置。

Speech-to-text, 3x faster than Whisper, remote FREE GPU 是免费的吗？

是的，Speech-to-text, 3x faster than Whisper, remote FREE GPU 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Speech-to-text, 3x faster than Whisper, remote FREE GPU 支持哪些平台？

Speech-to-text, 3x faster than Whisper, remote FREE GPU 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Speech-to-text, 3x faster than Whisper, remote FREE GPU？

由 speech2srt（@speech2srt）开发并维护，当前版本 v1.3.1。