Description

High-performance local speech-to-text transcription using Faster Whisper with NVIDIA GPU acceleration. Transcribe audio files locally without sending data to...

README (SKILL.md)

🎙️ Faster Whisper GPU

Name: Faster Whisper Gpu
Author: felipeoff

High-performance local speech-to-text transcription using Faster Whisper with NVIDIA GPU acceleration.

✨ Features

🚀 GPU Accelerated: Uses NVIDIA CUDA for blazing-fast transcription
🔒 100% Local: No data leaves your machine. Complete privacy.
💰 Free Forever: No API costs. Run unlimited transcriptions.
🌍 Multilingual: Supports 99 languages with automatic detection
📁 Multiple Formats: Input: MP3, WAV, FLAC, OGG, M4A. Output: TXT, SRT, JSON
🎯 Multiple Models: From tiny (fast) to large-v3 (most accurate)
🎬 Subtitle Generation: Create SRT files with word-level timestamps

📋 Requirements

Hardware

NVIDIA GPU with CUDA support (recommended: 4GB+ VRAM)
Or CPU-only mode (slower but works on any machine)

Software

Python 3.8+
NVIDIA drivers (for GPU support)
CUDA Toolkit 11.8+ or 12.x

🚀 Quick Start

Installation

# Install dependencies
pip install faster-whisper torch

# Verify GPU is available
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

Basic Usage

# Transcribe an audio file (auto-detects GPU)
python transcribe.py audio.mp3

# Specify language explicitly
python transcribe.py audio.mp3 --language pt

# Output as SRT subtitles
python transcribe.py audio.mp3 --format srt --output subtitles.srt

# Use larger model for better accuracy
python transcribe.py audio.mp3 --model large-v3

🔧 Advanced Usage

Command Line Options

python transcribe.py \x3Caudio_file> [options]

Options:
  --model {tiny,base,small,medium,large-v1,large-v2,large-v3}
                        Model size to use (default: base)
  --language LANG       Language code (e.g., 'pt', 'en', 'es'). Auto-detect if not specified.
  --format {txt,srt,json,vtt}
                        Output format (default: txt)
  --output FILE         Output file path (default: stdout)
  --device {cuda,cpu}   Device to use (default: cuda if available)
  --compute_type {int8,int8_float16,int16,float16,float32}
                        Computation precision (default: float16)
  --task {transcribe,translate}
                        Task: transcribe or translate to English (default: transcribe)
  --vad_filter          Enable voice activity detection filter
  --vad_parameters MIN_DURATION_ON,MIN_DURATION_OFF
                        VAD parameters as comma-separated values
  --condition_on_previous_text
                        Condition on previous text (default: True)
  --initial_prompt PROMPT
                        Initial prompt to guide transcription
  --word_timestamps     Include word-level timestamps (for SRT/JSON)
  --hotwords WORDS      Comma-separated hotwords to boost recognition

Examples

Portuguese Transcription with SRT Output

python transcribe.py meeting.mp3 --language pt --format srt --output meeting.srt

English Translation from Any Language

python transcribe.py japanese_audio.mp3 --task translate --format txt

High-Accuracy Mode with Large Model

python transcribe.py podcast.mp3 --model large-v3 --vad_filter --word_timestamps

CPU-Only Mode (no GPU)

python transcribe.py audio.mp3 --device cpu --compute_type int8

🐍 Python API

from faster_whisper import WhisperModel

# Load model
model = WhisperModel("base", device="cuda", compute_type="float16")

# Transcribe
segments, info = model.transcribe("audio.mp3", language="pt")

print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

📊 Model Sizes & VRAM Requirements

Model	Parameters	VRAM Required	Relative Speed	Accuracy
tiny	39 M	~1 GB	~32x	Basic
base	74 M	~1 GB	~16x	Good
small	244 M	~2 GB	~6x	Better
medium	769 M	~5 GB	~2x	Great
large-v3	1550 M	~10 GB	1x	Best

Benchmarks measured on NVIDIA RTX 4090

🔍 Supported Languages

Faster Whisper supports 99 languages including:

Portuguese (pt)
English (en)
Spanish (es)
French (fr)
German (de)
Italian (it)
Japanese (ja)
Chinese (zh)
Russian (ru)
And 90+ more...

🛠️ Troubleshooting

CUDA Out of Memory

# Use smaller model
python transcribe.py audio.mp3 --model tiny

# Or use CPU
python transcribe.py audio.mp3 --device cpu

# Or reduce precision
python transcribe.py audio.mp3 --compute_type int8

Model Download Issues

Models are automatically downloaded on first use to ~/.cache/huggingface/hub/. If behind a proxy, set:

export HF_HOME=/path/to/custom/cache

Slow Transcription

Ensure GPU is being used: check nvidia-smi during transcription
Use smaller model for faster results
Enable VAD filter to skip silent parts

🤝 Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Submit a pull request

📜 License

MIT License - See LICENSE for details.

Faster Whisper is developed by SYSTRAN and based on OpenAI's Whisper.

🙏 Acknowledgments

OpenAI Whisper - Original model
Faster Whisper - Optimized implementation
CTranslate2 - Fast inference engine

Made with ❤️ for the OpenClaw community

Usage Guidance

This skill appears to be what it says: a local Faster-Whisper transcription tool. Before installing, consider the following: (1) pip installing torch can be large and may require specific GPU/CUDA builds — prefer installing inside a virtualenv or conda environment. (2) Model weights are downloaded from the Hugging Face hub on first use (network access and several GBs of disk may be required for large models); if you require full offline operation, pre-download and place models in the HF cache or set HF_HOME. (3) Review the GitHub homepage/repo and the transcribe.py file if you need additional assurance about code behavior and licensing. (4) The skill does not request credentials or attempt to send your audio elsewhere, but network activity for model downloads is expected. If those behaviors are acceptable, the skill is coherent and reasonable to use.

Capability Analysis

Type: OpenClaw Skill Name: faster-whisper-gpu Version: 0.1.0 The skill bundle is classified as suspicious due to a potential path traversal vulnerability in `transcribe.py`. The script uses `Path(args.output).write_text()` to save transcription results, where `args.output` is directly taken from user input without sanitization. This allows an attacker to specify an arbitrary file path (e.g., `../../../../tmp/malicious.txt`) to write content outside the intended directory, which is a risky capability without clear malicious intent from the developer. All other files and instructions appear benign and aligned with the stated purpose of local speech-to-text transcription.

Capability Assessment

✓ Purpose & Capability

Name/description, SKILL.md, requirements.txt and transcribe.py all align: the skill implements local Faster-Whisper transcription with optional GPU support. Required binary is only python3; listed Python packages match the stated capability.

ℹ Instruction Scope

SKILL.md and transcribe.py limit actions to local transcription, writing outputs to stdout or local files. The only external activity is the expected download of model weights from the Hugging Face hub (~ ~/.cache/huggingface/hub) on first use; this is documented in SKILL.md. No instructions read unrelated system credentials or post audio to third-party endpoints.

ℹ Install Mechanism

There is no registry install spec; the skill is instruction-only and instructs users to pip install faster-whisper and torch. Using pip is expected but may pull large binary wheels (torch) and GPU-specific builds. Model weights are downloaded at runtime from Hugging Face — an expected but noteworthy network activity and storage use.

✓ Credentials

The skill requests no environment variables or credentials. SKILL.md mentions HF_HOME as an optional way to change the Hugging Face cache directory (documented as a troubleshooting tip) — reasonable and proportional.

✓ Persistence & Privilege

The skill is not always: true and does not request persistent system privileges. It does not modify other skills or agent-wide configs. It writes outputs and caches model files to the user's home cache directory (standard behavior).

Version History

v0.1.0

Initial release of faster-whisper-gpu. - Local speech-to-text transcription powered by Faster Whisper with NVIDIA GPU acceleration. - Transcribe audio into text, subtitles (SRT/VTT), or JSON with support for 99 languages. - Features multiple model sizes, word-level timestamps, hotword boosting, and voice activity detection. - All data processing is 100% local for maximum privacy. - Includes a command-line interface and Python API for flexible usage.

Metadata

Slug faster-whisper-gpu

Version 0.1.0

License —

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is Faster Whisper Gpu?

High-performance local speech-to-text transcription using Faster Whisper with NVIDIA GPU acceleration. Transcribe audio files locally without sending data to... It is an AI Agent Skill for Claude Code / OpenClaw, with 565 downloads so far.

How do I install Faster Whisper Gpu?

Run "/install faster-whisper-gpu" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Faster Whisper Gpu free?

Yes, Faster Whisper Gpu is completely free (open-source). You can download, install and use it at no cost.

Which platforms does Faster Whisper Gpu support?

Faster Whisper Gpu is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Faster Whisper Gpu?

It is built and maintained by Felipe Oliveira (@felipeoff); the current version is v0.1.0.

More Skills

Faster Whisper Gpu