功能描述

Fully offline, CUDA-accelerated local voice assistant pipeline for NVIDIA Jetson. Wake word (openWakeWord) → real-time VAD → whisper.cpp GPU STT → LLM → Pipe...

使用说明 (SKILL.md)

Jetson CUDA Voice Pipeline

Name: Jetson CUDA Voice Pipeline
Author: nikil511

Fully offline, GPU-accelerated local voice assistant for NVIDIA Jetson devices. No cloud for STT or TTS — only the LLM call uses the internet (OpenRouter or any OpenAI-compatible endpoint).

Architecture

ReSpeaker mic (hw:Array,0, S24_3LE, 16kHz)
    ↓ arecord raw stream — never restarted mid-conversation
openWakeWord — "Hey Jarvis" detection (~32ms chunks)
    ↓ wake word triggered → two-tone beep
_measure_ambient() — 480ms median RMS → dynamic VAD thresholds
    ↓
transcribe_stream() — VAD + whisper.cpp CUDA HTTP (~2-4s per utterance)
    ↓
ask_llm() — OpenRouter or local OpenAI-compatible API (~1-2s)
    ↓
Piper TTS — offline neural TTS, hot-loaded at startup → aplay
    ↓
ReSpeaker LEDs: 🔵 blue=listening  🩵 cyan=thinking  ⚫ off=done  🔴 red=error

Total latency: ~5-8 seconds from wake word to first spoken word.

Key Features

Zero mic-restart gap — same arecord pipe feeds wake word detection and STT
Dynamic ambient calibration — measures room noise floor on every wake word trigger (adapts to fans, AC, time of day)
Conversation history — 20-turn rolling context for natural follow-ups
Auto language detection — whisper -l auto, works multilingual
ReSpeaker LED ring — visual state feedback (silent no-op if device not present)
Fully configurable — all paths and thresholds via environment variables

Hardware Requirements

Component	Tested	Notes
Jetson Xavier NX	✅	ARM64, sm_72, 8GB, JetPack 5.1.4
ReSpeaker USB Mic Array v1.0	✅	2886:0007, S24_3LE, 16kHz
Any ALSA speaker	✅	tested with Creative MUVO 2c
Other Jetson models	✅	change `CMAKE_CUDA_ARCHITECTURES`

Quick Start

# 1. Install Python deps
pip install openwakeword piper-tts numpy requests pyusb

# 2. Build whisper.cpp with CUDA (see BUILD.md — ~45 min, one-time)
#    Then place binary at ~/.local/bin/whisper-server-gpu

# 3. Download Piper voice model
mkdir -p ~/.local/share/piper/voices && cd ~/.local/share/piper/voices
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx.json

# 4. Install and start services
export OPENROUTER_API_KEY=your-key-here
bash pipeline/setup.sh
bash pipeline/manage.sh start

# Say "Hey Jarvis" — blue LED = listening

Setup Details

Build whisper.cpp with CUDA

See BUILD.md for full instructions. Critical flag:

cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=72 -DCMAKE_BUILD_TYPE=Release
make -j4   # ~45 min — detach with nohup if needed

⚠️ CMAKE_CUDA_ARCHITECTURES=72 (sm_72 = Xavier NX) is critical. Default multi-arch compilation OOMs on 8GB Jetson.

Architecture map:

Xavier NX / AGX Xavier → 72
Orin → 87
TX2 → 62
Nano → 53

Piper Voice Models

mkdir -p ~/.local/share/piper/voices && cd "$_"

# English (required)
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx.json

# Greek (optional — any language from huggingface.co/rhasspy/piper-voices works)
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/el/el_GR/rapunzelina/medium/el_GR-rapunzelina-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/main/el/el_GR/rapunzelina/medium/el_GR-rapunzelina-medium.onnx.json

Service Install

setup.sh writes and enables the systemd user services automatically:

bash pipeline/setup.sh [/path/to/voice_pipeline.py] [API_KEY]

Or with env var:

OPENROUTER_API_KEY=sk-... bash pipeline/setup.sh

Re-run to update an existing install.

ReSpeaker Mic Gain & USB Autosuspend

# Optimal gain (no clipping, RMS ~180 ambient)
amixer -c 0 set Mic 90

# Prevent USB autosuspend (mic sleeps after 2s idle without this)
sudo tee /etc/udev/rules.d/99-usb-audio-nosuspend.rules \x3C\x3C 'EOF'
ACTION=="add", SUBSYSTEM=="usb", ATTR{idVendor}=="2886", ATTR{idProduct}=="0007", \
  ATTR{power/control}="on", ATTR{power/autosuspend}="-1"
EOF
sudo udevadm control --reload-rules

Management

bash pipeline/manage.sh start     # start both services
bash pipeline/manage.sh stop      # stop both services
bash pipeline/manage.sh restart   # restart both
bash pipeline/manage.sh status    # systemd status
bash pipeline/manage.sh logs      # tail live log
bash pipeline/manage.sh test-mic  # record 4s + play back
bash pipeline/manage.sh test-stt  # record 4s + transcribe
bash pipeline/manage.sh test-tts  # speak a test phrase

Environment Variables

Variable	Default	Description
`OPENROUTER_API_KEY`	(required)	API key for OpenRouter (or any OpenAI-compatible provider)
`VOICE_MIC`	`hw:Array,0`	ALSA mic device name
`VOICE_SPEAKER`	`hw:C2c,0`	ALSA speaker device name
`VOICE_LLM_URL`	OpenRouter	LLM API endpoint
`VOICE_LLM_MODEL`	`anthropic/claude-3.5-haiku`	Model name
`VOICE_WAKE_THRESHOLD`	`0.5`	Wake word confidence (0.0–1.0)
`VOICE_SPEECH_RMS`	`400`	Fallback speech RMS threshold
`VOICE_SILENCE_RMS`	`250`	Fallback silence RMS threshold
`VOICE_UTC_OFFSET`	`0`	Timezone offset hours for LLM context
`PIPER_VOICES_DIR`	`~/.local/share/piper/voices`	Piper voice models directory
`WHISPER_URL`	`http://127.0.0.1:8181/inference`	whisper-server endpoint
`WHISPER_BIN`	`~/.local/bin/whisper-server-gpu`	whisper-server binary (used by setup.sh)
`WHISPER_MODEL`	`~/.local/share/whisper/models/ggml-base.bin`	Whisper model (used by setup.sh)

Troubleshooting

Mic records silence

Check gain: amixer -c 0 set Mic 90
Use card name not number (hw:Array,0 not hw:0,0) — numbers shift on reboot
ReSpeaker requires S24_3LE format, not S16_LE
Disable USB autosuspend (see setup above)

Records full 6s timeout, never cuts off

Room ambient noise > VOICE_SILENCE_RMS fallback. Dynamic calibration handles this automatically.
If still an issue, set VOICE_SILENCE_RMS slightly above your measured ambient floor.

[BEEPING] or (bell dings) in transcript

Speaker beep being picked up by mic. The 0.3s drain buffer after beep handles this.
Check speaker/mic distance and speaker volume.

Whisper OOM during build

Must use -DCMAKE_CUDA_ARCHITECTURES=72 — default multi-arch build exhausts 8GB RAM.
Use -j4 not -j6.

LED not lighting up

Install pyusb: pip install pyusb
Only supported on ReSpeaker USB Mic Array v1.0 (2886:0007)
All LED errors are silent — pipeline continues without it.

Wake word triggers constantly (false positives)

Lower VOICE_WAKE_THRESHOLD to 0.7 or higher.
Ensure no TV/radio playing phrases close to "Hey Jarvis".

File Structure

jetson-cuda-voice/
├── SKILL.md                  ← this file
├── BUILD.md                  ← whisper.cpp CUDA build guide
└── pipeline/
    ├── voice_pipeline.py     ← main pipeline
    ├── led.py                ← ReSpeaker LED control (optional)
    ├── setup.sh              ← one-command service installer
    └── manage.sh             ← start/stop/status/test

安全使用建议

This skill appears to do what it says (local STT/TTS with a networked LLM). Before installing, consider: (1) Your speech is transcribed locally but the resulting text is sent to whatever LLM endpoint you configure (default openrouter.ai). Only install if you trust that provider or change VOICE_LLM_URL to a local/self-hosted endpoint. (2) setup.sh writes Environment="OPENROUTER_API_KEY=..." into a user systemd unit file (~/.config/systemd/user) — that stores your API key in plain text; consider using a systemd EnvironmentFile with restricted permissions or another secret mechanism instead of embedding the key. (3) The optional udev fix requires sudo (writes /etc/udev/rules.d). (4) Building whisper.cpp on a Jetson is time- and resource-intensive; follow BUILD.md and ensure you have adequate swap/free memory. (5) Inspect the scripts yourself (they're included) before running them. If you want stronger privacy, run a local/air-gapped LLM-compatible server and set VOICE_LLM_URL accordingly or avoid providing an API key.

功能分析

Type: OpenClaw Skill Name: jetson-cuda-voice Version: 1.1.0 The skill is classified as suspicious due to several vulnerabilities and risky operations, though without clear evidence of intentional malice. Key concerns include: 1) Potential for shell injection in `pipeline/manage.sh` and `pipeline/setup.sh` if environment variables or script arguments are manipulated by an attacker (e.g., via prompt injection to the OpenClaw agent). 2) The `OPENROUTER_API_KEY` is stored in plain text within the systemd service file (`~/.config/systemd/user/voice-pipeline.service`), posing an information disclosure risk. 3) The `SKILL.md` and `setup.sh` (as a tip) instruct the user to execute `sudo` commands to modify system-wide udev rules, which is a privileged operation, even if for a stated hardware fix. While the skill performs remote downloads and network calls, these are from legitimate sources (Hugging Face, OpenRouter) and for the stated purpose of the voice assistant.

能力评估

✓ Purpose & Capability

The name/description (Jetson CUDA voice pipeline) match the code and SKILL.md. Required binaries (arecord, aplay, python3) and dependencies (openwakeword, piper-tts, whisper.cpp) are appropriate for the stated functionality. Required env var OPENROUTER_API_KEY is used by the code to call an LLM and is consistent with the stated 'only the LLM uses the internet' claim.

ℹ Instruction Scope

Runtime instructions and scripts stick to the stated pipeline. The code captures microphone audio, runs local STT/TTS, and sends transcriptions to the LLM_URL (defaults to openrouter.ai). This is within scope, but it does mean user speech (transcriptions) are transmitted off-device to the configured LLM provider — the SKILL.md does disclose this, but users should be aware of the data flow and privacy implications.

✓ Install Mechanism

No opaque download/install spec in skill registry; build and download steps are explicit in SKILL.md/BUILD.md (git clone github.com/ggerganov/whisper.cpp, wget from huggingface, pip installs). These are standard sources for this workload; no shorteners or personal servers are used. Building whisper.cpp on-device is heavy but expected.

ℹ Credentials

Only one required env var (OPENROUTER_API_KEY) is requested and it is justified by the LLM call. However, setup.sh embeds the API key directly into the user systemd unit file (Environment=...), which persists the secret in plain text in ~/.config/systemd/user — a practical security concern to consider (see guidance).

ℹ Persistence & Privilege

setup.sh installs and enables user-level systemd services (whisper-server and voice-pipeline) so the pipeline persists for the user session; always:false so it is not force-included. The optional udev rule in instructions requires root to write /etc/udev/rules.d (expected for USB device handling). The service persistence combined with storing the API key in the unit increases the impact of a compromised account or machine.

版本历史

v1.1.0

v1.1.0: Add setup.sh one-command installer (embeds systemd services inline). Fix manage.sh hardcoded devices — now uses VOICE_MIC/VOICE_SPEAKER env vars. Remove unused json import. Fix fragile test-tts heredoc. Remove cmake from runtime requires. Clean up SKILL.md: quick start section, fixed file structure, removed missing systemd/ dir reference.

v1.0.0

Initial release — offline wake word + whisper.cpp GPU STT + Piper TTS + ReSpeaker LED feedback + dynamic ambient noise calibration. Tested on Jetson Xavier NX sm_72 JetPack 5.1.4.

元数据

Slug jetson-cuda-voice

版本 1.1.0

许可证 —

累计安装 0

当前安装数 0

历史版本数 2

常见问题