功能描述

Deploy and manage Clack, a voice relay server for OpenClaw. Bridges voice input (WebSocket) through STT → OpenClaw agent → TTS, enabling real-time voice conv...

使用说明 (SKILL.md)

Clack

Name: Clack
Author: fbn3799

WebSocket relay server that enables real-time voice conversations with an OpenClaw agent.

Flow: Client audio (PCM 16kHz/16-bit/mono) → STT → OpenClaw Gateway → TTS → PCM audio back to client.

Per-session provider selection: The client can independently choose STT and TTS providers per call — any combination of on-device (Apple speech frameworks) and server-side providers (ElevenLabs, OpenAI, Deepgram). The server auto-detects all available providers based on configured API keys and exposes them via /info.

Prerequisites

Python 3.10+
API key for at least one provider (ElevenLabs, OpenAI, or Deepgram) — not needed for local speech mode
OpenClaw Gateway with chatCompletions endpoint enabled
Root/sudo access (for systemd)
Secure connection: Domain with SSL (recommended) or Tailscale

Setup

Run the setup script. It creates a venv, installs deps, prompts for API keys, configures a systemd service, and optionally sets up SSL.

sudo bash scripts/setup.sh

The script auto-detects your OpenClaw gateway config and interactively prompts for provider API keys (ElevenLabs, OpenAI, Deepgram — all optional). On re-runs, existing keys can be kept, updated, or deleted.

Options

bash scripts/setup.sh [--port 9878] [--domain clack.example.com]

Flag	Default	Description
`--port`	`9878`	Relay server port
`--domain`	(none)	Domain for SSL setup (enables WSS)

Connection modes

All connections are encrypted. The app supports two modes:

Domain with SSL (recommended):

bash scripts/setup.sh --domain clack.yourdomain.com
# → wss://clack.yourdomain.com/voice

Requires a DNS A record pointing the domain to your server IP. The setup script auto-configures SSL via Caddy. You can use a free domain from DuckDNS or your own.

Tailscale:

# Install Tailscale on your server, then connect from the app using your Tailscale IP
# → ws://100.x.x.x:9878/voice (encrypted at network level)

No domain or SSL setup needed. Tailscale encrypts all traffic at the network layer. Install Tailscale on both your server and phone, then use the server's Tailscale IP in the app.

Security note: Port 9878 should be firewalled from the public internet. Only allow access via localhost (for Caddy reverse proxy) and Tailscale. The app does not support unencrypted public connections.

Enable OpenClaw Gateway endpoint

The gateway must have chatCompletions enabled. Apply this config patch:

{"http": {"endpoints": {"chatCompletions": {"enabled": true}}}}

Management

clack status     # Check service status
clack restart    # Restart the server
clack logs       # Tail logs
clack pair       # Generate a new pairing code
clack update     # Pull latest code and restart
clack setup      # Re-run interactive setup (add SSL later, update keys, etc.)
clack uninstall  # Remove service and venv

Client App

📱 iOS — Available on the App Store (or build from source at github.com/fbn3799/clack-app) 🤖 Android — Coming soon!

Security

Authentication

All endpoints except GET /health and POST /pair require a valid auth token (RELAY_AUTH_TOKEN). Tokens are verified using constant-time HMAC comparison to prevent timing attacks.

Pairing System

6-character alphanumeric one-time codes (~2.1 billion combinations)
Codes expire after 5 minutes (TTL) and are single-use
Rate limited: 5 attempts per IP per 5 minutes — returns HTTP 429 after
2-second delay on failed attempts to slow brute force
Generating a code requires the admin auth token (GET /pair)
Redeeming a code is public but rate-limited (POST /pair)

Encrypted Connections

Domain mode: WSS (WebSocket Secure) via Caddy with automatic SSL certificates
Tailscale mode: WireGuard encryption at the network layer
The app enforces encrypted connections — no unencrypted public access
Port 9878 should be firewalled; only accessible via localhost and Tailscale

Input Sanitization

All user-facing text inputs are sanitized before processing:

Voice transcripts: Capped at 300 characters (CLACK_MAX_INPUT_CHARS), echo detection filters feedback loops, hallucination detection discards nonsense STT output
User context: Stripped to natural-language characters only (letters, numbers, common punctuation, whitespace). Control characters, escape sequences, and non-printable characters are removed. Capped at 1000 characters. Context is wrapped in explicit delimiters before injection into the system prompt.
No shell execution: All external communication uses structured HTTP/WebSocket APIs. No user input is ever passed to a shell.

Data Privacy

No analytics, tracking, or telemetry
Voice audio goes to your server and only to the providers you choose
The iOS app stores only settings locally (server address, token, preferences)
Third-party API usage depends on your provider config (ElevenLabs, OpenAI, Deepgram)

Session Routing

Each voice call creates a clack:\x3Cuuid> session in OpenClaw. These are small, isolated sessions — one per call — so voice conversations don't pollute your main agent context.

Session Picker

The session picker in the iOS app provides context injection only. When you select a session key, it is added as text context to the LLM prompt — it does not change routing. All voice calls still create their own clack:\x3Cuuid> session.

User Context

Users can provide persistent context that gets injected into the system prompt for every voice call. This lets the AI know about the user's preferences, notes, or any background information.

How to set context

App text field: In the Clack app under Settings → Context, enter free-form text
Session picker: Select an OpenClaw session to inject its content as context
WebSocket message: Send {"type": "set_context", "text": "..."} during a voice session
HTTP API: PUT /context?token=...&text=... or POST /context with JSON body {"text": "..."}

Context is sanitized before saving — only natural-language characters are kept (letters, numbers, common punctuation). IP addresses and domains are stripped. The server returns the sanitized text in the response so the app can show the user exactly what will be sent as context.

Context persists across calls and server restarts. Clear it via DELETE /context or by sending an empty set_context message.

Conversation History

The relay maintains a shared history file across calls for continuity. History is stored as JSON in CLACK_HISTORY_DIR (default: /var/lib/clack/history).

Max messages: 50 (configurable via CLACK_MAX_HISTORY)
History persists across calls and server restarts
Viewable via GET /history, clearable via DELETE /history

Echo Test Mode

For testing audio round-trips without using LLM credits:

Server-wide: Set CLACK_ECHO_MODE=true environment variable
Per-session: Send {"type":"start","config":{"echo":true}} from the client

In echo mode, transcribed text is echoed back through TTS instead of being sent to the LLM. Audio is peak-normalized with capped gain to ensure consistent playback volume.

Provider Selection

STT and TTS providers can be configured independently per session. The server auto-detects all available providers at startup based on which API keys are set (ELEVENLABS_API_KEY, OPENAI_API_KEY, DEEPGRAM_API_KEY).

Available modes per direction (STT / TTS):

On-device (local): Uses Apple's built-in speech frameworks. Zero API costs.
Server provider: ElevenLabs, OpenAI, or Deepgram — whichever keys are configured.

How it works:

App fetches GET /info to discover available providers
User picks STT and TTS providers independently in Settings → Voice
On call start, the app sends sttProvider and ttsProvider in the session config
Server creates the appropriate provider instances per session

Example combinations:

STT	TTS	Use case
ElevenLabs	ElevenLabs	Full cloud — best quality
On-device	ElevenLabs	Save STT costs, keep premium voices
On-device	On-device	Fully local — zero API usage, works offline
OpenAI	Deepgram	Mix providers freely

Cost optimization: Use on-device STT (free, unlimited) with a premium cloud TTS voice — get great output quality while eliminating transcription costs entirely. Or go fully on-device for zero API spend.

Text input mode

When STT is set to on-device, the client sends transcribed text instead of audio:

{"type": "text_input", "text": "What's the weather like?"}

When TTS is set to on-device, the server returns response_text only and skips audio synthesis.

AI Response Rules

Responses are enforced to 1–3 sentences for natural voice conversation
Server-side max_tokens: 150 to prevent runaway responses
Server-side max input: 300 characters (CLACK_MAX_INPUT_CHARS) — transcripts exceeding this are truncated

HTTP Endpoints

Endpoint	Method	Auth	Description
`GET /health`	GET	No	Health check — returns service status
`POST /pair`	POST	No	Redeem pairing code → get auth token (rate-limited)
`GET /pair`	GET	Yes	Generate one-time pairing code
`GET /info`	GET	Yes	Server info: agent name, available STT/TTS providers
`GET /voices`	GET	Yes	List available TTS voices
`GET /sessions`	GET	Yes	List active sessions
`GET /history`	GET	Yes	Get conversation history
`DELETE /history`	DELETE	Yes	Clear conversation history
`GET /context`	GET	Yes	Get current user context
`PUT /context`	PUT	Yes	Set user context (query param `text`)
`POST /context`	POST	Yes	Set user context (JSON body `{"text": "..."}`)
`DELETE /context`	DELETE	Yes	Clear user context
`WebSocket /voice`	WS	Yes	Voice relay connection

WebSocket Protocol

Endpoint: ws://\x3Chost>:\x3Cport>/voice?token=\x3CRELAY_AUTH_TOKEN>

Client → Server

Message	Format	Description
`{"type":"start","config":{...}}`	JSON	Start session. Config: `voice`, `systemPrompt`, `echo`, `sttProvider`, `ttsProvider`
Binary frames	bytes	Raw PCM audio (16kHz, 16-bit, mono)
`{"type":"text_input","text":"..."}`	JSON	Local speech mode — send text directly
`{"type":"end_speech"}`	JSON	Signal end of speech, triggers processing
`{"type":"interrupt"}`	JSON	Cancel current TTS playback
`{"type":"ping"}`	JSON	Keepalive
`{"type":"set_context","text":"..."}`	JSON	Set user context (sanitized before saving)
`{"type":"auth","token":"..."}`	JSON	Authenticate (alternative to query param)

Server → Client

Message	Format	Description
`{"type":"ready"}`	JSON	Session ready
`{"type":"auth_ok"}` / `{"type":"auth_failed"}`	JSON	Auth result
`{"type":"processing","stage":"..."}`	JSON	Stage: `transcribing`, `thinking`, `speaking`, `filtered`
`{"type":"transcript","text":"...","final":true}`	JSON	STT result
`{"type":"response_text","text":"..."}`	JSON	LLM text response
`{"type":"response_start","format":"pcm_16000"}`	JSON	Audio stream starting
Binary frames	bytes	TTS audio (PCM 16kHz, 16-bit, mono)
`{"type":"response_end"}`	JSON	Audio stream done
`{"type":"tts_cancelled"}`	JSON	TTS playback was interrupted
`{"type":"context_updated","text":"..."}`	JSON	Context saved — `text` contains the sanitized version
`{"type":"context_cleared"}`	JSON	Context was cleared

Features

Multi-provider STT/TTS: ElevenLabs, OpenAI, and Deepgram support
Independent voice input/output configuration: Choose STT and TTS providers separately — full control over how your voice is transcribed and how the AI speaks back
On-device speech: Apple speech frameworks for STT and/or TTS — zero API costs, mix with cloud providers freely
Cost optimization: Use free on-device transcription with premium cloud voices, or go fully local for zero spend
Voice response rules: AI responses enforced short (1-3 sentences, max_tokens 150)
Input length limiting: Configurable max transcript length (default 300 chars)
Confidence filtering: Low-confidence STT results are discarded
Echo detection: Prevents feedback loops (TTS → mic → STT)
Echo test mode: Test audio pipeline without LLM (server-wide or per-session)
Audio normalization: Peak normalization with capped gain for echo mode playback
Audio chunking: Long recordings auto-split for reliable transcription
Hallucination detection: Filters repetitive/nonsense STT output
Interrupt/TTS cancellation: Cancel in-progress TTS for all providers
Pairing system: Rate-limited one-time codes for secure device pairing
Session isolation: Each call gets its own clack:\x3Cuuid> session
Conversation history: Shared across calls, 50 messages max, persistent
Token auth: Constant-time HMAC verification
Keepalive pings: Prevents client timeout during long LLM responses
Silence detection: Default threshold 220, configurable range 20–1000
Auto-restart: systemd restarts on crash

Voice Configuration

20 built-in ElevenLabs voices available. Default: Will. Pass voice name or ID in session config:

{"type": "start", "config": {"voice": "aria"}}

Available aliases: will, aria, roger, sarah, laura, charlie, george, callum, river, liam, charlotte, alice, matilda, jessica, eric, chris, brian, daniel, lily, bill.

Environment Variables

Variable	Default	Description
`RELAY_AUTH_TOKEN`	—	Required. Client auth token (32-char)
`OPENCLAW_GATEWAY_URL`	`http://127.0.0.1:18789`	OpenClaw Gateway URL
`OPENCLAW_GATEWAY_TOKEN`	—	Gateway bearer token
`STT_PROVIDER`	`elevenlabs`	STT provider (`elevenlabs`, `openai`, `deepgram`)
`TTS_PROVIDER`	`elevenlabs`	TTS provider (`elevenlabs`, `openai`, `deepgram`)
`TTS_VOICE`	`Will`	Default voice (name or ID)
`VOICE_RELAY_PORT`	`9878`	Server port
`CLACK_ECHO_MODE`	`false`	Enable echo test mode server-wide
`CLACK_MAX_INPUT_CHARS`	`300`	Max transcript length (chars)
`CLACK_HISTORY_DIR`	`/var/lib/clack/history`	History file storage directory
`CLACK_MAX_HISTORY`	`50`	Max conversation history messages
`CLACK_AGENT_NAME`	`Storm`	Agent name shown in the iOS app

Provider API keys (ELEVENLABS_API_KEY, OPENAI_API_KEY, DEEPGRAM_API_KEY) are stored in config.json with restricted file permissions, not as environment variables. The setup script manages these interactively.

安全使用建议

This skill appears to be what it claims: a self-hosted voice relay that integrates STT/TTS providers and the OpenClaw gateway. Before installing: (1) verify the upstream repository/author (SKILL metadata lists a GitHub URL in docs but the registry Source is 'unknown'); (2) audit scripts/setup.sh and the generated systemd service to see exactly what files and env vars will be written (it will store provider keys in config.json and put a service file under /etc/systemd); (3) be aware it requires sudo to install packages (apt, Tailscale, Caddy) and will modify network-facing config (reverse proxy, firewall suggestions) — consider installing in an isolated VM or container if you don't want system-wide changes; (4) ensure config.json and service files contain only keys you expect and that config.json has restrictive permissions (chmod 600); (5) if you rely on a private OpenClaw gateway, confirm enabling chatCompletions is acceptable. If you want more assurance, provide the full setup.sh/service file contents and I can point to the exact lines where keys are written, commands that contact external hosts, and the systemd unit that will be created.

功能分析

Type: OpenClaw Skill Name: clack Version: 1.5.3 The skill bundle is classified as benign. The `SKILL.md` and `server.py` implement strong defenses against prompt injection, including explicit AI safety rules and robust input sanitization (`sanitize_context` function in `server.py`) that strips potentially harmful characters, IP addresses, and domains from user context. While the `scripts/setup.sh` script performs extensive system modifications (e.g., installing packages, configuring systemd, Caddy/Nginx, Certbot) and makes an external network call (`curl ifconfig.me`), these actions are clearly documented, interactive, and necessary for the stated purpose of setting up a secure, self-hosted voice relay server. There is no evidence of intentional harmful behavior, data exfiltration (beyond public IP for setup validation), or obfuscation.

能力评估

✓ Purpose & Capability

Name/description match the files and instructions: the skill implements a WebSocket voice relay, needs an OpenClaw gateway token and a local relay auth token, and requires python3/systemctl to create a systemd service. Optional provider API keys (ElevenLabs/OpenAI/Deepgram) are present in code and documented as optional. Nothing requested appears unrelated to a voice relay.

ℹ Instruction Scope

SKILL.md and scripts instruct the agent to read the user's OpenClaw config (~/.openclaw/openclaw.json), prompt for and persist provider keys into config.json, modify the gateway config (enable chatCompletions) if requested, install system packages (Tailscale/Caddy), and create a systemd service. These are in-scope for a server setup but are privileged operations that modify local system and gateway configuration — review them before running.

ℹ Install Mechanism

This is an instruction-and-script skill (no centralized install spec). setup.sh runs apt installs and curls to add official APT sources (Tailscale, and likely Caddy via cloudsmith). The script uses known package hosts (pkgs.tailscale.com, cloudsmith) rather than opaque shorteners, which is typical but still a non-trivial change to the system. Expect packages and service files to be created under /etc and /usr/local.

✓ Credentials

Declared required env variables (OPENCLAW_GATEWAY_TOKEN, RELAY_AUTH_TOKEN) match the functionality: gateway bearer token for LLM routing and a relay auth token for client authentication. Provider API keys are referenced in code but are optional and appear only to enable STT/TTS providers. No unrelated credentials are requested.

ℹ Persistence & Privilege

The skill requires and uses root/sudo during setup to install packages, create a systemd service (/etc/systemd/system/clack.service), add CLI symlink, and alter reverse proxy configs (Caddy). always:false (not force-included). These privileges are expected for a long-running server but give the skill system-wide presence — review service files and created configs before trusting.

版本历史

v1.5.3

Clack 1.5.3 (includes unreleased 1.5.1 and 1.5.2): - Setup script now interactively prompts for ElevenLabs, OpenAI, and Deepgram API keys instead of requiring env vars up front. Existing keys can be kept, changed, or deleted on re-run. - New management CLI: `clack status`, `clack restart`, `clack logs`, `clack pair`, `clack update`, `clack setup`, and `clack uninstall` for easier service control. - Improved input sanitization: user-provided context is strictly filtered for valid natural-language text, with hard character and length caps. Shell input is never possible. - Expanded documentation and setup guidance, including full description of new input sanitization and updated example commands. - No functional code changes; documentation and setup tooling only.

v1.5.1

- Removed the install script (scripts/install.sh) from the skill. - No functional or feature changes.

v1.5.0

**Version 1.5.0 summary:** Updated environment variable requirements and metadata for clarity and consistency. - Removed commented-out environment variables for optional STT/TTS providers from metadata. - Only required env vars (`OPENCLAW_GATEWAY_TOKEN`, `RELAY_AUTH_TOKEN`) are listed in metadata. - No changes to functionality or configuration steps. - Documentation and metadata are now more accurate and streamlined.

v1.4.2

- Added documentation for new environment variables: RELAY_AUTH_TOKEN (auto-generated if not set), ELEVENLABS_API_KEY, OPENAI_API_KEY, and DEEPGRAM_API_KEY. - Enhanced clarity in the metadata on optional environment variables for STT/TTS provider setup. - No code or functionality changes in this version.

v1.4.1

Metadata fixes for accuracy

v1.4.0

Version 1.4.0 Security fixes

v1.0.0

Clack enables you to do voice communication with your openclaw via the Clack app for iOS and Android (both apps pending).

元数据

Slug clack

版本 1.5.3

许可证 —

累计安装 1

当前安装数 1

历史版本数 7

常见问题

Clack 是什么？

Deploy and manage Clack, a voice relay server for OpenClaw. Bridges voice input (WebSocket) through STT → OpenClaw agent → TTS, enabling real-time voice conv... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 398 次。

如何安装 Clack？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install clack」即可一键安装，无需额外配置。

Clack 是免费的吗？

是的，Clack 完全免费（开源免费），可自由下载、安装和使用。

Clack 支持哪些平台？

Clack 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（linux）。

谁开发了 Clack？

由 fbn3799（@fbn3799）开发并维护，当前版本 v1.5.3。

Clack

Clack

Prerequisites

Setup

Options

Connection modes

Enable OpenClaw Gateway endpoint

Management

Client App

Security

Authentication

Pairing System

Encrypted Connections

Input Sanitization

Data Privacy

Session Routing

Session Picker

User Context

How to set context

Conversation History

Echo Test Mode

Provider Selection

Available modes per direction (STT / TTS):

How it works:

Example combinations:

Text input mode

AI Response Rules

HTTP Endpoints

WebSocket Protocol

Client → Server

Server → Client

Features

Voice Configuration

Environment Variables

Clack 是什么？

如何安装 Clack？

Clack 是免费的吗？

Clack 支持哪些平台？

谁开发了 Clack？

💬 留言讨论