← Back to Skills Marketplace
openlark

VoxCPM2 — Tokenizer-Free Multilingual TTS

by OpenLark · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ⚠ suspicious
18
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install voxcpm
Description
VoxCPM2 — Tokenizer-Free TTS model guide. Covers installation, Python/CLI API (TTS/Voice Design/Controllable Cloning/Ultimate Cloning/Streaming), vLLM-Omni d...
README (SKILL.md)

VoxCPM2 — Tokenizer-Free Multilingual TTS

A tokenizer-free TTS from OpenBMB based on a diffusion autoregressive architecture. 2B parameters, trained on 2M+ hours, 30 languages, 48kHz output, built on MiniCPM-4.

Architecture: LocEnc → TSLM → RALM → LocDiT, AudioVAE V2 asymmetric 16kHz→48kHz.

Installation

pip install voxcpm  # Python ≥3.10, PyTorch ≥2.5, CUDA ≥12
model = VoxCPM.from_pretrained("openbmb/VoxCPM2", device="auto")  # cuda→mps→cpu
# torch.compile issues: optimize=False; HF mirror: export HF_ENDPOINT=https://hf-mirror.com

Models

V2 (2B) 1.5 (0.8B) 0.5B
Sample Rate 48kHz 44.1kHz 16kHz
Languages 30 2(zh/en) 2(zh/en)
Voice Design
VRAM/RTF ~8GB/~0.30 ~6GB/~0.15 ~5GB/~0.17

30 languages: Chinese, English, Japanese, Korean, French, German, Spanish, Italian, Russian, Arabic, Hindi, Thai, Vietnamese, Turkish, Dutch, Finnish, Norwegian, Swedish, Danish, Polish, Portuguese, Greek, Hebrew, Indonesian, Malay, Burmese, Khmer, Lao, Swahili, Tagalog + 9 Chinese dialects (Sichuan, Cantonese, Wu, Northeastern, Henan, Shaanxi, Shandong, Tianjin, Minnan)

Python API

from voxcpm import VoxCPM; import soundfile as sf
model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False)

# Basic TTS
sf.write("out.wav", model.generate("Hello!", cfg_value=2.0, inference_timesteps=10), model.tts_model.sample_rate)

# Voice Design (text description → voice, no reference audio needed)
wav = model.generate("(A young woman, gentle voice)Hello!")

# Controllable Cloning (reference audio + style control)
wav = model.generate("Hello.", reference_wav_path="voice.wav")
wav = model.generate("(faster, cheerful)Hi.", reference_wav_path="voice.wav")

# Ultimate Cloning (reference audio + transcript for full detail reproduction)
wav = model.generate("Text.", prompt_wav_path="ref.wav", prompt_text="transcript", reference_wav_path="ref.wav")

# Streaming
import numpy as np
wav = np.concatenate([c for c in model.generate_streaming("Streaming!")])

generate() params: text(required) reference_wav_path prompt_wav_path prompt_text cfg_value=2.0(1-3) inference_timesteps=10(4-30) normalize=False denoise=False retry_badcase=True

CLI

voxcpm design --text "Hello" --control "Young female warm voice" --output out.wav --device auto
voxcpm clone --text "Hi" --reference-audio voice.wav --prompt-audio ref.wav --prompt-text "txt" --output out.wav
voxcpm batch --input examples/input.txt --output-dir outs

Web Demo

git clone https://github.com/OpenBMB/VoxCPM.git && cd VoxCPM && pip install -e .
python app.py --port 8808 --device auto

Deployment

vLLM-Omni (recommended, OpenAI-compatible)

uv pip install vllm==0.19.0 --torch-backend=auto
git clone https://github.com/vllm-project/vllm-omni.git && cd vllm-omni && uv pip install -e .
vllm serve openbmb/VoxCPM2 --omni --port 8000
curl http://localhost:8000/v1/audio/speech -H "Content-Type:application/json" -d '{"model":"openbmb/VoxCPM2","input":"Hello!","voice":"default"}' --output out.wav

Nano-vLLM: pip install nano-vllm-voxcpm (RTF ~0.13 vs standard ~0.30)

Fine-tuning

# LoRA (recommended)
python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v2/voxcpm_finetune_lora.yaml
# Full fine-tuning
python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v2/voxcpm_finetune_all.yaml
# WebUI
python lora_ft_webui.py  # http://localhost:7860

Data format JSONL: {"audio":"path","text":"transcript","ref_audio":"path"} (recommend 30-50% samples with ref_audio). LoRA params r=32 alpha=16, hot-swappable (load_lora/unload_lora/set_lora_enabled). Adapt to a speaker with as little as 5-10 minutes of audio.

License

Apache 2.0 — free for commercial use

Usage Guidance
Install only if you will use it with your own voice or voices you are explicitly authorized to reproduce. Do not use it to impersonate people, create deceptive audio, or clone a speaker without documented consent; handle any uploaded voice samples as sensitive personal data.
Capability Assessment
Purpose & Capability
Voice cloning, controllable cloning, and speaker adaptation are coherent with a TTS skill, but they are high-impact identity-replication capabilities and need explicit consent and non-impersonation limits.
Instruction Scope
The documented runtime guidance appears to enable cloning-style use without clearly scoping it to the user's own voice or authorized speakers.
Install Mechanism
No malicious install behavior was evidenced in the supplied scan context; the concern is the capability and under-disclosed safety posture rather than installation.
Credentials
A TTS or voice-cloning workflow may reasonably need audio inputs, model/provider access, and generated audio outputs, but users should treat source voice samples as sensitive biometric data.
Persistence & Privilege
No artifact-backed evidence of hidden persistence, privilege escalation, destructive behavior, or exfiltration was provided.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install voxcpm
  3. After installation, invoke the skill by name or use /voxcpm
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
- Initial release of voxcpm: a tokenizer-free, multilingual TTS model guide based on VoxCPM2. - Features installation steps, Python and CLI usage for TTS, voice design, voice cloning, and streaming. - Includes instructions for vLLM-Omni OpenAI-compatible deployment and fine-tuning (SFT/LoRA). - Supports 30 languages, high-quality 48kHz output, and advanced voice control features. - Provides model comparisons, sample commands, and Web demo setup. - Licensed under Apache 2.0 for free commercial use.
Metadata
Slug voxcpm
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is VoxCPM2 — Tokenizer-Free Multilingual TTS?

VoxCPM2 — Tokenizer-Free TTS model guide. Covers installation, Python/CLI API (TTS/Voice Design/Controllable Cloning/Ultimate Cloning/Streaming), vLLM-Omni d... It is an AI Agent Skill for Claude Code / OpenClaw, with 18 downloads so far.

How do I install VoxCPM2 — Tokenizer-Free Multilingual TTS?

Run "/install voxcpm" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is VoxCPM2 — Tokenizer-Free Multilingual TTS free?

Yes, VoxCPM2 — Tokenizer-Free Multilingual TTS is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does VoxCPM2 — Tokenizer-Free Multilingual TTS support?

VoxCPM2 — Tokenizer-Free Multilingual TTS is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created VoxCPM2 — Tokenizer-Free Multilingual TTS?

It is built and maintained by OpenLark (@openlark); the current version is v1.0.0.

💬 Comments