← 返回 Skills 市场

Music Craft — MiniMax

Name: Music Craft — MiniMax
Author: luischarro

作者 LuisCharro · GitHub ↗ · v1.0.1 · MIT-0

cross-platform ⚠ suspicious

总下载

当前安装

版本数

在 OpenClaw 中安装

/install music-craft-minimax

功能描述

Advanced music generation for OpenClaw, using the MiniMax Music 2.6 token plan. Use for cover and style transfer, two-song mashup, lyrics generation API, emo...

使用说明 (SKILL.md)

Music Craft — MiniMax

This is the power-user upgrade of music-craft. It does everything that skill does, plus the features that require the MiniMax Music 2.6 token plan:

Cover and style transfer from a reference audio file or YouTube URL (preserves melody)
Two-song mashup (Song A's content and emotion + Song B's style)
Lyrics generation via the MiniMax API endpoint (with edit mode for iteration)
Emotion analysis on input audio to drive prompt construction (vocal speed, intensity curve, pitch bends)
Fine control over generation parameters (BPM, key, structure, avoid list as separate flags via mmx)

For everything else (standard song generation, instrumentation, anti-sparse prompt engineering, structure tags, user preference flow), this skill uses the same workflow as music-craft. Read that skill first to understand the base, then come back here for the MiniMax-specific extensions.

Routing and Blocker Checks

Classify the request before analysis or generation:

Text-only style reference means the user gave a song name, artist, era, or genre cue without source audio. Treat it as style inference, not cover analysis.
Reference audio or YouTube means the user provided a file or playable source that should be analyzed.
Cover preserves melody and usually needs a source file plus a target style decision.
Style transfer uses a reference track or analyzed audio as style input, then changes the production direction.
Mashup needs Song A and Song B, plus a decision about which one contributes content and which one contributes style.
Emotion prompt means the user wants analysis turned into descriptive prompt language, not a full cover.

The scripts/lint_music_request.py helper emits one of these routes:

Route	When
`base_prompt`	Standard generation, no MiniMax-specific feature needed.
`minimax_cover`	Melody-preserving cover from audio or YouTube.
`minimax_mashup`	Two-song mashup (A + B, both identified).
`minimax_style_transfer`	Style transfer that does not preserve the source melody.
`minimax_emotion_prompt`	Emotion analysis, or precision `mmx` flag usage.
`needs_clarification`	At least one blocker is unresolved; ask the user first.

Surface blockers before analysis:

no source file or usable URL
unclear which track is Song A versus Song B
missing target style
missing lyrics decision, such as original, translated, rewritten, or instrumental
conflicting cover/style-transfer intent: the user asked for both "cover" (preserve melody) and "style transfer" (reproduce style) at once. These are mutually exclusive. Ask the user to pick one.

After you have prompt text and mmx flags, lint them together before generation:

compare prompt BPM with --bpm
compare prompt key with --key
compare prompt structure line with --structure
compare prompt duration with --duration (or implicit length expectation)
compare prompt vocal mode with --vocals
compare prompt language with --language
compare prompt avoid language with --avoid
stop when the prompt says one thing and the flags say another

If the user only has a text reference, route to the free-tool path in references/free-tool-inputs.md first. If the user has audio, analyze first and only then build the prompt. The linter returns a retry_guidance array with one hint per conflict so the operator can re-align prompt and flags on the next attempt.

When To Use

Use this skill when the task involves:

generating a cover of an existing song with a different style (chanson version of a rock track, reggaeton version of a pop hit, and so on)
style transfer from a YouTube URL or audio file to a target genre
two-song mashup where Song A's lyrics and emotional arc are kept, but Song B's style is applied
emotion analysis on input audio to extract intensity curves, vocal speed, pitch bends, and emotion classifications
generating lyrics in a specific language and theme via the MiniMax lyrics_generation API
editing existing lyrics to match a target style or emotional arc (MiniMax lyrics_generation edit mode)
using mmx CLI directly for fine control over --avoid, --bpm, --key, --structure, --vocals, --instruments as separate flags
accessing MiniMax's music-cover or music-cover-free models for melody preservation

Request Intake (adapted for MiniMax features)

After the Routing and Blocker Checks classify the request, run this 2-pass intake to extract the full set of fields the user cares about. Label each field's confidence: clear (user said it), inferred (sensible default), missing (need to ask), or conflicting (user said two incompatible things — pause to resolve).

Fields checklist (MiniMax-specific)

#	Field	What to look for	MiniMax-specific notes
1	Route	Cover / style transfer / mashup / standard / emotion prompt	From the Routing and Blocker Checks section. Determines which MiniMax features to use.
2	Source audio or URL	File path or playable YouTube URL	Required for cover, mashup, style transfer. For standard, optional (text-only style reference is also fine).
3	Song A identity	Name, artist, audio	For mashup: needed. For cover: this is the source.
4	Song B identity	Name, artist, audio	For mashup only.
5	Target style	Genre / mood / reference	The destination of the cover or style transfer. If user says "like Rosalía", that's clear. If user says "something good", that's missing.
6	Lyrics decision	Original / translated / new / instrumental	For cover, default to original (translated if user requests it). For standard, default to new (or user-provided).
7	Vocal mode	Solo / duet / choir / instrumental	Drives `--vocals` and `--language` flags.
8	Language	BCP-47 code (en, fr, es, etc.)	For lyrics language AND vocal language.
9	Duration	Approximate length (jingle ~30s, standard ~3min, epic ~6min)	mmx has no native duration control (see "Song length" section). Length is driven by lyrics + structure, so the intake needs the lyrics to control length.
10	BPM, key, structure	Exact values if user wants `--bpm`/`--key`/`--structure`	Optional. If provided, the prompt AND flags must agree (lint them).
11	Emotion arc	For emotion-prompt workflows: which emotions to emphasize	Drives the analysis-to-prompt translation.
12	Output location	Where the audio and analysis files go	Same as the base skill — per-song subfolder in `~/Music mix/\x3Cproject>/\x3Csong-slug>/`.

Confidence map example (MiniMax-specific)

Request: "Hazme un cover del 'Bizcochito' de Rosalía pero en reggaetón"

clear:     source_audio=path, song_a=Bizcochito, target_style=reggaeton
inferred:  language=es, vocal_mode=solo_female, lyrics_decision=original
missing:   output_location (which project folder? per-song subfolder?)
            vocal_register (full chest, head voice, whisper? — affects --vocals flag)

Request: "I have a YouTube link of an old rock song and want it as a dreamy shoegaze ballad, with English lyrics because the original is in French"

clear:     source_url=URL, song_a=old_rock_song, target_style=shoegaze
            lyrics_decision=translated, target_language=en
inferred:  vocal_mode=duet or solo (depends on original), ~3min
missing:   audio source for source audio analysis (YouTube needs to be downloaded first)
            BPM/key from analysis output (will be filled in after analysis)
            output_location

If any field is missing or conflicting, that's a question to ask. The Ambiguity Questions section below has specific patterns for each route. If everything is clear or inferred, the request is ready to translate.

User Preference Flow (message patterns → action)

The skill does not start with a questionnaire. It starts by reading and inferring from the user's natural-language request.

User says...	Skill does...
"Haz un cover de X en Y"	Route: `minimax_cover`. Ask: source audio file (or download from YouTube), target language for lyrics, vocal register.
"Make this song sound like Rosalía"	Route: `minimax_style_transfer`. Ask: source audio, which album/era of Rosalía.
"I have audio of A, mash with B, keep A's melody"	Route: `minimax_mashup`. Ask: A vs B confirmation, source audio for A, B can be name or audio.
"Analyze the emotion curve of this track"	Route: `minimax_emotion_prompt` (analysis-only). Run `analysis_orchestrator.py --audio` first, then read the JSON.
"I want the lyrics to be about X, in French, melancholic"	Route: `base_prompt` (standard). Use the lyrics API to generate, then pass to `mmx music generate --lyrics-file`. Ask: target BPM/key/structure or derive from analysis.
"Recreate the song but in 90 BPM D minor"	Route: `base_prompt` with `mmx` flags. Lint prompt vs flags before generation. Verify BPM/key consistency.
"I don't know, surprise me"	Pick a coherent default (e.g. upbeat indie pop, EN, ~3min, auto-lyrics, standard generation) and confirm with the user before generating.
"Same song again but as a reggaeton version"	Route: `minimax_cover` with the existing song as source. Use the same project/song subfolder, suffix the MP3 (`M1_original.mp3` + `M2_reggaeton.mp3`).

This table is the abstract of references/user-preference-flow.md (which lives in the base skill). If you want a more detailed case, defer to the base skill's table and combine with this skill's route mapping.

Output File Layout (Per-Song Subfolders)

MiniMax-specific additions (drop these into the per-song subfolder alongside the base items):

File	Source	Notes
`\x3Csong-slug>_analysis.json`	`analysis_orchestrator.py --output`	MiniMax-specific analysis results (emotion, BPM, key, segments)
`\x3Csong-slug>_lyrics.txt`	`mmx music generate --lyrics-file`	Optional if user provided lyrics inline
`\x3Csong-slug>_\x3Cstyle>_prompt.txt`	The exact text passed to `--prompt`	For reproducibility

The LLM should aim for the base skill's layout by default. The MiniMax-specific files are added on top when MiniMax features are used (cover workflow, mashup, analysis, etc.).

Quick Start with the Orchestrator

For any input combination, the analysis_orchestrator.py script is the single entry point:

# Audio file
python3 scripts/analysis_orchestrator.py --audio /tmp/song.wav

# Two songs (mashup) - gets BPM + key compatibility scoring for free
python3 scripts/analysis_orchestrator.py --audio /tmp/song_a.wav --audio /tmp/song_b.wav

# Video - extracts audio + visual features (scenes, color, motion)
python3 scripts/analysis_orchestrator.py --video /tmp/clip.mp4

# Image (album art) - color palette + style hints
python3 scripts/analysis_orchestrator.py --image /tmp/album_art.jpg

# YouTube URL - downloads then analyzes
python3 scripts/analysis_orchestrator.py --youtube "https://youtube.com/watch?v=..."

# Combination: audio + image
python3 scripts/analysis_orchestrator.py --audio /tmp/song.wav --image /tmp/art.jpg

# Demucs source separation — for TIMBRE/PITCH analysis of an isolated vocal, NOT for lyrics
python3 scripts/analysis_orchestrator.py --audio /tmp/song.wav --use-demucs

# Whisper lyrics extraction — run on the FULL mix (do NOT pre-separate with Demucs)
python3 scripts/analysis_orchestrator.py --audio /tmp/song.wav --lyrics

# VLM captioning for images (calls mmx vision describe / MiniMax 3.0 — cloud, skip if MiniMax is blocked)
python3 scripts/analysis_orchestrator.py --image /tmp/album_art.jpg --vlm

The orchestrator dispatches to the right analysis scripts and produces a unified JSON. Optional packages (CLAP, autochord, allin1, pyloudnorm, pylette, scenedetect, demucs, beat_this, basic-pitch, transformers/MERT, open_clip) are detected at runtime and used when available.

Extraction guidance (what actually improves the output)

These are the rules that make the extracted data useful to the downstream generator. They are tool-agnostic — they apply whether the backend is MiniMax cloud or a local model.

Lyrics: transcribe the FULL mix, do not Demucs-first. Feeding Demucs-isolated vocals into Whisper measurably worsens transcription word-error-rate in most configurations. Run the transcriber on the original mix. Use faster-whisper over vanilla whisper (same accuracy, much lower latency/VRAM), and prefer the large-v2 model for sung lyrics — large-v3 is reliably worse on singing. Use medium/base only as a speed compromise.
Use Demucs only for timbre/pitch. Source separation helps when you want clean vocal-stem features (breathiness, pitch range, vocal brightness) or per-instrument detection — never as a lyrics pre-step.
Prioritise the high-value features. For driving a generation prompt, the features that matter most (in order) are tempo/BPM, key/scale, beats/downbeats, chords, then structure (section boundaries). Energy/RMS and spectral centroid map to texture words (punchy, airy, sparse, dense) and to dynamic tags. Spend analysis budget there first.
Give key detection a long window. Estimate key/chroma over ~120s of audio (not a short clip) for a stable result; BPM is stable from ~60s.
Carry confidence through to the prompt. Hedge low/medium detections ("around 128 BPM", "likely D minor") and never inject missing values as facts — see Analysis Quality below.
Map structure boundaries to actions. Detected section boundaries become the [Verse]/[Chorus]/[Bridge] tag roadmap, and (for backends that support it) the repaint windows for fixing one bad section instead of regenerating the whole track.

Output file layout (per-song subfolders)

Every generation should be saved into a per-song subfolder that bundles the audio with its analysis, prompt, and lyrics. The LLM should ask the user for the project root and song slug up front (default: ~/Music mix/\x3Cproject>/\x3Csong-slug>/), then run the full chain of commands below.

# Example: DBC - Two Paths, two versions

# 1. Make the subfolder
mkdir -p ~/Music\ mix/dbc/two-paths

# 2. Run the analysis and save JSON into the subfolder
python3 scripts/analysis_orchestrator.py \
  --audio /tmp/two_paths.wav \
  --use-demucs --lyrics --lyrics-source auto \
  --output ~/Music\ mix/dbc/two-paths/two_paths_analysis.json

# 3. Build the prompt from the analysis, save it next to the JSON
python3 scripts/emotion_to_prompt.py \
  --emotion ~/Music\ mix/dbc/two-paths/two_paths_analysis.json \
  --output ~/Music\ mix/dbc/two-paths/two_paths_synthwave_prompt.txt

# 4. Generate each version, save the MP3 into the subfolder with a
#    versioned filename so multiple takes stack cleanly
mmx music generate \
  --prompt "$(cat ~/Music\ mix/dbc/two-paths/two_paths_synthwave_prompt.txt)" \
  --lyrics-file ~/Music\ mix/dbc/two-paths/two_paths_lyrics.txt \
  --out ~/Music\ mix/dbc/two-paths/M1_two_paths_synthwave.mp3

The result is a self-contained song folder that the user can review, archive, share, or re-generate from without losing any context.

What's New in v1.0.0

v1.0.0 is the first stable release. It builds on the v0.x series (v0.3.0 / v0.4.0 dev line) with stronger preflight routing, wider prompt/flag consistency, and explicit post-generation verification:

Preflight routing:

lint_music_request.py now emits one of six routes: base_prompt, minimax_cover, minimax_mashup, minimax_style_transfer, minimax_emotion_prompt, or needs_clarification
New blockers: missing Song B, missing lyrics decision, and conflicting cover/style-transfer intent
A retry_guidance array on every conflict so the operator can re-align prompt and flags

Prompt and flag consistency:

Linter now detects conflicts in BPM, key, structure, duration, vocal mode, language, and avoid list
The canonical mmx prompt schema is documented in references/examples.md

Analysis quality:

All analysis scripts converge on a compact summary (tempo, key, sections, instrumentation, vocal traits, energy curve, hook points, mix notes)
Confidence levels (clear, high, medium, low, inferred, missing) attached to every detection
Missing optional dependencies fall back to a JSON error block instead of failing the whole workflow

Output verification:

Post-generation verification checklists for covers, mashups, style transfer, and emotion prompts
Eight failure signatures (copied too closely, lost melody, wrong tempo, wrong key, muddy mix, weak chorus, style mismatch, neutral vocals) with matching fixes
Revision prompt templates that preserve source identity while fixing one specific dimension

Tests and portability:

Smoke tests now cover all new linter routes, the new conflict types, and the stdlib-only import guarantee
Windows is documented as partial support; scripts stay POSIX-safe, audio tools may need platform install

What's New in v0.3.0

v0.3.0 builds on v0.2.0 with a substantially richer analysis pipeline:

New analysis scripts (8):

extract_stems.py — Demucs source separation (vocal/drums/bass/other)
track_beats.py — beat_this beat + downbeat tracking (ISMIR 2024 SOTA)
extract_melody.py — Spotify Basic Pitch polyphonic AMT → MIDI + key/scale
compute_audio_embedding.py — MERT v1-330M music embeddings (vibe similarity)
classify_instruments.py — MIT AST 527-class AudioSet tagging
extract_video_features.py — extended with camera motion + VLM captioning
analyze_image.py — extended with OpenCLIP, OCR, face detection, VLM caption
analysis_orchestrator.py — single entry point, --use-demucs, --vlm, --ocr flags

New prompt slots (consumed in emotion_to_prompt.py):

beat grid: 4/4 at 150 BPM (confidence 0.80) from beat_this
melodic key from MIDI: E minor; interval motion: mostly leaps; modal character: pentatonic, blues from Basic Pitch
AST-detected sound palette: rock music (0.16), punk rock (0.14), grunge (0.20) from MIT AST
emotion signature from analysis: intense, passionate, dramatic, triumphant (expanded to 25-emotion classifier)
vocal texture in verse: breathier / more intimate than average (per-section aggregation)
tempo: tight, on-beat delivery (from tempo_consistency)
tonal character: dark warm tone, rolled-off highs (from brightness)
instruments detected: electronic / synthetic textures (from instrument_hints)
natural dramatic pauses detected at: 2s (11.7s pause), 20s (3.3s pause) (from Demucs vocal-stem)
style direction: ... (from analyze_two_songs mashup_plan)

Bug fixes:

parselmouth 0.4.x API (get_value_at_time / get_value_at_xy)
ffmpeg 8.x image2 muxer workaround (per-frame extraction)
pylette 5.1+ capital-P import + Pylette fallback
open_clip 3.3 3-tuple return + get_tokenizer() for tokenizer
demucs 4.x apply_model API

Prompting wins (verified end-to-end with DBC Woodstock 2013):

Mix: 0 silence gaps, 35 pitch bends
Vocal stem: 19 silence gaps, 49 pitch bends, 2.32 syll/sec
BPM 150 (4/4) from beat_this, E minor (MIDI-confirmed G# minor)
AST: "Rock music", "Punk rock", "Heavy metal", "Grunge" — matches the actual band

When NOT To Use

Do not use this skill when:

the user only needs standard song generation without cover, mashup, or analysis — use music-craft instead (lighter, no MiniMax dependency)
the runtime does not expose a music_generate tool and there is no MINIMAX_API_KEY configured — both skills need the runtime
the user wants deterministic, single-shot generation with no iteration — overkill
the user wants to mutate a specific existing audio file (pitch shift, time stretch, stem split) — that is post-production, not generation
the user is not on a MiniMax Token Plan — the advanced features (cover, mmx per-flag control, lyrics API, emotion-driven prompts) require the plan

Decision Tree

Use the base skill unless one of these MiniMax-specific needs is present:

melody-preserving cover or style transfer from audio or YouTube
two-song mashup
lyrics API preview/edit flow
emotion analysis that feeds the prompt
exact mmx control for BPM, key, structure, or avoid lists

If the user wants a new song that only borrows a style, stay in music-craft unless they also need exact flag control or lyrics API iteration.

If the source is a YouTube URL and download is blocked, ask for a local file before changing the workflow.

First Response Defaults

Use these defaults on the first pass:

Cover from audio or YouTube: start with the one-step cover path. Switch to two-step only if the user wants translated lyrics, edited ASR lyrics, or custom lyrics.
Style transfer only: do not use cover unless melody preservation matters. Use standard generation plus mmx flags if exact BPM/key/structure matter.
Two-song mashup: anchor on Song A. If Song A has audio, default to the cover two-step workflow; if Song B is only named, ask for a short style description or fetch more context if free tools are available.
Lyrics API generation or edit: use write_full_song for blank-page generation and edit for revisions.
Emotion-analysis-to-prompt: run analysis first, then convert to a prompt; only ask whether the output should be cover, mashup, or standard generation, plus the target language if missing.
Exact BPM/key/structure control: make mmx flags the source of truth and keep the prompt descriptive but non-conflicting.

Ambiguity Questions

Ask at most 1-3 questions. Separate blockers from quality tweaks:

Required blockers first: source file or URL, which song is A vs B, whether lyrics already exist, whether the output must preserve melody.
Optional quality after blockers: target language, target style, BPM, key, structure, instruments, vocal color, avoid list.

Use these exact patterns when clarification is needed:

Cover: "Which source should I use?" "Do you want the original lyrics, translated lyrics, or new lyrics?" "Any target style, or should I derive it from the source?"
Mashup: "Which song is A and which is B?" "Do you have audio for Song B, or only the name?" "Should the lyrics stay the same or be rewritten?"
Lyrics API: "Write from scratch or edit existing lyrics?" "What language should I target?" "Any hard structure requirements?"
Emotion prompt: "Do you want cover, mashup, or standard generation?" "What language should the output use?" "Should I prioritize tenderness, energy, or structure?"
mmx precision: "Which values are mandatory: BPM, key, structure, or avoid list?" "Any instruments or vocals that must stay in or stay out?"

Relationship to `music-craft`

This skill extends the base skill, it does not replace it. The shared concepts are:

Concept	Where it lives
Pre-Flight Check (platform detection)	This skill (extended required list)
Anti-sparse rules (canonical text)	Base skill, referenced from here
Prompt formula (production sheet)	Base skill, referenced from here
Structure tags (14 tags)	Base skill, referenced from here
User preference flow (auto-detect + ask)	Base skill, referenced from here
Output file layout (per-song subfolders, slug rules, version prefix)	Base skill, referenced from here; MiniMax adds analysis.json and lyrics.txt
Rate limits (generic)	Base skill
Quality verification checklist	Base skill, extended here for MiniMax
Operating rules (6-step loop)	Base skill, summarized here with MiniMax-specific extensions

The MiniMax-specific additions are:

MiniMax concept	Where it lives
`mmx` CLI quick reference	This skill
`mmx` full flag reference	This skill, `references/mmx-flags-reference.md`
Cover workflow (one-step, two-step)	This skill, `references/cover-workflow.md`
Lyrics generation API	This skill, `references/lyrics-generation.md`
Mashup workflow (A + B)	This skill, `references/mashup-workflow.md`
Emotion analysis (vocal speed, intensity, pitch)	This skill, `references/emotion-analysis.md`
MiniMax-specific error handling	This skill, `references/error-handling.md`
Audio analysis scripts	This skill, `scripts/`
Free tool inputs (web, image, memory)	Both skills — base layer in `music-craft`, MiniMax layer here in `references/free-tool-inputs.md`

Pre-Flight Check (extended)

The platform detection block is the same as music-craft (run it first). The required and optional lists are extended for MiniMax.

Platform Notes

macOS/Linux are the primary targets: use python3, command -v, and normal shell export/PATH checks.
Windows is partial support only: prefer PowerShell, use python or py -3, and verify env vars with Get-ChildItem Env:MINIMAX_API_KEY or Test-Path Env:MINIMAX_API_KEY.
On Windows, ffmpeg, yt-dlp, and mmx are PATH-sensitive; if Get-Command/where.exe cannot find them, restart the shell or add the install directory to PATH.
If Windows path/dependency issues keep blocking analysis, use WSL for the script-heavy parts instead of claiming full native support. For the full WSL2 setup (including the corporate items below), follow the base skill's references/windows-wsl-setup.md.
Corporate machines (TLS-inspecting proxy): pip/HuggingFace/model downloads inside WSL fail with CERTIFICATE_VERIFY_FAILED until the corporate root CA is installed in the distro (and REQUESTS_CA_BUNDLE/SSL_CERT_FILE point at the system bundle). A proxy env var (HTTP_PROXY) can also hijack 127.0.0.1 calls to a local API — unset it / set no_proxy=127.0.0.1,localhost. Both are covered in the base reference above.
If MiniMax itself is blocked (corporate firewall) the cloud features here — mmx music cover, the lyrics API, and any analysis script that calls MiniMax (e.g. emotion_to_prompt.py) — will fail. In that case use only the local-capable tools (yt-dlp, ffmpeg, librosa, Whisper) for analysis and the local ACE-Step backend in music-craft for generation.

Required (skill will not work without these)

Check	What it is	How to verify	If missing
`music_generate` tool	The runtime's built-in music generation tool	Inspect the active runtime's tool list	Tell the user: "This skill needs a `music_generate` tool, but the active runtime does not expose one. Configure a music provider in OpenClaw and try again." Stop.
`MINIMAX_API_KEY` env var	API key for the MiniMax Music 2.6 plan	`test -n "$MINIMAX_API_KEY" && echo "OK"`	Tell the user: "This skill needs the `MINIMAX_API_KEY` environment variable. Get one from your MiniMax account and export it. If you do not have a MiniMax Token Plan, use `music-craft` instead — it works with any provider." Stop.
`mmx` CLI	The MiniMax CLI for fine-flag control	`command -v mmx && mmx --version` (macOS/Linux) or `Get-Command mmx; mmx --version` (PowerShell)	Ask the user: install via the MiniMax install guide, or skip mmx-specific features and use the `music_generate` tool with prompts. Do not block — `mmx` is optional if the runtime has MiniMax configured, but Windows support is only partial and depends on PATH visibility.
`python3`	Required for the analysis scripts	`command -v python3` (macOS/Linux) or `python` / `py -3` (Windows PowerShell)	Tell the user: "The analysis pipeline (emotion analysis, mashup) needs Python 3.9+." Propose an install command for the active shell. Block emotion analysis if missing.

Optional (skill works without these, but quality improves with them)

Tool	What it unlocks	Install per platform
`ffmpeg`	Audio conversion (WAV for analysis, MP3 export, trimming)	`apt install ffmpeg` · `brew install ffmpeg` · `winget install Gyan.FFmpeg` (restart PowerShell after install so PATH updates apply)
`yt-dlp`	YouTube audio download for cover and mashup	`pip install -U yt-dlp` or `py -3 -m pip install -U yt-dlp` on Windows; ensure the CLI is on `PATH`
`librosa`	Audio analysis (BPM, key, energy, structure)	`pip install librosa numpy scipy`
`parselmouth`	Better pitch tracking (Praat under the hood)	`pip install praat-parselmouth`
`scikit-learn`	Audio clustering (segment detection)	`pip install scikit-learn`

The full per-platform install table is in the base skill's music-craft Pre-Flight Check.

The "ask the user" pattern

Same as the base skill: for each missing optional tool, present three options — install (propose exact command, let user approve), skip (use the simple path), or cancel. Never auto-install.

If MINIMAX_API_KEY is missing, the redirect is to the base skill, not "install MiniMax" — the user may not have a Token Plan at all.

Local analysis memory (separate from generation)

Generation runs on MiniMax's cloud — your laptop just sends the prompt and downloads the MP3, so generation itself uses negligible local memory.

However, this skill's local analysis scripts run on your machine and can use real memory. Before running the full analysis pipeline, check available memory:

Script	Models loaded	Approx peak RAM
`analyze_vocal_emotion.py`	`parselmouth` (Praat) + `scipy`	~500 MB
`analyze_audio.py`	`librosa` + `transformers` (MERT or MIT AST)	2–4 GB
`extract_lyrics_whisper.py`	`whisper` model (tiny/base/medium)	1–5 GB depending on model size
`extract_stems.py`	`Demucs` (htdemucs)	2–4 GB
`emotion_to_prompt.py`	calls MiniMax API — negligible local	\x3C100 MB
`compute_audio_embedding.py`	MERT model	1–2 GB
`classify_instruments.py`	MIT AST	1–2 GB

Combined (full analysis pipeline on a 4-min song): ~6–10 GB peak on top of OS and other apps. On unified-memory systems (Apple Silicon, integrated graphics), this competes directly with macOS/Windows and your other applications. On dedicated-GPU systems (NVIDIA, AMD), model memory is taken from system RAM unless you have CUDA acceleration.

Recommendations:

Close heavy apps (browser with many tabs, IDE, Docker) before running the full pipeline
For extract_lyrics_whisper.py, use the tiny model by default — base/medium are 2-5x heavier with marginal quality gain for most songs
For extract_stems.py, the --quality flag controls Demucs model size; default htdemucs is the heaviest; htdemucs_ft is the lightest
If you run out of memory, run analysis steps individually rather than via analysis_orchestrator.py (which loads everything)

The scripts/smoke_test.py script verifies the environment is set up; it does not test memory headroom. Run your own memory check before running a full analysis.

Free Tool Augmentation (Input Enrichment)

The OpenClaw runtime exposes several free tools (web_fetch, web_search, image analysis, memory, browser) that enrich the music generation workflow. The base layer is documented in music-craft → Free Tool Augmentation and references/free-tool-inputs.md. This section shows how they compose with MiniMax-specific features.

Quick recap of free tools

Tool	Purpose
`web_fetch`	Fetch URL content (lyrics pages, YouTube metadata, Wikipedia)
`web_search`	Find lyrics, artist info, genre descriptions
`image` / `MiniMax__understand_image`	Analyze album art, concert photos, music video screenshots
`memory_search` / `memory_get`	Recall user's prior music preferences
`browser`	JS-heavy site fallback (last resort)

MiniMax compositions (high-value combos)

web_fetch + lyrics_generation: fetch the user's draft from a URL, run it through edit mode for cleanup, generate.
web_search + cover workflow: find covers in the target style, extract their characteristics, apply to the user's track.
image + mmx per-flag control: analyze album art, translate to --instruments, --bpm, --key, --structure for fine-grained style matching.
memory + emotion analysis: combine the user's prior preferences with deep audio analysis of a reference track.

For the full worked examples, parameter recommendations, and MiniMax-specific edge cases, see references/free-tool-inputs.md.

Operating Rules

Same 6-step loop as music-craft, with MiniMax-specific extensions:

Read and auto-detect — same
Ask only the ambiguous parts — same, plus ask if the user wants cover / mashup / standard
Translate to a production-sheet prompt — same, but consider whether to use mmx flags (see references/mmx-flags-reference.md) instead of packing everything into the prompt
Structure the lyrics — same, plus consider lyrics API for generation or edit (see references/lyrics-generation.md)
Generate and verify — same, plus the music-cover model for melody preservation
Iterate — same, plus emotion analysis to inform the next prompt adjustment

For the full 6-step detail, see music-craft → Operating Rules.

Song length (mmx has no native duration control)

Unlike music-craft's ACE-Step backend (which takes audio_duration as a parameter), MiniMax Music 2.6 has no explicit duration flag. Output length is determined by:

Lyrics length (primary): each [Verse]/[Chorus] section takes ~15-30 seconds depending on word count and singing pace. A typical 3:30 song has ~150-200 lyrics words across 2 verses + 2 choruses + bridge.
Structure tags: [Intro], [Instrumental Break], [Outro] add silent/sparse sections that extend total length without lyrics.
Prompt hints (secondary): phrases like "3 minute song" or "4 minute track" nudge the model toward that length.
BPM and section count (minor effect): faster BPMs and more sections tend to produce slightly longer outputs.

Practical recipe for a full 3:30 song:

Lyrics: ~150-200 words with [Verse 1], [Pre-Chorus], [Chorus], [Verse 2], [Bridge], [Outro] tags (full song structure, not just one chorus)
Prompt: include structure hints like "full 3-minute song with intro, 2 verses, 2 choruses, bridge, and outro" or use --structure "intro-verse-pre_chorus-chorus-verse-chorus-bridge-chorus-outro"
Check output length — if it's a 1-minute hook, the lyrics are probably too short
If output is too short: regenerate with longer lyrics (the model can't add sections that aren't in the lyrics)
If output is too long: trim lyrics to ~120 words or add [Instrumental Break] tags to control pacing

Don't expect mmx to hit 3:30 exactly. Output length varies by ±20-30s depending on the model. If you need precise length, ACE-Step is the right tool (it has audio_duration). If you want MiniMax's vocal quality and the song length is flexible, mmx is fine.

mmx CLI Quick Reference

The mmx CLI exposes MiniMax Music 2.6 parameters as separate flags. This gives finer control than packing everything into a single prompt string.

The most useful flags:

Flag	Effect	Example
`--avoid`	Elements to avoid (comma-separated)	`--avoid "sparse, a cappella, electronic sounds"`
`--bpm`	Exact BPM	`--bpm 80`
`--key`	Musical key	`--key "E minor"`
`--structure`	Song structure	`--structure "intro-verse-pre chorus-chorus-verse-chorus-bridge-chorus-outro"`
`--vocals`	Vocal style	`--vocals "passionate French male vocal"`
`--instruments`	Featured instruments	`--instruments "accordion, upright bass, strings, piano"`
`--genre`	Genre	`--genre "french chanson"`
`--mood`	Mood	`--mood "melancholic romantic dramatic"`
`--lyrics-optimizer`	Auto-generate lyrics from prompt	(flag only)
`--model`	Model name	`--model music-2.6` (paid, highest RPM) or `--model music-2.6-free` (default, free tier)
`--cover-feature-id`	Use a preprocessed cover (two-step workflow)	(from preprocess call)

Full reference with all flags and examples: references/mmx-flags-reference.md.

When to use mmx vs music_generate:

mmx: when you need fine control over specific parameters (BPM, key, structure as separate flags)
music_generate: when the prompt-only path is enough, and you want to keep the workflow provider-agnostic

Both produce equivalent results if the prompt and flags are aligned.

mmx Music Generation — verified patterns (June 9, 2026)

End-to-end verified invocations from this session (M5_idkw_dreampop_shoegaze + M5_idkw_opera_metal in ~/Music mix/hello_cleveland/i_dont_know_why/):

Pattern A — full song with detailed prompt + 6 metas (production-grade output)

mmx music generate \
  --prompt "dream pop reimagining, shoegaze-influenced indie rock turned ethereal and cinematic.
My Bloody Valentine meets Slowdive meets Radiohead.
Male lead vocal, breathy and vulnerable, double-tracked with slight detuning and tape warmth.
Wall of clean electric guitars with heavy chorus pedal and tremolo picking.
Shimmering washes of reverb, sub-bass synth pad foundation, soft brushed electronic drums.
Glockenspiel and celesta melody line high above the mix.
Organ pads swelling at choruses, reversed guitar samples between sections.
Heavy reverb and analog warmth throughout, lo-fi texture.
Emotional arc: hazy drifting opening building wave confusion overwhelming beautiful climax fading dreamlike denouement outro.
Avoid: sharp percussive agresivo distortion clear upfront vocals minimal sparse.
Tempo 96 BPM in D major, dreamlike half-time feel.
Suitable as a slow-burn alt-pop anthem, melodic and textural, intimate verses and soaring choruses.
Modern production, polished mix, atmospheric vocal production where vocals sit among the instruments rather than above them." \
  --lyrics-file gen1_lyrics.txt \
  --model music-2.6 \
  --vocals "breathy vulnerable male lead, double-tracked with slight detuning" \
  --genre "dream pop, shoegaze-influenced indie" \
  --mood "hazy confusion building to overwhelming beautiful release, then dreamlike fade" \
  --instruments "wall of clean electric guitars with heavy chorus pedal, sub-bass synth pad, soft brushed electronic drums, glockenspiel, celesta, organ pads, reversed guitar samples" \
  --bpm 96 \
  --key "D major" \
  --structure "intro-verse-pre_chorus-chorus-post_chorus-verse2-chorus-repeat-outro" \
  --use-case "slow-burn alt-pop anthem, suitable for late-night listening" \
  --avoid "sharp percussive agresivo distortion, clear upfront vocals, minimal sparse arrangement" \
  --references "My Bloody Valentine, Slowdive, Radiohead" \
  --out M5_idkw_dreampop_shoegaze.mp3

Output: 167.9s MP3, 5.4 MB, -8.8 LUFS, 5.7 LRA (good dynamics).

Pattern B — crazy combo: opera vocals + heavy metal music (for fun experiments)

mmx music generate \
  --prompt "extreme dramatic contrast: powerful operatic tenor vocals over heavy metal instrumentation.
Like Freddie Mercury fronting Metallica. Epic, theatrical, over the top.
Thunderous double bass drums, distorted electric guitars with palm-muted chugging,
guttural rhythm section, blast beats, tremolo picking, minor key riffing.
Operatic vocals soaring above the metal wall of sound, belting high notes with vibrato.
Gothic theatrical atmosphere, dramatic dynamic shifts from whisper-quiet verses
to explosive metal choruses. Anthem-like, stadium-ready." \
  --lyrics-file gen2_lyrics.txt \
  --model music-2.6 \
  --vocals "operatic tenor, powerful Freddie Mercury style, vibrato, theatrical belting" \
  --genre "symphonic metal" \
  --mood "dramatic, theatrical, anthemic, intense" \
  --instruments "distorted electric guitars, double bass drums, blast beats, orchestral strings" \
  --tempo "fast" \
  --bpm 160 \
  --key "D minor" \
  --structure "verse-pre_chorus-chorus-verse-pre_chorus-chorus-outro" \
  --use-case "epic music experiment" \
  --avoid "pop, soft, gentle, acoustic, slow" \
  --out M5_idkw_opera_metal.mp3

Output: 155.8s MP3, 5.0 MB, -9.6 LUFS, 4.3 LRA (compressed but still has dynamics).

Model selection

Model	When to use	Cost
`music-2.6` (default)	Production work, full quality	Token Plan / paid
`music-2.6-free`	Free tier, lower RPM, "unlimited" quota for some plans	Free
`music-2.5+`	Older model, still good quality	Token Plan / paid
`music-2.5`	Legacy	Token Plan / paid
`music-cover`	Cover/re-interpretation of source audio (one-step)	Token Plan / paid
`music-cover-free`	Free cover variant	Free

music-2.6-free is the default for most users — same model, free tier. The mmx CLI uses it as the default when no --model is specified.

`is_instrumental` and `lyrics_optimizer` flags (miniMax-specific paths)

The mmx CLI exposes two important flags that bypass the --lyrics requirement:

Flag	What it does	When to use
`--instrumental`	Generate music without vocals (no lyrics needed)	When user wants BGM, intro, soundtrack, loop
`--lyrics-optimizer`	Auto-generate lyrics from the prompt (no `--lyrics` needed)	When user says "make me a song about X" but doesn't have lyrics

Examples:

# Pure instrumental (no vocals)
mmx music generate \
  --prompt "Instrumental only, no vocals, no lyrics. Loopable coffee shop background, soft piano, brushed drums, 90 BPM, C major" \
  --instrumental \
  --length 180000 \
  --out coffee_bgm.mp3

# Auto-generated lyrics from prompt
mmx music generate \
  --prompt "Upbeat indie folk, melancholic but hopeful, male vocal, acoustic guitar, 100 BPM" \
  --lyrics-optimizer \
  --out indie_folk.mp3

Note: mmx music generate with --length uses milliseconds (the example shows --length 180000 for 3 minutes). This is mmx-specific; the underlying MiniMax API has no official duration parameter.

URL expiration warning

mmx music generate returns a saved: path. If you ever use --output-format url (the official API default), the URL expires after 24 hours. Download immediately. The mmx CLI auto-downloads to --out so this is not a problem when using --out directly.

Cover Workflow

Two cover backends exist — pick by what's available:

MiniMax cloud cover (this skill): mmx music cover, melody-preserving via MiniMax's music-cover model. Needs MINIMAX_API_KEY and network access to MiniMax.

Local ACE-Step cover (in music-craft): task_type=cover with the source audio uploaded (multipart) and audio_cover_strength controlling how far to restyle. Fully local, no cloud, follows the source melody/structure. Caveat: a full-length cover is slow and VRAM-heavy on a ~12 GB GPU and can hit the server's 600 s generation timeout — cover a shorter segment or raise ACESTEP_GENERATION_TIMEOUT. See music-craft's "ACE-Step Audio-Conditioned Generation" section.

So if MiniMax is unavailable (no key, or blocked on your network), you can still do a melody-aware cover locally with ACE-Step — it is not cloud-only. Only pure text-prompt generation (no source audio) is a "reimagining" rather than a cover.

Cover workflow preserves the original song's melody while applying a different style. Two paths:

One-step (quick):

mmx music cover \
  --prompt "French chanson, accordion, strings, passionate French vocal, 80 BPM" \
  --audio-file /tmp/original.ogg \
  --out /tmp/cover.mp3

MiniMax extracts lyrics via ASR and applies the new style.

Two-step (more control):

Preprocess the audio to extract features and structure
Edit the lyrics (correct ASR errors, add section tags)
Generate with the edited lyrics

The two-step path gives better results when the original lyrics need correction or when the user wants different lyrics in the new style.

Full detail with payload examples, error handling, and use cases: references/cover-workflow.md.

Lyrics Generation

MiniMax has a dedicated lyrics_generation endpoint that produces structured lyrics (with [Verse], [Chorus], etc. tags) from a theme prompt. Two modes:

write_full_song — create new lyrics from a theme
edit — modify existing lyrics (e.g., make the chorus stronger, shift to a hopeful ending)

The output is structured lyrics that can be passed directly to music_generate or mmx music generate.

Full detail with API examples, parameters, and use cases: references/lyrics-generation.md.

Web Lyrics Lookup (LRCLib)

As an optional complement to Whisper transcription, the orchestrator can look up song lyrics from LRCLib (open, no auth, JSON API at https://lrclib.net/api) when the song is a known mainstream track. This is a graceful fallback — Whisper is the primary source, LRCLib is a quality boost for the right song.

Coverage reality check: LRCLib has good coverage for mainstream vocal music (pop, rock, hip-hop, R&B, country) and is poor or empty for:

Instrumental tracks (Joe Satriani, King Crimson, much jazz, classical)
Obscure bands / friend bands
Live / bootleg / unofficial releases
Non-English lyrics for English titles (and vice versa)

When LRCLib is empty (the expected case for instrumentals), the script returns no_web_lyrics and the caller silently uses Whisper. This is the designed path, not a failure.

CLI usage:

# Standalone lookup
python3 scripts/fetch_lyrics_web.py \
  --artist "Coldplay" --title "Yellow" \
  --whisper-transcript "look at the stars..." \
  --min-match 0.6 --json

Orchestrator integration via the --lyrics-source flag:

Value	Behavior
`whisper` (default)	Always use Whisper, never touch the web
`web`	Always try LRCLib, never run Whisper
`auto`	Whisper first; if the song is recognized AND LRCLib returns a confident match (>60% word overlap), use LRCLib; otherwise fall back to Whisper
`off`	Skip lyrics extraction entirely

The orchestrator auto-detects artist and title from the audio path stem (e.g. Coldplay - Yellow.wav → artist="Coldplay", title="Yellow"). Pass --name-a "Artist - Title" to override.

The result includes a web_lookup sub-dict with status, match_score, and the plain lyrics (when matched), so you can inspect what was used and why.

Full detail with scoring heuristic and exit codes: see scripts/fetch_lyrics_web.py docstring.

Mashup Workflow

The signature MiniMax-specific feature: combine Song A (content + emotion) with Song B (style).

Workflow:

Get Song A (audio file, YouTube URL, or song name)
Get Song B (audio file, YouTube URL, or song name)
Run emotion analysis on Song A (if audio available) to extract the emotional arc
Build a prompt that applies Song B's style to Song A's content and emotion
Generate using the cover workflow (preserves melody) or standard generation (creative reimagining)

This is the most powerful feature in this skill. The output preserves what makes Song A recognizable (lyrics, melody, emotion) while applying Song B's production style.

Full detail with the emotion-to-prompt conversion and the two-song analysis script: references/mashup-workflow.md and references/emotion-analysis.md.

Emotion Analysis

Emotion analysis extracts per-section features from input audio:

Intensity (loudness) — drives dynamic range
Pitch (Hz range, trend) — drives vocal intensity
Vocal effort (low / medium / high) — drives delivery style
Breathiness — drives intimacy vs full voice
Spectral centroid (brightness) — drives timbre matching
Emotion classification (list of 30+ emotions) — drives mood keywords for the prompt
Repetitive intensification — drives chorus build
Emotional shifts (sudden vs gradual) — drives transitions
Vocal speed (syllables per second) — drives elongation cues
Pitch bends at phrase endings — drives emotional emphasis

The analysis outputs JSON that the emotion_to_prompt.py script converts into a ready-to-use production-sheet prompt.

Local-only path (when MiniMax is unavailable): emotion_to_prompt.py calls the MiniMax cloud, so it fails when MiniMax is blocked or no key is set. In that case build the prompt locally from the analysis JSON without that script: take the extracted BPM and key/scale as explicit metadata fields; turn the energy curve and spectral brightness into texture words; turn the emotion classification and intensity curve into mood words and dynamic section tags; and feed transcribed lyrics (full-mix Whisper) as the lyric body. This is the same data, assembled by the agent instead of the cloud helper, and it feeds any backend (including a local model).

Scripts: scripts/analyze_vocal_emotion.py, scripts/analyze_audio.py, scripts/emotion_to_prompt.py.

Full detail: references/emotion-analysis.md.

For the generation side — how to use the analysis to evoke emotion in the OUTPUT, the 21 emotion recipes (joy, desperation, melancholy, triumph, yearning, anger, vulnerability, confidence, nostalgia, anxious, hopeful, tragic, heroic, tender, sensual, lonely, playful, haunting, serene, celebratory, bittersweet), the iteration loop, and common mistakes — see references/emotion-delivery.md.

Analysis Quality (Summary Format, Confidence, Fallbacks)

Analysis scripts in scripts/ produce different views (emotion, beats, melody, structure, instrumentation). The skill expects them to converge on a single compact summary so downstream code and humans can read the same shape regardless of which scripts ran.

Compact Analysis Summary

Every analysis result should include a summary object with these keys:

Key	Type	Meaning
`tempo`	string	BPM value with confidence, e.g. `120 BPM (confidence 0.92)`
`key`	string	Detected key, e.g. `E minor (confidence 0.71)`
`sections`	list	Section labels with timing, e.g. `[{"label": "verse", "start": 0.0, "end": 28.5}, ...]`
`instrumentation`	list	Detected instrument palette, e.g. `["electric guitar", "drums", "bass"]`
`vocal_traits`	dict	Breathiness, intensity, pitch range, e.g. `{"breathiness": "high", "intensity": "medium"}`
`energy_curve`	list	Per-section energy values, e.g. `[{"t": 0, "energy": 0.6}, ...]`
`hook_points`	list	Timestamps of detected hooks, e.g. `[12.4, 48.0]`
`mix_notes`	list	Short strings, e.g. `["vocal upfront", "wide stereo drums", "rolled-off highs"]`

Scripts may add their own fields, but every script must return at least the keys above (use empty list / unknown string when a key has no data).

Confidence Levels

Every numeric or categorical detection in the analysis must carry a confidence value so weak detections do not get treated as facts.

Confidence	Numeric range	Interpretation
`clear`	n/a	The detection is unambiguous (e.g. user-supplied text, MIDI-confirmed key).
`high`	`>= 0.75`	Strong evidence from multiple sources or models.
`medium`	`0.5 - 0.74`	Reasonable evidence but alternative interpretations exist.
`low`	`\x3C 0.5`	Weak signal; treat as a hint, not a fact.
`inferred`	n/a	Not measured directly; derived from context (e.g. lyrics from a YouTube URL).
`missing`	n/a	Not available; the analysis did not run or did not find evidence.

When feeding analysis into a prompt, prefix any low or medium detection with a hedge like "around" or "approximately", and never include missing values as if they were facts.

Fallback Behavior for Missing Optional Dependencies

The advanced analysis scripts depend on optional packages (librosa, parselmouth, transformers, demucs, beat_this, basic_pitch, etc.). Each script must:

Try to import the optional dependency at the top of the function.
On ImportError, return a JSON object that includes {"error": "install with pip install X", "summary": {}} instead of raising.
Never let a missing optional dependency crash the whole workflow.

The orchestrator at scripts/analysis_orchestrator.py collects per-script results and continues even if some scripts failed. The combined summary simply omits keys whose underlying analysis could not run. The linter, prompt builder, and generation step all read the summary and skip missing keys without erroring.

This means a user without demucs installed can still get tempo, key, and structure analysis from the base pipeline. The only loss is the per-stem vocal analysis, which is opt-in via --use-demucs.

Rate Limits (MiniMax-specific)

The MiniMax Music 2.6 documented limits are:

RPM: 120 requests per minute
Concurrent connections: 20
Output URL expiry: 24 hours (download the audio promptly)
Cover feature ID validity: 24 hours (use the preprocess output within a day)

Under the Token Plan 3.0 (June 2026+), the actual quota is credit-based rather than RPM-based:

A unified general credit pool covers M3, M2.7, and M2.7-highspeed
A 5-hour rolling window resets continuously
A weekly window runs Monday 02:00 CEST → next Monday 02:00 CEST
Weekly status may be inactive on Plus plan (no weekly cap enforced, but the schema is there)

Practical implication: the documented 120 RPM is the API limit, but the Token Plan 3.0 quota is what determines your real ceiling. If you generate 4500 requests in 5 hours on Plus, you will be rate-limited regardless of RPM.

Before submitting a batch, check the active plan:

# Check current Token Plan usage
curl -s -H "Authorization: Bearer $MINIMAX_API_KEY" \
  https://www.minimax.io/v1/token_plan/remains | jq .

If a call fails with 429 (rate limit):

Wait at least 60 seconds.
Check the Token Plan usage endpoint.
If 5h window is exhausted, wait for the reset.
Reduce concurrency if running a batch.

Anti-Sparse (MiniMax-Specific Deep Dive)

The base skill's anti-sparse rules apply. The MiniMax-specific failure mode is more severe than other providers:

MiniMax interprets "sparse" or "minimal" as "remove all instruments", even more aggressively than other providers. The model has been observed to:

Remove all instruments in quiet sections when the prompt uses the word "quiet"
Drop percussion entirely when the prompt uses "intimate"
Go a cappella on build-up sections when the prompt uses "build"

Mitigation:

Never use the words "sparse", "minimal", "stripped back", "quiet" in a MiniMax prompt without pairing them with explicit instruments.
Always add: "ALL instruments ALWAYS playing throughout, NEVER go a cappella or silent at any point".
Always list every instrument you want to hear.
For quiet sections, use the explicit form: "quiet sections: reduced to accordion and bass only, still fully played, NOT silent".

If a generation comes back sparse despite these rules, retry once with an even more explicit instrument list. If it fails again, the prompt has a structural issue — try a different style.

For the canonical anti-sparse text and worked examples, see the base skill's Anti-Sparse Rules section.

Quality Verification Checklist

Same 8-point checklist as the base skill, plus 4 MiniMax-specific items:

Cover preserves melody recognisably. If the user said "make it sound like Song X", the new version should be recognisable as Song X's melody with Song Y's style.
Emotion curve matches Song A (for mashups). The dynamic arc of the output should follow the original's intensity, not flatten to a single energy.
--avoid flags are respected. If the user said "no electronic sounds", the output should not have synths.
Per-flag control worked (BPM, key, structure). If the user asked for 80 BPM in E minor, the output should be in that range, not "close enough".

Output Verification (Covers, Mashups, Style Transfer)

After generation, run a post-generation check that is specific to the route. Use the analysis orchestrator's output on the generated file when possible.

Verification Checklist per Route

Cover (minimax_cover)

Melody recognisable as the source (basic-pitch MIDI compare or ear-test)
Target style is clearly audible (genre/mood keywords present)
Source BPM is within ±10 BPM
Source key is preserved (or user agreed to shift)
Lyrics decision respected (original / translated / new / instrumental)
--avoid flags respected

Mashup (minimax_mashup)

Song A's lyrics and emotional arc recognisable
Song B's style is dominant in the production
Vocal intensity matches Song A's emotion curve
Section structure feels coherent (not random)
--avoid flags respected for Song B's style

Style Transfer (minimax_style_transfer)

Source style (the reference track) is reproduced in timbre, instrumentation, and feel
Output melody is NOT recognisable as the source (it is a new composition in the source style)
Target genre/mood keywords audible
BPM and key reasonable for the new style (not forced from source)

Emotion Prompt / Precision (minimax_emotion_prompt)

Per-flag values (BPM, key, structure, avoid) match the flags
Prompt language and flags are not contradicting (linter clean)
Lyrics reflect the requested theme and language

Failure Signatures and Fixes

When the generated track does not match the request, identify the failure signature and apply the matching fix. The most common signatures:

Failure signature	Likely cause	Fix
Copied too closely (cover sounds like a remaster, not a new style)	Prompt did not specify the new style firmly enough, or `--avoid` list left the original instrumentation unguarded.	Add explicit target style language, list new instruments, expand `--avoid` with the source's dominant sounds. Re-run.
Lost source melody (cover no longer recognisable)	`--prompt` overrode the cover model, or the source audio was too noisy / clipped.	Switch to the two-step cover workflow (preprocess + generate with `cover-feature-id`); reduce style strength in the prompt.
Wrong tempo (BPM noticeably off)	Prompt and `--bpm` disagreed, or vocal delivery speed misled the detector.	Lint prompt + flags first. Re-run with the linter-clean pair. If still off, set `--bpm` explicitly and drop the BPM number from the prompt.
Wrong key (key shifted up/down)	Prompt mentioned a key but flags used another.	Lint the pair. Use the same key in both. If MIDI confirms a different source key, trust MIDI over prompt.
Muddy mix (low clarity, washed out)	Overly dense instrumentation, lack of anti-sparse guard, or too many `--avoid` exclusions.	Reduce instrument count, raise `--bpm` for tightness, add explicit "all instruments clearly audible".
Vocals too neutral (no emotion)	Emotion analysis not run, or intensity curve not transferred.	Run `analyze_vocal_emotion.py` on the source and feed `intensity_curve` into the prompt. Add explicit "vocal intensity: ..." clause.
Weak chorus (chorus does not lift)	Structure line lacks a build cue, or the prompt was a single energy.	Add structure with explicit build cues: "verse: intimate, chorus: soaring, all instruments louder in chorus".
Style mismatch (output does not match the requested genre)	Prompt used vague genre words or the wrong dominant instrument.	Replace vague words with concrete genre + instrument list. Use the canonical `mmx` prompt schema in `references/examples.md`.

Revision Prompt Templates

When a generation comes back with one of the failure signatures above, build a revision prompt that preserves the source identity while changing the failing dimension.

Template: stronger style change (cover too close)

Same melody and lyrics as before. Re-imagine the production as [TARGET_STYLE] with [INSTRUMENT_LIST].
ALL instruments always playing throughout, never go a cappella.
Avoid: [STYLE_CONTRADICTING_WORDS from previous run].

Template: keep the melody (cover lost it)

Re-apply the original melody from the source audio. Keep the recognizable hook at [HOOK_TIME].
Use a softer production in [TARGET_STYLE] but DO NOT change the melodic contour.
Avoid: [WORDS_THAT_PUSHED_TOO_FAR].

Template: fix tempo drift

Keep the source BPM (use --bpm [SOURCE_BPM]). Do not slow down or speed up the vocal delivery.
Avoid: rubato, half-time, double-time, slowing down, speeding up.

Template: fix key shift

Stay in [SOURCE_KEY]. Do not transpose. Use the same chord progression as the source.
Avoid: key change, modulation, transpose.

Template: fix muddy mix

Make every instrument clearly audible. Reduce instrument count to [N].
Add contrast: quieter verses, louder choruses. Keep vocals upfront in the mix.
Avoid: dense layering, atmospheric washes, sustained pads throughout.

Template: lift the chorus

Chorus: soaring, all instruments louder than the verse, fuller chords, more reverb on the lead vocal.
Verse: intimate, single voice, soft drums, breathy delivery.
Bridge: build tension, add a melodic lift before the final chorus.

These templates pair with the failure-signature table. After the revision, re-run the verification checklist above.

Lyrics Optimizer Behavior

Same as the base skill — when music_generate is called without explicit lyrics, MiniMax auto-generates. With this skill, you can also call the lyrics_generation API directly to preview the lyrics before generation, or to iterate via the edit mode.

If the user wants specific words, the lyrics_generation API's edit mode lets you modify auto-generated lyrics to match the user's intent without regenerating the whole song.

Reference Map

references/mmx-flags-reference.md — full mmx CLI flag reference with worked examples
references/examples.md — practical MiniMax examples with routing, first questions, workflow shapes, and prompt/flag lint catches
references/cover-workflow.md — one-step and two-step cover workflow with payloads, error handling, use cases
references/lyrics-generation.md — the lyrics_generation API endpoint, both modes, examples
references/mashup-workflow.md — two-song mashup workflow, emotion-to-prompt conversion, decision tree
references/emotion-analysis.md — 25+ emotion classifications + per-emotion detection cookbook + emotion combinations + the analysis pipeline
references/emotion-delivery.md — 21 emotion recipes for the OUTPUT + iteration loop + common mistakes
references/advanced-audio-analysis.md — advanced free tools (Essentia, Demucs, Basic Pitch, Music21, CREPE) for deeper analysis when basic librosa/parselmouth is not enough
references/error-handling.md — MiniMax-specific error table, recovery patterns, anti-sparse failure recovery
scripts/check_environment.py — lightweight preflight diagnostic for Python, env vars, CLI tools, and optional packages
scripts/lint_music_request.py — standard-library helper for routing, blocker, missing-field, prompt, and mmx flag conflict checks
scripts/smoke_test.py — standard-library smoke tests for pure helper behavior
scripts/ — Python helpers for audio analysis (download, segment, analyze, convert emotion to prompt)
music-craft — base skill with shared concepts (Pre-Flight, anti-sparse, prompt formula, structure tags, Request Intake, User Preference Flow)
music-craft → references/free-tool-inputs.md — base layer for free tool inputs (web_fetch, web_search, image, memory)
references/free-tool-inputs.md — MiniMax layer: free-tool routing, blocker checks, and prompt/flag conflict lint before analysis

安全使用建议

Review before installing. Use this only if you are comfortable sending music prompts, lyrics, URLs, images, and audio to MiniMax or related services. Run it in a virtual environment or sandbox, preinstall dependencies yourself, avoid the auto-install YouTube path, keep MINIMAX_API_KEY scoped and rotated if exposed, and choose output paths carefully because some helpers can overwrite files.

能力标签

cryptorequires-sensitive-credentials

能力评估

ℹ Purpose & Capability

The declared purpose covers MiniMax music generation, covers, mashups, lyrics APIs, YouTube inputs, emotion analysis, and mmx CLI use. Image/video analysis and VLM captioning are broad but mostly documented as style-enrichment paths for music generation.

⚠ Instruction Scope

The documentation says optional tools should be installed only after user approval, but scripts/download_youtube.py automatically invokes pip install yt-dlp with --break-system-packages when yt-dlp is missing. That is a concrete mismatch between runtime behavior and stated operator control.

⚠ Install Mechanism

There is no separate hidden installer, but the runtime YouTube helper performs unpinned package installation from pip and mutates the Python environment automatically. That is high-impact environment authority for a skill helper script.

⚠ Credentials

Cloud MiniMax use, LRCLib lookup, YouTube download, and optional mmx vision calls are largely purpose-aligned and disclosed. However, compute_audio_embedding.py uses AutoModel.from_pretrained(..., trust_remote_code=True), which can execute Hugging Face repository code without a clear user-facing warning.

ℹ Persistence & Privilege

The skill writes analysis JSON, prompts, generated audio, temporary media, and cache files under user-selected paths, /tmp, and ~/.cache/openclaw. No background persistence or credential harvesting was found, but ffmpeg -y and copy operations can overwrite destination files.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install music-craft-minimax
安装完成后，直接呼叫该 Skill 的名称或使用 /music-craft-minimax 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.0.1

v1.0.1: Rename display name from 'OpenClaw Music Workflow — MiniMax' to 'Music Craft — MiniMax' for slug consistency. Bundle body unchanged from v1.0.0 (999-line SKILL.md, 10 reference docs, 21 scripts, 34 files, 587 KB).

v1.0.0

First stable release. MiniMax Music 2.6 power-user upgrade: cover/style transfer, two-song mashup, lyrics generation API, emotion-driven prompt engineering, fine mmx flag control, 26-test smoke suite. Inherits shared content (output file layout, rate limits, anti-sparse rules) from music-craft via cross-references.

元数据

Slug music-craft-minimax

版本 1.0.1

许可证 MIT-0

累计安装 1

当前安装数 1

历史版本数 2

常见问题