Music Craft — MiniMax
/install music-craft-minimax
Music Craft — MiniMax
This is the power-user upgrade of music-craft. It does everything that skill does, plus the features that require the MiniMax Music 2.6 token plan:
- Cover and style transfer from a reference audio file or YouTube URL (preserves melody)
- Two-song mashup (Song A's content and emotion + Song B's style)
- Lyrics generation via the MiniMax API endpoint (with edit mode for iteration)
- Emotion analysis on input audio to drive prompt construction (vocal speed, intensity curve, pitch bends)
- Fine control over generation parameters (BPM, key, structure, avoid list as separate flags via
mmx)
For everything else (standard song generation, instrumentation, anti-sparse prompt engineering, structure tags, user preference flow), this skill uses the same workflow as music-craft. Read that skill first to understand the base, then come back here for the MiniMax-specific extensions.
Routing and Blocker Checks
Classify the request before analysis or generation:
- Text-only style reference means the user gave a song name, artist, era, or genre cue without source audio. Treat it as style inference, not cover analysis.
- Reference audio or YouTube means the user provided a file or playable source that should be analyzed.
- Cover preserves melody and usually needs a source file plus a target style decision.
- Style transfer uses a reference track or analyzed audio as style input, then changes the production direction.
- Mashup needs Song A and Song B, plus a decision about which one contributes content and which one contributes style.
- Emotion prompt means the user wants analysis turned into descriptive prompt language, not a full cover.
The scripts/lint_music_request.py helper emits one of these routes:
| Route | When |
|---|---|
base_prompt |
Standard generation, no MiniMax-specific feature needed. |
minimax_cover |
Melody-preserving cover from audio or YouTube. |
minimax_mashup |
Two-song mashup (A + B, both identified). |
minimax_style_transfer |
Style transfer that does not preserve the source melody. |
minimax_emotion_prompt |
Emotion analysis, or precision mmx flag usage. |
needs_clarification |
At least one blocker is unresolved; ask the user first. |
Surface blockers before analysis:
- no source file or usable URL
- unclear which track is Song A versus Song B
- missing target style
- missing lyrics decision, such as original, translated, rewritten, or instrumental
- conflicting cover/style-transfer intent: the user asked for both "cover" (preserve melody) and "style transfer" (reproduce style) at once. These are mutually exclusive. Ask the user to pick one.
After you have prompt text and mmx flags, lint them together before generation:
- compare prompt BPM with
--bpm - compare prompt key with
--key - compare prompt structure line with
--structure - compare prompt duration with
--duration(or implicit length expectation) - compare prompt vocal mode with
--vocals - compare prompt language with
--language - compare prompt avoid language with
--avoid - stop when the prompt says one thing and the flags say another
If the user only has a text reference, route to the free-tool path in references/free-tool-inputs.md first. If the user has audio, analyze first and only then build the prompt. The linter returns a retry_guidance array with one hint per conflict so the operator can re-align prompt and flags on the next attempt.
When To Use
Use this skill when the task involves:
- generating a cover of an existing song with a different style (chanson version of a rock track, reggaeton version of a pop hit, and so on)
- style transfer from a YouTube URL or audio file to a target genre
- two-song mashup where Song A's lyrics and emotional arc are kept, but Song B's style is applied
- emotion analysis on input audio to extract intensity curves, vocal speed, pitch bends, and emotion classifications
- generating lyrics in a specific language and theme via the MiniMax
lyrics_generationAPI - editing existing lyrics to match a target style or emotional arc (MiniMax
lyrics_generationedit mode) - using
mmxCLI directly for fine control over--avoid,--bpm,--key,--structure,--vocals,--instrumentsas separate flags - accessing MiniMax's
music-coverormusic-cover-freemodels for melody preservation
Request Intake (adapted for MiniMax features)
After the Routing and Blocker Checks classify the request, run this 2-pass intake to extract the full set of fields the user cares about. Label each field's confidence: clear (user said it), inferred (sensible default), missing (need to ask), or conflicting (user said two incompatible things — pause to resolve).
Fields checklist (MiniMax-specific)
| # | Field | What to look for | MiniMax-specific notes |
|---|---|---|---|
| 1 | Route | Cover / style transfer / mashup / standard / emotion prompt | From the Routing and Blocker Checks section. Determines which MiniMax features to use. |
| 2 | Source audio or URL | File path or playable YouTube URL | Required for cover, mashup, style transfer. For standard, optional (text-only style reference is also fine). |
| 3 | Song A identity | Name, artist, audio | For mashup: needed. For cover: this is the source. |
| 4 | Song B identity | Name, artist, audio | For mashup only. |
| 5 | Target style | Genre / mood / reference | The destination of the cover or style transfer. If user says "like Rosalía", that's clear. If user says "something good", that's missing. |
| 6 | Lyrics decision | Original / translated / new / instrumental | For cover, default to original (translated if user requests it). For standard, default to new (or user-provided). |
| 7 | Vocal mode | Solo / duet / choir / instrumental | Drives --vocals and --language flags. |
| 8 | Language | BCP-47 code (en, fr, es, etc.) | For lyrics language AND vocal language. |
| 9 | Duration | Approximate length (jingle ~30s, standard ~3min, epic ~6min) | mmx has no native duration control (see "Song length" section). Length is driven by lyrics + structure, so the intake needs the lyrics to control length. |
| 10 | BPM, key, structure | Exact values if user wants --bpm/--key/--structure |
Optional. If provided, the prompt AND flags must agree (lint them). |
| 11 | Emotion arc | For emotion-prompt workflows: which emotions to emphasize | Drives the analysis-to-prompt translation. |
| 12 | Output location | Where the audio and analysis files go | Same as the base skill — per-song subfolder in ~/Music mix/\x3Cproject>/\x3Csong-slug>/. |
Confidence map example (MiniMax-specific)
Request: "Hazme un cover del 'Bizcochito' de Rosalía pero en reggaetón"
clear: source_audio=path, song_a=Bizcochito, target_style=reggaeton
inferred: language=es, vocal_mode=solo_female, lyrics_decision=original
missing: output_location (which project folder? per-song subfolder?)
vocal_register (full chest, head voice, whisper? — affects --vocals flag)
Request: "I have a YouTube link of an old rock song and want it as a dreamy shoegaze ballad, with English lyrics because the original is in French"
clear: source_url=URL, song_a=old_rock_song, target_style=shoegaze
lyrics_decision=translated, target_language=en
inferred: vocal_mode=duet or solo (depends on original), ~3min
missing: audio source for source audio analysis (YouTube needs to be downloaded first)
BPM/key from analysis output (will be filled in after analysis)
output_location
If any field is missing or conflicting, that's a question to ask. The Ambiguity Questions section below has specific patterns for each route. If everything is clear or inferred, the request is ready to translate.
User Preference Flow (message patterns → action)
The skill does not start with a questionnaire. It starts by reading and inferring from the user's natural-language request.
| User says... | Skill does... |
|---|---|
| "Haz un cover de X en Y" | Route: minimax_cover. Ask: source audio file (or download from YouTube), target language for lyrics, vocal register. |
| "Make this song sound like Rosalía" | Route: minimax_style_transfer. Ask: source audio, which album/era of Rosalía. |
| "I have audio of A, mash with B, keep A's melody" | Route: minimax_mashup. Ask: A vs B confirmation, source audio for A, B can be name or audio. |
| "Analyze the emotion curve of this track" | Route: minimax_emotion_prompt (analysis-only). Run analysis_orchestrator.py --audio first, then read the JSON. |
| "I want the lyrics to be about X, in French, melancholic" | Route: base_prompt (standard). Use the lyrics API to generate, then pass to mmx music generate --lyrics-file. Ask: target BPM/key/structure or derive from analysis. |
| "Recreate the song but in 90 BPM D minor" | Route: base_prompt with mmx flags. Lint prompt vs flags before generation. Verify BPM/key consistency. |
| "I don't know, surprise me" | Pick a coherent default (e.g. upbeat indie pop, EN, ~3min, auto-lyrics, standard generation) and confirm with the user before generating. |
| "Same song again but as a reggaeton version" | Route: minimax_cover with the existing song as source. Use the same project/song subfolder, suffix the MP3 (M1_original.mp3 + M2_reggaeton.mp3). |
This table is the abstract of references/user-preference-flow.md (which lives in the base skill). If you want a more detailed case, defer to the base skill's table and combine with this skill's route mapping.
Output File Layout (Per-Song Subfolders)
MiniMax-specific additions (drop these into the per-song subfolder alongside the base items):
| File | Source | Notes |
|---|---|---|
\x3Csong-slug>_analysis.json |
analysis_orchestrator.py --output |
MiniMax-specific analysis results (emotion, BPM, key, segments) |
\x3Csong-slug>_lyrics.txt |
mmx music generate --lyrics-file |
Optional if user provided lyrics inline |
\x3Csong-slug>_\x3Cstyle>_prompt.txt |
The exact text passed to --prompt |
For reproducibility |
The LLM should aim for the base skill's layout by default. The MiniMax-specific files are added on top when MiniMax features are used (cover workflow, mashup, analysis, etc.).
Quick Start with the Orchestrator
For any input combination, the analysis_orchestrator.py script is the single entry point:
# Audio file
python3 scripts/analysis_orchestrator.py --audio /tmp/song.wav
# Two songs (mashup) - gets BPM + key compatibility scoring for free
python3 scripts/analysis_orchestrator.py --audio /tmp/song_a.wav --audio /tmp/song_b.wav
# Video - extracts audio + visual features (scenes, color, motion)
python3 scripts/analysis_orchestrator.py --video /tmp/clip.mp4
# Image (album art) - color palette + style hints
python3 scripts/analysis_orchestrator.py --image /tmp/album_art.jpg
# YouTube URL - downloads then analyzes
python3 scripts/analysis_orchestrator.py --youtube "https://youtube.com/watch?v=..."
# Combination: audio + image
python3 scripts/analysis_orchestrator.py --audio /tmp/song.wav --image /tmp/art.jpg
# Demucs source separation — for TIMBRE/PITCH analysis of an isolated vocal, NOT for lyrics
python3 scripts/analysis_orchestrator.py --audio /tmp/song.wav --use-demucs
# Whisper lyrics extraction — run on the FULL mix (do NOT pre-separate with Demucs)
python3 scripts/analysis_orchestrator.py --audio /tmp/song.wav --lyrics
# VLM captioning for images (calls mmx vision describe / MiniMax 3.0 — cloud, skip if MiniMax is blocked)
python3 scripts/analysis_orchestrator.py --image /tmp/album_art.jpg --vlm
The orchestrator dispatches to the right analysis scripts and produces a unified JSON. Optional packages (CLAP, autochord, allin1, pyloudnorm, pylette, scenedetect, demucs, beat_this, basic-pitch, transformers/MERT, open_clip) are detected at runtime and used when available.
Extraction guidance (what actually improves the output)
These are the rules that make the extracted data useful to the downstream generator. They are tool-agnostic — they apply whether the backend is MiniMax cloud or a local model.
- Lyrics: transcribe the FULL mix, do not Demucs-first. Feeding Demucs-isolated vocals into Whisper measurably worsens transcription word-error-rate in most configurations. Run the transcriber on the original mix. Use faster-whisper over vanilla whisper (same accuracy, much lower latency/VRAM), and prefer the
large-v2model for sung lyrics —large-v3is reliably worse on singing. Usemedium/baseonly as a speed compromise. - Use Demucs only for timbre/pitch. Source separation helps when you want clean vocal-stem features (breathiness, pitch range, vocal brightness) or per-instrument detection — never as a lyrics pre-step.
- Prioritise the high-value features. For driving a generation prompt, the features that matter most (in order) are tempo/BPM, key/scale, beats/downbeats, chords, then structure (section boundaries). Energy/RMS and spectral centroid map to texture words (punchy, airy, sparse, dense) and to dynamic tags. Spend analysis budget there first.
- Give key detection a long window. Estimate key/chroma over ~120s of audio (not a short clip) for a stable result; BPM is stable from ~60s.
- Carry confidence through to the prompt. Hedge
low/mediumdetections ("around 128 BPM", "likely D minor") and never injectmissingvalues as facts — see Analysis Quality below. - Map structure boundaries to actions. Detected section boundaries become the
[Verse]/[Chorus]/[Bridge]tag roadmap, and (for backends that support it) the repaint windows for fixing one bad section instead of regenerating the whole track.
Output file layout (per-song subfolders)
Every generation should be saved into a per-song subfolder that bundles the audio with its analysis, prompt, and lyrics. The LLM should ask the user for the project root and song slug up front (default: ~/Music mix/\x3Cproject>/\x3Csong-slug>/), then run the full chain of commands below.
# Example: DBC - Two Paths, two versions
# 1. Make the subfolder
mkdir -p ~/Music\ mix/dbc/two-paths
# 2. Run the analysis and save JSON into the subfolder
python3 scripts/analysis_orchestrator.py \
--audio /tmp/two_paths.wav \
--use-demucs --lyrics --lyrics-source auto \
--output ~/Music\ mix/dbc/two-paths/two_paths_analysis.json
# 3. Build the prompt from the analysis, save it next to the JSON
python3 scripts/emotion_to_prompt.py \
--emotion ~/Music\ mix/dbc/two-paths/two_paths_analysis.json \
--output ~/Music\ mix/dbc/two-paths/two_paths_synthwave_prompt.txt
# 4. Generate each version, save the MP3 into the subfolder with a
# versioned filename so multiple takes stack cleanly
mmx music generate \
--prompt "$(cat ~/Music\ mix/dbc/two-paths/two_paths_synthwave_prompt.txt)" \
--lyrics-file ~/Music\ mix/dbc/two-paths/two_paths_lyrics.txt \
--out ~/Music\ mix/dbc/two-paths/M1_two_paths_synthwave.mp3
The result is a self-contained song folder that the user can review, archive, share, or re-generate from without losing any context.
What's New in v1.0.0
v1.0.0 is the first stable release. It builds on the v0.x series (v0.3.0 / v0.4.0 dev line) with stronger preflight routing, wider prompt/flag consistency, and explicit post-generation verification:
Preflight routing:
lint_music_request.pynow emits one of six routes:base_prompt,minimax_cover,minimax_mashup,minimax_style_transfer,minimax_emotion_prompt, orneeds_clarification- New blockers: missing Song B, missing lyrics decision, and conflicting cover/style-transfer intent
- A
retry_guidancearray on every conflict so the operator can re-align prompt and flags
Prompt and flag consistency:
- Linter now detects conflicts in BPM, key, structure, duration, vocal mode, language, and avoid list
- The canonical
mmxprompt schema is documented inreferences/examples.md
Analysis quality:
- All analysis scripts converge on a compact
summary(tempo, key, sections, instrumentation, vocal traits, energy curve, hook points, mix notes) - Confidence levels (clear, high, medium, low, inferred, missing) attached to every detection
- Missing optional dependencies fall back to a JSON error block instead of failing the whole workflow
Output verification:
- Post-generation verification checklists for covers, mashups, style transfer, and emotion prompts
- Eight failure signatures (copied too closely, lost melody, wrong tempo, wrong key, muddy mix, weak chorus, style mismatch, neutral vocals) with matching fixes
- Revision prompt templates that preserve source identity while fixing one specific dimension
Tests and portability:
- Smoke tests now cover all new linter routes, the new conflict types, and the stdlib-only import guarantee
- Windows is documented as partial support; scripts stay POSIX-safe, audio tools may need platform install
What's New in v0.3.0
v0.3.0 builds on v0.2.0 with a substantially richer analysis pipeline:
New analysis scripts (8):
extract_stems.py— Demucs source separation (vocal/drums/bass/other)track_beats.py— beat_this beat + downbeat tracking (ISMIR 2024 SOTA)extract_melody.py— Spotify Basic Pitch polyphonic AMT → MIDI + key/scalecompute_audio_embedding.py— MERT v1-330M music embeddings (vibe similarity)classify_instruments.py— MIT AST 527-class AudioSet taggingextract_video_features.py— extended with camera motion + VLM captioninganalyze_image.py— extended with OpenCLIP, OCR, face detection, VLM captionanalysis_orchestrator.py— single entry point, --use-demucs, --vlm, --ocr flags
New prompt slots (consumed in emotion_to_prompt.py):
beat grid: 4/4 at 150 BPM (confidence 0.80)from beat_thismelodic key from MIDI: E minor; interval motion: mostly leaps; modal character: pentatonic, bluesfrom Basic PitchAST-detected sound palette: rock music (0.16), punk rock (0.14), grunge (0.20)from MIT ASTemotion signature from analysis: intense, passionate, dramatic, triumphant(expanded to 25-emotion classifier)vocal texture in verse: breathier / more intimate than average(per-section aggregation)tempo: tight, on-beat delivery(from tempo_consistency)tonal character: dark warm tone, rolled-off highs(from brightness)instruments detected: electronic / synthetic textures(from instrument_hints)natural dramatic pauses detected at: 2s (11.7s pause), 20s (3.3s pause)(from Demucs vocal-stem)style direction: ...(from analyze_two_songs mashup_plan)
Bug fixes:
- parselmouth 0.4.x API (get_value_at_time / get_value_at_xy)
- ffmpeg 8.x image2 muxer workaround (per-frame extraction)
- pylette 5.1+ capital-P import + Pylette fallback
- open_clip 3.3 3-tuple return + get_tokenizer() for tokenizer
- demucs 4.x apply_model API
Prompting wins (verified end-to-end with DBC Woodstock 2013):
- Mix: 0 silence gaps, 35 pitch bends
- Vocal stem: 19 silence gaps, 49 pitch bends, 2.32 syll/sec
- BPM 150 (4/4) from beat_this, E minor (MIDI-confirmed G# minor)
- AST: "Rock music", "Punk rock", "Heavy metal", "Grunge" — matches the actual band
When NOT To Use
Do not use this skill when:
- the user only needs standard song generation without cover, mashup, or analysis — use
music-craftinstead (lighter, no MiniMax dependency) - the runtime does not expose a
music_generatetool and there is noMINIMAX_API_KEYconfigured — both skills need the runtime - the user wants deterministic, single-shot generation with no iteration — overkill
- the user wants to mutate a specific existing audio file (pitch shift, time stretch, stem split) — that is post-production, not generation
- the user is not on a MiniMax Token Plan — the advanced features (cover, mmx per-flag control, lyrics API, emotion-driven prompts) require the plan
Decision Tree
Use the base skill unless one of these MiniMax-specific needs is present:
- melody-preserving cover or style transfer from audio or YouTube
- two-song mashup
- lyrics API preview/edit flow
- emotion analysis that feeds the prompt
- exact
mmxcontrol for BPM, key, structure, or avoid lists
If the user wants a new song that only borrows a style, stay in music-craft unless they also need exact flag control or lyrics API iteration.
If the source is a YouTube URL and download is blocked, ask for a local file before changing the workflow.
First Response Defaults
Use these defaults on the first pass:
- Cover from audio or YouTube: start with the one-step cover path. Switch to two-step only if the user wants translated lyrics, edited ASR lyrics, or custom lyrics.
- Style transfer only: do not use cover unless melody preservation matters. Use standard generation plus
mmxflags if exact BPM/key/structure matter. - Two-song mashup: anchor on Song A. If Song A has audio, default to the cover two-step workflow; if Song B is only named, ask for a short style description or fetch more context if free tools are available.
- Lyrics API generation or edit: use
write_full_songfor blank-page generation andeditfor revisions. - Emotion-analysis-to-prompt: run analysis first, then convert to a prompt; only ask whether the output should be cover, mashup, or standard generation, plus the target language if missing.
- Exact BPM/key/structure control: make
mmxflags the source of truth and keep the prompt descriptive but non-conflicting.
Ambiguity Questions
Ask at most 1-3 questions. Separate blockers from quality tweaks:
- Required blockers first: source file or URL, which song is A vs B, whether lyrics already exist, whether the output must preserve melody.
- Optional quality after blockers: target language, target style, BPM, key, structure, instruments, vocal color, avoid list.
Use these exact patterns when clarification is needed:
- Cover: "Which source should I use?" "Do you want the original lyrics, translated lyrics, or new lyrics?" "Any target style, or should I derive it from the source?"
- Mashup: "Which song is A and which is B?" "Do you have audio for Song B, or only the name?" "Should the lyrics stay the same or be rewritten?"
- Lyrics API: "Write from scratch or edit existing lyrics?" "What language should I target?" "Any hard structure requirements?"
- Emotion prompt: "Do you want cover, mashup, or standard generation?" "What language should the output use?" "Should I prioritize tenderness, energy, or structure?"
- mmx precision: "Which values are mandatory: BPM, key, structure, or avoid list?" "Any instruments or vocals that must stay in or stay out?"
Relationship to music-craft
This skill extends the base skill, it does not replace it. The shared concepts are:
| Concept | Where it lives |
|---|---|
| Pre-Flight Check (platform detection) | This skill (extended required list) |
| Anti-sparse rules (canonical text) | Base skill, referenced from here |
| Prompt formula (production sheet) | Base skill, referenced from here |
| Structure tags (14 tags) | Base skill, referenced from here |
| User preference flow (auto-detect + ask) | Base skill, referenced from here |
| Output file layout (per-song subfolders, slug rules, version prefix) | Base skill, referenced from here; MiniMax adds analysis.json and lyrics.txt |
| Rate limits (generic) | Base skill |
| Quality verification checklist | Base skill, extended here for MiniMax |
| Operating rules (6-step loop) | Base skill, summarized here with MiniMax-specific extensions |
The MiniMax-specific additions are:
| MiniMax concept | Where it lives |
|---|---|
mmx CLI quick reference |
This skill |
mmx full flag reference |
This skill, references/mmx-flags-reference.md |
| Cover workflow (one-step, two-step) | This skill, references/cover-workflow.md |
| Lyrics generation API | This skill, references/lyrics-generation.md |
| Mashup workflow (A + B) | This skill, references/mashup-workflow.md |
| Emotion analysis (vocal speed, intensity, pitch) | This skill, references/emotion-analysis.md |
| MiniMax-specific error handling | This skill, references/error-handling.md |
| Audio analysis scripts | This skill, scripts/ |
| Free tool inputs (web, image, memory) | Both skills — base layer in music-craft, MiniMax layer here in references/free-tool-inputs.md |
Pre-Flight Check (extended)
The platform detection block is the same as music-craft (run it first). The required and optional lists are extended for MiniMax.
Platform Notes
- macOS/Linux are the primary targets: use
python3,command -v, and normal shellexport/PATHchecks. - Windows is partial support only: prefer PowerShell, use
pythonorpy -3, and verify env vars withGet-ChildItem Env:MINIMAX_API_KEYorTest-Path Env:MINIMAX_API_KEY. - On Windows,
ffmpeg,yt-dlp, andmmxare PATH-sensitive; ifGet-Command/where.execannot find them, restart the shell or add the install directory toPATH. - If Windows path/dependency issues keep blocking analysis, use WSL for the script-heavy parts instead of claiming full native support. For the full WSL2 setup (including the corporate items below), follow the base skill's
references/windows-wsl-setup.md. - Corporate machines (TLS-inspecting proxy):
pip/HuggingFace/model downloads inside WSL fail withCERTIFICATE_VERIFY_FAILEDuntil the corporate root CA is installed in the distro (andREQUESTS_CA_BUNDLE/SSL_CERT_FILEpoint at the system bundle). A proxy env var (HTTP_PROXY) can also hijack127.0.0.1calls to a local API — unset it / setno_proxy=127.0.0.1,localhost. Both are covered in the base reference above. - If MiniMax itself is blocked (corporate firewall) the cloud features here —
mmx music cover, the lyrics API, and any analysis script that calls MiniMax (e.g.emotion_to_prompt.py) — will fail. In that case use only the local-capable tools (yt-dlp,ffmpeg,librosa, Whisper) for analysis and the local ACE-Step backend inmusic-craftfor generation.
Required (skill will not work without these)
| Check | What it is | How to verify | If missing |
|---|---|---|---|
music_generate tool |
The runtime's built-in music generation tool | Inspect the active runtime's tool list | Tell the user: "This skill needs a music_generate tool, but the active runtime does not expose one. Configure a music provider in OpenClaw and try again." Stop. |
MINIMAX_API_KEY env var |
API key for the MiniMax Music 2.6 plan | test -n "$MINIMAX_API_KEY" && echo "OK" |
Tell the user: "This skill needs the MINIMAX_API_KEY environment variable. Get one from your MiniMax account and export it. If you do not have a MiniMax Token Plan, use music-craft instead — it works with any provider." Stop. |
mmx CLI |
The MiniMax CLI for fine-flag control | command -v mmx && mmx --version (macOS/Linux) or Get-Command mmx; mmx --version (PowerShell) |
Ask the user: install via the MiniMax install guide, or skip mmx-specific features and use the music_generate tool with prompts. Do not block — mmx is optional if the runtime has MiniMax configured, but Windows support is only partial and depends on PATH visibility. |
python3 |
Required for the analysis scripts | command -v python3 (macOS/Linux) or python / py -3 (Windows PowerShell) |
Tell the user: "The analysis pipeline (emotion analysis, mashup) needs Python 3.9+." Propose an install command for the active shell. Block emotion analysis if missing. |
Optional (skill works without these, but quality improves with them)
| Tool | What it unlocks | Install per platform |
|---|---|---|
ffmpeg |
Audio conversion (WAV for analysis, MP3 export, trimming) | apt install ffmpeg · brew install ffmpeg · winget install Gyan.FFmpeg (restart PowerShell after install so PATH updates apply) |
yt-dlp |
YouTube audio download for cover and mashup | pip install -U yt-dlp or py -3 -m pip install -U yt-dlp on Windows; ensure the CLI is on PATH |
librosa |
Audio analysis (BPM, key, energy, structure) | pip install librosa numpy scipy |
parselmouth |
Better pitch tracking (Praat under the hood) | pip install praat-parselmouth |
scikit-learn |
Audio clustering (segment detection) | pip install scikit-learn |
The full per-platform install table is in the base skill's music-craft Pre-Flight Check.
The "ask the user" pattern
Same as the base skill: for each missing optional tool, present three options — install (propose exact command, let user approve), skip (use the simple path), or cancel. Never auto-install.
If MINIMAX_API_KEY is missing, the redirect is to the base skill, not "install MiniMax" — the user may not have a Token Plan at all.
Local analysis memory (separate from generation)
Generation runs on MiniMax's cloud — your laptop just sends the prompt and downloads the MP3, so generation itself uses negligible local memory.
However, this skill's local analysis scripts run on your machine and can use real memory. Before running the full analysis pipeline, check available memory:
| Script | Models loaded | Approx peak RAM |
|---|---|---|
analyze_vocal_emotion.py |
parselmouth (Praat) + scipy |
~500 MB |
analyze_audio.py |
librosa + transformers (MERT or MIT AST) |
2–4 GB |
extract_lyrics_whisper.py |
whisper model (tiny/base/medium) |
1–5 GB depending on model size |
extract_stems.py |
Demucs (htdemucs) |
2–4 GB |
emotion_to_prompt.py |
calls MiniMax API — negligible local | \x3C100 MB |
compute_audio_embedding.py |
MERT model | 1–2 GB |
classify_instruments.py |
MIT AST | 1–2 GB |
Combined (full analysis pipeline on a 4-min song): ~6–10 GB peak on top of OS and other apps. On unified-memory systems (Apple Silicon, integrated graphics), this competes directly with macOS/Windows and your other applications. On dedicated-GPU systems (NVIDIA, AMD), model memory is taken from system RAM unless you have CUDA acceleration.
Recommendations:
- Close heavy apps (browser with many tabs, IDE, Docker) before running the full pipeline
- For
extract_lyrics_whisper.py, use thetinymodel by default —base/mediumare 2-5x heavier with marginal quality gain for most songs - For
extract_stems.py, the--qualityflag controls Demucs model size; defaulthtdemucsis the heaviest;htdemucs_ftis the lightest - If you run out of memory, run analysis steps individually rather than via
analysis_orchestrator.py(which loads everything)
The scripts/smoke_test.py script verifies the environment is set up; it does not test memory headroom. Run your own memory check before running a full analysis.
Free Tool Augmentation (Input Enrichment)
The OpenClaw runtime exposes several free tools (web_fetch, web_search, image analysis, memory, browser) that enrich the music generation workflow. The base layer is documented in music-craft → Free Tool Augmentation and references/free-tool-inputs.md. This section shows how they compose with MiniMax-specific features.
Quick recap of free tools
| Tool | Purpose |
|---|---|
web_fetch |
Fetch URL content (lyrics pages, YouTube metadata, Wikipedia) |
web_search |
Find lyrics, artist info, genre descriptions |
image / MiniMax__understand_image |
Analyze album art, concert photos, music video screenshots |
memory_search / memory_get |
Recall user's prior music preferences |
browser |
JS-heavy site fallback (last resort) |
MiniMax compositions (high-value combos)
web_fetch+lyrics_generation: fetch the user's draft from a URL, run it through edit mode for cleanup, generate.web_search+ cover workflow: find covers in the target style, extract their characteristics, apply to the user's track.image+mmxper-flag control: analyze album art, translate to--instruments,--bpm,--key,--structurefor fine-grained style matching.memory+ emotion analysis: combine the user's prior preferences with deep audio analysis of a reference track.
For the full worked examples, parameter recommendations, and MiniMax-specific edge cases, see references/free-tool-inputs.md.
Operating Rules
Same 6-step loop as music-craft, with MiniMax-specific extensions:
- Read and auto-detect — same
- Ask only the ambiguous parts — same, plus ask if the user wants cover / mashup / standard
- Translate to a production-sheet prompt — same, but consider whether to use
mmxflags (seereferences/mmx-flags-reference.md) instead of packing everything into the prompt - Structure the lyrics — same, plus consider lyrics API for generation or edit (see
references/lyrics-generation.md) - Generate and verify — same, plus the
music-covermodel for melody preservation - Iterate — same, plus emotion analysis to inform the next prompt adjustment
For the full 6-step detail, see music-craft → Operating Rules.
Song length (mmx has no native duration control)
Unlike music-craft's ACE-Step backend (which takes audio_duration as a parameter), MiniMax Music 2.6 has no explicit duration flag. Output length is determined by:
- Lyrics length (primary): each
[Verse]/[Chorus]section takes ~15-30 seconds depending on word count and singing pace. A typical 3:30 song has ~150-200 lyrics words across 2 verses + 2 choruses + bridge. - Structure tags:
[Intro],[Instrumental Break],[Outro]add silent/sparse sections that extend total length without lyrics. - Prompt hints (secondary): phrases like "3 minute song" or "4 minute track" nudge the model toward that length.
- BPM and section count (minor effect): faster BPMs and more sections tend to produce slightly longer outputs.
Practical recipe for a full 3:30 song:
- Lyrics: ~150-200 words with
[Verse 1],[Pre-Chorus],[Chorus],[Verse 2],[Bridge],[Outro]tags (full song structure, not just one chorus) - Prompt: include structure hints like
"full 3-minute song with intro, 2 verses, 2 choruses, bridge, and outro"or use--structure "intro-verse-pre_chorus-chorus-verse-chorus-bridge-chorus-outro" - Check output length — if it's a 1-minute hook, the lyrics are probably too short
- If output is too short: regenerate with longer lyrics (the model can't add sections that aren't in the lyrics)
- If output is too long: trim lyrics to ~120 words or add
[Instrumental Break]tags to control pacing
Don't expect mmx to hit 3:30 exactly. Output length varies by ±20-30s depending on the model. If you need precise length, ACE-Step is the right tool (it has audio_duration). If you want MiniMax's vocal quality and the song length is flexible, mmx is fine.
mmx CLI Quick Reference
The mmx CLI exposes MiniMax Music 2.6 parameters as separate flags. This gives finer control than packing everything into a single prompt string.
The most useful flags:
| Flag | Effect | Example |
|---|---|---|
--avoid |
Elements to avoid (comma-separated) | --avoid "sparse, a cappella, electronic sounds" |
--bpm |
Exact BPM | --bpm 80 |
--key |
Musical key | --key "E minor" |
--structure |
Song structure | --structure "intro-verse-pre chorus-chorus-verse-chorus-bridge-chorus-outro" |
--vocals |
Vocal style | --vocals "passionate French male vocal" |
--instruments |
Featured instruments | --instruments "accordion, upright bass, strings, piano" |
--genre |
Genre | --genre "french chanson" |
--mood |
Mood | --mood "melancholic romantic dramatic" |
--lyrics-optimizer |
Auto-generate lyrics from prompt | (flag only) |
--model |
Model name | --model music-2.6 (paid, highest RPM) or --model music-2.6-free (default, free tier) |
--cover-feature-id |
Use a preprocessed cover (two-step workflow) | (from preprocess call) |
Full reference with all flags and examples: references/mmx-flags-reference.md.
When to use mmx vs music_generate:
mmx: when you need fine control over specific parameters (BPM, key, structure as separate flags)music_generate: when the prompt-only path is enough, and you want to keep the workflow provider-agnostic
Both produce equivalent results if the prompt and flags are aligned.
mmx Music Generation — verified patterns (June 9, 2026)
End-to-end verified invocations from this session (M5_idkw_dreampop_shoegaze + M5_idkw_opera_metal in ~/Music mix/hello_cleveland/i_dont_know_why/):
Pattern A — full song with detailed prompt + 6 metas (production-grade output)
mmx music generate \
--prompt "dream pop reimagining, shoegaze-influenced indie rock turned ethereal and cinematic.
My Bloody Valentine meets Slowdive meets Radiohead.
Male lead vocal, breathy and vulnerable, double-tracked with slight detuning and tape warmth.
Wall of clean electric guitars with heavy chorus pedal and tremolo picking.
Shimmering washes of reverb, sub-bass synth pad foundation, soft brushed electronic drums.
Glockenspiel and celesta melody line high above the mix.
Organ pads swelling at choruses, reversed guitar samples between sections.
Heavy reverb and analog warmth throughout, lo-fi texture.
Emotional arc: hazy drifting opening building wave confusion overwhelming beautiful climax fading dreamlike denouement outro.
Avoid: sharp percussive agresivo distortion clear upfront vocals minimal sparse.
Tempo 96 BPM in D major, dreamlike half-time feel.
Suitable as a slow-burn alt-pop anthem, melodic and textural, intimate verses and soaring choruses.
Modern production, polished mix, atmospheric vocal production where vocals sit among the instruments rather than above them." \
--lyrics-file gen1_lyrics.txt \
--model music-2.6 \
--vocals "breathy vulnerable male lead, double-tracked with slight detuning" \
--genre "dream pop, shoegaze-influenced indie" \
--mood "hazy confusion building to overwhelming beautiful release, then dreamlike fade" \
--instruments "wall of clean electric guitars with heavy chorus pedal, sub-bass synth pad, soft brushed electronic drums, glockenspiel, celesta, organ pads, reversed guitar samples" \
--bpm 96 \
--key "D major" \
--structure "intro-verse-pre_chorus-chorus-post_chorus-verse2-chorus-repeat-outro" \
--use-case "slow-burn alt-pop anthem, suitable for late-night listening" \
--avoid "sharp percussive agresivo distortion, clear upfront vocals, minimal sparse arrangement" \
--references "My Bloody Valentine, Slowdive, Radiohead" \
--out M5_idkw_dreampop_shoegaze.mp3
Output: 167.9s MP3, 5.4 MB, -8.8 LUFS, 5.7 LRA (good dynamics).
Pattern B — crazy combo: opera vocals + heavy metal music (for fun experiments)
mmx music generate \
--prompt "extreme dramatic contrast: powerful operatic tenor vocals over heavy metal instrumentation.
Like Freddie Mercury fronting Metallica. Epic, theatrical, over the top.
Thunderous double bass drums, distorted electric guitars with palm-muted chugging,
guttural rhythm section, blast beats, tremolo picking, minor key riffing.
Operatic vocals soaring above the metal wall of sound, belting high notes with vibrato.
Gothic theatrical atmosphere, dramatic dynamic shifts from whisper-quiet verses
to explosive metal choruses. Anthem-like, stadium-ready." \
--lyrics-file gen2_lyrics.txt \
--model music-2.6 \
--vocals "operatic tenor, powerful Freddie Mercury style, vibrato, theatrical belting" \
--genre "symphonic metal" \
--mood "dramatic, theatrical, anthemic, intense" \
--instruments "distorted electric guitars, double bass drums, blast beats, orchestral strings" \
--tempo "fast" \
--bpm 160 \
--key "D minor" \
--structure "verse-pre_chorus-chorus-verse-pre_chorus-chorus-outro" \
--use-case "epic music experiment" \
--avoid "pop, soft, gentle, acoustic, slow" \
--out M5_idkw_opera_metal.mp3
Output: 155.8s MP3, 5.0 MB, -9.6 LUFS, 4.3 LRA (compressed but still has dynamics).
Model selection
| Model | When to use | Cost |
|---|---|---|
music-2.6 (default) |
Production work, full quality | Token Plan / paid |
music-2.6-free |
Free tier, lower RPM, "unlimited" quota for some plans | Free |
music-2.5+ |
Older model, still good quality | Token Plan / paid |
music-2.5 |
Legacy | Token Plan / paid |
music-cover |
Cover/re-interpretation of source audio (one-step) | Token Plan / paid |
music-cover-free |
Free cover variant | Free |
music-2.6-free is the default for most users — same model, free tier. The mmx CLI uses it as the default when no --model is specified.
is_instrumental and lyrics_optimizer flags (miniMax-specific paths)
The mmx CLI exposes two important flags that bypass the --lyrics requirement:
| Flag | What it does | When to use |
|---|---|---|
--instrumental |
Generate music without vocals (no lyrics needed) | When user wants BGM, intro, soundtrack, loop |
--lyrics-optimizer |
Auto-generate lyrics from the prompt (no --lyrics needed) |
When user says "make me a song about X" but doesn't have lyrics |
Examples:
# Pure instrumental (no vocals)
mmx music generate \
--prompt "Instrumental only, no vocals, no lyrics. Loopable coffee shop background, soft piano, brushed drums, 90 BPM, C major" \
--instrumental \
--length 180000 \
--out coffee_bgm.mp3
# Auto-generated lyrics from prompt
mmx music generate \
--prompt "Upbeat indie folk, melancholic but hopeful, male vocal, acoustic guitar, 100 BPM" \
--lyrics-optimizer \
--out indie_folk.mp3
Note: mmx music generate with --length uses milliseconds (the example shows --length 180000 for 3 minutes). This is mmx-specific; the underlying MiniMax API has no official duration parameter.
URL expiration warning
mmx music generate returns a saved: path. If you ever use --output-format url (the official API default), the URL expires after 24 hours. Download immediately. The mmx CLI auto-downloads to --out so this is not a problem when using --out directly.
Cover Workflow
Two cover backends exist — pick by what's available:
- MiniMax cloud cover (this skill):
mmx music cover, melody-preserving via MiniMax'smusic-covermodel. NeedsMINIMAX_API_KEYand network access to MiniMax.- Local ACE-Step cover (in
music-craft):task_type=coverwith the source audio uploaded (multipart) andaudio_cover_strengthcontrolling how far to restyle. Fully local, no cloud, follows the source melody/structure. Caveat: a full-length cover is slow and VRAM-heavy on a ~12 GB GPU and can hit the server's 600 s generation timeout — cover a shorter segment or raiseACESTEP_GENERATION_TIMEOUT. See music-craft's "ACE-Step Audio-Conditioned Generation" section.So if MiniMax is unavailable (no key, or blocked on your network), you can still do a melody-aware cover locally with ACE-Step — it is not cloud-only. Only pure text-prompt generation (no source audio) is a "reimagining" rather than a cover.
Cover workflow preserves the original song's melody while applying a different style. Two paths:
One-step (quick):
mmx music cover \
--prompt "French chanson, accordion, strings, passionate French vocal, 80 BPM" \
--audio-file /tmp/original.ogg \
--out /tmp/cover.mp3
MiniMax extracts lyrics via ASR and applies the new style.
Two-step (more control):
- Preprocess the audio to extract features and structure
- Edit the lyrics (correct ASR errors, add section tags)
- Generate with the edited lyrics
The two-step path gives better results when the original lyrics need correction or when the user wants different lyrics in the new style.
Full detail with payload examples, error handling, and use cases: references/cover-workflow.md.
Lyrics Generation
MiniMax has a dedicated lyrics_generation endpoint that produces structured lyrics (with [Verse], [Chorus], etc. tags) from a theme prompt. Two modes:
write_full_song— create new lyrics from a themeedit— modify existing lyrics (e.g., make the chorus stronger, shift to a hopeful ending)
The output is structured lyrics that can be passed directly to music_generate or mmx music generate.
Full detail with API examples, parameters, and use cases: references/lyrics-generation.md.
Web Lyrics Lookup (LRCLib)
As an optional complement to Whisper transcription, the orchestrator can look up song lyrics from LRCLib (open, no auth, JSON API at https://lrclib.net/api) when the song is a known mainstream track. This is a graceful fallback — Whisper is the primary source, LRCLib is a quality boost for the right song.
Coverage reality check: LRCLib has good coverage for mainstream vocal music (pop, rock, hip-hop, R&B, country) and is poor or empty for:
- Instrumental tracks (Joe Satriani, King Crimson, much jazz, classical)
- Obscure bands / friend bands
- Live / bootleg / unofficial releases
- Non-English lyrics for English titles (and vice versa)
When LRCLib is empty (the expected case for instrumentals), the script returns no_web_lyrics and the caller silently uses Whisper. This is the designed path, not a failure.
CLI usage:
# Standalone lookup
python3 scripts/fetch_lyrics_web.py \
--artist "Coldplay" --title "Yellow" \
--whisper-transcript "look at the stars..." \
--min-match 0.6 --json
Orchestrator integration via the --lyrics-source flag:
| Value | Behavior |
|---|---|
whisper (default) |
Always use Whisper, never touch the web |
web |
Always try LRCLib, never run Whisper |
auto |
Whisper first; if the song is recognized AND LRCLib returns a confident match (>60% word overlap), use LRCLib; otherwise fall back to Whisper |
off |
Skip lyrics extraction entirely |
The orchestrator auto-detects artist and title from the audio path stem (e.g. Coldplay - Yellow.wav → artist="Coldplay", title="Yellow"). Pass --name-a "Artist - Title" to override.
The result includes a web_lookup sub-dict with status, match_score, and the plain lyrics (when matched), so you can inspect what was used and why.
Full detail with scoring heuristic and exit codes: see scripts/fetch_lyrics_web.py docstring.
Mashup Workflow
The signature MiniMax-specific feature: combine Song A (content + emotion) with Song B (style).
Workflow:
- Get Song A (audio file, YouTube URL, or song name)
- Get Song B (audio file, YouTube URL, or song name)
- Run emotion analysis on Song A (if audio available) to extract the emotional arc
- Build a prompt that applies Song B's style to Song A's content and emotion
- Generate using the cover workflow (preserves melody) or standard generation (creative reimagining)
This is the most powerful feature in this skill. The output preserves what makes Song A recognizable (lyrics, melody, emotion) while applying Song B's production style.
Full detail with the emotion-to-prompt conversion and the two-song analysis script: references/mashup-workflow.md and references/emotion-analysis.md.
Emotion Analysis
Emotion analysis extracts per-section features from input audio:
- Intensity (loudness) — drives dynamic range
- Pitch (Hz range, trend) — drives vocal intensity
- Vocal effort (low / medium / high) — drives delivery style
- Breathiness — drives intimacy vs full voice
- Spectral centroid (brightness) — drives timbre matching
- Emotion classification (list of 30+ emotions) — drives mood keywords for the prompt
- Repetitive intensification — drives chorus build
- Emotional shifts (sudden vs gradual) — drives transitions
- Vocal speed (syllables per second) — drives elongation cues
- Pitch bends at phrase endings — drives emotional emphasis
The analysis outputs JSON that the emotion_to_prompt.py script converts into a ready-to-use production-sheet prompt.
Local-only path (when MiniMax is unavailable):
emotion_to_prompt.pycalls the MiniMax cloud, so it fails when MiniMax is blocked or no key is set. In that case build the prompt locally from the analysis JSON without that script: take the extracted BPM and key/scale as explicit metadata fields; turn the energy curve and spectral brightness into texture words; turn the emotion classification and intensity curve into mood words and dynamic section tags; and feed transcribed lyrics (full-mix Whisper) as the lyric body. This is the same data, assembled by the agent instead of the cloud helper, and it feeds any backend (including a local model).
Scripts: scripts/analyze_vocal_emotion.py, scripts/analyze_audio.py, scripts/emotion_to_prompt.py.
Full detail: references/emotion-analysis.md.
For the generation side — how to use the analysis to evoke emotion in the OUTPUT, the 21 emotion recipes (joy, desperation, melancholy, triumph, yearning, anger, vulnerability, confidence, nostalgia, anxious, hopeful, tragic, heroic, tender, sensual, lonely, playful, haunting, serene, celebratory, bittersweet), the iteration loop, and common mistakes — see references/emotion-delivery.md.
Analysis Quality (Summary Format, Confidence, Fallbacks)
Analysis scripts in scripts/ produce different views (emotion, beats, melody, structure, instrumentation). The skill expects them to converge on a single compact summary so downstream code and humans can read the same shape regardless of which scripts ran.
Compact Analysis Summary
Every analysis result should include a summary object with these keys:
| Key | Type | Meaning |
|---|---|---|
tempo |
string | BPM value with confidence, e.g. 120 BPM (confidence 0.92) |
key |
string | Detected key, e.g. E minor (confidence 0.71) |
sections |
list | Section labels with timing, e.g. [{"label": "verse", "start": 0.0, "end": 28.5}, ...] |
instrumentation |
list | Detected instrument palette, e.g. ["electric guitar", "drums", "bass"] |
vocal_traits |
dict | Breathiness, intensity, pitch range, e.g. {"breathiness": "high", "intensity": "medium"} |
energy_curve |
list | Per-section energy values, e.g. [{"t": 0, "energy": 0.6}, ...] |
hook_points |
list | Timestamps of detected hooks, e.g. [12.4, 48.0] |
mix_notes |
list | Short strings, e.g. ["vocal upfront", "wide stereo drums", "rolled-off highs"] |
Scripts may add their own fields, but every script must return at least the keys above (use empty list / unknown string when a key has no data).
Confidence Levels
Every numeric or categorical detection in the analysis must carry a confidence value so weak detections do not get treated as facts.
| Confidence | Numeric range | Interpretation |
|---|---|---|
clear |
n/a | The detection is unambiguous (e.g. user-supplied text, MIDI-confirmed key). |
high |
>= 0.75 |
Strong evidence from multiple sources or models. |
medium |
0.5 - 0.74 |
Reasonable evidence but alternative interpretations exist. |
low |
\x3C 0.5 |
Weak signal; treat as a hint, not a fact. |
inferred |
n/a | Not measured directly; derived from context (e.g. lyrics from a YouTube URL). |
missing |
n/a | Not available; the analysis did not run or did not find evidence. |
When feeding analysis into a prompt, prefix any low or medium detection with a hedge like "around" or "approximately", and never include missing values as if they were facts.
Fallback Behavior for Missing Optional Dependencies
The advanced analysis scripts depend on optional packages (librosa, parselmouth, transformers, demucs, beat_this, basic_pitch, etc.). Each script must:
- Try to import the optional dependency at the top of the function.
- On
ImportError, return a JSON object that includes{"error": "install with pip install X", "summary": {}}instead of raising. - Never let a missing optional dependency crash the whole workflow.
The orchestrator at scripts/analysis_orchestrator.py collects per-script results and continues even if some scripts failed. The combined summary simply omits keys whose underlying analysis could not run. The linter, prompt builder, and generation step all read the summary and skip missing keys without erroring.
This means a user without demucs installed can still get tempo, key, and structure analysis from the base pipeline. The only loss is the per-stem vocal analysis, which is opt-in via --use-demucs.
Rate Limits (MiniMax-specific)
The MiniMax Music 2.6 documented limits are:
- RPM: 120 requests per minute
- Concurrent connections: 20
- Output URL expiry: 24 hours (download the audio promptly)
- Cover feature ID validity: 24 hours (use the preprocess output within a day)
Under the Token Plan 3.0 (June 2026+), the actual quota is credit-based rather than RPM-based:
- A unified
generalcredit pool covers M3, M2.7, and M2.7-highspeed - A 5-hour rolling window resets continuously
- A weekly window runs Monday 02:00 CEST → next Monday 02:00 CEST
- Weekly status may be inactive on Plus plan (no weekly cap enforced, but the schema is there)
Practical implication: the documented 120 RPM is the API limit, but the Token Plan 3.0 quota is what determines your real ceiling. If you generate 4500 requests in 5 hours on Plus, you will be rate-limited regardless of RPM.
Before submitting a batch, check the active plan:
# Check current Token Plan usage
curl -s -H "Authorization: Bearer $MINIMAX_API_KEY" \
https://www.minimax.io/v1/token_plan/remains | jq .
If a call fails with 429 (rate limit):
- Wait at least 60 seconds.
- Check the Token Plan usage endpoint.
- If 5h window is exhausted, wait for the reset.
- Reduce concurrency if running a batch.
Anti-Sparse (MiniMax-Specific Deep Dive)
The base skill's anti-sparse rules apply. The MiniMax-specific failure mode is more severe than other providers:
MiniMax interprets "sparse" or "minimal" as "remove all instruments", even more aggressively than other providers. The model has been observed to:
- Remove all instruments in quiet sections when the prompt uses the word "quiet"
- Drop percussion entirely when the prompt uses "intimate"
- Go a cappella on build-up sections when the prompt uses "build"
Mitigation:
- Never use the words "sparse", "minimal", "stripped back", "quiet" in a MiniMax prompt without pairing them with explicit instruments.
- Always add:
"ALL instruments ALWAYS playing throughout, NEVER go a cappella or silent at any point". - Always list every instrument you want to hear.
- For quiet sections, use the explicit form:
"quiet sections: reduced to accordion and bass only, still fully played, NOT silent".
If a generation comes back sparse despite these rules, retry once with an even more explicit instrument list. If it fails again, the prompt has a structural issue — try a different style.
For the canonical anti-sparse text and worked examples, see the base skill's Anti-Sparse Rules section.
Quality Verification Checklist
Same 8-point checklist as the base skill, plus 4 MiniMax-specific items:
- Cover preserves melody recognisably. If the user said "make it sound like Song X", the new version should be recognisable as Song X's melody with Song Y's style.
- Emotion curve matches Song A (for mashups). The dynamic arc of the output should follow the original's intensity, not flatten to a single energy.
--avoidflags are respected. If the user said "no electronic sounds", the output should not have synths.- Per-flag control worked (BPM, key, structure). If the user asked for 80 BPM in E minor, the output should be in that range, not "close enough".
Output Verification (Covers, Mashups, Style Transfer)
After generation, run a post-generation check that is specific to the route. Use the analysis orchestrator's output on the generated file when possible.
Verification Checklist per Route
Cover (minimax_cover)
- Melody recognisable as the source (basic-pitch MIDI compare or ear-test)
- Target style is clearly audible (genre/mood keywords present)
- Source BPM is within ±10 BPM
- Source key is preserved (or user agreed to shift)
- Lyrics decision respected (original / translated / new / instrumental)
-
--avoidflags respected
Mashup (minimax_mashup)
- Song A's lyrics and emotional arc recognisable
- Song B's style is dominant in the production
- Vocal intensity matches Song A's emotion curve
- Section structure feels coherent (not random)
-
--avoidflags respected for Song B's style
Style Transfer (minimax_style_transfer)
- Source style (the reference track) is reproduced in timbre, instrumentation, and feel
- Output melody is NOT recognisable as the source (it is a new composition in the source style)
- Target genre/mood keywords audible
- BPM and key reasonable for the new style (not forced from source)
Emotion Prompt / Precision (minimax_emotion_prompt)
- Per-flag values (BPM, key, structure, avoid) match the flags
- Prompt language and flags are not contradicting (linter clean)
- Lyrics reflect the requested theme and language
Failure Signatures and Fixes
When the generated track does not match the request, identify the failure signature and apply the matching fix. The most common signatures:
| Failure signature | Likely cause | Fix |
|---|---|---|
| Copied too closely (cover sounds like a remaster, not a new style) | Prompt did not specify the new style firmly enough, or --avoid list left the original instrumentation unguarded. |
Add explicit target style language, list new instruments, expand --avoid with the source's dominant sounds. Re-run. |
| Lost source melody (cover no longer recognisable) | --prompt overrode the cover model, or the source audio was too noisy / clipped. |
Switch to the two-step cover workflow (preprocess + generate with cover-feature-id); reduce style strength in the prompt. |
| Wrong tempo (BPM noticeably off) | Prompt and --bpm disagreed, or vocal delivery speed misled the detector. |
Lint prompt + flags first. Re-run with the linter-clean pair. If still off, set --bpm explicitly and drop the BPM number from the prompt. |
| Wrong key (key shifted up/down) | Prompt mentioned a key but flags used another. | Lint the pair. Use the same key in both. If MIDI confirms a different source key, trust MIDI over prompt. |
| Muddy mix (low clarity, washed out) | Overly dense instrumentation, lack of anti-sparse guard, or too many --avoid exclusions. |
Reduce instrument count, raise --bpm for tightness, add explicit "all instruments clearly audible". |
| Vocals too neutral (no emotion) | Emotion analysis not run, or intensity curve not transferred. | Run analyze_vocal_emotion.py on the source and feed intensity_curve into the prompt. Add explicit "vocal intensity: ..." clause. |
| Weak chorus (chorus does not lift) | Structure line lacks a build cue, or the prompt was a single energy. | Add structure with explicit build cues: "verse: intimate, chorus: soaring, all instruments louder in chorus". |
| Style mismatch (output does not match the requested genre) | Prompt used vague genre words or the wrong dominant instrument. | Replace vague words with concrete genre + instrument list. Use the canonical mmx prompt schema in references/examples.md. |
Revision Prompt Templates
When a generation comes back with one of the failure signatures above, build a revision prompt that preserves the source identity while changing the failing dimension.
Template: stronger style change (cover too close)
Same melody and lyrics as before. Re-imagine the production as [TARGET_STYLE] with [INSTRUMENT_LIST].
ALL instruments always playing throughout, never go a cappella.
Avoid: [STYLE_CONTRADICTING_WORDS from previous run].
Template: keep the melody (cover lost it)
Re-apply the original melody from the source audio. Keep the recognizable hook at [HOOK_TIME].
Use a softer production in [TARGET_STYLE] but DO NOT change the melodic contour.
Avoid: [WORDS_THAT_PUSHED_TOO_FAR].
Template: fix tempo drift
Keep the source BPM (use --bpm [SOURCE_BPM]). Do not slow down or speed up the vocal delivery.
Avoid: rubato, half-time, double-time, slowing down, speeding up.
Template: fix key shift
Stay in [SOURCE_KEY]. Do not transpose. Use the same chord progression as the source.
Avoid: key change, modulation, transpose.
Template: fix muddy mix
Make every instrument clearly audible. Reduce instrument count to [N].
Add contrast: quieter verses, louder choruses. Keep vocals upfront in the mix.
Avoid: dense layering, atmospheric washes, sustained pads throughout.
Template: lift the chorus
Chorus: soaring, all instruments louder than the verse, fuller chords, more reverb on the lead vocal.
Verse: intimate, single voice, soft drums, breathy delivery.
Bridge: build tension, add a melodic lift before the final chorus.
These templates pair with the failure-signature table. After the revision, re-run the verification checklist above.
Lyrics Optimizer Behavior
Same as the base skill — when music_generate is called without explicit lyrics, MiniMax auto-generates. With this skill, you can also call the lyrics_generation API directly to preview the lyrics before generation, or to iterate via the edit mode.
If the user wants specific words, the lyrics_generation API's edit mode lets you modify auto-generated lyrics to match the user's intent without regenerating the whole song.
Reference Map
references/mmx-flags-reference.md— fullmmxCLI flag reference with worked examplesreferences/examples.md— practical MiniMax examples with routing, first questions, workflow shapes, and prompt/flag lint catchesreferences/cover-workflow.md— one-step and two-step cover workflow with payloads, error handling, use casesreferences/lyrics-generation.md— thelyrics_generationAPI endpoint, both modes, examplesreferences/mashup-workflow.md— two-song mashup workflow, emotion-to-prompt conversion, decision treereferences/emotion-analysis.md— 25+ emotion classifications + per-emotion detection cookbook + emotion combinations + the analysis pipelinereferences/emotion-delivery.md— 21 emotion recipes for the OUTPUT + iteration loop + common mistakesreferences/advanced-audio-analysis.md— advanced free tools (Essentia, Demucs, Basic Pitch, Music21, CREPE) for deeper analysis when basic librosa/parselmouth is not enoughreferences/error-handling.md— MiniMax-specific error table, recovery patterns, anti-sparse failure recoveryscripts/check_environment.py— lightweight preflight diagnostic for Python, env vars, CLI tools, and optional packagesscripts/lint_music_request.py— standard-library helper for routing, blocker, missing-field, prompt, andmmxflag conflict checksscripts/smoke_test.py— standard-library smoke tests for pure helper behaviorscripts/— Python helpers for audio analysis (download, segment, analyze, convert emotion to prompt)music-craft— base skill with shared concepts (Pre-Flight, anti-sparse, prompt formula, structure tags, Request Intake, User Preference Flow)music-craft→ references/free-tool-inputs.md — base layer for free tool inputs (web_fetch, web_search, image, memory)references/free-tool-inputs.md— MiniMax layer: free-tool routing, blocker checks, and prompt/flag conflict lint before analysis
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install music-craft-minimax - 安装完成后,直接呼叫该 Skill 的名称或使用
/music-craft-minimax触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
Music Craft — MiniMax 是什么?
Advanced music generation for OpenClaw, using the MiniMax Music 2.6 token plan. Use for cover and style transfer, two-song mashup, lyrics generation API, emo... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 30 次。
如何安装 Music Craft — MiniMax?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install music-craft-minimax」即可一键安装,无需额外配置。
Music Craft — MiniMax 是免费的吗?
是的,Music Craft — MiniMax 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Music Craft — MiniMax 支持哪些平台?
Music Craft — MiniMax 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Music Craft — MiniMax?
由 LuisCharro(@luischarro)开发并维护,当前版本 v1.0.1。