← 返回 Skills 市场
luischarro

Music Craft — MiniMax

作者 LuisCharro · GitHub ↗ · v1.0.1 · MIT-0
cross-platform ⚠ suspicious
30
总下载
0
收藏
1
当前安装
2
版本数
在 OpenClaw 中安装
/install music-craft-minimax
功能描述
Advanced music generation for OpenClaw, using the MiniMax Music 2.6 token plan. Use for cover and style transfer, two-song mashup, lyrics generation API, emo...
使用说明 (SKILL.md)

Music Craft — MiniMax

This is the power-user upgrade of music-craft. It does everything that skill does, plus the features that require the MiniMax Music 2.6 token plan:

  • Cover and style transfer from a reference audio file or YouTube URL (preserves melody)
  • Two-song mashup (Song A's content and emotion + Song B's style)
  • Lyrics generation via the MiniMax API endpoint (with edit mode for iteration)
  • Emotion analysis on input audio to drive prompt construction (vocal speed, intensity curve, pitch bends)
  • Fine control over generation parameters (BPM, key, structure, avoid list as separate flags via mmx)

For everything else (standard song generation, instrumentation, anti-sparse prompt engineering, structure tags, user preference flow), this skill uses the same workflow as music-craft. Read that skill first to understand the base, then come back here for the MiniMax-specific extensions.

Routing and Blocker Checks

Classify the request before analysis or generation:

  • Text-only style reference means the user gave a song name, artist, era, or genre cue without source audio. Treat it as style inference, not cover analysis.
  • Reference audio or YouTube means the user provided a file or playable source that should be analyzed.
  • Cover preserves melody and usually needs a source file plus a target style decision.
  • Style transfer uses a reference track or analyzed audio as style input, then changes the production direction.
  • Mashup needs Song A and Song B, plus a decision about which one contributes content and which one contributes style.
  • Emotion prompt means the user wants analysis turned into descriptive prompt language, not a full cover.

The scripts/lint_music_request.py helper emits one of these routes:

Route When
base_prompt Standard generation, no MiniMax-specific feature needed.
minimax_cover Melody-preserving cover from audio or YouTube.
minimax_mashup Two-song mashup (A + B, both identified).
minimax_style_transfer Style transfer that does not preserve the source melody.
minimax_emotion_prompt Emotion analysis, or precision mmx flag usage.
needs_clarification At least one blocker is unresolved; ask the user first.

Surface blockers before analysis:

  • no source file or usable URL
  • unclear which track is Song A versus Song B
  • missing target style
  • missing lyrics decision, such as original, translated, rewritten, or instrumental
  • conflicting cover/style-transfer intent: the user asked for both "cover" (preserve melody) and "style transfer" (reproduce style) at once. These are mutually exclusive. Ask the user to pick one.

After you have prompt text and mmx flags, lint them together before generation:

  • compare prompt BPM with --bpm
  • compare prompt key with --key
  • compare prompt structure line with --structure
  • compare prompt duration with --duration (or implicit length expectation)
  • compare prompt vocal mode with --vocals
  • compare prompt language with --language
  • compare prompt avoid language with --avoid
  • stop when the prompt says one thing and the flags say another

If the user only has a text reference, route to the free-tool path in references/free-tool-inputs.md first. If the user has audio, analyze first and only then build the prompt. The linter returns a retry_guidance array with one hint per conflict so the operator can re-align prompt and flags on the next attempt.

When To Use

Use this skill when the task involves:

  • generating a cover of an existing song with a different style (chanson version of a rock track, reggaeton version of a pop hit, and so on)
  • style transfer from a YouTube URL or audio file to a target genre
  • two-song mashup where Song A's lyrics and emotional arc are kept, but Song B's style is applied
  • emotion analysis on input audio to extract intensity curves, vocal speed, pitch bends, and emotion classifications
  • generating lyrics in a specific language and theme via the MiniMax lyrics_generation API
  • editing existing lyrics to match a target style or emotional arc (MiniMax lyrics_generation edit mode)
  • using mmx CLI directly for fine control over --avoid, --bpm, --key, --structure, --vocals, --instruments as separate flags
  • accessing MiniMax's music-cover or music-cover-free models for melody preservation

Request Intake (adapted for MiniMax features)

After the Routing and Blocker Checks classify the request, run this 2-pass intake to extract the full set of fields the user cares about. Label each field's confidence: clear (user said it), inferred (sensible default), missing (need to ask), or conflicting (user said two incompatible things — pause to resolve).

Fields checklist (MiniMax-specific)

# Field What to look for MiniMax-specific notes
1 Route Cover / style transfer / mashup / standard / emotion prompt From the Routing and Blocker Checks section. Determines which MiniMax features to use.
2 Source audio or URL File path or playable YouTube URL Required for cover, mashup, style transfer. For standard, optional (text-only style reference is also fine).
3 Song A identity Name, artist, audio For mashup: needed. For cover: this is the source.
4 Song B identity Name, artist, audio For mashup only.
5 Target style Genre / mood / reference The destination of the cover or style transfer. If user says "like Rosalía", that's clear. If user says "something good", that's missing.
6 Lyrics decision Original / translated / new / instrumental For cover, default to original (translated if user requests it). For standard, default to new (or user-provided).
7 Vocal mode Solo / duet / choir / instrumental Drives --vocals and --language flags.
8 Language BCP-47 code (en, fr, es, etc.) For lyrics language AND vocal language.
9 Duration Approximate length (jingle ~30s, standard ~3min, epic ~6min) mmx has no native duration control (see "Song length" section). Length is driven by lyrics + structure, so the intake needs the lyrics to control length.
10 BPM, key, structure Exact values if user wants --bpm/--key/--structure Optional. If provided, the prompt AND flags must agree (lint them).
11 Emotion arc For emotion-prompt workflows: which emotions to emphasize Drives the analysis-to-prompt translation.
12 Output location Where the audio and analysis files go Same as the base skill — per-song subfolder in ~/Music mix/\x3Cproject>/\x3Csong-slug>/.

Confidence map example (MiniMax-specific)

Request: "Hazme un cover del 'Bizcochito' de Rosalía pero en reggaetón"

clear:     source_audio=path, song_a=Bizcochito, target_style=reggaeton
inferred:  language=es, vocal_mode=solo_female, lyrics_decision=original
missing:   output_location (which project folder? per-song subfolder?)
            vocal_register (full chest, head voice, whisper? — affects --vocals flag)

Request: "I have a YouTube link of an old rock song and want it as a dreamy shoegaze ballad, with English lyrics because the original is in French"

clear:     source_url=URL, song_a=old_rock_song, target_style=shoegaze
            lyrics_decision=translated, target_language=en
inferred:  vocal_mode=duet or solo (depends on original), ~3min
missing:   audio source for source audio analysis (YouTube needs to be downloaded first)
            BPM/key from analysis output (will be filled in after analysis)
            output_location

If any field is missing or conflicting, that's a question to ask. The Ambiguity Questions section below has specific patterns for each route. If everything is clear or inferred, the request is ready to translate.

User Preference Flow (message patterns → action)

The skill does not start with a questionnaire. It starts by reading and inferring from the user's natural-language request.

User says... Skill does...
"Haz un cover de X en Y" Route: minimax_cover. Ask: source audio file (or download from YouTube), target language for lyrics, vocal register.
"Make this song sound like Rosalía" Route: minimax_style_transfer. Ask: source audio, which album/era of Rosalía.
"I have audio of A, mash with B, keep A's melody" Route: minimax_mashup. Ask: A vs B confirmation, source audio for A, B can be name or audio.
"Analyze the emotion curve of this track" Route: minimax_emotion_prompt (analysis-only). Run analysis_orchestrator.py --audio first, then read the JSON.
"I want the lyrics to be about X, in French, melancholic" Route: base_prompt (standard). Use the lyrics API to generate, then pass to mmx music generate --lyrics-file. Ask: target BPM/key/structure or derive from analysis.
"Recreate the song but in 90 BPM D minor" Route: base_prompt with mmx flags. Lint prompt vs flags before generation. Verify BPM/key consistency.
"I don't know, surprise me" Pick a coherent default (e.g. upbeat indie pop, EN, ~3min, auto-lyrics, standard generation) and confirm with the user before generating.
"Same song again but as a reggaeton version" Route: minimax_cover with the existing song as source. Use the same project/song subfolder, suffix the MP3 (M1_original.mp3 + M2_reggaeton.mp3).

This table is the abstract of references/user-preference-flow.md (which lives in the base skill). If you want a more detailed case, defer to the base skill's table and combine with this skill's route mapping.

Output File Layout (Per-Song Subfolders)

MiniMax-specific additions (drop these into the per-song subfolder alongside the base items):

File Source Notes
\x3Csong-slug>_analysis.json analysis_orchestrator.py --output MiniMax-specific analysis results (emotion, BPM, key, segments)
\x3Csong-slug>_lyrics.txt mmx music generate --lyrics-file Optional if user provided lyrics inline
\x3Csong-slug>_\x3Cstyle>_prompt.txt The exact text passed to --prompt For reproducibility

The LLM should aim for the base skill's layout by default. The MiniMax-specific files are added on top when MiniMax features are used (cover workflow, mashup, analysis, etc.).

Quick Start with the Orchestrator

For any input combination, the analysis_orchestrator.py script is the single entry point:

# Audio file
python3 scripts/analysis_orchestrator.py --audio /tmp/song.wav

# Two songs (mashup) - gets BPM + key compatibility scoring for free
python3 scripts/analysis_orchestrator.py --audio /tmp/song_a.wav --audio /tmp/song_b.wav

# Video - extracts audio + visual features (scenes, color, motion)
python3 scripts/analysis_orchestrator.py --video /tmp/clip.mp4

# Image (album art) - color palette + style hints
python3 scripts/analysis_orchestrator.py --image /tmp/album_art.jpg

# YouTube URL - downloads then analyzes
python3 scripts/analysis_orchestrator.py --youtube "https://youtube.com/watch?v=..."

# Combination: audio + image
python3 scripts/analysis_orchestrator.py --audio /tmp/song.wav --image /tmp/art.jpg

# Demucs source separation — for TIMBRE/PITCH analysis of an isolated vocal, NOT for lyrics
python3 scripts/analysis_orchestrator.py --audio /tmp/song.wav --use-demucs

# Whisper lyrics extraction — run on the FULL mix (do NOT pre-separate with Demucs)
python3 scripts/analysis_orchestrator.py --audio /tmp/song.wav --lyrics

# VLM captioning for images (calls mmx vision describe / MiniMax 3.0 — cloud, skip if MiniMax is blocked)
python3 scripts/analysis_orchestrator.py --image /tmp/album_art.jpg --vlm

The orchestrator dispatches to the right analysis scripts and produces a unified JSON. Optional packages (CLAP, autochord, allin1, pyloudnorm, pylette, scenedetect, demucs, beat_this, basic-pitch, transformers/MERT, open_clip) are detected at runtime and used when available.

Extraction guidance (what actually improves the output)

These are the rules that make the extracted data useful to the downstream generator. They are tool-agnostic — they apply whether the backend is MiniMax cloud or a local model.

  • Lyrics: transcribe the FULL mix, do not Demucs-first. Feeding Demucs-isolated vocals into Whisper measurably worsens transcription word-error-rate in most configurations. Run the transcriber on the original mix. Use faster-whisper over vanilla whisper (same accuracy, much lower latency/VRAM), and prefer the large-v2 model for sung lyrics — large-v3 is reliably worse on singing. Use medium/base only as a speed compromise.
  • Use Demucs only for timbre/pitch. Source separation helps when you want clean vocal-stem features (breathiness, pitch range, vocal brightness) or per-instrument detection — never as a lyrics pre-step.
  • Prioritise the high-value features. For driving a generation prompt, the features that matter most (in order) are tempo/BPM, key/scale, beats/downbeats, chords, then structure (section boundaries). Energy/RMS and spectral centroid map to texture words (punchy, airy, sparse, dense) and to dynamic tags. Spend analysis budget there first.
  • Give key detection a long window. Estimate key/chroma over ~120s of audio (not a short clip) for a stable result; BPM is stable from ~60s.
  • Carry confidence through to the prompt. Hedge low/medium detections ("around 128 BPM", "likely D minor") and never inject missing values as facts — see Analysis Quality below.
  • Map structure boundaries to actions. Detected section boundaries become the [Verse]/[Chorus]/[Bridge] tag roadmap, and (for backends that support it) the repaint windows for fixing one bad section instead of regenerating the whole track.

Output file layout (per-song subfolders)

Every generation should be saved into a per-song subfolder that bundles the audio with its analysis, prompt, and lyrics. The LLM should ask the user for the project root and song slug up front (default: ~/Music mix/\x3Cproject>/\x3Csong-slug>/), then run the full chain of commands below.

# Example: DBC - Two Paths, two versions

# 1. Make the subfolder
mkdir -p ~/Music\ mix/dbc/two-paths

# 2. Run the analysis and save JSON into the subfolder
python3 scripts/analysis_orchestrator.py \
  --audio /tmp/two_paths.wav \
  --use-demucs --lyrics --lyrics-source auto \
  --output ~/Music\ mix/dbc/two-paths/two_paths_analysis.json

# 3. Build the prompt from the analysis, save it next to the JSON
python3 scripts/emotion_to_prompt.py \
  --emotion ~/Music\ mix/dbc/two-paths/two_paths_analysis.json \
  --output ~/Music\ mix/dbc/two-paths/two_paths_synthwave_prompt.txt

# 4. Generate each version, save the MP3 into the subfolder with a
#    versioned filename so multiple takes stack cleanly
mmx music generate \
  --prompt "$(cat ~/Music\ mix/dbc/two-paths/two_paths_synthwave_prompt.txt)" \
  --lyrics-file ~/Music\ mix/dbc/two-paths/two_paths_lyrics.txt \
  --out ~/Music\ mix/dbc/two-paths/M1_two_paths_synthwave.mp3

The result is a self-contained song folder that the user can review, archive, share, or re-generate from without losing any context.

What's New in v1.0.0

v1.0.0 is the first stable release. It builds on the v0.x series (v0.3.0 / v0.4.0 dev line) with stronger preflight routing, wider prompt/flag consistency, and explicit post-generation verification:

Preflight routing:

  • lint_music_request.py now emits one of six routes: base_prompt, minimax_cover, minimax_mashup, minimax_style_transfer, minimax_emotion_prompt, or needs_clarification
  • New blockers: missing Song B, missing lyrics decision, and conflicting cover/style-transfer intent
  • A retry_guidance array on every conflict so the operator can re-align prompt and flags

Prompt and flag consistency:

  • Linter now detects conflicts in BPM, key, structure, duration, vocal mode, language, and avoid list
  • The canonical mmx prompt schema is documented in references/examples.md

Analysis quality:

  • All analysis scripts converge on a compact summary (tempo, key, sections, instrumentation, vocal traits, energy curve, hook points, mix notes)
  • Confidence levels (clear, high, medium, low, inferred, missing) attached to every detection
  • Missing optional dependencies fall back to a JSON error block instead of failing the whole workflow

Output verification:

  • Post-generation verification checklists for covers, mashups, style transfer, and emotion prompts
  • Eight failure signatures (copied too closely, lost melody, wrong tempo, wrong key, muddy mix, weak chorus, style mismatch, neutral vocals) with matching fixes
  • Revision prompt templates that preserve source identity while fixing one specific dimension

Tests and portability:

  • Smoke tests now cover all new linter routes, the new conflict types, and the stdlib-only import guarantee
  • Windows is documented as partial support; scripts stay POSIX-safe, audio tools may need platform install

What's New in v0.3.0

v0.3.0 builds on v0.2.0 with a substantially richer analysis pipeline:

New analysis scripts (8):

  • extract_stems.py — Demucs source separation (vocal/drums/bass/other)
  • track_beats.py — beat_this beat + downbeat tracking (ISMIR 2024 SOTA)
  • extract_melody.py — Spotify Basic Pitch polyphonic AMT → MIDI + key/scale
  • compute_audio_embedding.py — MERT v1-330M music embeddings (vibe similarity)
  • classify_instruments.py — MIT AST 527-class AudioSet tagging
  • extract_video_features.py — extended with camera motion + VLM captioning
  • analyze_image.py — extended with OpenCLIP, OCR, face detection, VLM caption
  • analysis_orchestrator.py — single entry point, --use-demucs, --vlm, --ocr flags

New prompt slots (consumed in emotion_to_prompt.py):

  • beat grid: 4/4 at 150 BPM (confidence 0.80) from beat_this
  • melodic key from MIDI: E minor; interval motion: mostly leaps; modal character: pentatonic, blues from Basic Pitch
  • AST-detected sound palette: rock music (0.16), punk rock (0.14), grunge (0.20) from MIT AST
  • emotion signature from analysis: intense, passionate, dramatic, triumphant (expanded to 25-emotion classifier)
  • vocal texture in verse: breathier / more intimate than average (per-section aggregation)
  • tempo: tight, on-beat delivery (from tempo_consistency)
  • tonal character: dark warm tone, rolled-off highs (from brightness)
  • instruments detected: electronic / synthetic textures (from instrument_hints)
  • natural dramatic pauses detected at: 2s (11.7s pause), 20s (3.3s pause) (from Demucs vocal-stem)
  • style direction: ... (from analyze_two_songs mashup_plan)

Bug fixes:

  • parselmouth 0.4.x API (get_value_at_time / get_value_at_xy)
  • ffmpeg 8.x image2 muxer workaround (per-frame extraction)
  • pylette 5.1+ capital-P import + Pylette fallback
  • open_clip 3.3 3-tuple return + get_tokenizer() for tokenizer
  • demucs 4.x apply_model API

Prompting wins (verified end-to-end with DBC Woodstock 2013):

  • Mix: 0 silence gaps, 35 pitch bends
  • Vocal stem: 19 silence gaps, 49 pitch bends, 2.32 syll/sec
  • BPM 150 (4/4) from beat_this, E minor (MIDI-confirmed G# minor)
  • AST: "Rock music", "Punk rock", "Heavy metal", "Grunge" — matches the actual band

When NOT To Use

Do not use this skill when:

  • the user only needs standard song generation without cover, mashup, or analysis — use music-craft instead (lighter, no MiniMax dependency)
  • the runtime does not expose a music_generate tool and there is no MINIMAX_API_KEY configured — both skills need the runtime
  • the user wants deterministic, single-shot generation with no iteration — overkill
  • the user wants to mutate a specific existing audio file (pitch shift, time stretch, stem split) — that is post-production, not generation
  • the user is not on a MiniMax Token Plan — the advanced features (cover, mmx per-flag control, lyrics API, emotion-driven prompts) require the plan

Decision Tree

Use the base skill unless one of these MiniMax-specific needs is present:

  • melody-preserving cover or style transfer from audio or YouTube
  • two-song mashup
  • lyrics API preview/edit flow
  • emotion analysis that feeds the prompt
  • exact mmx control for BPM, key, structure, or avoid lists

If the user wants a new song that only borrows a style, stay in music-craft unless they also need exact flag control or lyrics API iteration.

If the source is a YouTube URL and download is blocked, ask for a local file before changing the workflow.

First Response Defaults

Use these defaults on the first pass:

  • Cover from audio or YouTube: start with the one-step cover path. Switch to two-step only if the user wants translated lyrics, edited ASR lyrics, or custom lyrics.
  • Style transfer only: do not use cover unless melody preservation matters. Use standard generation plus mmx flags if exact BPM/key/structure matter.
  • Two-song mashup: anchor on Song A. If Song A has audio, default to the cover two-step workflow; if Song B is only named, ask for a short style description or fetch more context if free tools are available.
  • Lyrics API generation or edit: use write_full_song for blank-page generation and edit for revisions.
  • Emotion-analysis-to-prompt: run analysis first, then convert to a prompt; only ask whether the output should be cover, mashup, or standard generation, plus the target language if missing.
  • Exact BPM/key/structure control: make mmx flags the source of truth and keep the prompt descriptive but non-conflicting.

Ambiguity Questions

Ask at most 1-3 questions. Separate blockers from quality tweaks:

  • Required blockers first: source file or URL, which song is A vs B, whether lyrics already exist, whether the output must preserve melody.
  • Optional quality after blockers: target language, target style, BPM, key, structure, instruments, vocal color, avoid list.

Use these exact patterns when clarification is needed:

  • Cover: "Which source should I use?" "Do you want the original lyrics, translated lyrics, or new lyrics?" "Any target style, or should I derive it from the source?"
  • Mashup: "Which song is A and which is B?" "Do you have audio for Song B, or only the name?" "Should the lyrics stay the same or be rewritten?"
  • Lyrics API: "Write from scratch or edit existing lyrics?" "What language should I target?" "Any hard structure requirements?"
  • Emotion prompt: "Do you want cover, mashup, or standard generation?" "What language should the output use?" "Should I prioritize tenderness, energy, or structure?"
  • mmx precision: "Which values are mandatory: BPM, key, structure, or avoid list?" "Any instruments or vocals that must stay in or stay out?"

Relationship to music-craft

This skill extends the base skill, it does not replace it. The shared concepts are:

Concept Where it lives
Pre-Flight Check (platform detection) This skill (extended required list)
Anti-sparse rules (canonical text) Base skill, referenced from here
Prompt formula (production sheet) Base skill, referenced from here
Structure tags (14 tags) Base skill, referenced from here
User preference flow (auto-detect + ask) Base skill, referenced from here
Output file layout (per-song subfolders, slug rules, version prefix) Base skill, referenced from here; MiniMax adds analysis.json and lyrics.txt
Rate limits (generic) Base skill
Quality verification checklist Base skill, extended here for MiniMax
Operating rules (6-step loop) Base skill, summarized here with MiniMax-specific extensions

The MiniMax-specific additions are:

MiniMax concept Where it lives
mmx CLI quick reference This skill
mmx full flag reference This skill, references/mmx-flags-reference.md
Cover workflow (one-step, two-step) This skill, references/cover-workflow.md
Lyrics generation API This skill, references/lyrics-generation.md
Mashup workflow (A + B) This skill, references/mashup-workflow.md
Emotion analysis (vocal speed, intensity, pitch) This skill, references/emotion-analysis.md
MiniMax-specific error handling This skill, references/error-handling.md
Audio analysis scripts This skill, scripts/
Free tool inputs (web, image, memory) Both skills — base layer in music-craft, MiniMax layer here in references/free-tool-inputs.md

Pre-Flight Check (extended)

The platform detection block is the same as music-craft (run it first). The required and optional lists are extended for MiniMax.

Platform Notes

  • macOS/Linux are the primary targets: use python3, command -v, and normal shell export/PATH checks.
  • Windows is partial support only: prefer PowerShell, use python or py -3, and verify env vars with Get-ChildItem Env:MINIMAX_API_KEY or Test-Path Env:MINIMAX_API_KEY.
  • On Windows, ffmpeg, yt-dlp, and mmx are PATH-sensitive; if Get-Command/where.exe cannot find them, restart the shell or add the install directory to PATH.
  • If Windows path/dependency issues keep blocking analysis, use WSL for the script-heavy parts instead of claiming full native support. For the full WSL2 setup (including the corporate items below), follow the base skill's references/windows-wsl-setup.md.
  • Corporate machines (TLS-inspecting proxy): pip/HuggingFace/model downloads inside WSL fail with CERTIFICATE_VERIFY_FAILED until the corporate root CA is installed in the distro (and REQUESTS_CA_BUNDLE/SSL_CERT_FILE point at the system bundle). A proxy env var (HTTP_PROXY) can also hijack 127.0.0.1 calls to a local API — unset it / set no_proxy=127.0.0.1,localhost. Both are covered in the base reference above.
  • If MiniMax itself is blocked (corporate firewall) the cloud features here — mmx music cover, the lyrics API, and any analysis script that calls MiniMax (e.g. emotion_to_prompt.py) — will fail. In that case use only the local-capable tools (yt-dlp, ffmpeg, librosa, Whisper) for analysis and the local ACE-Step backend in music-craft for generation.

Required (skill will not work without these)

Check What it is How to verify If missing
music_generate tool The runtime's built-in music generation tool Inspect the active runtime's tool list Tell the user: "This skill needs a music_generate tool, but the active runtime does not expose one. Configure a music provider in OpenClaw and try again." Stop.
MINIMAX_API_KEY env var API key for the MiniMax Music 2.6 plan test -n "$MINIMAX_API_KEY" && echo "OK" Tell the user: "This skill needs the MINIMAX_API_KEY environment variable. Get one from your MiniMax account and export it. If you do not have a MiniMax Token Plan, use music-craft instead — it works with any provider." Stop.
mmx CLI The MiniMax CLI for fine-flag control command -v mmx && mmx --version (macOS/Linux) or Get-Command mmx; mmx --version (PowerShell) Ask the user: install via the MiniMax install guide, or skip mmx-specific features and use the music_generate tool with prompts. Do not block — mmx is optional if the runtime has MiniMax configured, but Windows support is only partial and depends on PATH visibility.
python3 Required for the analysis scripts command -v python3 (macOS/Linux) or python / py -3 (Windows PowerShell) Tell the user: "The analysis pipeline (emotion analysis, mashup) needs Python 3.9+." Propose an install command for the active shell. Block emotion analysis if missing.

Optional (skill works without these, but quality improves with them)

Tool What it unlocks Install per platform
ffmpeg Audio conversion (WAV for analysis, MP3 export, trimming) apt install ffmpeg · brew install ffmpeg · winget install Gyan.FFmpeg (restart PowerShell after install so PATH updates apply)
yt-dlp YouTube audio download for cover and mashup pip install -U yt-dlp or py -3 -m pip install -U yt-dlp on Windows; ensure the CLI is on PATH
librosa Audio analysis (BPM, key, energy, structure) pip install librosa numpy scipy
parselmouth Better pitch tracking (Praat under the hood) pip install praat-parselmouth
scikit-learn Audio clustering (segment detection) pip install scikit-learn

The full per-platform install table is in the base skill's music-craft Pre-Flight Check.

The "ask the user" pattern

Same as the base skill: for each missing optional tool, present three options — install (propose exact command, let user approve), skip (use the simple path), or cancel. Never auto-install.

If MINIMAX_API_KEY is missing, the redirect is to the base skill, not "install MiniMax" — the user may not have a Token Plan at all.

Local analysis memory (separate from generation)

Generation runs on MiniMax's cloud — your laptop just sends the prompt and downloads the MP3, so generation itself uses negligible local memory.

However, this skill's local analysis scripts run on your machine and can use real memory. Before running the full analysis pipeline, check available memory:

Script Models loaded Approx peak RAM
analyze_vocal_emotion.py parselmouth (Praat) + scipy ~500 MB
analyze_audio.py librosa + transformers (MERT or MIT AST) 2–4 GB
extract_lyrics_whisper.py whisper model (tiny/base/medium) 1–5 GB depending on model size
extract_stems.py Demucs (htdemucs) 2–4 GB
emotion_to_prompt.py calls MiniMax API — negligible local \x3C100 MB
compute_audio_embedding.py MERT model 1–2 GB
classify_instruments.py MIT AST 1–2 GB

Combined (full analysis pipeline on a 4-min song): ~6–10 GB peak on top of OS and other apps. On unified-memory systems (Apple Silicon, integrated graphics), this competes directly with macOS/Windows and your other applications. On dedicated-GPU systems (NVIDIA, AMD), model memory is taken from system RAM unless you have CUDA acceleration.

Recommendations:

  • Close heavy apps (browser with many tabs, IDE, Docker) before running the full pipeline
  • For extract_lyrics_whisper.py, use the tiny model by default — base/medium are 2-5x heavier with marginal quality gain for most songs
  • For extract_stems.py, the --quality flag controls Demucs model size; default htdemucs is the heaviest; htdemucs_ft is the lightest
  • If you run out of memory, run analysis steps individually rather than via analysis_orchestrator.py (which loads everything)

The scripts/smoke_test.py script verifies the environment is set up; it does not test memory headroom. Run your own memory check before running a full analysis.

Free Tool Augmentation (Input Enrichment)

The OpenClaw runtime exposes several free tools (web_fetch, web_search, image analysis, memory, browser) that enrich the music generation workflow. The base layer is documented in music-craft → Free Tool Augmentation and references/free-tool-inputs.md. This section shows how they compose with MiniMax-specific features.

Quick recap of free tools

Tool Purpose
web_fetch Fetch URL content (lyrics pages, YouTube metadata, Wikipedia)
web_search Find lyrics, artist info, genre descriptions
image / MiniMax__understand_image Analyze album art, concert photos, music video screenshots
memory_search / memory_get Recall user's prior music preferences
browser JS-heavy site fallback (last resort)

MiniMax compositions (high-value combos)

  • web_fetch + lyrics_generation: fetch the user's draft from a URL, run it through edit mode for cleanup, generate.
  • web_search + cover workflow: find covers in the target style, extract their characteristics, apply to the user's track.
  • image + mmx per-flag control: analyze album art, translate to --instruments, --bpm, --key, --structure for fine-grained style matching.
  • memory + emotion analysis: combine the user's prior preferences with deep audio analysis of a reference track.

For the full worked examples, parameter recommendations, and MiniMax-specific edge cases, see references/free-tool-inputs.md.

Operating Rules

Same 6-step loop as music-craft, with MiniMax-specific extensions:

  1. Read and auto-detect — same
  2. Ask only the ambiguous parts — same, plus ask if the user wants cover / mashup / standard
  3. Translate to a production-sheet prompt — same, but consider whether to use mmx flags (see references/mmx-flags-reference.md) instead of packing everything into the prompt
  4. Structure the lyrics — same, plus consider lyrics API for generation or edit (see references/lyrics-generation.md)
  5. Generate and verify — same, plus the music-cover model for melody preservation
  6. Iterate — same, plus emotion analysis to inform the next prompt adjustment

For the full 6-step detail, see music-craft → Operating Rules.

Song length (mmx has no native duration control)

Unlike music-craft's ACE-Step backend (which takes audio_duration as a parameter), MiniMax Music 2.6 has no explicit duration flag. Output length is determined by:

  • Lyrics length (primary): each [Verse]/[Chorus] section takes ~15-30 seconds depending on word count and singing pace. A typical 3:30 song has ~150-200 lyrics words across 2 verses + 2 choruses + bridge.
  • Structure tags: [Intro], [Instrumental Break], [Outro] add silent/sparse sections that extend total length without lyrics.
  • Prompt hints (secondary): phrases like "3 minute song" or "4 minute track" nudge the model toward that length.
  • BPM and section count (minor effect): faster BPMs and more sections tend to produce slightly longer outputs.

Practical recipe for a full 3:30 song:

  1. Lyrics: ~150-200 words with [Verse 1], [Pre-Chorus], [Chorus], [Verse 2], [Bridge], [Outro] tags (full song structure, not just one chorus)
  2. Prompt: include structure hints like "full 3-minute song with intro, 2 verses, 2 choruses, bridge, and outro" or use --structure "intro-verse-pre_chorus-chorus-verse-chorus-bridge-chorus-outro"
  3. Check output length — if it's a 1-minute hook, the lyrics are probably too short
  4. If output is too short: regenerate with longer lyrics (the model can't add sections that aren't in the lyrics)
  5. If output is too long: trim lyrics to ~120 words or add [Instrumental Break] tags to control pacing

Don't expect mmx to hit 3:30 exactly. Output length varies by ±20-30s depending on the model. If you need precise length, ACE-Step is the right tool (it has audio_duration). If you want MiniMax's vocal quality and the song length is flexible, mmx is fine.

mmx CLI Quick Reference

The mmx CLI exposes MiniMax Music 2.6 parameters as separate flags. This gives finer control than packing everything into a single prompt string.

The most useful flags:

Flag Effect Example
--avoid Elements to avoid (comma-separated) --avoid "sparse, a cappella, electronic sounds"
--bpm Exact BPM --bpm 80
--key Musical key --key "E minor"
--structure Song structure --structure "intro-verse-pre chorus-chorus-verse-chorus-bridge-chorus-outro"
--vocals Vocal style --vocals "passionate French male vocal"
--instruments Featured instruments --instruments "accordion, upright bass, strings, piano"
--genre Genre --genre "french chanson"
--mood Mood --mood "melancholic romantic dramatic"
--lyrics-optimizer Auto-generate lyrics from prompt (flag only)
--model Model name --model music-2.6 (paid, highest RPM) or --model music-2.6-free (default, free tier)
--cover-feature-id Use a preprocessed cover (two-step workflow) (from preprocess call)

Full reference with all flags and examples: references/mmx-flags-reference.md.

When to use mmx vs music_generate:

  • mmx: when you need fine control over specific parameters (BPM, key, structure as separate flags)
  • music_generate: when the prompt-only path is enough, and you want to keep the workflow provider-agnostic

Both produce equivalent results if the prompt and flags are aligned.

mmx Music Generation — verified patterns (June 9, 2026)

End-to-end verified invocations from this session (M5_idkw_dreampop_shoegaze + M5_idkw_opera_metal in ~/Music mix/hello_cleveland/i_dont_know_why/):

Pattern A — full song with detailed prompt + 6 metas (production-grade output)

mmx music generate \
  --prompt "dream pop reimagining, shoegaze-influenced indie rock turned ethereal and cinematic.
My Bloody Valentine meets Slowdive meets Radiohead.
Male lead vocal, breathy and vulnerable, double-tracked with slight detuning and tape warmth.
Wall of clean electric guitars with heavy chorus pedal and tremolo picking.
Shimmering washes of reverb, sub-bass synth pad foundation, soft brushed electronic drums.
Glockenspiel and celesta melody line high above the mix.
Organ pads swelling at choruses, reversed guitar samples between sections.
Heavy reverb and analog warmth throughout, lo-fi texture.
Emotional arc: hazy drifting opening building wave confusion overwhelming beautiful climax fading dreamlike denouement outro.
Avoid: sharp percussive agresivo distortion clear upfront vocals minimal sparse.
Tempo 96 BPM in D major, dreamlike half-time feel.
Suitable as a slow-burn alt-pop anthem, melodic and textural, intimate verses and soaring choruses.
Modern production, polished mix, atmospheric vocal production where vocals sit among the instruments rather than above them." \
  --lyrics-file gen1_lyrics.txt \
  --model music-2.6 \
  --vocals "breathy vulnerable male lead, double-tracked with slight detuning" \
  --genre "dream pop, shoegaze-influenced indie" \
  --mood "hazy confusion building to overwhelming beautiful release, then dreamlike fade" \
  --instruments "wall of clean electric guitars with heavy chorus pedal, sub-bass synth pad, soft brushed electronic drums, glockenspiel, celesta, organ pads, reversed guitar samples" \
  --bpm 96 \
  --key "D major" \
  --structure "intro-verse-pre_chorus-chorus-post_chorus-verse2-chorus-repeat-outro" \
  --use-case "slow-burn alt-pop anthem, suitable for late-night listening" \
  --avoid "sharp percussive agresivo distortion, clear upfront vocals, minimal sparse arrangement" \
  --references "My Bloody Valentine, Slowdive, Radiohead" \
  --out M5_idkw_dreampop_shoegaze.mp3

Output: 167.9s MP3, 5.4 MB, -8.8 LUFS, 5.7 LRA (good dynamics).

Pattern B — crazy combo: opera vocals + heavy metal music (for fun experiments)

mmx music generate \
  --prompt "extreme dramatic contrast: powerful operatic tenor vocals over heavy metal instrumentation.
Like Freddie Mercury fronting Metallica. Epic, theatrical, over the top.
Thunderous double bass drums, distorted electric guitars with palm-muted chugging,
guttural rhythm section, blast beats, tremolo picking, minor key riffing.
Operatic vocals soaring above the metal wall of sound, belting high notes with vibrato.
Gothic theatrical atmosphere, dramatic dynamic shifts from whisper-quiet verses
to explosive metal choruses. Anthem-like, stadium-ready." \
  --lyrics-file gen2_lyrics.txt \
  --model music-2.6 \
  --vocals "operatic tenor, powerful Freddie Mercury style, vibrato, theatrical belting" \
  --genre "symphonic metal" \
  --mood "dramatic, theatrical, anthemic, intense" \
  --instruments "distorted electric guitars, double bass drums, blast beats, orchestral strings" \
  --tempo "fast" \
  --bpm 160 \
  --key "D minor" \
  --structure "verse-pre_chorus-chorus-verse-pre_chorus-chorus-outro" \
  --use-case "epic music experiment" \
  --avoid "pop, soft, gentle, acoustic, slow" \
  --out M5_idkw_opera_metal.mp3

Output: 155.8s MP3, 5.0 MB, -9.6 LUFS, 4.3 LRA (compressed but still has dynamics).

Model selection

Model When to use Cost
music-2.6 (default) Production work, full quality Token Plan / paid
music-2.6-free Free tier, lower RPM, "unlimited" quota for some plans Free
music-2.5+ Older model, still good quality Token Plan / paid
music-2.5 Legacy Token Plan / paid
music-cover Cover/re-interpretation of source audio (one-step) Token Plan / paid
music-cover-free Free cover variant Free

music-2.6-free is the default for most users — same model, free tier. The mmx CLI uses it as the default when no --model is specified.

is_instrumental and lyrics_optimizer flags (miniMax-specific paths)

The mmx CLI exposes two important flags that bypass the --lyrics requirement:

Flag What it does When to use
--instrumental Generate music without vocals (no lyrics needed) When user wants BGM, intro, soundtrack, loop
--lyrics-optimizer Auto-generate lyrics from the prompt (no --lyrics needed) When user says "make me a song about X" but doesn't have lyrics

Examples:

# Pure instrumental (no vocals)
mmx music generate \
  --prompt "Instrumental only, no vocals, no lyrics. Loopable coffee shop background, soft piano, brushed drums, 90 BPM, C major" \
  --instrumental \
  --length 180000 \
  --out coffee_bgm.mp3

# Auto-generated lyrics from prompt
mmx music generate \
  --prompt "Upbeat indie folk, melancholic but hopeful, male vocal, acoustic guitar, 100 BPM" \
  --lyrics-optimizer \
  --out indie_folk.mp3

Note: mmx music generate with --length uses milliseconds (the example shows --length 180000 for 3 minutes). This is mmx-specific; the underlying MiniMax API has no official duration parameter.

URL expiration warning

mmx music generate returns a saved: path. If you ever use --output-format url (the official API default), the URL expires after 24 hours. Download immediately. The mmx CLI auto-downloads to --out so this is not a problem when using --out directly.

Cover Workflow

Two cover backends exist — pick by what's available:

  • MiniMax cloud cover (this skill): mmx music cover, melody-preserving via MiniMax's music-cover model. Needs MINIMAX_API_KEY and network access to MiniMax.
  • Local ACE-Step cover (in music-craft): task_type=cover with the source audio uploaded (multipart) and audio_cover_strength controlling how far to restyle. Fully local, no cloud, follows the source melody/structure. Caveat: a full-length cover is slow and VRAM-heavy on a ~12 GB GPU and can hit the server's 600 s generation timeout — cover a shorter segment or raise ACESTEP_GENERATION_TIMEOUT. See music-craft's "ACE-Step Audio-Conditioned Generation" section.

So if MiniMax is unavailable (no key, or blocked on your network), you can still do a melody-aware cover locally with ACE-Step — it is not cloud-only. Only pure text-prompt generation (no source audio) is a "reimagining" rather than a cover.

Cover workflow preserves the original song's melody while applying a different style. Two paths:

One-step (quick):

mmx music cover \
  --prompt "French chanson, accordion, strings, passionate French vocal, 80 BPM" \
  --audio-file /tmp/original.ogg \
  --out /tmp/cover.mp3

MiniMax extracts lyrics via ASR and applies the new style.

Two-step (more control):

  1. Preprocess the audio to extract features and structure
  2. Edit the lyrics (correct ASR errors, add section tags)
  3. Generate with the edited lyrics

The two-step path gives better results when the original lyrics need correction or when the user wants different lyrics in the new style.

Full detail with payload examples, error handling, and use cases: references/cover-workflow.md.

Lyrics Generation

MiniMax has a dedicated lyrics_generation endpoint that produces structured lyrics (with [Verse], [Chorus], etc. tags) from a theme prompt. Two modes:

  • write_full_song — create new lyrics from a theme
  • edit — modify existing lyrics (e.g., make the chorus stronger, shift to a hopeful ending)

The output is structured lyrics that can be passed directly to music_generate or mmx music generate.

Full detail with API examples, parameters, and use cases: references/lyrics-generation.md.

Web Lyrics Lookup (LRCLib)

As an optional complement to Whisper transcription, the orchestrator can look up song lyrics from LRCLib (open, no auth, JSON API at https://lrclib.net/api) when the song is a known mainstream track. This is a graceful fallback — Whisper is the primary source, LRCLib is a quality boost for the right song.

Coverage reality check: LRCLib has good coverage for mainstream vocal music (pop, rock, hip-hop, R&B, country) and is poor or empty for:

  • Instrumental tracks (Joe Satriani, King Crimson, much jazz, classical)
  • Obscure bands / friend bands
  • Live / bootleg / unofficial releases
  • Non-English lyrics for English titles (and vice versa)

When LRCLib is empty (the expected case for instrumentals), the script returns no_web_lyrics and the caller silently uses Whisper. This is the designed path, not a failure.

CLI usage:

# Standalone lookup
python3 scripts/fetch_lyrics_web.py \
  --artist "Coldplay" --title "Yellow" \
  --whisper-transcript "look at the stars..." \
  --min-match 0.6 --json

Orchestrator integration via the --lyrics-source flag:

Value Behavior
whisper (default) Always use Whisper, never touch the web
web Always try LRCLib, never run Whisper
auto Whisper first; if the song is recognized AND LRCLib returns a confident match (>60% word overlap), use LRCLib; otherwise fall back to Whisper
off Skip lyrics extraction entirely

The orchestrator auto-detects artist and title from the audio path stem (e.g. Coldplay - Yellow.wav → artist="Coldplay", title="Yellow"). Pass --name-a "Artist - Title" to override.

The result includes a web_lookup sub-dict with status, match_score, and the plain lyrics (when matched), so you can inspect what was used and why.

Full detail with scoring heuristic and exit codes: see scripts/fetch_lyrics_web.py docstring.

Mashup Workflow

The signature MiniMax-specific feature: combine Song A (content + emotion) with Song B (style).

Workflow:

  1. Get Song A (audio file, YouTube URL, or song name)
  2. Get Song B (audio file, YouTube URL, or song name)
  3. Run emotion analysis on Song A (if audio available) to extract the emotional arc
  4. Build a prompt that applies Song B's style to Song A's content and emotion
  5. Generate using the cover workflow (preserves melody) or standard generation (creative reimagining)

This is the most powerful feature in this skill. The output preserves what makes Song A recognizable (lyrics, melody, emotion) while applying Song B's production style.

Full detail with the emotion-to-prompt conversion and the two-song analysis script: references/mashup-workflow.md and references/emotion-analysis.md.

Emotion Analysis

Emotion analysis extracts per-section features from input audio:

  • Intensity (loudness) — drives dynamic range
  • Pitch (Hz range, trend) — drives vocal intensity
  • Vocal effort (low / medium / high) — drives delivery style
  • Breathiness — drives intimacy vs full voice
  • Spectral centroid (brightness) — drives timbre matching
  • Emotion classification (list of 30+ emotions) — drives mood keywords for the prompt
  • Repetitive intensification — drives chorus build
  • Emotional shifts (sudden vs gradual) — drives transitions
  • Vocal speed (syllables per second) — drives elongation cues
  • Pitch bends at phrase endings — drives emotional emphasis

The analysis outputs JSON that the emotion_to_prompt.py script converts into a ready-to-use production-sheet prompt.

Local-only path (when MiniMax is unavailable): emotion_to_prompt.py calls the MiniMax cloud, so it fails when MiniMax is blocked or no key is set. In that case build the prompt locally from the analysis JSON without that script: take the extracted BPM and key/scale as explicit metadata fields; turn the energy curve and spectral brightness into texture words; turn the emotion classification and intensity curve into mood words and dynamic section tags; and feed transcribed lyrics (full-mix Whisper) as the lyric body. This is the same data, assembled by the agent instead of the cloud helper, and it feeds any backend (including a local model).

Scripts: scripts/analyze_vocal_emotion.py, scripts/analyze_audio.py, scripts/emotion_to_prompt.py.

Full detail: references/emotion-analysis.md.

For the generation side — how to use the analysis to evoke emotion in the OUTPUT, the 21 emotion recipes (joy, desperation, melancholy, triumph, yearning, anger, vulnerability, confidence, nostalgia, anxious, hopeful, tragic, heroic, tender, sensual, lonely, playful, haunting, serene, celebratory, bittersweet), the iteration loop, and common mistakes — see references/emotion-delivery.md.

Analysis Quality (Summary Format, Confidence, Fallbacks)

Analysis scripts in scripts/ produce different views (emotion, beats, melody, structure, instrumentation). The skill expects them to converge on a single compact summary so downstream code and humans can read the same shape regardless of which scripts ran.

Compact Analysis Summary

Every analysis result should include a summary object with these keys:

Key Type Meaning
tempo string BPM value with confidence, e.g. 120 BPM (confidence 0.92)
key string Detected key, e.g. E minor (confidence 0.71)
sections list Section labels with timing, e.g. [{"label": "verse", "start": 0.0, "end": 28.5}, ...]
instrumentation list Detected instrument palette, e.g. ["electric guitar", "drums", "bass"]
vocal_traits dict Breathiness, intensity, pitch range, e.g. {"breathiness": "high", "intensity": "medium"}
energy_curve list Per-section energy values, e.g. [{"t": 0, "energy": 0.6}, ...]
hook_points list Timestamps of detected hooks, e.g. [12.4, 48.0]
mix_notes list Short strings, e.g. ["vocal upfront", "wide stereo drums", "rolled-off highs"]

Scripts may add their own fields, but every script must return at least the keys above (use empty list / unknown string when a key has no data).

Confidence Levels

Every numeric or categorical detection in the analysis must carry a confidence value so weak detections do not get treated as facts.

Confidence Numeric range Interpretation
clear n/a The detection is unambiguous (e.g. user-supplied text, MIDI-confirmed key).
high >= 0.75 Strong evidence from multiple sources or models.
medium 0.5 - 0.74 Reasonable evidence but alternative interpretations exist.
low \x3C 0.5 Weak signal; treat as a hint, not a fact.
inferred n/a Not measured directly; derived from context (e.g. lyrics from a YouTube URL).
missing n/a Not available; the analysis did not run or did not find evidence.

When feeding analysis into a prompt, prefix any low or medium detection with a hedge like "around" or "approximately", and never include missing values as if they were facts.

Fallback Behavior for Missing Optional Dependencies

The advanced analysis scripts depend on optional packages (librosa, parselmouth, transformers, demucs, beat_this, basic_pitch, etc.). Each script must:

  1. Try to import the optional dependency at the top of the function.
  2. On ImportError, return a JSON object that includes {"error": "install with pip install X", "summary": {}} instead of raising.
  3. Never let a missing optional dependency crash the whole workflow.

The orchestrator at scripts/analysis_orchestrator.py collects per-script results and continues even if some scripts failed. The combined summary simply omits keys whose underlying analysis could not run. The linter, prompt builder, and generation step all read the summary and skip missing keys without erroring.

This means a user without demucs installed can still get tempo, key, and structure analysis from the base pipeline. The only loss is the per-stem vocal analysis, which is opt-in via --use-demucs.

Rate Limits (MiniMax-specific)

The MiniMax Music 2.6 documented limits are:

  • RPM: 120 requests per minute
  • Concurrent connections: 20
  • Output URL expiry: 24 hours (download the audio promptly)
  • Cover feature ID validity: 24 hours (use the preprocess output within a day)

Under the Token Plan 3.0 (June 2026+), the actual quota is credit-based rather than RPM-based:

  • A unified general credit pool covers M3, M2.7, and M2.7-highspeed
  • A 5-hour rolling window resets continuously
  • A weekly window runs Monday 02:00 CEST → next Monday 02:00 CEST
  • Weekly status may be inactive on Plus plan (no weekly cap enforced, but the schema is there)

Practical implication: the documented 120 RPM is the API limit, but the Token Plan 3.0 quota is what determines your real ceiling. If you generate 4500 requests in 5 hours on Plus, you will be rate-limited regardless of RPM.

Before submitting a batch, check the active plan:

# Check current Token Plan usage
curl -s -H "Authorization: Bearer $MINIMAX_API_KEY" \
  https://www.minimax.io/v1/token_plan/remains | jq .

If a call fails with 429 (rate limit):

  1. Wait at least 60 seconds.
  2. Check the Token Plan usage endpoint.
  3. If 5h window is exhausted, wait for the reset.
  4. Reduce concurrency if running a batch.

Anti-Sparse (MiniMax-Specific Deep Dive)

The base skill's anti-sparse rules apply. The MiniMax-specific failure mode is more severe than other providers:

MiniMax interprets "sparse" or "minimal" as "remove all instruments", even more aggressively than other providers. The model has been observed to:

  • Remove all instruments in quiet sections when the prompt uses the word "quiet"
  • Drop percussion entirely when the prompt uses "intimate"
  • Go a cappella on build-up sections when the prompt uses "build"

Mitigation:

  • Never use the words "sparse", "minimal", "stripped back", "quiet" in a MiniMax prompt without pairing them with explicit instruments.
  • Always add: "ALL instruments ALWAYS playing throughout, NEVER go a cappella or silent at any point".
  • Always list every instrument you want to hear.
  • For quiet sections, use the explicit form: "quiet sections: reduced to accordion and bass only, still fully played, NOT silent".

If a generation comes back sparse despite these rules, retry once with an even more explicit instrument list. If it fails again, the prompt has a structural issue — try a different style.

For the canonical anti-sparse text and worked examples, see the base skill's Anti-Sparse Rules section.

Quality Verification Checklist

Same 8-point checklist as the base skill, plus 4 MiniMax-specific items:

  1. Cover preserves melody recognisably. If the user said "make it sound like Song X", the new version should be recognisable as Song X's melody with Song Y's style.
  2. Emotion curve matches Song A (for mashups). The dynamic arc of the output should follow the original's intensity, not flatten to a single energy.
  3. --avoid flags are respected. If the user said "no electronic sounds", the output should not have synths.
  4. Per-flag control worked (BPM, key, structure). If the user asked for 80 BPM in E minor, the output should be in that range, not "close enough".

Output Verification (Covers, Mashups, Style Transfer)

After generation, run a post-generation check that is specific to the route. Use the analysis orchestrator's output on the generated file when possible.

Verification Checklist per Route

Cover (minimax_cover)

  • Melody recognisable as the source (basic-pitch MIDI compare or ear-test)
  • Target style is clearly audible (genre/mood keywords present)
  • Source BPM is within ±10 BPM
  • Source key is preserved (or user agreed to shift)
  • Lyrics decision respected (original / translated / new / instrumental)
  • --avoid flags respected

Mashup (minimax_mashup)

  • Song A's lyrics and emotional arc recognisable
  • Song B's style is dominant in the production
  • Vocal intensity matches Song A's emotion curve
  • Section structure feels coherent (not random)
  • --avoid flags respected for Song B's style

Style Transfer (minimax_style_transfer)

  • Source style (the reference track) is reproduced in timbre, instrumentation, and feel
  • Output melody is NOT recognisable as the source (it is a new composition in the source style)
  • Target genre/mood keywords audible
  • BPM and key reasonable for the new style (not forced from source)

Emotion Prompt / Precision (minimax_emotion_prompt)

  • Per-flag values (BPM, key, structure, avoid) match the flags
  • Prompt language and flags are not contradicting (linter clean)
  • Lyrics reflect the requested theme and language

Failure Signatures and Fixes

When the generated track does not match the request, identify the failure signature and apply the matching fix. The most common signatures:

Failure signature Likely cause Fix
Copied too closely (cover sounds like a remaster, not a new style) Prompt did not specify the new style firmly enough, or --avoid list left the original instrumentation unguarded. Add explicit target style language, list new instruments, expand --avoid with the source's dominant sounds. Re-run.
Lost source melody (cover no longer recognisable) --prompt overrode the cover model, or the source audio was too noisy / clipped. Switch to the two-step cover workflow (preprocess + generate with cover-feature-id); reduce style strength in the prompt.
Wrong tempo (BPM noticeably off) Prompt and --bpm disagreed, or vocal delivery speed misled the detector. Lint prompt + flags first. Re-run with the linter-clean pair. If still off, set --bpm explicitly and drop the BPM number from the prompt.
Wrong key (key shifted up/down) Prompt mentioned a key but flags used another. Lint the pair. Use the same key in both. If MIDI confirms a different source key, trust MIDI over prompt.
Muddy mix (low clarity, washed out) Overly dense instrumentation, lack of anti-sparse guard, or too many --avoid exclusions. Reduce instrument count, raise --bpm for tightness, add explicit "all instruments clearly audible".
Vocals too neutral (no emotion) Emotion analysis not run, or intensity curve not transferred. Run analyze_vocal_emotion.py on the source and feed intensity_curve into the prompt. Add explicit "vocal intensity: ..." clause.
Weak chorus (chorus does not lift) Structure line lacks a build cue, or the prompt was a single energy. Add structure with explicit build cues: "verse: intimate, chorus: soaring, all instruments louder in chorus".
Style mismatch (output does not match the requested genre) Prompt used vague genre words or the wrong dominant instrument. Replace vague words with concrete genre + instrument list. Use the canonical mmx prompt schema in references/examples.md.

Revision Prompt Templates

When a generation comes back with one of the failure signatures above, build a revision prompt that preserves the source identity while changing the failing dimension.

Template: stronger style change (cover too close)

Same melody and lyrics as before. Re-imagine the production as [TARGET_STYLE] with [INSTRUMENT_LIST].
ALL instruments always playing throughout, never go a cappella.
Avoid: [STYLE_CONTRADICTING_WORDS from previous run].

Template: keep the melody (cover lost it)

Re-apply the original melody from the source audio. Keep the recognizable hook at [HOOK_TIME].
Use a softer production in [TARGET_STYLE] but DO NOT change the melodic contour.
Avoid: [WORDS_THAT_PUSHED_TOO_FAR].

Template: fix tempo drift

Keep the source BPM (use --bpm [SOURCE_BPM]). Do not slow down or speed up the vocal delivery.
Avoid: rubato, half-time, double-time, slowing down, speeding up.

Template: fix key shift

Stay in [SOURCE_KEY]. Do not transpose. Use the same chord progression as the source.
Avoid: key change, modulation, transpose.

Template: fix muddy mix

Make every instrument clearly audible. Reduce instrument count to [N].
Add contrast: quieter verses, louder choruses. Keep vocals upfront in the mix.
Avoid: dense layering, atmospheric washes, sustained pads throughout.

Template: lift the chorus

Chorus: soaring, all instruments louder than the verse, fuller chords, more reverb on the lead vocal.
Verse: intimate, single voice, soft drums, breathy delivery.
Bridge: build tension, add a melodic lift before the final chorus.

These templates pair with the failure-signature table. After the revision, re-run the verification checklist above.

Lyrics Optimizer Behavior

Same as the base skill — when music_generate is called without explicit lyrics, MiniMax auto-generates. With this skill, you can also call the lyrics_generation API directly to preview the lyrics before generation, or to iterate via the edit mode.

If the user wants specific words, the lyrics_generation API's edit mode lets you modify auto-generated lyrics to match the user's intent without regenerating the whole song.

Reference Map

安全使用建议
Review before installing. Use this only if you are comfortable sending music prompts, lyrics, URLs, images, and audio to MiniMax or related services. Run it in a virtual environment or sandbox, preinstall dependencies yourself, avoid the auto-install YouTube path, keep MINIMAX_API_KEY scoped and rotated if exposed, and choose output paths carefully because some helpers can overwrite files.
能力标签
cryptorequires-sensitive-credentials
能力评估
Purpose & Capability
The declared purpose covers MiniMax music generation, covers, mashups, lyrics APIs, YouTube inputs, emotion analysis, and mmx CLI use. Image/video analysis and VLM captioning are broad but mostly documented as style-enrichment paths for music generation.
Instruction Scope
The documentation says optional tools should be installed only after user approval, but scripts/download_youtube.py automatically invokes pip install yt-dlp with --break-system-packages when yt-dlp is missing. That is a concrete mismatch between runtime behavior and stated operator control.
Install Mechanism
There is no separate hidden installer, but the runtime YouTube helper performs unpinned package installation from pip and mutates the Python environment automatically. That is high-impact environment authority for a skill helper script.
Credentials
Cloud MiniMax use, LRCLib lookup, YouTube download, and optional mmx vision calls are largely purpose-aligned and disclosed. However, compute_audio_embedding.py uses AutoModel.from_pretrained(..., trust_remote_code=True), which can execute Hugging Face repository code without a clear user-facing warning.
Persistence & Privilege
The skill writes analysis JSON, prompts, generated audio, temporary media, and cache files under user-selected paths, /tmp, and ~/.cache/openclaw. No background persistence or credential harvesting was found, but ffmpeg -y and copy operations can overwrite destination files.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install music-craft-minimax
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /music-craft-minimax 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.1
v1.0.1: Rename display name from 'OpenClaw Music Workflow — MiniMax' to 'Music Craft — MiniMax' for slug consistency. Bundle body unchanged from v1.0.0 (999-line SKILL.md, 10 reference docs, 21 scripts, 34 files, 587 KB).
v1.0.0
First stable release. MiniMax Music 2.6 power-user upgrade: cover/style transfer, two-song mashup, lyrics generation API, emotion-driven prompt engineering, fine mmx flag control, 26-test smoke suite. Inherits shared content (output file layout, rate limits, anti-sparse rules) from music-craft via cross-references.
元数据
Slug music-craft-minimax
版本 1.0.1
许可证 MIT-0
累计安装 1
当前安装数 1
历史版本数 2
常见问题

Music Craft — MiniMax 是什么?

Advanced music generation for OpenClaw, using the MiniMax Music 2.6 token plan. Use for cover and style transfer, two-song mashup, lyrics generation API, emo... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 30 次。

如何安装 Music Craft — MiniMax?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install music-craft-minimax」即可一键安装,无需额外配置。

Music Craft — MiniMax 是免费的吗?

是的,Music Craft — MiniMax 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Music Craft — MiniMax 支持哪些平台?

Music Craft — MiniMax 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Music Craft — MiniMax?

由 LuisCharro(@luischarro)开发并维护,当前版本 v1.0.1。

💬 留言讨论