← 返回 Skills 市场
Multimodal Base
作者
yuyonghao-123
· GitHub ↗
· v0.1.0
· MIT-0
134
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install yuyonghao-multimodal-base
功能描述
Supports image understanding, OCR, speech-to-text, and text-to-speech synthesis with multi-voice and multimodal unified processing using OpenAI and Edge TTS.
安全使用建议
This skill largely does what it says (image understanding, OCR, STT and TTS), but there are several red flags you should consider before installing:
- It requires an OpenAI API key (used by multiple modules) even though the registry metadata lists no required env vars — confirm you are willing to provide that key. Limit the key's scope if possible.
- The README asks you to pip install the Python 'edge-tts' CLI, but package.json also lists an npm 'edge-tts' package — clarify which implementation is intended. The code spawns a system 'edge-tts' command, so you must install the Python CLI or otherwise provide that executable.
- The speech recognizer can run locally and will attempt to download a Whisper model from Hugging Face and save it to disk. Downloading and executing third-party binaries has risk — review the model URL and consider running in a sandbox or verifying checksums.
- The code spawns external executables ('whisper' / whisper.cpp, 'edge-tts', 'ffprobe') and writes output/temp files. Ensure you trust the package source and run it in an environment where those binaries and filesystem writes are acceptable.
- If you want to proceed: ask the author to (1) update registry metadata to declare required env vars (OPENAI_API_KEY), (2) clarify install steps (npm vs pip edge-tts), and (3) document where the Whisper model is stored and whether checksums/verifications are provided. Otherwise run the skill in an isolated container or VM and avoid giving it high-privilege credentials.
功能分析
Type: OpenClaw Skill
Name: yuyonghao-multimodal-base
Version: 0.1.0
The skill bundle provides a standard multimodal integration for image processing (GPT-4o/Tesseract), speech recognition (Whisper), and text-to-speech (Edge-TTS). The code uses legitimate APIs and follows standard patterns for handling external binaries via `child_process.spawn` with argument arrays, which mitigates basic shell injection risks. No evidence of data exfiltration, malicious persistence, or prompt injection was found across the source files (src/image-processor.js, src/speech-recognizer.js, etc.).
能力评估
Purpose & Capability
Code and SKILL.md implement image understanding (OpenAI GPT-4V), OCR (tesseract.js), Whisper-based speech-to-text (API and local), and Edge TTS — which matches the skill description. However, registry metadata declared no required env vars while the code and docs rely on OPENAI_API_KEY. Also SKILL.md asks to pip install Python edge-tts while package.json lists an npm 'edge-tts' dependency — this mismatch is unexplained.
Instruction Scope
Runtime instructions and code perform file reads/writes (images, audio, temp files, output directory), call external network APIs (OpenAI endpoints, Hugging Face model URL), and spawn local executables ('whisper' / whisper.cpp, 'edge-tts', and 'ffprobe'). The pipeline also implements an automatic model download from a Hugging Face URL. Those actions go beyond pure in-process computation and require user awareness and filesystem/network permissions.
Install Mechanism
There is no automated install spec in the registry (instruction-only), but SKILL.md instructs npm install and pip install edge-tts. The code will download a binary model from a Hugging Face URL at runtime (extract/write to disk). Downloading/extracting model binaries and depending on external CLI tools increases risk and should be reviewed; the pip vs npm edge-tts ambiguity is also an installation coherence issue.
Credentials
The code and documentation require an OpenAI API key (process.env.OPENAI_API_KEY) for image and audio API calls, but the registry metadata lists no required environment variables. The skill also expects system binaries (whisper executable, edge-tts CLI, ffprobe) which are not declared in metadata. The requested access (OpenAI key + ability to write model and audio files + spawn executables) is significant and should be clearly declared and limited to what the user expects.
Persistence & Privilege
The skill does not request permanent inclusion (always:false) and does not modify other skills or global agent settings. It stores output and temporary files within its own directories but does not claim elevated privileges.
如何使用
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install yuyonghao-multimodal-base - 安装完成后,直接呼叫该 Skill 的名称或使用
/yuyonghao-multimodal-base触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v0.1.0
Initial release of Multimodal Base Skill.
- Provides unified pipeline for image understanding (GPT-4V, OCR), speech recognition (Whisper), and speech synthesis (Edge TTS)
- Supports flexible multimodal input/output handling in one interface
- Includes separate ImageProcessor, SpeechRecognizer, and SpeechSynthesizer modules
- Offers configurable options including model, API key, language, and voice settings
- Out-of-the-box support for 8 TTS voice types across 4 languages
- Documentation and code examples for quick setup and API usage
元数据
常见问题
Multimodal Base 是什么?
Supports image understanding, OCR, speech-to-text, and text-to-speech synthesis with multi-voice and multimodal unified processing using OpenAI and Edge TTS. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 134 次。
如何安装 Multimodal Base?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install yuyonghao-multimodal-base」即可一键安装,无需额外配置。
Multimodal Base 是免费的吗?
是的,Multimodal Base 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Multimodal Base 支持哪些平台?
Multimodal Base 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Multimodal Base?
由 yuyonghao-123(@yuyonghao-123)开发并维护,当前版本 v0.1.0。
推荐 Skills